Skip to main content

Home

Part 2: Enabling additional instrumentation

This topic explains how to enable additional instrumentation on your gateway/edge/client nodes that are used to submit jobs to your big data platform. Additional instrumentation can include:

  • Hive queries in Hadoop that are pushed to Unravel Server by the Hive Hook sensor, a JAR file.

  • Spark job performance metrics that are pushed to Unravel Server by the Spark sensor, a JAR file.

  • Impala queries that are pulled from Cloudera Manager .

  • Tez Dag information is pushed to Unravel server by the Tez sensor, a JAR file.

Sensor JARs are packaged in a parcel on Unravel Server.

1. Distribute the Unravel parcel
  1. In Cloudera Manager, go to the Parcels page by clicking the parcels glyph (package.png) on the top of the page.

  2. Click Configuration to see the Parcel Settings pop-up.

  3. In the Parcel Settings pop-up, go to the Remote Parcel Repository URLs section, and click the + glyph to add a new entry.

  4. In a new browser tab, copy the exact directory name for your CDH version from the http://unravel-host:3000/parcels/ directory.

    For example, the exact directory name might be cdh7.0 or cdh7.1.

  5. Add http://unravel-host:3000/parcels/cdh-version/ (including the trailing slash).

    Where:

    cdh-version is your version of CDH. For example, cdh7.0 or cdh7.1.

    unravel-host is the hostname or LAN IP address of Unravel Server. On a multi-host Unravel Server, this would be the host where the unravel_lr daemon is running.

    Note

    If you're using Active Directory Kerberos, unravel-host must be a fully qualified domain name or IP address.

    Tip

    If you're running more than one version of CDP (for example, you have multiple clusters), you can add more than one parcel entry for unravel-host.

  6. Click Save.

  7. Click Check for New Parcels.

  8. On the Parcels page, pick a target cluster in the Location box.

  9. In the list of Parcel Names, find the UNRAVEL_SENSOR parcel that matches the version of the target cluster and click Download.

  10. Click Distribute.

  11. If you have an old parcel from Unravel, deactivate it now.

  12. On the new parcel, click Activate.

2. Put the Hive Hook JAR in AUX_CLASSPATH
  1. In Cloudera Manager, select the target cluster from the drop-down, click Hive on Tez >Configuration, and search for Service Environment.

  2. In Hive on Tez Service Environment Advanced Configuration Snippet (Safety Valve) enter the following exactly as shown, with no substitutions:

    AUX_CLASSPATH=${AUX_CLASSPATH}:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
  3. Ensure that the Unravel hive hook JAR has the read/execute access for the user running the hive server.

3. For Oozie, copy the Hive Hook and BTrace JARs to the HDFS shared library path

Copy the Hive Hook JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar , and the Btrace JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar to the shared lib path specified by oozie.libpath. If you don't do this, jobs controlled by Oozie 2.3+ will fail.

4. Deploy the BTrace JAR for Tez service

On the Cloudera Manager, go to Tez > configurationI and search the following properties:

  • tez.am.launch.cmd-opts

  • tez.task.launch.cmd-opts

Append the following to tez.am.launch.cmd-opts and tez.task.launch.cmd-opts properties:

-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr,config=tez -Dunravel.server.hostport=<unravel_host>:4043
5. Set Hive Hook configuration
  1. On the Cloudera Manager, click Hive on Tez service and then click the Configuration tab.

  2. Search for hive-site.xml, which will lead to the Hive Client Advanced Configuration Snippet for hive-site.xml section.

  3. Specify the hive hook configurations. You have the option to either use the XML text field or Editor.

    • XML text field:

      Click View as XML to open the XML text field and copy-paste the following:

      <property>
        <name>com.unraveldata.host</name>
        <value> <UNRAVEL HOST NAME> </value>
        <description>Unravel hive-hook processing host</description>
      </property>
      <property>
        <name>com.unraveldata.hive.hook.tcp</name>
        <value>true</value>
      </property>
      <property>
        <name>com.unraveldata.hive.hdfs.dir</name>
        <value>/user/unravel/<HOOK_RESULT_DIR></value>
        <description>destination for hive-hook, Unravel log processing</description>
      </property>
      <property>
        <name>hive.exec.driver.run.hooks</name>
      <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value>
        <description>for Unravel, from unraveldata.com</description>
      </property>
      <property>
        <name>hive.exec.pre.hooks</name>  <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value>
        <description>for Unravel, from unraveldata.com</description>
      </property>
      <property>
        <name>hive.exec.post.hooks</name>  <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value>
        <description>for Unravel, from unraveldata.com</description>
      </property>
      <property>
        <name>hive.exec.failure.hooks</name>  <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value>
        <description>for Unravel, from unraveldata.com</description>
      </property>

      Ensure to replace the following with the appropriate details:

      • UNRAVEL HOST NAME

      • HOOK_RESULT_DIRECTORY

      Note

      These details can be also copied from the hive-hook/hive-site.xml.snip file which is located at <Unravel installation directory>/services/snippet directory.

      In a multi-cluster deployment, the hive-hook/hive-site.xml.snip file is located on the edge node.

      To indicate that the property value cannot be overridden, specify <final>true</final>.

    • Editor:

      Click + and enter the configuration name, value, and description (optional). Select the Final check box to indicate that the value of the configuration cannot be overridden.

      For example:

      Name

      hive.exec.pre.hooks

      Value

      com.unraveldata.dataflow.hive.hook.UnravelHiveHook

      Description

      From unraveldata.com

  4. Similarly, ensure to add the same hive hook configurations in HiveServer2 Advanced Configuration Snippet for hive-site.xml.

  5. Optionally, add a comment in Reason for change and then click Save Changes.

  6. From the Cloudera Manager page, deploy the Hive client configuration and restart the Hive services using the Actions drop-down.

  7. Check Unravel UI to see if all Hive queries are running.

    • If queries are running fine and appearing in Unravel UI, then you have successfully added the hive hooks configurations.

    • If queries are failing with a class not found error or permission problems:

      • Undo the hive-site.xml changes in Cloudera Manager.

      • Deploy the hive client configuration.

      • Restart the Hive service.

      • Follow the steps in Troubleshooting.

6. Set Kafka configuration
  1. In Cloudera Manager, select the target cluster, click Kafka service > Configuration, and search for broker_java_opts.

  2. In Additional Broker Java Options enter the following:

    -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.local.only=true -Djava.rmi.server.useLocalHostname=true -Dcom.sun.management.jmxremote.rmi.port=9393
  3. Click Save Changes.

7. Deploy the Spark JAR
  1. In Cloudera Manager, select the target cluster, then select the Spark service.

  2. Select Configuration.

  3. Search for spark-defaults.

  4. In Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf, enter the following text, replacing placeholders with your particular values:

    spark.unravel.server.hostport=unravel-host:4043 
    spark.driver.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=driver,libs=spark-version
    spark.executor.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=executor,libs=spark-version
    spark.eventLog.enabled=true

    On a multi-host Unravel Server deployment, use host2's FQDN or logical hostname for unravel-host.

    For spark-version, use a Spark version that is compatible with this version of Unravel. For example, spark-2.4 for Spark 2.4.x.

  5. Save changes.

  6. Deploy the client configuration by clicking the deploy glyph (DeployGlyph.png) or by using the Actions pull-down menu. Your spark-shell will ensure new JVM containers are created with the necessary extraJavaOptions for the Spark drivers and executors.

  7. Enable Spark streaming.

  8. Check Unravel UI to see if all Spark jobs are running.

    • If jobs are running fine and appearing in Unravel UI, you're done.

    • If queries are failing with a class not found error or permission problems:

      • Undo the spark-defaults.conf changes in Cloudera Manager.

      • Deploy the client configuration.

      • Investigate and fix the issue.

      • Follow the steps in Troubleshooting.

Note

If you have YARN-client mode applications, the default Spark configuration is not sufficient, because the driver JVM starts before the configuration set through the SparkConf is applied. For more information, see Apache Spark Configuration. In this case, configure the Unravel Sensor for Spark to profile specific Spark applications only (in other words, per-application profiling rather than cluster-wide profiling).

8. Retrieve Impala data from Cloudera Manager

Configure Unravel Server to retrieve Impala query data from Cloudera Manager as follows:

  1. Add com.unraveldata.data.source=cm in /usr/local/unravel/etc/unravel.properties on Unravel Server.

  2. Tell Unravel Server some information about your Cloudera Manager's URL, port number, login credentials, and so on.

    You do this by adding the following properties to /usr/local/unravel/etc/unravel.properties on Unravel Server:

    For example,

    com.unraveldata.data.source=cm 
    com.unraveldata.cloudera.manager.url=http://my-cm-url  
    com.unraveldata.cloudera.manager.username=mycmname 
    com.unraveldata.cloudera.manager.password=mycmpassword
  3. Ensure that the Cloudera Manager user in com.unraveldata.cloudera.manager.username has read access to Cloudera Manager REST APIs.

    You can verify this by running a curl command such as the following, substituting your local values for the variables:

    curl --user clouderamanager-username:clouderamanager-password 'http://clouderamanager-url:clouderamanager-port/api/v13/clusters'
    curl --user clouderamanager-username:clouderamanager-password 'http://clouderamanager-url:clouderamanager-port/api/v13/clusters/cluster-name/services'

    Note

    By default, the Impala sensor task is enabled. To disable it, specify the following option in /usr/local/unravel/etc/unravel.properties on Unravel Server:

    com.unraveldata.sensor.tasks.disabled=iw
  4. (Optional) Change the Impala lookback window.

    By default, when Unravel Server starts, it retrieves the last 5 minutes of Impala queries. To change this, do the following:

    1. On Unravel Server, change com.unraveldata.cloudera.manager.impala.look.back.minutes in /usr/local/unravel/etc/unravel.properties.

      For example, to set the lookback to seven minutes:

      com.unraveldata.cloudera.manager.impala.look.back.minutes=-7

      Note

      Include a minus sign in front of the new value.

    2. Restart the unravel_us daemon.

References

For more information on creating permanent functions, see Cloudera documentation.