Home

Part 2: Enabling additional instrumentation

This topic explains how to enable additional instrumentation on your gateway/edge/client nodes that are used to submit jobs to your big data platform. Additional instrumentation can include:

  • Hive queries in Hadoop pushed to Unravel Server by the Hive Hook sensor, a JAR file

  • Spark job performance metrics pushed to Unravel Server by the Spark sensor, a JAR file

  • Impala queries pulled from Cloudera Manager

Sensor JARs are packaged in a parcel on Unravel Server.

1. Distribute the Unravel parcel
  1. In Cloudera Manager, go to the Parcels page by clicking the parcels glyph (package.png) on the top of the page.

  2. Click Configuration to see the Parcel Settings pop-up.

  3. In the Parcel Settings pop-up, go to the Remote Parcel Repository URLs section, and click the + glyph to add a new entry.

  4. In a new browser tab, copy the exact directory name for your CDH version from the http://unravel-host:3000/parcels/ directory.

    For example, the exact directory name might be cdh5.16 or cdh6.0.

  5. Add http://unravel-host:3000/parcels/cdh-version/ (including the trailing slash).

    Where:

    cdh-version is your version of CDH. For example, cdh5.16 or cdh6.0.

    unravel-host is the hostname or LAN IP address of Unravel Server. On a multi-host Unravel Server, this would be the host where the unravel_lr daemon is running.

    Note

    If you're using Active Directory Kerberos, unravel-host must be a fully qualified domain name or IP address.

    Tip

    If you're running more than one version of CDH (for example, you have multiple clusters), you can add more than one parcel entry for unravel-host.

  6. Click Save.

  7. Click Check for New Parcels.

  8. On the Parcels page, pick a target cluster in the Location box.

  9. In the list of Parcel Names, find the UNRAVEL_SENSOR parcel that matches the version of the target cluster and click Download.

  10. Click Distribute.

  11. If you have an old parcel from Unravel, deactivate it now.

  12. On the new parcel, click Activate.

2. Put the Hive Hook JAR in AUX_CLASSPATH
  1. In Cloudera Manager, select the target cluster, click Hive | Configuration, and search for hive-env.

  2. In Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hive-env.sh enter the following exactly as shown, with no substitutions:

    AUX_CLASSPATH=${AUX_CLASSPATH}:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
  3. If Sentry is enabled, grant privileges on the JAR files to the Sentry roles that run Hive queries.

    Sentry commands may also be needed to enable access to the Hive Hook JAR file. Grant privileges on the JAR files to the roles that run hive queries. Log in to Beeline as user hive and use the Hive SQL GRANT statement to do so.

    For example (substitute role as appropriate),

    GRANT ALL ON URI 'file:///opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar' TO ROLE role
3. For Oozie, copy the Hive Hook and BTrace JARs to the HDFS shared library path

Copy the Hive Hook JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar and the Btrace JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar to the shared lib path specified by oozie.libpath. If you don't do this, jobs controlled by Oozie 2.3+ will fail.

4. Deploy the Hive Hook JAR
  1. Copy the Hive Hook snippet into hive-site.xml:

    From the Unravel server, OS use the command cat /usr/local/unravel/hive-hook/hive-site.xml.snip, copy the contents, and paste them into hive-site.xml.

    Note

    On a multi-host Unravel Server deployment, use the /usr/local/unravel/hive-hook/hive-site.xml.snip snippet from host2.

  2. In Cloudera Manager, go to the Hive service.

  3. Select the Configuration tab.

  4. Search for hive-site.xml in the middle of the page.

  5. Add the snippet to Hive Client Advanced Configuration Snippet for hive-site.xml (Gateway Default Group).

    To edit, click View as XML.

    If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks and append the value com.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.

    For example:

    <property>  
    <name>hive.exec.post.hooks</name>  
    <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value>  
    <description>for Unravel, from unraveldata.com</description>
    </property>

    Note

    Hook classes are separated by comma without any space.

    Add custom property com.unraveldata.host and set the value to unravel-gateway-internal-IP-hostname.

  6. Add the snippet to HiveServer2 Advanced Configuration Snippet for hive-site.xml.

    To edit, click View as XML.

    If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks and append the valuecom.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.

    For example:

    <property>
    <name>hive.exec.post.hooks</name>  
    <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value>  
    <description>for Unravel, from unraveldata.com</description>
    </property>

    Note

    Hook classes are separated by comma without any space.

    Add custom property com.unraveldata.host and set the value to unravel-gateway-internal-IP-hostname.

  7. Save the changes with optional comment Unravel snippet in hive-site.xml.

  8. Deploy the Hive client configuration by clicking the deploy glyph (DeployGlyph.png) or by using the Actions pull-down menu.

  9. Restart the Hive service.

    Tip

    Cloudera Manager recommends a restart, which is not necessary for activating these changes. Don't restart now; you can restart later.

  10. Check Unravel UI to see if all Hive queries are running.

    • If queries are running fine and appearing in Unravel UI, you are done.

    • If queries are failing with a class not found error or permission problems:

      • Undo the hive-site.xml changes in Cloudera Manager.

      • Deploy the hive client configuration.

      • Restart the Hive service.

      • Follow the steps in Troubleshooting.

5. Deploy the Spark JAR
  1. In Cloudera Manager, select the target cluster, then click Spark.

  2. Select Configuration.

  3. Search for spark-defaults.

  4. In Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf, enter the following text, replacing placeholders with your particular values:

    spark.unravel.server.hostport=unravel-host:4043 
    spark.driver.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=driver,libs=spark-version
    spark.executor.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=executor,libs=spark-version
    spark.eventLog.enabled=true

    On a multi-host Unravel Server deployment, use host2's FQDN or logical hostname for unravel-host.

  5. Save changes.

  6. Deploy the client configuration by clicking the deploy glyph (DeployGlyph.png) or by using the Actions pull-down menu. Your spark-shell will ensure new JVM containers are created with the necessary extraJavaOptions for the Spark drivers and executors.

  7. Enable Spark streaming.

    The Spark streaming probe is disabled by default and you must enable it manually by editing spark-defaults.conf.

    Search for spark.driver.extraJavaOptions and set it to the following. Be sure to substitute the correct version of Spark for spark-version.

    Note

    Unravel supports the Spark streaming feature for Spark 1.6.x, 2.0.x, 2.1.x, and 2.2.x only.

    Note

    Support for Spark apps using the Structured Streaming API introduced in Spark 2 is limited.

    javaagent:unravel-sensor-path/btrace-agent.jar=script=DriverProbe.class:SQLProbe.class:StreamingProbe.class,libs=spark-spark-version.
  8. Check Unravel UI to see if all Spark jobs are running.

    • If jobs are running fine and appearing in Unravel UI, you're done.

    • If queries are failing with a class not found error or permission problems:

      • Undo the spark-defaults.conf changes in Cloudera Manager.

      • Deploy the client configuration.

      • Investigate and fix the issue.

      • Follow the steps in Troubleshooting.

Note

If you have YARN-client mode applications, the default Spark configuration is not sufficient, because the driver JVM starts before the configuration set through the SparkConf is applied. For more information, see Apache Spark Configuration. In this case, configure the Unravel Sensor for Spark to profile specific Spark applications only (in other words, per-application profiling rather than cluster-wide profiling).

6. Configure YARN-MapReduce JVM sensor cluster-wide
  1. In Cloudera Manager, go to YARN service.

  2. Select the Configuration tab.

  3. Search for Application Master Java Opts Base and concatenate the following xml block properties snippet (ensure to start with a space and add below).

    Note

    Make sure that "-" is a minus sign. You need to modify the value of unravel-host with your Unravel Server IP address or a fully qualified DNS. For multi-host Unravel installation, use the IP address of Host2.

    -javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=unravel-host:4043 
  4. Search for MapReduce Client Advanced Configuration Snippet (Safety Valve) for mapred-site.xml in the middle of the page.

  5. Enter the following XML four-block properties snippet to Gateway Default Group. (Click View as XML.)

    <property>
    <name>mapreduce.task.profile</name>
    <value>true</value>
    </property> 
    <property>
    <name>mapreduce.task.profile.maps</name>
    <value>0-5</value>
    </property> 
    <property>
    <name>mapreduce.task.profile.reduces</name>
    <value>0-5</value>
    </property> 
    // this is one line 
    <property>
    <name>mapreduce.task.profile.params</name>
    <value>-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=unravel-host:4043</value></property> 
  6. Save the changes.

  7. Deploy the client configuration by clicking the deploy glyph (DeployGlyph.png) or by using the Actions pull-down menu.

  8. Cloudera Manager will specify a restart which is not necessary to effect these changes. (Click Restart Stale Services if that is visible. However, you can also perform this later when you have a planned maintenance.)

Tip

The restart is important for the MR sensor to be picked up by queries submitted via Hiveserver2.

Use the Unravel UI to monitor the situation. When you view the MapReduce APM page for any completed MRjob you should see mappers and reducers in the Resource Usage tab.

7. Retrieve Impala data from Cloudera Manager

Configure Unravel Server to retrieve Impala query data from Cloudera Manager as follows:

  1. Add com.unraveldata.data.source=cm in /usr/local/unravel/etc/unravel.properties on Unravel Server.

  2. Tell Unravel Server some information about your Cloudera Manager's URL, port number, login credentials, and so on.

    You do this by adding the following properties to /usr/local/unravel/etc/unravel.properties on Unravel Server:

    For example,

    Prior to Unravel v4.5.4

    com.unraveldata.data.source=cm 
    com.unraveldata.cloudera.manager.url=http://my-cm-url 
    com.unraveldata.cloudera.manager.port=9997 
    com.unraveldata.cloudera.manager.username=mycmname 
    com.unraveldata.cloudera.manager.password=mycmpassword

    Unravel v4.5.4.x and later

    com.unraveldata.data.source=cm 
    ## include the port with the url
    com.unraveldata.cloudera.manager.url=http://my-cm-url:port
    com.unraveldata.cloudera.manager.username=mycmname 
    com.unraveldata.cloudera.manager.password=mycmpassword
  3. Make sure that the Cloudera Manager user in com.unraveldata.cloudera.manager.username has read access to Cloudera Manager REST APIs.

    You can verify this by running a curl command such as the following, substituting your local values for the variables:

    curl --user clouderamanager-username:clouderamanager-password 'http://clouderamanager-url:clouderamanager-port/api/v13/clusters'
    curl --user clouderamanager-username:clouderamanager-password 'http://clouderamanager-url:clouderamanager-port/api/v13/clusters/cluster-name/services'

    Note

    By default, the Impala sensor task is enabled. To disable it, specify the following option in /usr/local/unravel/etc/unravel.properties on Unravel Server:

    com.unraveldata.sensor.tasks.disabled=iw
  4. (Optional) Change the Impala lookback window.

    By default, when Unravel Server starts, it retrieves the last 5 minutes of Impala queries. To change this, do the following:

    1. On Unravel Server, change com.unraveldata.cloudera.manager.impala.look.back.minutes in /usr/local/unravel/etc/unravel.properties.

      For example, to set the lookback to 7 minutes:

      com.unraveldata.cloudera.manager.impala.look.back.minutes=-7

      Note

      Include a minus sign in front of the new value.

    2. Restart the unravel_us daemon.

References

For more information on creating permanent functions, see Cloudera documentation.