Home

Part 2: Enabling additional instrumentation

This topic explains how to enable additional instrumentation on your gateway/edge/client nodes that are used to submit jobs to your big data platform. Additional instrumentation can include:

  • Hive queries in Hadoop that are pushed to Unravel Server by the Hive Hook sensor, a JAR file

  • Spark job performance metrics that are pushed to Unravel Server by the Spark sensor, a JAR file

  • Impala queries that are pulled from the Cloudera Manager

Sensor JARs are packaged in a parcel on Unravel server.

1. Distribute the Unravel parcel
  1. In Cloudera Manager, go to the Parcels page by clicking the parcels glyph (package.png) on the top of the page.

  2. Click Configuration to see the Parcel Settings pop-up.

  3. In the Parcel Settings pop-up, go to the Remote Parcel Repository URLs section, and click the + glyph to add a new entry.

  4. In a new browser tab, copy the exact directory name for your CDH version from the http://unravel-host:3000/parcels/ directory.

    For example, the exact directory name might be cdh5.16 or cdh6.0.

  5. Add http://unravel-host:3000/parcels/cdh-version/ (including the trailing slash).

    Where:

    cdh-version is your version of CDH. For example, cdh5.16 or cdh6.0.

    unravel-host is the hostname or LAN IP address of Unravel Server. On a multi-host Unravel Server, this would be the host where the unravel_lr daemon is running.

    Note

    If you're using Active Directory Kerberos, unravel-host must be a fully qualified domain name or IP address.

    Tip

    If you're running more than one version of CDH (for example, you have multiple clusters), you can add more than one parcel entry for unravel-host.

  6. Click Save.

  7. Click Check for New Parcels.

  8. On the Parcels page, pick a target cluster in the Location box.

  9. In the list of Parcel Names, find the UNRAVEL_SENSOR parcel that matches the version of the target cluster and click Download.

  10. Click Distribute.

  11. If you have an old parcel from Unravel, deactivate it now.

  12. On the new parcel, click Activate.

2. Put the Hive Hook JAR in AUX_CLASSPATH
  1. In Cloudera Manager, select the target cluster, click Hive | Configuration, and search for hive-env.

  2. In Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hive-env.sh enter the following exactly as shown, with no substitutions:

    AUX_CLASSPATH=${AUX_CLASSPATH}:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
  3. If Sentry is enabled, grant privileges on the JAR files to the Sentry roles that run Hive queries.

    Sentry commands may also be needed to enable access to the Hive Hook JAR file. Grant privileges on the JAR files to the roles that run hive queries. Log in to Beeline as user hive and use the Hive SQL GRANT statement to do so.

    For example (substitute role as appropriate),

    GRANT ALL ON URI 'file:///opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar' TO ROLE role
3. For Oozie, copy the Hive Hook and BTrace JARs to the HDFS shared library path

Copy the Hive Hook JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar and the Btrace JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar to the shared lib path specified by oozie.libpath. If you don't do this, jobs controlled by Oozie 2.3+ will fail.

4. Deploy the Hive Hook JAR
  1. Copy the Hive Hook snippet into hive-site.xml:

    From the Unravel server OS, use the command cat <Unravel installation directory>/hive-hook/hive-site.xml.snip, copy the contents, and paste them into hive-site.xml.

    Note

    On a multi-host Unravel Server deployment, use the <Unravel installation directory>/hive-hook/hive-site.xml.snip snippet from host2.

  2. In Cloudera Manager, go to the Hive service.

  3. Select the Configuration tab.

  4. Search for hive-site.xml in the middle of the page.

  5. Add the snippet to Hive Client Advanced Configuration Snippet for hive-site.xml (Gateway Default Group).

    To edit, click View as XML.

    If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks and append the value com.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.

    For example:

    <property>  
    <name>hive.exec.post.hooks</name>  
    <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value>  
    <description>for Unravel, from unraveldata.com</description>
    </property>

    Note

    Hook classes are separated by comma without any space.

    Add custom property com.unraveldata.host and set the value to unravel-gateway-internal-IP-hostname.

  6. Add the snippet to HiveServer2 Advanced Configuration Snippet for hive-site.xml.

    To edit, click View as XML.

    If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks and append the valuecom.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.

    For example:

    <property>
    <name>hive.exec.post.hooks</name>  
    <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value>  
    <description>for Unravel, from unraveldata.com</description>
    </property>

    Note

    Hook classes are separated by comma without any space.

    Add custom property com.unraveldata.host and set the value to unravel-gateway-internal-IP-hostname.

  7. Save the changes with optional comment Unravel snippet in hive-site.xml.

  8. Deploy the Hive client configuration by clicking the deploy glyph (DeployGlyph.png) or by using the Actions pull-down menu.

  9. Restart the Hive service.

    Tip

    Cloudera Manager recommends a restart, which is not necessary for activating these changes. Don't restart now; you can restart later.

  10. Check Unravel UI to see if all Hive queries are running.

    • If queries are running fine and appearing in Unravel UI, you are done.

    • If queries are failing with a class not found error or permission problems:

      • Undo the hive-site.xml changes in Cloudera Manager.

      • Deploy the hive client configuration.

      • Restart the Hive service.

      • Follow the steps in Troubleshooting.

5. Deploy the Spark JAR
  1. In Cloudera Manager, select the target cluster and then click Spark.

  2. Select Configuration.

  3. Search for spark-defaults.

  4. In Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf, enter the following text, replacing placeholders with your particular values:

    spark.unravel.server.hostport=unravel-host:4043 
    spark.driver.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=driver,libs=spark-version
    spark.executor.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=executor,libs=spark-version
    spark.eventLog.enabled=true

    On a multi-host Unravel Server deployment, use host2's FQDN or logical hostname for unravel-host.

  5. Save changes.

  6. Deploy the client configuration by clicking the deploy glyph (DeployGlyph.png) or by using the Actions pull-down menu. Your spark-shell will ensure new JVM containers are created with the necessary extraJavaOptions for the Spark drivers and executors.

  7. Enable Spark streaming.

    The Spark streaming probe is disabled by default and you must enable it manually by editing spark-defaults.conf.

    Search for spark.driver.extraJavaOptions and set it to the following. Be sure to substitute the correct version of Spark for spark-version.

    Note

    Unravel supports the Spark streaming feature for Spark 1.6.x, 2.0.x, 2.1.x, and 2.2.x only.

    Note

    Support for Spark apps using the Structured Streaming API introduced in Spark 2 is limited.

    javaagent:unravel-sensor-path/btrace-agent.jar=script=DriverProbe.class:SQLProbe.class:StreamingProbe.class,libs=spark-spark-version.
  8. Check Unravel UI to see if all Spark jobs are running.

    • If jobs are running and appearing in Unravel UI, you have deployed the Spark jar successfully.

    • If queries are failing with a class not found error or permission problems:

      • Undo the spark-defaults.conf changes in Cloudera Manager.

      • Deploy the client configuration.

      • Investigate and fix the issue.

      • Follow the steps in Troubleshooting.

Note

If you have YARN-client mode applications, the default Spark configuration is not sufficient, because the driver JVM starts before the configuration set through the SparkConf is applied. For more information, see Apache Spark Configuration. In this case, configure the Unravel Sensor for Spark to profile specific Spark applications only (in other words, per-application profiling rather than cluster-wide profiling).

6. Configure YARN-MapReduce JVM sensor cluster-wide
  1. In Cloudera Manager, go to YARN service.

  2. Select the Configuration tab.

  3. Search for Application Master Java Opts Base and concatenate the following XML block properties snippet (ensure to start with a space and add below).

    Note

    Make sure that "-" is a minus sign. You need to modify the value of unravel-host with your Unravel Server IP address or a fully qualified DNS. For multi-host Unravel installation, use the IP address of Host2.

    -javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=unravel-host:4043 
  4. Search for MapReduce Client Advanced Configuration Snippet (Safety Valve) for mapred-site.xml in the middle of the page.

  5. Enter the following XML four-block properties snippet to Gateway Default Group. (Click View as XML.)

    <property>
    <name>mapreduce.task.profile</name>
    <value>true</value>
    </property> 
    <property>
    <name>mapreduce.task.profile.maps</name>
    <value>0-5</value>
    </property> 
    <property>
    <name>mapreduce.task.profile.reduces</name>
    <value>0-5</value>
    </property> 
    // this is one line 
    <property>
    <name>mapreduce.task.profile.params</name>
    <value>-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=unravel-host:4043</value></property> 
  6. Save the changes.

  7. Deploy the client configuration by clicking the deploy glyph (DeployGlyph.png) or by using the Actions pull-down menu.

  8. Cloudera Manager will specify a restart which is not necessary to effect these changes. (Click Restart Stale Services if that is visible. However, you can also perform this later when you have a planned maintenance.)

Tip

The restart is important for the MR sensor to be picked up by queries submitted via Hiveserver2.

Use the Unravel UI to monitor the situation. When you view the MapReduce APM page for any completed MRjob you should see mappers and reducers in the Resource Usage tab.

7. Retrieve Impala data from Cloudera Manager

The following properties can be set to retrieve Impala query data from Cloudera Manager:

Property/Description

Set by user

Unit

Defaults

com.unraveldata.data.source

Can be cm or impalad.

cm

For example,

com.unraveldata.data.source=cm 
com.unraveldata.cloudera.manager.url=http://my-cm-url  
com.unraveldata.cloudera.manager.username=mycmname 
com.unraveldata.cloudera.manager.password=mycmpassword

For multi-cluster, use the following format

# cloudera manager CM1
com.unraveldata.cloudera.manager.CM1.url=http://my-cm-url1 
com.unraveldata.cloudera.manager.CM1.username=mycmname1 
com.unraveldata.cloudera.manager.CM1.password=mycmpassword
 
# cloudera manager CM2
com.unraveldata.cloudera.manager.CM2.url=//my-cm-url2 
com.unraveldata.cloudera.manager.CM2.username=mycmname2 
com.unraveldata.cloudera.manager.CM2.password=mycmpassword2 

Note

By default, the Impala sensor task is enabled. To disable it, you can specify the following option from manager config.

com.unraveldata.sensor.tasks.disabled=iw

Optionally, you can change the Impala lookback window. By default, when Unravel Server starts, it retrieves the last 5 minutes of Impala queries. To change this, do the following

Using manager config, change the value for com.unraveldata.cloudera.manager.impala.look.back.minutes.

For example, to set the lookback to seven minutes:

com.unraveldata.cloudera.manager.impala.look.back.minutes=-7

Note

Include a minus sign in front of the new value.

References

For more information on creating permanent functions, see Cloudera documentation.