Part 2: Enabling additional instrumentation
This topic explains how to enable additional instrumentation on your gateway/edge/client nodes that are used to submit jobs to your big data platform. Additional instrumentation can include:
Hive queries in Hadoop that are pushed to Unravel Server by the Hive Hook sensor, a JAR file.
Spark job performance metrics that are pushed to Unravel Server by the Spark sensor, a JAR file.
Impala queries that are pulled from Cloudera Manager .
Tez Dag information is pushed to Unravel server by the Tez sensor, a JAR file.
Sensor JARs are packaged in a parcel on Unravel Server.
1. Distribute the Unravel parcel
In Cloudera Manager, go to the Parcels page by clicking the parcels glyph () on the top of the page.
Click Configuration to see the Parcel Settings pop-up.
In the Parcel Settings pop-up, go to the Remote Parcel Repository URLs section, and click the + glyph to add a new entry.
In a new browser tab, copy the exact directory name for your CDH version from the
http://
directory.unravel-host
:3000/parcels/For example, the exact directory name might be
cdh7.0
orcdh7.1
.Add
http://
(including the trailing slash).unravel-host
:3000/parcels/cdh-version
/Where:
cdh-version
is your version of CDH. For example,cdh7.0
orcdh7.1
.unravel-host
is the hostname or LAN IP address of Unravel Server. On a multi-host Unravel Server, this would be the host where theunravel_lr
daemon is running.Note
If you're using Active Directory Kerberos,
unravel-host
must be a fully qualified domain name or IP address.Tip
If you're running more than one version of CDP (for example, you have multiple clusters), you can add more than one parcel entry for
unravel-host
.Click Save.
Click Check for New Parcels.
On the Parcels page, pick a target cluster in the Location box.
In the list of Parcel Names, find the
UNRAVEL_SENSOR
parcel that matches the version of the target cluster and click Download.Click Distribute.
If you have an old parcel from Unravel, deactivate it now.
On the new parcel, click Activate.
2. Put the Hive Hook JAR in AUX_CLASSPATH
In Cloudera Manager, select the target cluster from the drop-down, click Hive on Tez >Configuration, and search for
Service Environment
.In Hive on Tez Service Environment Advanced Configuration Snippet (Safety Valve) enter the following exactly as shown, with no substitutions:
AUX_CLASSPATH=${AUX_CLASSPATH}:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
Ensure that the Unravel hive hook JAR has the read/execute access for the user running the hive server.
3. For Oozie, copy the Hive Hook and BTrace JARs to the HDFS shared library path
Copy the Hive Hook JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
, and the Btrace JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar
to the shared lib path specified by oozie.libpath
. If you don't do this, jobs controlled by Oozie 2.3+ will fail.
4. Deploy the BTrace JAR for Tez service
On the Cloudera Manager, go to Tez > configurationI and search the following properties:
tez.am.launch.cmd-opts
tez.task.launch.cmd-opts
Append the following to tez.am.launch.cmd-opts and tez.task.launch.cmd-opts properties:
-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr,config=tez -Dunravel.server.hostport=<unravel_host>
:4043
5. Set Hive Hook configuration
On the Cloudera Manager, click Hive on Tez service and then click the Configuration tab.
Search for
hive-site.xml
, which will lead to the Hive Client Advanced Configuration Snippet for hive-site.xml section.Specify the hive hook configurations. You have the option to either use the XML text field or Editor.
XML text field:
Click View as XML to open the XML text field and copy-paste the following:
<property> <name>com.unraveldata.host</name> <value>
<UNRAVEL HOST NAME>
</value> <description>Unravel hive-hook processing host</description> </property> <property> <name>com.unraveldata.hive.hook.tcp</name> <value>true</value> </property> <property> <name>com.unraveldata.hive.hdfs.dir</name> <value>/user/unravel/<HOOK_RESULT_DIR>
</value> <description>destination for hive-hook, Unravel log processing</description> </property> <property> <name>hive.exec.driver.run.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value> <description>for Unravel, from unraveldata.com</description> </property> <property> <name>hive.exec.pre.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value> <description>for Unravel, from unraveldata.com</description> </property> <property> <name>hive.exec.post.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value> <description>for Unravel, from unraveldata.com</description> </property> <property> <name>hive.exec.failure.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook</value> <description>for Unravel, from unraveldata.com</description> </property>Ensure to replace the following with the appropriate details:
UNRAVEL HOST NAME
HOOK_RESULT_DIRECTORY
Note
These details can be also copied from the
hive-hook/hive-site.xml.snip
file which is located at<Unravel installation directory>/services/snippet
directory.In a multi-cluster deployment, the
hive-hook/hive-site.xml.snip
file is located on the edge node.To indicate that the property value cannot be overridden, specify
<final>true</final>
.Editor:
Click + and enter the configuration name, value, and description (optional). Select the Final check box to indicate that the value of the configuration cannot be overridden.
For example:
Name
hive.exec.pre.hooks
Value
com.unraveldata.dataflow.hive.hook.UnravelHiveHook
Description
From unraveldata.com
Similarly, ensure to add the same hive hook configurations in HiveServer2 Advanced Configuration Snippet for hive-site.xml.
Optionally, add a comment in Reason for change and then click Save Changes.
From the Cloudera Manager page, deploy the Hive client configuration and restart the Hive services using the Actions drop-down.
Check Unravel UI to see if all Hive queries are running.
If queries are running fine and appearing in Unravel UI, then you have successfully added the hive hooks configurations.
If queries are failing with a
class not found
error or permission problems:Undo the
hive-site.xml
changes in Cloudera Manager.Deploy the hive client configuration.
Restart the Hive service.
Follow the steps in Troubleshooting.
6. Set Kafka configuration
In Cloudera Manager, select the target cluster, click Kafka service > Configuration, and search for
broker_java_opts
.In Additional Broker Java Options enter the following:
-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.local.only=true -Djava.rmi.server.useLocalHostname=true -Dcom.sun.management.jmxremote.rmi.port=9393
Click Save Changes.
7. Deploy the Spark JAR
In Cloudera Manager, select the target cluster, then select the Spark service.
Select Configuration.
Search for
spark-defaults
.In Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf, enter the following text, replacing placeholders with your particular values:
spark.unravel.server.hostport=
unravel-host
:4043 spark.driver.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=driver,libs=spark-version
spark.executor.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=executor,libs=spark-version
spark.eventLog.enabled=trueOn a multi-host Unravel Server deployment, use host2's FQDN or logical hostname for
unravel-host
.For
spark-version
, use a Spark version that is compatible with this version of Unravel. For example,spark-2.4
for Spark 2.4.x.Save changes.
Deploy the client configuration by clicking the deploy glyph () or by using the Actions pull-down menu. Your spark-shell will ensure new JVM containers are created with the necessary extraJavaOptions for the Spark drivers and executors.
Check Unravel UI to see if all Spark jobs are running.
If jobs are running fine and appearing in Unravel UI, you're done.
If queries are failing with a
class not found
error or permission problems:Undo the
spark-defaults.conf
changes in Cloudera Manager.Deploy the client configuration.
Investigate and fix the issue.
Follow the steps in Troubleshooting.
Note
If you have YARN-client mode applications, the default Spark configuration is not sufficient, because the driver JVM starts before the configuration set through the SparkConf is applied. For more information, see Apache Spark Configuration. In this case, configure the Unravel Sensor for Spark to profile specific Spark applications only (in other words, per-application profiling rather than cluster-wide profiling).
8. Retrieve Impala data from Cloudera Manager
Configure Unravel Server to retrieve Impala query data from Cloudera Manager as follows:
Add com.unraveldata.data.source=cm in
/usr/local/unravel/etc/unravel.properties
on Unravel Server.Tell Unravel Server some information about your Cloudera Manager's URL, port number, login credentials, and so on.
You do this by adding the following properties to
/usr/local/unravel/etc/unravel.properties
on Unravel Server:For example,
com.unraveldata.data.source=cm com.unraveldata.cloudera.manager.url=http://
my-cm-url
com.unraveldata.cloudera.manager.username=mycmname
com.unraveldata.cloudera.manager.password=mycmpassword
Ensure that the Cloudera Manager user in com.unraveldata.cloudera.manager.username has read access to Cloudera Manager REST APIs.
You can verify this by running a curl command such as the following, substituting your local values for the variables:
curl --user
clouderamanager-username
:clouderamanager-password
'http://clouderamanager-url
:clouderamanager-port
/api/v13/clusters' curl --userclouderamanager-username
:clouderamanager-password
'http://clouderamanager-url
:clouderamanager-port
/api/v13/clusters/cluster-name
/services'Note
By default, the Impala sensor task is enabled. To disable it, specify the following option in
/usr/local/unravel/etc/unravel.properties
on Unravel Server:com.unraveldata.sensor.tasks.disabled=iw
(Optional) Change the Impala lookback window.
By default, when Unravel Server starts, it retrieves the last 5 minutes of Impala queries. To change this, do the following:
On Unravel Server, change com.unraveldata.cloudera.manager.impala.look.back.minutes in
/usr/local/unravel/etc/unravel.properties
.For example, to set the lookback to seven minutes:
com.unraveldata.cloudera.manager.impala.look.back.minutes=-7
Note
Include a minus sign in front of the new value.
Restart the
unravel_us
daemon.
References
For more information on creating permanent functions, see Cloudera documentation.