Part 2: Enabling additional instrumentation
This topic explains how to enable additional instrumentation on your gateway/edge/client nodes that are used to submit jobs to your big data platform. Additional instrumentation can include:
Hive queries in Hadoop pushed to Unravel Server by the Hive Hook sensor, a JAR file
Spark job performance metrics pushed to Unravel Server by the Spark sensor, a JAR file
Impala queries pulled from Cloudera Manager
Sensor JARs are packaged in a parcel on Unravel Server.
1. Distribute the Unravel parcel
In Cloudera Manager, go to the Parcels page by clicking the parcels glyph () on the top of the page.
Click Configuration to see the Parcel Settings pop-up.
In the Parcel Settings pop-up, go to the Remote Parcel Repository URLs section, and click the + glyph to add a new entry.
In a new browser tab, copy the exact directory name for your CDH version from the
http://
directory.unravel-host
:3000/parcels/For example, the exact directory name might be
cdh5.16
orcdh6.0
.Add
http://
(including the trailing slash).unravel-host
:3000/parcels/cdh-version
/Where:
cdh-version
is your version of CDH. For example,cdh5.16
orcdh6.0
.unravel-host
is the hostname or LAN IP address of Unravel Server. On a multi-host Unravel Server, this would be the host where theunravel_lr
daemon is running.Note
If you're using Active Directory Kerberos,
unravel-host
must be a fully qualified domain name or IP address.Tip
If you're running more than one version of CDH (for example, you have multiple clusters), you can add more than one parcel entry for
unravel-host
.Click Save.
Click Check for New Parcels.
On the Parcels page, pick a target cluster in the Location box.
In the list of Parcel Names, find the
UNRAVEL_SENSOR
parcel that matches the version of the target cluster and click Download.Click Distribute.
If you have an old parcel from Unravel, deactivate it now.
On the new parcel, click Activate.
2. Put the Hive Hook JAR in AUX_CLASSPATH
In Cloudera Manager, select the target cluster, click Hive | Configuration, and search for
hive-env
.In Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for
hive-env.sh
enter the following exactly as shown, with no substitutions:AUX_CLASSPATH=${AUX_CLASSPATH}:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
If Sentry is enabled, grant privileges on the JAR files to the Sentry roles that run Hive queries.
Sentry commands may also be needed to enable access to the Hive Hook JAR file. Grant privileges on the JAR files to the roles that run hive queries. Log in to Beeline as user
hive
and use the HiveSQL GRANT
statement to do so.For example (substitute
role
as appropriate),GRANT ALL ON URI 'file:///opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar' TO ROLE
role
3. For Oozie, copy the Hive Hook and BTrace JARs to the HDFS shared library path
Copy the Hive Hook JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
and the Btrace JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar
to the shared lib path specified by oozie.libpath
. If you don't do this, jobs controlled by Oozie 2.3+ will fail.
4. Deploy the Hive Hook JAR
Copy the Hive Hook snippet into
hive-site.xml
:From the Unravel server, OS use the command
cat /usr/local/unravel/hive-hook/hive-site.xml.snip
, copy the contents, and paste them intohive-site.xml
.Note
On a multi-host Unravel Server deployment, use the
/usr/local/unravel/hive-hook/hive-site.xml.snip
snippet from host2.In Cloudera Manager, go to the Hive service.
Select the Configuration tab.
Search for
hive-site.xml
in the middle of the page.Add the snippet to Hive Client Advanced Configuration Snippet for hive-site.xml (Gateway Default Group).
To edit, click View as XML.
If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys
hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks
and append the value com.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.For example:
<property> <name>hive.exec.post.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value> <description>for Unravel, from unraveldata.com</description> </property>
Note
Hook classes are separated by comma without any space.
Add custom property com.unraveldata.host and set the value to
unravel-gateway-internal-IP-hostname
.Add the snippet to HiveServer2 Advanced Configuration Snippet for hive-site.xml.
To edit, click View as XML.
If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys
hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks
and append the valuecom.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.For example:
<property> <name>hive.exec.post.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value> <description>for Unravel, from unraveldata.com</description> </property>
Note
Hook classes are separated by comma without any space.
Add custom property com.unraveldata.host and set the value to
unravel-gateway-internal-IP-hostname
.Save the changes with optional comment
Unravel snippet in hive-site.xml
.Deploy the Hive client configuration by clicking the deploy glyph () or by using the Actions pull-down menu.
Restart the Hive service.
Tip
Cloudera Manager recommends a restart, which is not necessary for activating these changes. Don't restart now; you can restart later.
Check Unravel UI to see if all Hive queries are running.
If queries are running fine and appearing in Unravel UI, you are done.
If queries are failing with a
class not found
error or permission problems:Undo the
hive-site.xml
changes in Cloudera Manager.Deploy the hive client configuration.
Restart the Hive service.
Follow the steps in Troubleshooting.
5. Deploy the Spark JAR
In Cloudera Manager, select the target cluster, then click Spark.
Select Configuration.
Search for
spark-defaults
.In Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf, enter the following text, replacing placeholders with your particular values:
spark.unravel.server.hostport=
unravel-host
:4043 spark.driver.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=driver,libs=spark-version
spark.executor.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=executor,libs=spark-version
spark.eventLog.enabled=trueOn a multi-host Unravel Server deployment, use host2's FQDN or logical hostname for
unravel-host
.Save changes.
Deploy the client configuration by clicking the deploy glyph () or by using the Actions pull-down menu. Your spark-shell will ensure new JVM containers are created with the necessary extraJavaOptions for the Spark drivers and executors.
Enable Spark streaming.
The Spark streaming probe is disabled by default and you must enable it manually by editing
spark-defaults.conf
.Search for
spark.driver.extraJavaOptions
and set it to the following. Be sure to substitute the correct version of Spark forspark-version
.Note
Unravel supports the Spark streaming feature for Spark 1.6.x, 2.0.x, 2.1.x, and 2.2.x only.
Note
Support for Spark apps using the Structured Streaming API introduced in Spark 2 is limited.
javaagent:
unravel-sensor-path
/btrace-agent.jar=script=DriverProbe.class:SQLProbe.class:StreamingProbe.class,libs=spark-spark-version
.Check Unravel UI to see if all Spark jobs are running.
If jobs are running fine and appearing in Unravel UI, you're done.
If queries are failing with a
class not found
error or permission problems:Undo the
spark-defaults.conf
changes in Cloudera Manager.Deploy the client configuration.
Investigate and fix the issue.
Follow the steps in Troubleshooting.
Note
If you have YARN-client mode applications, the default Spark configuration is not sufficient, because the driver JVM starts before the configuration set through the SparkConf is applied. For more information, see Apache Spark Configuration. In this case, configure the Unravel Sensor for Spark to profile specific Spark applications only (in other words, per-application profiling rather than cluster-wide profiling).
6. Configure YARN-MapReduce JVM sensor cluster-wide
In Cloudera Manager, go to YARN service.
Select the Configuration tab.
Search for Application Master Java Opts Base and concatenate the following xml block properties snippet (ensure to start with a space and add below).
Note
Make sure that "-" is a minus sign. You need to modify the value of
unravel-host
with your Unravel Server IP address or a fully qualified DNS. For multi-host Unravel installation, use the IP address of Host2.-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=
unravel-host
:4043Search for MapReduce Client Advanced Configuration Snippet (Safety Valve) for
mapred-site.xml
in the middle of the page.Enter the following XML four-block properties snippet to Gateway Default Group. (Click View as XML.)
<property> <name>mapreduce.task.profile</name> <value>true</value> </property> <property> <name>mapreduce.task.profile.maps</name> <value>0-5</value> </property> <property> <name>mapreduce.task.profile.reduces</name> <value>0-5</value> </property> // this is one line <property> <name>mapreduce.task.profile.params</name> <value>-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=
unravel-host
:4043</value></property>Save the changes.
Cloudera Manager will specify a restart which is not necessary to effect these changes. (Click Restart Stale Services if that is visible. However, you can also perform this later when you have a planned maintenance.)
Tip
The restart is important for the MR sensor to be picked up by queries submitted via Hiveserver2.
Use the Unravel UI to monitor the situation. When you view the MapReduce APM page for any completed MRjob you should see mappers and reducers in the Resource Usage tab.
7. Retrieve Impala data from Cloudera Manager
Configure Unravel Server to retrieve Impala query data from Cloudera Manager as follows:
Add com.unraveldata.data.source=cm in
/usr/local/unravel/etc/unravel.properties
on Unravel Server.Tell Unravel Server some information about your Cloudera Manager's URL, port number, login credentials, and so on.
You do this by adding the following properties to
/usr/local/unravel/etc/unravel.properties
on Unravel Server:For example,
Prior to Unravel v4.5.4
com.unraveldata.data.source=cm com.unraveldata.cloudera.manager.url=http://
my-cm-url
com.unraveldata.cloudera.manager.port=9997 com.unraveldata.cloudera.manager.username=mycmname
com.unraveldata.cloudera.manager.password=mycmpassword
Unravel v4.5.4.x and later
com.unraveldata.data.source=cm ## include the port with the url com.unraveldata.cloudera.manager.url=http://
my-cm-url
:port
com.unraveldata.cloudera.manager.username=mycmname
com.unraveldata.cloudera.manager.password=mycmpassword
Make sure that the Cloudera Manager user in com.unraveldata.cloudera.manager.username has read access to Cloudera Manager REST APIs.
You can verify this by running a curl command such as the following, substituting your local values for the variables:
curl --user
clouderamanager-username
:clouderamanager-password
'http://clouderamanager-url
:clouderamanager-port
/api/v13/clusters' curl --userclouderamanager-username
:clouderamanager-password
'http://clouderamanager-url
:clouderamanager-port
/api/v13/clusters/cluster-name
/services'Note
By default, the Impala sensor task is enabled. To disable it, specify the following option in
/usr/local/unravel/etc/unravel.properties
on Unravel Server:com.unraveldata.sensor.tasks.disabled=iw
(Optional) Change the Impala lookback window.
By default, when Unravel Server starts, it retrieves the last 5 minutes of Impala queries. To change this, do the following:
On Unravel Server, change com.unraveldata.cloudera.manager.impala.look.back.minutes in
/usr/local/unravel/etc/unravel.properties
.For example, to set the lookback to 7 minutes:
com.unraveldata.cloudera.manager.impala.look.back.minutes=-7
Note
Include a minus sign in front of the new value.
Restart the
unravel_us
daemon.
References
For more information on creating permanent functions, see Cloudera documentation.