CDH
This topic explains how to enable additional instrumentation on your gateway/edge/client nodes that are used to submit jobs to your big data platform. Additional instrumentation can include:
Hive queries in Hadoop that are pushed to Unravel Server by the Hive Hook sensor, a JAR file
Spark job performance metrics that are pushed to Unravel Server by the Spark sensor, a JAR file
Impala queries that are pulled from the Cloudera Manager
Sensor JARs are packaged in a parcel on Unravel server.
In Cloudera Manager, go to the Parcels page by clicking the parcels glyph () on the top of the page.
Click Configuration to see the Parcel Settings pop-up.
In the Parcel Settings pop-up, go to the Remote Parcel Repository URLs section, and click the + glyph to add a new entry.
In a new browser tab, copy the exact directory name for your CDH version from the
http://
directory.unravel-host
:3000/parcels/For example, the exact directory name might be
cdh5.16
orcdh6.0
.Add
http://
(including the trailing slash).unravel-host
:3000/parcels/cdh-version
/Where:
cdh-version
is your version of CDH. For example,cdh5.16
orcdh6.0
.unravel-host
is the hostname or LAN IP address of Unravel Server. On a multi-host Unravel Server, this would be the host where thelog_receiver
daemon is running.Note
If you're using Active Directory Kerberos,
unravel-host
must be a fully qualified domain name or IP address.Tip
If you're running more than one version of CDH (for example, you have multiple clusters), you can add more than one parcel entry for
unravel-host
.Click Save.
Click Check for New Parcels.
On the Parcels page, pick a target cluster in the Location box.
In the list of Parcel Names, find the
UNRAVEL_SENSOR
parcel that matches the version of the target cluster and click Download.Click Distribute.
If you have an old parcel from Unravel, deactivate it now.
On the new parcel, click Activate.
In Cloudera Manager, select the target cluster, click Hive | Configuration, and search for
hive-env
.In Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for
hive-env.sh
enter the following exactly as shown, with no substitutions:AUX_CLASSPATH=${AUX_CLASSPATH}:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
If Sentry is enabled, grant privileges on the JAR files to the Sentry roles that run Hive queries.
Sentry commands may also be needed to enable access to the Hive Hook JAR file. Grant privileges on the JAR files to the roles that run hive queries. Log in to Beeline as user
hive
and use the HiveSQL GRANT
statement to do so.For example (substitute
role
as appropriate),GRANT ALL ON URI 'file:///opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar' TO ROLE
role
Copy the Hive Hook JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/unravel_hive_hook.jar
and the Btrace JAR, /opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar
to the shared lib path specified by oozie.libpath
. If you don't do this, jobs controlled by Oozie 2.3+ will fail.
Copy the Hive Hook snippet into
hive-site.xml
:From the Unravel server OS, use the command
cat <Unravel installation directory>/hive-hook/hive-site.xml.snip
, copy the contents, and paste them intohive-site.xml
.Note
On a multi-host Unravel Server deployment, use the
<Unravel installation directory>/hive-hook/hive-site.xml.snip
snippet from host2.In Cloudera Manager, go to the Hive service.
Select the Configuration tab.
Search for
hive-site.xml
in the middle of the page.Add the snippet to Hive Client Advanced Configuration Snippet for hive-site.xml (Gateway Default Group).
To edit, click View as XML.
If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys
hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks
and append the value com.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.For example:
<property> <name>hive.exec.post.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value> <description>for Unravel, from unraveldata.com</description> </property>
Note
Hook classes are separated by comma without any space.
Add custom property com.unraveldata.host and set the value to
unravel-gateway-internal-IP-hostname
.Add the snippet to HiveServer2 Advanced Configuration Snippet for hive-site.xml.
To edit, click View as XML.
If you configure CDH with Cloudera Navigator's safety valve setting, you must edit the following keys
hive.exec.post.hooks, hive.exec.pre.hooks, hive.exec.failure.hooks
and append the valuecom.unraveldata.dataflow.hive.hook.UnravelHiveHook without any space.For example:
<property> <name>hive.exec.post.hooks</name> <value>com.unraveldata.dataflow.hive.hook.UnravelHiveHook,com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger</value> <description>for Unravel, from unraveldata.com</description> </property>
Note
Hook classes are separated by comma without any space.
Add custom property com.unraveldata.host and set the value to
unravel-gateway-internal-IP-hostname
.Save the changes with optional comment
Unravel snippet in hive-site.xml
.Deploy the Hive client configuration by clicking the deploy glyph () or by using the Actions pull-down menu.
Restart the Hive service.
Tip
Cloudera Manager recommends a restart, which is not necessary for activating these changes. Don't restart now; you can restart later.
Check Unravel UI to see if all Hive queries are running.
If queries are running fine and appearing in Unravel UI, you are done.
If queries are failing with a
class not found
error or permission problems:Undo the
hive-site.xml
changes in Cloudera Manager.Deploy the hive client configuration.
Restart the Hive service.
Follow the steps in Troubleshooting.
In Cloudera Manager, select the target cluster and then click Spark.
Select Configuration.
Search for
spark-defaults
.In Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf, enter the following text, replacing placeholders with your particular values:
spark.unravel.server.hostport=
unravel-host
:4043 spark.driver.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=driver,libs=spark-version
spark.executor.extraJavaOptions=-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=config=executor,libs=spark-version
spark.eventLog.enabled=trueOn a multi-host Unravel Server deployment, use host2's FQDN or logical hostname for
unravel-host
.Save changes.
Deploy the client configuration by clicking the deploy glyph () or by using the Actions pull-down menu. Your spark-shell will ensure new JVM containers are created with the necessary extraJavaOptions for the Spark drivers and executors.
Check Unravel UI to see if all Spark jobs are running.
If jobs are running and appearing in Unravel UI, you have deployed the Spark jar successfully.
If queries are failing with a
class not found
error or permission problems:Undo the
spark-defaults.conf
changes in Cloudera Manager.Deploy the client configuration.
Investigate and fix the issue.
Follow the steps in Troubleshooting.
Note
If you have YARN-client mode applications, the default Spark configuration is not sufficient, because the driver JVM starts before the configuration set through the SparkConf is applied. For more information, see Apache Spark Configuration. In this case, configure the Unravel Sensor for Spark to profile specific Spark applications only (in other words, per-application profiling rather than cluster-wide profiling).
In Cloudera Manager, go to YARN service.
Select the Configuration tab.
Search for Application Master Java Opts Base and concatenate the following XML block properties snippet (ensure to start with a space and add below).
Note
Make sure that "-" is a minus sign. You need to modify the value of
unravel-host
with your Unravel Server IP address or a fully qualified DNS. For multi-host Unravel installation, use the IP address of Host2.-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=
unravel-host
:4043Search for MapReduce Client Advanced Configuration Snippet (Safety Valve) for
mapred-site.xml
in the middle of the page.Enter the following XML four-block properties snippet to Gateway Default Group. (Click View as XML.)
<property> <name>mapreduce.task.profile</name> <value>true</value> </property> <property> <name>mapreduce.task.profile.maps</name> <value>0-5</value> </property> <property> <name>mapreduce.task.profile.reduces</name> <value>0-5</value> </property> // this is one line <property> <name>mapreduce.task.profile.params</name> <value>-javaagent:/opt/cloudera/parcels/UNRAVEL_SENSOR/lib/java/btrace-agent.jar=libs=mr -Dunravel.server.hostport=
unravel-host
:4043</value></property>Save the changes.
Deploy the client configuration by clicking the deploy glyph () or by using the Actions pull-down menu.
Cloudera Manager will specify a restart which is not necessary to effect these changes. (Click Restart Stale Services if that is visible. However, you can also perform this later when you have a planned maintenance.)
Tip
The restart is important for the MR sensor to be picked up by queries submitted via Hiveserver2.
Use the Unravel UI to monitor the situation. When you view the MapReduce APM page for any completed MRjob you should see mappers and reducers in the Resource Usage tab.
The following properties can be set to retrieve Impala query data from Cloudera Manager:
Property/Description | Set by user | Unit | Defaults |
---|---|---|---|
com.unraveldata.data.source Can be cm or impalad. | cm |
For example,
com.unraveldata.data.source=cm com.unraveldata.cloudera.manager.url=http://my-cm-url
com.unraveldata.cloudera.manager.username=mycmname
com.unraveldata.cloudera.manager.password=mycmpassword
For multi-cluster, use the following format
# cloudera manager CM1 com.unraveldata.cloudera.manager.CM1.url=http://my-cm-url1
com.unraveldata.cloudera.manager.CM1.username=mycmname1
com.unraveldata.cloudera.manager.CM1.password=mycmpassword
# cloudera manager CM2 com.unraveldata.cloudera.manager.CM2.url=//my-cm-url2
com.unraveldata.cloudera.manager.CM2.username=mycmname2
com.unraveldata.cloudera.manager.CM2.password=mycmpassword2
Note
By default, the Impala sensor task is enabled. To disable it, you can edit the following properties as follows from manager config.
com.unraveldata.sensor.tasks.disabled=iw
Optionally, you can change the Impala lookback window. By default, when Unravel Server starts, it retrieves the last 5 minutes of Impala queries. To change this, do the following
Using manager config change the value for com.unraveldata.cloudera.manager.impala.look.back.minutes property.
For example, to set the lookback to seven minutes:
com.unraveldata.cloudera.manager.impala.look.back.minutes=-7
Note
Include a minus sign in front of the new value.
For quick initial installation, you can use the hdfs
principal and its keytab. However, for production use, you may want to create an alternate principal with restricted access to specific areas and use its corresponding keytab. This topic explains how to do this.
You can name the alternate principal whatever you prefer; these steps, name it unravel
. Its name doesn't need to be the same as the local username.
The steps apply only to CDH and have been tested using Cloudera Manager with the recommended Sentry configuration.
Check the HDFS default umask.
For access via ACL, the group part of the HDFS default umask needs to have read and execute access. This allows Unravel to see subdirectories and read files. The default umask setting on HDFS for both CDH and HDP is
022
. The middle digit controls the group mask, and ACLs are masked using this default group mode.You can check the HDFS umask setting from either Cloudera Manager or in
hdfs-site.xml
:In Cloudera Manager, check the value of dfs.umaskmode and make sure the middle digit is
2
or0
.In
hdfs-site.xml
file search for fs.permissions.umask-mode and make sure the middle digit is2
or0
.
Enable ACL inheritance.
In Cloudera Manager's HDFS configuration, search for
namenode advanced configuration snippet
, and set its dfs.namenode.posix.acl.inheritance.enabled property totrue
inhdfs-site.xml
. This is a workaround for an issue where HDFS was not compliant with the Posix standard for ACL inheritance. For details, see Apache JIRA HDFS-6962. Cloudera backported the fix for this issue into CDH5.8.4, CDH5.9.1, and later, setting dfs.namenode.posix.acl.inheritance.enabled tofalse
in Hadoop 2.x andtrue
in Hadoop 3.x.Restart the cluster to effect the change of dfs.namenode.posix.acl.inheritance.enabled to
true
.Change the ACLs of the target HDFS directories.
Run the following commands as global
hdfs
to change the ACLs of the following HDFS directories. Run these in the order presented.Set the ACL for future directories.
Note
Be sure to set the permissions at the
/user/history
level. Files are first written to anintermediate_done
folder under/user/history
and then moved to/user/history/done
.hadoop fs -setfacl -R -m default:user:unravel:r-x /user/spark/applicationHistory hadoop fs -setfacl -R -m default:user:unravel:r-x /user/history hadoop fs -setfacl -R -m default:user:unravel:r-x /tmp/logs hadoop fs -setfacl -R -m default:user:unravel:r-x /user/hive/warehouse
If you have Spark2 installed, set the ACL of the Spark2 application history folder:
hadoop fs -setfacl -R -m default:user:unravel:r-x /user/spark/spark2ApplicationHistory
Set ACL for existing directories.
hadoop fs -setfacl -R -m user:unravel:r-x /user/spark/applicationHistory hadoop fs -setfacl -R -m user:unravel:r-x /user/history hadoop fs -setfacl -R -m user:unravel:r-x /tmp/logs hadoop fs -setfacl -R -m user:unravel:r-x /user/hive/warehouse
If you have Spark2 installed, set the ACL of the Spark2 application history folder:
hadoop fs -setfacl -R -m user:unravel:r-x /user/spark/spark2ApplicationHistory
Verify the ACL of the target HDFS directories.
hdfs dfs -getfacl /user/spark/applicationHistory hdfs dfs -getfacl /user/spark/spark2ApplicationHistory hdfs dfs -getfacl /user/history hdfs dfs -getfacl /tmp/logs hdfs dfs -getfacl /user/hive/warehouse
On the Unravel Server, verify HDFS permission on folders as the target user (
unravel
,hdfs
,mapr
, or custom) with a valid kerberos ticket corresponding to the keytab principal.sudo -u unravel kdestroy sudo -u unravel kinit -kt
keytab-file
principal
sudo -u unravel hadoop fs -ls /user/history sudo -u unravel hadoop fs -ls /tmp/logs sudo -u unravel hadoop fs -ls /user/hive/warehouseFind and verify the keytab:
klist -kt
keytab-file
Warning
If you're using KMS and HDFS encryption and the
hdfs
principal, you might need to adjustkms-acls.xml
permissions in Cloudera Manager for DECRYPT_EEK if access is denied. In particular, the "done" directory might not allow decryption of logs by thehdfs
principal.If you're using "JNI" based groups for HDFS (a setting in Cloudera Manager), you need to add this line to
/usr/local/unravel/etc/unravel.ext.sh
:export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
If Kerberos is enabled, set the new values for
keytab-file
andprincipal:
<Unravel installation directory>/manager config kerberos set --keytab /etc/security/keytabs/unravel.service.keytab --principal unravel/server@example.com <Unravel installation directory>/manager config kerberos enable
Important
Whenever you change Kerberos tokens or principal, restart all services,
<installation directory>/manager restart
.
References
For more information on creating permanent functions, see Cloudera documentation.