Data
The Data page displays KPIs and insights about the data that applications read, create, and edit. For this, Unravel collects data from metastores, filesystems, and applications that run on the clusters monitored by Unravel. Unravel currently supports getting data from one or more Hive metastores. It connects to a Hive metastore via a direct JDBC connection to the database of the Hive metastore. This section describes connecting to Hive metastore and configuring FSImage to collect data. Data
Connecting to Hive metastore
Hive metastore connection can be set either with auto-configuration or manual configuration.
Option 1: Auto-configuration
Stop Unravel
<Unravel installation directory>/unravel/manager stop
Run auto-configuration.
<unravel_installation_directory>/unravel/manager config auto
Note
In a multi-cluster deployment, the command is run differently based on whether the hive metastore is on core node or the edge node. In both the cases, the following commands must be run only from the core node.
When the hive metastore is on the core node, run the following from the core node:
<unravel_installation_directory>/unravel/manager config auto
When the hive metastore is on the edge node, run the following command also from the core node:
<unravel_installation_directory>/unravel/manager config edge auto <edge-key>
Auto-configuration will set all the properties for the JDBC connection to the database of the Hive metastore on the current node except for the password. In case, the Hive metastore password is not found during auto-configuration, you will be prompted to set the password.
Set the Hive metastore password.
Note
This step is required only if the password is not found during the auto-configuration.
Tip
Run the manager config edge show command to get the
<HIVE_KEY>
<CLUSTER_KEY>
<HIVE_KEY>
is the definition of the Hive service.CLUSTER_KEY
is the name of the cluster where you set the Hive configurations.
In a single cluster deployment set the password as follows:
<Unravel installation directory>/unravel/manager config hive metastore password
<CLUSTER_KEY>
<password>
## For example <Unravel installation directory>/unravel/manager config hive metastore password my-cluster passwordIn a multi-cluster deployment, where edge nodes are monitoring, set the password from the core node as follows:
<Unravel installation directory>/unravel/manager config hive metastore password
<CLUSTER_KEY>
<HIVE_KEY>
<password>
(If core is monitoring) <Unravel installation directory>/unravel/manager config edge hive metastore password<CLUSTER-KEY>
<HIVE-KEY>
<password>
##For example: <Unravel installation directory>/unravel/manager config edge hive metastore password my-cluster hive passwordAlso, refer to Encrypting/Decrypting passwords.
Apply the changes.
<Unravel installation directory>/unravel/manager config apply
Start Unravel
<Unravel installation directory>/unravel/manager start
Go to Unravel UI's Jobs > Applications page to confirm that Hive queries are displayed. Approximately twenty-four hours after configuration the Data page displays a list of your Hive Metastore tables along with their KPIs and other details.
Option 2: Manual configuration
Stop Unravel
<Unravel installation directory>/unravel/manager stop
Run the following command to connect to hive metastore.
Tip
Run the manager config edge show command to get the
<EDGE_KEY>
<CLUSTER_KEY>
<HIVE_KEY>
details.<EDGE_KEY>
is the label you provide to identify the edge node when you set the cluster.<CLUSTER_KEY>
is the name of the cluster where you set the Hive configurations.<HIVE_KEY>
is the definition of the Hive service.
Single cluster deployment
manager config hive metastore set
<CLUSTER_KEY>
HIVE_KEY
DRIVER
URL
USER
PASSWORD
##For example: manager config edge hive metastore set my-cluster hive oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@prodHost:1521:ORCL user passcodeMulti-cluster deployment
manager config edge hive metastore set
<EDGE_KEY>
<CLUSTER_KEY>
<HIVE_KEY>
DRIVER
URL
USER
PASSWORD
##For example: manager config edge hive metastore set my-edge my-cluster hive oracle.jdbc.driver.OracleDriver jdbc:oracle:thin:@prodHost:1521:ORCL user passcode
This will set the following metastore database information:
JDBC driver: JDBC Driver class name for the data store containing the metadata. For example:
MySQL: com.mysql.jdbc.Driver
Oracle: oracle.jdbc.driver.OracleDriver
Microsoft: com.microsoft.sqlserver.jdbc.SQLServerDriver
JDBC URL: JDBC connection string for the datastore containing the metadata of the form.
jdbc:
DB_Driver
://HOST:PORT/
hiveFor example:
Oracle: jdbc:oracle:thin:@prodHost:1521:ORCL
Microsoft: jdbc:sqlserver://jdbc_url
Username: Username used to access the data store.
Password: Password used to access the data store.
Apply the changes.
<Unravel installation directory>/unravel/manager config apply
Start Unravel
<Unravel installation directory>/unravel/manager start
Configuring FSImage
This topic explains how to configure FSimage, which is triggered by default. FSImage is only applicable on CDH, CDP, and HDP. See Toggling FSImage status for how to disable it.
Common configurations
FSImage requires configuration for some Data Page features and content, specifically to:
Automatically generate Files Report
Calculate and populate the partition and table size information on the Data page. Refer to Table details section
Create the Small Files report upon user request.
Toggling FSImage status
Stop Unravel
<Unravel installation directory>/unravel/manager stop
Change the setting.
<Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.disable true
This property enables or disables Unravel ability to generate Small Files and File Reports. Default is false.
Note
false
: enables the Small Files and Files reports in both the backend and UI.true
: disables the Small Files and Files reports. in both the backend and UI.Apply the changes.
<Unravel installation directory>/unravel/manager config apply
Start Unravel
<Unravel installation directory>/unravel/manager start
Define resources used to process FSImage
Stop Unravel
<Unravel installation directory>/unravel/manager stop
From the Installation directory, set the properties listed in the table as follows:
For example: <Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.spark.cores 6 <Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.spark.driver.memory 8G
The following properties define the resources used to process FSImage.
Property/Description
Set by user
Unit
Default
unravel.python.reporting.files.spark.cores
The number of cores used to process FSImage.
The default is the recommended value in order to not overload the unravel node with FSImage runs.
Unit: count > 0
count
4
unravel.python.reporting.files.spark.driver.memory
The amount of memory to allocate to the JVM that runs Spark.
The value must be is a positive number.
To specify bytes: #, for example, 30.
To specify megabytes: #M, for example, 30M.
To specify gigabytes: #G, for example, 30G.
count
16G
Apply the changes.
<Unravel installation directory>/unravel/manager config apply
Start Unravel
<Unravel installation directory>/unravel/manager start
Configuring FSImage in a single cluster deployment
Accessing FSImage
Unravel can run as an HDFS admin.
This allows Unravel to access the FSImage on the Namenode using the
hdfs dfsadmin
command.Unravel cannot run as HDFS admin
FSImage must be downloaded for Unravel to use it. Create a
cron
to download it to the Unravel server. The old image must be deleted before running thecron
job.The download can take up to 10 hours depending on the size of FSImage. The time can determine how often
cron
is/should be run.
Set the following properties to access FSImage:
Stop Unravel
<Unravel installation directory>/unravel/manager stop
From the Installation directory, set the properties listed in the table as follows:
<Unravel installation directory>/unravel/manager config properties set
<property>
<value>
For example: <Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.skip_fetch_fsimage=true; <Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.external_fsimage_dir=/srv/unravel/tmp/fsimages/reports;Property/Description
Default
unravel.python.reporting.files.skip_fetch_fsimage
If HDFS admin privileges can not be granted, set this to true to allow Unravel's OnDemand process to use an externally fetched FSimage.
true
: OnDemand etl_fsimage process does not fetch FSImage from the name node. Instead, the FSImage is expected to be available in the directory specified by unravel.python.reporting.files.external_fsimage_dir.false
unravel.python.reporting.files.external_fsimage_dir
Directory for FSimage when skip_fetch_fsimage=true. The FSimage externally fetched is expected to be in this directory. Unravel uses the latest file in this directory which starts with " fsimage_".
This directory must be different than Unravel's internal directory, i.e., /srv/unravel/tmp/reports/fsimage.
-
Apply the changes.
<Unravel installation directory>/unravel/manager config apply
Start Unravel
<Unravel installation directory>/unravel/manager start
Importing etl_fsimage
The etl_fsimage
task imports the latest FSImage from Namenode. The etl_fsimage
run time is proportional to the image size, for example
FSimage Size |
|
---|---|
19 GB | 24 hours |
9 GB | 14 hours |
4 GB | 7 hours |
After Unravel Server is installed or upgraded, run time the following command to trigger the FSImage import.
curl -v http://localhost:5000/small-files-etl
Important
FSImage is a snapshot that becomes outdated with the passage of time, in other words, the older the image the more it diverges from the real-time structure.
Configuring FSImage in a multi-cluster deployment
Unravel Ondemand processes the HDFS FSImage as follows:
Fetches the raw FSImage from HDFS Namenode:
hdfs dfsadmin -fetchImage
<path to fsimage file on local machine>
Parses the raw FSImage into a tab-separated text file.
hdfs oiv
<path to fsimage file on local machine>
For this to work, FSImage must be fetched and parsed on a cluster gateway node.
In a Single cluster configuration, the Unravel core node itself is the cluster gateway node. Thus FSImage is fetched and parsed by the Unravel Ondemand process itself.
In a Multi-cluster configuration, Unravel edge nodes are the cluster gateway nodes. These nodes have a trivial Unravel footprint and there is no way to fetch/process the FSImage using any unravel process.
Thus, for the FSImage to work, it must be fetched, parsed, and uploaded to the Unravel core node by a non-Unravel process/script. A template script is provided for this purpose, which must be run by each of the Unravel edge nodes.
Configuring the template script
You must configure the following parameters in the template script.
Parameters | Description |
---|---|
| Set Unravel generated UID for the cluster attached to the edge node. |
| Set Unravel core node’s fully qualified hostname. |
| Set Unravel user name. |
| If Unravel is installed in a directory other than |
Any user can run the template script. However, that user must have HDFS dfsadmin privileges. In addition, if the cluster is kerberised, an appropriate kinit
statement should be added to the script.
Template Script #!/bin/bash set -x # This script should be set up as a cron job on Unravel edge node to run every 24 hours once. Ondemand Fsimage is triggered # every day at 00:00 UTC hours. This script should be set up in such a way that the Parsed Fsimage is available some time # before 00:00 UTC on Unravel core node. For achieving that, the following times must be considered # 1. Time to fetch Fsimage from Namenode using hdfs dfsadmin -fetchImage command # 2. Time to parse Fsimage into TAB separated text file using hdfs oiv command # 3. Time to copy parsed Fsimage to Unravel core node (using rsync) # The above three times should be found out manually and cron job should run at such a time that parsed Fsimage is # available a little before 00:00 UTC so that Ondedmand Fsimage deals with relatively fresh Fsimage data. Please note # this is not a must for Ondemand Fsimage functionality, but just for seeing as fresh data as possible # Fetching Fsimage requires dfsadmin privilege for the user running the command. # 1. In non Kerberised cluster, we must run this script as that user (or any other mechanism like logging into that user # without prompt just for running the dfsadmin command or setting setuid bit on this script so that it runs as that user # always etc.) # 2. In Kerberised cluster, we must do kinit for the user/service having dfsadmin privilege (typically hdfs has that # permission. # As described above, hdfs dfsadmin -fetchImage command requires that the user running the command must have dfsadmin privilege. # Thus this cron job # 1. Must run as that user in non Kerberized environment # 2. Have kinit done as the dfsadmin user # If a kinit needs to be done, please add it below # Configure Unravel Edge node specific variables here. In order for rsync to work without a password prompt, the # public SSH key of the user must be added to UNRAVEL_CORE_NODE_USER's $HOME/.ssh/authorized_keys file. In # addition, the Unravel edge node hostname must be added as a known_host to Unravel core node for this. This user can # be different that the dfsadmin user earlier. CLUSTER_UID=default UNRAVEL_CORE_NODE_HOSTNAME=xyz.unraveldata.com UNRAVEL_CORE_NODE_USER=unravel FSIMAGE_DESTINATION_BASEDIR=<Unravel_installation_dir>/unravel/data/tmp/reports/fsimage mkdir /tmp/$$ if [ $? -ne 0 ] then echo "Failed to mkdir /tmp/$$" exit 1 fi hdfs dfsadmin -fetchImage /tmp/$$ if [ $? -ne 0 ] then echo "Failed to fetch Fsimage" exit 1 fi hdfs oiv -i /tmp/$$/fs* -o /tmp/$$/fsimage.txt -p Delimited -t /tmp/$$/fsimage.tmp if [ $? -ne 0 ] then echo "Failed to parse Fsimage" exit 1 fi rsync /tmp/$$/fsimage.txt ${UNRAVEL_CORE_NODE_USER}
@${UNRAVEL_CORE_NODE_HOSTNAME}
:${FSIMAGE_DESTINATION_BASEDIR}
/${CLUSTER_UID}
if [ $? -ne 0 ] then echo "Failed to upload Fsimage /tmp/$$/fsimage.txt to @${UNRAVEL_CORE_NODE_HOSTNAME}
at ${FSIMAGE_DESTINATION_BASEDIR}
/${CLUSTER_UID}
" exit 1 fi rm -rf /tmp/$$
Running template script and set up a cron job
FSImage is processed by the Unravel ondemand process every day at 00:00 UTC. To guarantee data freshness, the latest FSimage should be uploaded to the Unravel Core node a short time before 00:00 UTC. Before uploading the latest FSimage, ensure to do the following:
Fetch FSImage using hdfs dfsadmin -fetchImage command and find out the time taken to do it.
Parse FSImage using hdfs oiv command and find out the time taken to do it.
Upload FSImage to a temporary location on the Unravel Core node and find out the time taken to do it.
You must set the template script as a cron job that runs every day at such a time that the above three processes in which the template script runs finish before 00:00 UTC.
Note
The uploading of FSImage from the Unravel edge node to the Unravel core node is done using rsync. Appropriate permissions related to rsync (such as adding the Unravel edge node as a well-known SSH host, adding the public RSH key of the uploading user which is the user that runs the cron job) should be added to authorized SSH keys in the Unravel core node.
Verifying the FSImage configuration
After the FSImage has been successfully fetched you can go to the UI to verify.
The Tables have table and partition sizes.
The four data File reports are populated.
You can generate a Small files report.
Tip
The relevant log file is
<unravel-installation-directory>/logs/ondemand_tasks.out
Run one of the following commands to display the progress of the
etl_fsimage
task.egrep 'ETL_FSIMAGE|FSIMAGE_REPORTS_UTILS' ondemand_tasks.out
grep etl_fsimage\(\) unravel_ondemand.out
Run one of the following commands to display the progress of the
run_small_files
which is started whenever Small Files Report is triggered from UI.egrep 'SMALL_FILES_REPORT|FSIMAGE_REPORTS_UTILS' ondemand_tasks.out
grep run_small_files\(\) ondemand_tasks.out
Connecting Databricks workspaces for Data page
Ensure that at least one of the workspace is populated before you configure a workspace for the Data page.
To configure the Databricks for Data page, do the following:
Stop Unravel
<Unravel installation directory>/unravel/manager stop
Set the following property.
<Unravel installation directory>/unravel/manager config properties set hive.metastore.
<X>
.workspace.ids <Comma-separated list of Databricks workspaces> ##Replace<X>
with the metastore variables listed in com.unraveldata.hive.metastore.list.Apply the changes.
<Unravel installation directory>/unravel/manager config apply
Start Unravel
<Unravel installation directory>/unravel/manager start