Data

The Data page displays KPIs and insights about the data that applications read, create, and edit. For this, Unravel collects data from metastores, filesystems, and applications that run on the clusters monitored by Unravel. Unravel currently supports getting data from one or more Hive metastores. It connects to a Hive metastore via a direct JDBC connection to the database of the Hive metastore. This section describes connecting to Hive metastore and configuring FSImage to collect data.

Connecting to Hive metastore

Hive metastore connection can be set either with auto-configuration or manual configuration.

Notice

In a multi-cluster environment, run the following steps on the core node.

Option 1: Auto-configuration

Stop Unravel

<Unravel installation directory>/unravel/manager stop

Run auto-configuration.
- Single cluster
```
<unravel_installation_directory>/unravel/manager config auto
```
- Multi-cluster
  Run the following command from the core node, where edge nodes are monitoring the Hadoop cluster:
```
<unravel_installation_directory>/unravel/manager config edge auto EDGE_KEY>
##Example: /opt/unravel/manager config edge auto my-edge
```
  In case, the core node is monitoring the Hadoop cluster directly, run the following command from the core node:
```
<unravel_installation_directory>/unravel/manager config auto
```
Set the Hive metastore password. The Hive metastore database password can be recovered automatically only for a cluster manager with an administrative account. Otherwise, it must be set manually.
1. Run the manager config edge show command to get the <EDGE_KEY>, <HIVE_KEY>, and <CLUSTER_KEY>, which must be provided when you set the Hive metastore password.
  - <EDGE_KEY> is the label you provide to identify the edge node when you set the cluster.
  - CLUSTER_KEY is the name of the cluster where you set the Hive configurations.
  - <HIVE_KEY> is the definition of the Hive service. In the output of the manager config edge show command, this is shown as <SERVICE_KEY>
2. In a single cluster deployment set the password as follows. If the password is omitted, it will be prompted without echo.
```
<Unravel installation directory>/unravel/manager config hive metastore password <CLUSTER_KEY> <password>
##Example: <Unravel installation directory>/unravel/manager config hive metastore password cluster1 password
```
  In a multi-cluster deployment, where edge nodes are monitoring, set the password on the core node as follows:
```
<Unravel installation directory>/unravel/manager config edge hive metastore password EDGE_KEY> <CLUSTER-KEY> <HIVE-KEY> <password> 
##Example: <Unravel installation directory>/unravel/manager config edge hive metastore password local-node cluster1 hive password
```
  In case, the core node is monitoring the Hadoop cluster directly, run the following command from the core node.
```
<Unravel installation directory>/unravel/manager config hive metastore password <CLUSTER_KEY> <HIVE_KEY> <password> 
##Example: <Unravel installation directory>/unravel/manager config edge hive metastore password clluster1 hive password
```
In case, the core node is monitoring the Hadoop cluster directly, run the following command from the core node.

Apply the changes.

<Unravel installation directory>/unravel/manager config apply

Start Unravel
```
<Unravel installation directory>/unravel/manager start
```
Go to Unravel UI's Jobs > Applications page to confirm that Hive queries are displayed. Approximately twenty-four hours after configuration the Data page displays a list of your Hive Metastore tables along with their KPIs and other details.

Option 2: Manual configuration

Stop Unravel

<Unravel installation directory>/unravel/manager stop

Run the manager config edge show command to get the <EDGE_KEY>, <HIVE_KEY>, and <CLUSTER_KEY>, which must be provided when you connect to the Hive metastore.
- <EDGE_KEY> is the label you provide to identify the edge node when you set the cluster.
- CLUSTER_KEY is the name of the cluster where you set the Hive configurations.
- <HIVE_KEY> is the definition of the Hive service. In the output of the manager config edge show command, this is shown as <SERVICE_KEY>
Run the following command to connect to Hive metastore.
- Single cluster deployment
```
<Unravel installation directory>/unravel/manager config hive metastore set <CLUSTER_KEY> HIVE_KEY DRIVER URL USER PASSWORD

##For example: /opt/unravel/manager config edge hive metastore set my-cluster hive com.mysql.jdbc.Driver jdbc:mysql://localhost:3306/database user passcode
```
- Multi-cluster deployment
```
<Unravel installation directory>/unravel/manager config edge hive metastore set <EDGE_KEY> <CLUSTER_KEY> <HIVE_KEY> DRIVER URL USER PASSWORD

##Example: /opt/unravel/manager config edge hive metastore set my-edge my-cluster hive com.mysql.jdbc.Driver jdbc:mysql://localhost:3306/database user passcode
```
This will set the following metastore database information:
- JDBC driver: JDBC Driver class name for the data store containing the metadata.
  For example:
  - MySQL: com.mysql.jdbc.Driver
  - PostgreSQL: org.postgresql.Driver
  - Oracle: oracle.jdbc.driver.OracleDriver
  - Microsoft: com.microsoft.sqlserver.jdbc.SQLServerDriver
- JDBC URL: JDBC connection string in the format expected by the driver.
  For example:
  - MySQL: jdbc:mysql://host:port/database
  - PostgreSQL: jdbc:postgresql://host:port/database
  - Oracle: jdbc:oracle:thin:@database:port:sid
  - Microsoft: jdbc:sqlserver://hostname:port;databaseName=database
- Username: Username used to access the data store.
- Password: Password used to access the data store.

Apply the changes.

<Unravel installation directory>/unravel/manager config apply

Start Unravel

<Unravel installation directory>/unravel/manager start

Configuring FSImage

In Hadoop, FSImage is a file stored on the OS file system that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node. This file is used by the NameNode when it is started.

FSImage must be configured in Unravel for some of the Data page features and content, specifically to:

Automatically generate Files Report
Calculate and populate the partition and table size information on the Data page. Refer to Table details section
Create the Small Files report upon user request.

The FSImage is applicable only for CDH, CDP, and HDP platforms. This topic explains how to configure FSImage. The FSImage status is enabled by default, to disable the feature see Disable FSImage status.

The etl_fsimage task processes the FSImage for each of the connected clusters. FSImage processing involves file reports generation and table size extraction. The duration of the task depends on the size of FSImage. The etl_fsimage task imports the latest FSImage from Namenode. The etl_fsimage run time is proportional to the image size, for example:

FSimage Size	`etl_fsimage`
19 GB	24 hours
9 GB	14 hours
4 GB	7 hours

Caution

FSImage is a snapshot that becomes outdated with the passage of time, in other words, the older the image the more it diverges from the real-time structure.

In Unravel, FSImage can be configured on a single cluster as well as multi-cluster environments. The following sections are included here:

Defining resources used to process FSImage

You can set Unravel properties to define the resources that are used to process the FSImage. Run the following steps to define the resources. In a multi-cluster environment, you must perform the following steps on the core node.

Stop Unravel

<Unravel installation directory>/unravel/manager stop

For FSImage processing, a standalone Spark process is used. This process runs with the default 4 cores and16 GB memory, which is suitable for a small sized FSImage file less than 10 GB.

To support larger FSImage files, set the properties shown in the table as follows:

##For example:
<Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.spark.cores 6
<Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.spark.driver.memory 8G

The following properties define the resources used to process FSImage.

Property/Description	Set by user	Unit	Default
unravel.python.reporting.files.spark.cores The number of cores used to process FSImage. The default is the recommended value in order to not overload the unravel node with FSImage runs. Unit: count > 0		count	`4`
unravel.python.reporting.files.spark.driver.memory The amount of memory to allocate to the JVM that runs Spark. The value must be is a positive number. To specify bytes: #, for example, 30. To specify megabytes: #M, for example, 30M. To specify gigabytes: #G, for example, 30G.		count	`16G`

Property/Description

Set by user

Unit

Default

unravel.python.reporting.files.spark.cores

The number of cores used to process FSImage.

The default is the recommended value in order to not overload the unravel node with FSImage runs.

Unit: count > 0

count

4

unravel.python.reporting.files.spark.driver.memory

The amount of memory to allocate to the JVM that runs Spark.

The value must be is a positive number.

To specify bytes: #, for example, 30.

To specify megabytes: #M, for example, 30M.

To specify gigabytes: #G, for example, 30G.

count

16G

Apply the changes.

<Unravel installation directory>/unravel/manager config apply

Start Unravel

<Unravel installation directory>/unravel/manager start

Configuring FSImage in a single cluster deployment

FSImage is configured differently based on whether you can access the FSImage with DFS admin permissions or not.

Configuring FSImage with DFS admin permissions

With hdfs dfsadmin permissions, run the following command to trigger the FSImage import:

curl -v http://localhost:5000/small-files-etl

Configuring FSImage without DFS admin permissions

Download the FSImage.
You must download the FSImage for Unravel usage. Create a cron to download it to the Unravel server. The old image must be deleted before running the cron job.
The download can take up to 10 hours depending on the size of FSImage. The time can determine how often the cron should be run.

Stop Unravel

<Unravel installation directory>/unravel/manager stop

From the Installation directory, set the properties listed in the table as follows:

<Unravel installation directory>/unravel/manager config properties set <property> <value>

##For example:
<Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.skip_fetch_fsimage=true;
<Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.external_fsimage_dir=/srv/unravel/tmp/fsimages/reports;

Property/Description	Default
unravel.python.reporting.files.skip_fetch_fsimage If DFS admin privileges can not be granted, set this to true to allow Unravel's OnDemand process to use an externally fetched FSimage. `true`: OnDemand etl_fsimage process does not fetch FSImage from the name node. Instead, the FSImage is expected to be available in the directory specified by unravel.python.reporting.files.external_fsimage_dir.	false
unravel.python.reporting.files.external_fsimage_dir Directory for FSimage when skip_fetch_fsimage=true. The FSimage externally fetched is expected to be in this directory. Unravel uses the latest file in this directory which starts with " fsimage_". This directory must be different than Unravel's internal directory, i.e., /srv/unravel/tmp/reports/fsimage.	-

Property/Description

Default

unravel.python.reporting.files.skip_fetch_fsimage

If DFS admin privileges can not be granted, set this to true to allow Unravel's OnDemand process to use an externally fetched FSimage.

true: OnDemand etl_fsimage process does not fetch FSImage from the name node. Instead, the FSImage is expected to be available in the directory specified by unravel.python.reporting.files.external_fsimage_dir.

false

unravel.python.reporting.files.external_fsimage_dir

Directory for FSimage when skip_fetch_fsimage=true. The FSimage externally fetched is expected to be in this directory. Unravel uses the latest file in this directory which starts with " fsimage_".

This directory must be different than Unravel's internal directory, i.e., /srv/unravel/tmp/reports/fsimage.

Apply the changes.

<Unravel installation directory>/unravel/manager config apply

Start Unravel.

<Unravel installation directory>/unravel/manager start

Run time the following command to trigger the FSImage import.
```
curl -v http://localhost:5000/small-files-etl
```

Configuring FSImage in a multi-cluster deployment

Unravel Ondemand processes the HDFS FSImage as follows:

Fetches the raw FSImage from HDFS Namenode:

hdfs dfsadmin -fetchImage <path to fsimage file on local machine>

Parses the raw FSImage into a tab-separated text file.
```
hdfs oiv <path to fsimage file on local machine>
```

In multi-cluster environment, Unravel user does not have permissions to fetch and process the FSImage. Only a user with HDFS dfsadmin can fetch, parse and upload the FSImage. A template script is provided for this purpose.

Configuring the template script

The template script is used to fetch, parse, and upload the FSImage to the core node. Therefore, this script must be run on each of the Unravel edge nodes. The following must be considered before running the script.

Any user with HDFS dfsadmin privileges can run the script.

If the cluster is kerberised, an appropriate kinit statement should be added to the script. This must be done as a dfsadmin user.

kinit -kt <keytab_path> <principal_name>
##For example:
kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-hdp99d53@UNRAVELDATA.COM

In a non Kerberised cluster, run the script as the dfsadmin user (or any other mechanism like logging into the dfsadmin user without prompt for running the dfsadmin command or setting setuid bit on this script so that it runs as dfsadmin user always etc. )
The uploading of FSImage from the Unravel edge node to the Unravel core node is done using rsync. Appropriate permissions related to rsync (such as adding the Unravel edge node as a well-known SSH host, adding the public RSA key of the uploading user which is the user that runs the cron job) should be added to authorized SSH keys in the Unravel core node.
For rsync to work without a password prompt, do the following:
1. Add the public SSH key of the user to the Unravel core node user's $HOME/.ssh/authorized_keys file.
2. Add the Unravel edge node hostname as a known_host to Unravel core node.
3. Run the following commands for SSH passwordless login for rsync command execution. You can skip the step to generate the keys, if you already have the public keys.
```
 ssh-keygen -t rsa (##Skip this step, if you already have the public keys.)
 ssh <UNRAVEL_CORE_NODE_USER>@<UNRAVEL_CORE_NODE_HOSTNAME> mkdir -p .ssh
 cat ~/.ssh/id_rsa.pub | ssh <UNRAVEL_CORE_NODE_USER>@<UNRAVEL_CORE_NODE_HOSTNAME> 'cat >> ~/.ssh/authorized_keys'
 ssh <UNRAVEL_CORE_NODE_USER>@<UNRAVEL_CORE_NODE_HOSTNAME> "chmod 700 ~/.ssh; chmod 640 ~/.ssh/authorized_keys"
```

Running template script and set up a cron job

FSImage is processed by the Unravel ondemand process every day at 00:00 UTC. To guarantee data freshness, the latest FSimage should be uploaded to the Unravel Core node a short time before 00:00 UTC.

Before uploading the latest FSimage, observe the total time taken to run the script and accordingly set the cron job so that Unravel has access to the fresh FSImage before 00.00 UTC.

The template script can be set up as a cron job that runs every day at such a time that the above three processes in which the script runs finish before 00:00 UTC.

You must configure the following parameters in the script:

Parameters	Description
`CLUSTER_ACCESS_ID`	Set cluster access id for the cluster attached to the edge node. Run the following command to get the cluster access ID: <Unravel_installation_directory>/unravel/manager support show cluster_access_id
`UNRAVEL_CORE_NODE_HOSTNAME`	Set Unravel core node’s fully qualified hostname.
`UNRAVEL_CORE_NODE_USER`	Set Unravel user name.
`FSIMAGE_DESTINATION_BASEDIR`	If Unravel is installed in a directory other than `/opt/unravel`, then this should be set to `<unravel_installation_dir>/data/tmp/reports/fsimage`.

Template script

#!/bin/bash

set -x

kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-hdp99d53@UNRAVELDATA.COM

CLUSTER_ACCESS_ID=<Cluster access ID>
UNRAVEL_CORE_NODE_HOSTNAME=<Hostname of Unravel core node>
UNRAVEL_CORE_NODE_USER=<Username of Unravel core node>
FSIMAGE_DESTINATION_BASEDIR=<Unravel_installation_dir>/unravel/data/tmp/reports/fsimage

mkdir /tmp/$$
if [ $? -ne 0 ]
then
 echo "Failed to mkdir /tmp/$$"
 exit 1
fi

hdfs dfsadmin -fetchImage /tmp/$$
if [ $? -ne 0 ]
then
 echo "Failed to fetch Fsimage"
 exit 1
fi

hdfs oiv -i /tmp/$$/fs* -o /tmp/$$/fsimage.txt -p Delimited -t /tmp/$$/fsimage.tmp
if [ $? -ne 0 ]
then
 echo "Failed to parse Fsimage"
 exit 1
fi

rsync /tmp/$$/fsimage.txt ${UNRAVEL_CORE_NODE_USER}@${UNRAVEL_CORE_NODE_HOSTNAME}:${FSIMAGE_DESTINATION_BASEDIR}/${CLUSTER_ACCESS_ID}/
if [ $? -ne 0 ]
then
 echo "Failed to upload Fsimage /tmp/$$/fsimage.txt to ${UNRAVEL_CORE_NODE_USER}@${UNRAVEL_CORE_NODE_HOSTNAME} at ${FSIMAGE_DESTINATION_BASEDIR}/${CLUSTER_ACCESS_ID}/"
 exit 1
fi

rm -rf /tmp/$$

Verifying the FSImage configuration

After the FSImage has been successfully fetched you can go to the UI to verify.

The four data File reports are populated.
You can generate a Small files report.

Important

Table worker daemon checks for tables sizes every 24 hours by default. So even if FSImage is run, it would take that much time to reflect the size. To short-circuit you can restart the table_worker daemon.

Tip

The relevant log file is <unravel-installation-directory>/logs/ondemand_tasks.out

Run one of the following commands to display the progress of the etl_fsimage task.

egrep 'ETL_FSIMAGE|FSIMAGE_REPORTS_UTILS' ondemand_tasks.out

grep etl_fsimage\(\) unravel_ondemand.out

Run one of the following commands to display the progress of the run_small_files which is started whenever Small Files Report is triggered from UI.
```
egrep 'SMALL_FILES_REPORT|FSIMAGE_REPORTS_UTILS' ondemand_tasks.out
```
```
grep run_small_files ondemand_tasks.out
```

Disabling FSImage status

The FSImage status is enabled by default, if you want to disable FSImage, perform the following steps.

Note

In a multi-cluster environment, you must perform the following steps on the core node.

Stop Unravel

<Unravel installation directory>/unravel/manager stop

Change the setting.
```
<Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.disable true
```
This property enables or disables Unravel ability to generate Small Files and File Reports. Default is false.
Note
false: enables the Small Files and Files reports in both the backend and UI.
true: disables the Small Files and Files reports. in both the backend and UI.

Apply the changes.

<Unravel installation directory>/unravel/manager config apply

Start Unravel

<Unravel installation directory>/unravel/manager start

Connecting Databricks workspaces for Data page

Ensure that at least one of the workspace is populated before you configure a workspace for the Data page.

To configure the Databricks for Data page, do the following:

Stop Unravel

<Unravel installation directory>/unravel/manager stop

Set the following property.

<Unravel installation directory>/unravel/manager config properties set hive.metastore.<X>.workspace.ids <Comma-separated list of Databricks workspaces>

##Replace <X> with  the metastore variables listed in com.unraveldata.hive.metastore.list.

Apply the changes.

<Unravel installation directory>/unravel/manager config apply

Start Unravel

<Unravel installation directory>/unravel/manager start

In this section:

Home

Data

Connecting to Hive metastore

Notice

Configuring FSImage

Caution

Defining resources used to process FSImage

Configuring FSImage in a single cluster deployment

Configuring FSImage with DFS admin permissions

Configuring FSImage without DFS admin permissions

Configuring FSImage in a multi-cluster deployment

Configuring the template script

Verifying the FSImage configuration

Important

Tip

Disabling FSImage status

Note

Note

Connecting Databricks workspaces for Data page

Search results