Home

Configuring FSImage (4.7.0.0)

Note

The FSImage is applicable only for CDH, CDP, and HDP platforms.

In Hadoop, FSImage is a file stored on the OS file system that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node. This file is used by the NameNode when it is started.

FSImage must be configured in Unravel for some of the Data page features and content, specifically to:

This topic explains how to configure FSImage. The FSImage status is enabled by default, to disable the feature see Disable FSImage status.

The etl_fsimage task processes the FSImage for each of the connected clusters. FSImage processing involves file reports generation and table size extraction. The duration of the task depends on the size of FSImage. The etl_fsimage task imports the latest FSImage from Namenode. The etl_fsimage run time is proportional to the image size, for example:

Caution

FSImage is a snapshot that becomes outdated with the passage of time, in other words, the older the image the more it diverges from the real-time structure.

In Unravel, FSImage can be configured on a single cluster as well as multi-cluster environments. The following sections are included here:

Define resources used to process FSImage

You can set Unravel properties to define the resources that are used to process the FSImage. Run the following steps to define the resources. In a multi-cluster environment, you must perform the following steps on the core node.

  1. Stop Unravel

    <Unravel installation directory>/unravel/manager stop
    
  2. For FSImage processing, a standalone Spark process is used. This process runs with the default 4 cores and16 GB memory, which is suitable for a small-sized FSImage file less than 10 GB.

    To support larger FSImage files, set the properties shown in the table as follows:

    ##For example:
    /opt/unravel/manager config properties set unravel.python.reporting.files.spark.cores 6
    /opt/unravel/manager config properties set unravel.python.reporting.files.spark.driver.memory 8G
    

    The following properties define the resources used to process FSImage.

  3. Apply the changes.

    <Unravel installation directory>/unravel/manager config apply
    
  4. Start Unravel

    <Unravel installation directory>/unravel/manager start
Configure FSImage

If the Unravel user has hdfs dfsadmin permissions, then the ondemand processes will automatically fetch the raw FSImage from the HDFS Namenode and parse it into a tab-separated text file. However if the Unravel user does not have this permission then any other user with HDFS dfsadmin permissions can fetch, parse and upload the FSImage manually.

Configure FSImage in a single cluster deployment

FSImage is configured differently based on whether you can access with hdfs dfsadmin permissions or not.

Configure FSImage with hdfs dfsadmin permissions

In a single cluster environment, if you are an Unravel user with hdfs dfsadmin permissions, then the ondemand processes will automatically fetch the raw FSImage from the HDFS Namenode and parse it into a tab-separated text file.

Run the following command with the hdfs dfsadmin permissions to trigger the FSImage import:

curl -v http://localhost:5000/small-files-etl
Configure FSImage without hdfs dfsadmin permissions

In a single cluster environment, if you are an Unravel user without dfsadmin privileges, then any user with the hdfs dfsadmin permissions must run first fetch, parse, and upload the FSImage manually. Later, you (Unravel user without dfsadmin privileges) can run the following steps to download and configure the FSImage:

If the Unravel user does not have the hdfs dfsadmin permissions, then any other user with the hdfs dfsadmin permissions can manually fetch, parse, and upload the FSImage. Later, you (Unravel user without hdfs dfsadmin privileges) can download and configure the FSImage.

  1. As a user with hdfs dfsadmin permissions, fetch the raw FSImage from the HDFS Namenode:

    hdfs dfsadmin -fetchImage <path to fsimage file on local machine>
  2. Parse the raw FSImage into a tab-separated text file.

    hdfs oiv <path to fsimage file on local machine>

    You must download the FSImage for Unravel usage in a directory that is accessible by an Unravel user. The FSimage externally fetched should be placed in this directory, which should be configured in the unravel.python.reporting.files.external_fsimage_dir property.

    You can create a cron job to download it to the Unravel server. The old image must be deleted before running the cron job. The download can take up to 10 hours depending on the size of FSImage. The time can determine how often the cron should be run.

  3. As an Unravel user, stop Unravel.

    <Unravel installation directory>/unravel/manager stop
    
  4. From the Installation directory, set the properties listed in the table as follows:

    <Unravel installation directory>/unravel/manager config properties set <property> <value>
    
    ##For example:
    <Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.skip_fetch_fsimage=true;
    <Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.external_fsimage_dir=/srv/unravel/tmp/fsimages/reports;

    The following table provides more details of these properties:

  5. Apply the changes.

    <Unravel installation directory>/unravel/manager config apply
    
  6. Start Unravel.

    <Unravel installation directory>/unravel/manager start
  7. Run the following command to trigger the FSImage import.

    curl -v http://localhost:5000/small-files-etl
Configure FSImage in a multi-cluster deployment

Unravel Ondemand processes the HDFS FSImage as follows:

  • Fetches the raw FSImage from HDFS Namenode:

    hdfs dfsadmin -fetchImage <path to fsimage file on local machine>
  • Parses the raw FSImage into a tab-separated text file.

    hdfs oiv <path to fsimage file on local machine>

In the multi-cluster environment, an Unravel user cannot fetch and process the FSImage. This can be done only by a user with hdfs dfsadmin permissions. Therefore, a template script is provided, which can be used for this purpose. Before running the template script, you must configure the script as follows:

Configure the template script

The template script is used to fetch, parse, and upload the FSImage to the core node. Therefore, this script must be run on each of the Unravel edge nodes. The following must be considered before running the script.

  • Any user with hdfs dfsadmin privileges can run the script.

  • If the cluster is kerberised, an appropriate kinit statement should be added to the script. This must be done as a dfsadmin user.

    kinit -kt <keytab_path> <principal_name>
    ##For example:
    kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-hdp99d53@UNRAVELDATA.COM
    
  • In a non-Kerberised cluster, run the script as the dfsadmin user (or any other mechanism like logging into the dfsadmin user without prompt for running the dfsadmin command or setting setuid bit on this script so that it runs as dfsadmin user always etc. )

  • The uploading of FSImage from the Unravel edge node to the Unravel core node is done using rsync. Appropriate permissions related to rsync (such as adding the Unravel edge node as a well-known SSH host, adding the public RSA key of the uploading user which is the user that runs the cron job) should be added to authorized SSH keys in the Unravel core node.

    For rsync to work without a password prompt, do the following:

    1. Add the public SSH key of the user to the Unravel core node user's $HOME/.ssh/authorized_keys file.

    2. Add the Unravel edge node hostname as a known_host to Unravel core node.

    3. Run the following commands for SSH passwordless login for rsync command execution. You can skip the step to generate the keys, if you already have the public keys.

       ssh-keygen -t rsa (##Skip this step, if you already have the public keys.)
       ssh <UNRAVEL_CORE_NODE_USER>@<UNRAVEL_CORE_NODE_HOSTNAME> mkdir -p .ssh
       cat ~/.ssh/id_rsa.pub | ssh <UNRAVEL_CORE_NODE_USER>@<UNRAVEL_CORE_NODE_HOSTNAME> 'cat >> ~/.ssh/authorized_keys'
       ssh <UNRAVEL_CORE_NODE_USER>@<UNRAVEL_CORE_NODE_HOSTNAME> "chmod 700 ~/.ssh; chmod 640 ~/.ssh/authorized_keys"
Run template script and set up a cron job

FSImage is processed by the Unravel ondemand process every day at 00:00 UTC. To guarantee data freshness, the latest FSimage should be uploaded to the Unravel Core node a short time before 00:00 UTC.

Before uploading the latest FSimage, observe the total time taken to run the script and accordingly set the cron job so that Unravel has access to the fresh FSImage before 00.00 UTC.

The template script can be set up as a cron job that runs every day at such a time that the above three processes in which the script runs finish before 00:00 UTC.

You must configure the following parameters in the script:

Parameters

Description

CLUSTER_ACCESS_ID

Set cluster access id for the cluster attached to the edge node. Run the following command to get the cluster access ID:

<Unravel_installation_directory>/unravel/manager support show cluster_access_id

UNRAVEL_CORE_NODE_HOSTNAME

Set Unravel core node’s fully qualified hostname.

UNRAVEL_CORE_NODE_USER

Set Unravel user name.

FSIMAGE_DESTINATION_BASEDIR

If Unravel is installed in a directory other than /opt/unravel, then this should be set to <unravel_installation_dir>/data/tmp/reports/fsimage.

Template script
#!/bin/bash

set -x

kinit -kt /etc/security/keytabs/hdfs.headless.keytab hdfs-hdp99d53@UNRAVELDATA.COM

CLUSTER_ACCESS_ID=<Cluster access ID>
UNRAVEL_CORE_NODE_HOSTNAME=<Hostname of Unravel core node>
UNRAVEL_CORE_NODE_USER=<Username of Unravel core node>
FSIMAGE_DESTINATION_BASEDIR=<Unravel_installation_dir>/unravel/data/tmp/reports/fsimage

mkdir /tmp/$$
if [ $? -ne 0 ]
then
 echo "Failed to mkdir /tmp/$$"
 exit 1
fi

hdfs dfsadmin -fetchImage /tmp/$$
if [ $? -ne 0 ]
then
 echo "Failed to fetch Fsimage"
 exit 1
fi

hdfs oiv -i /tmp/$$/fs* -o /tmp/$$/fsimage.txt -p Delimited -t /tmp/$$/fsimage.tmp
if [ $? -ne 0 ]
then
 echo "Failed to parse Fsimage"
 exit 1
fi

rsync /tmp/$$/fsimage.txt ${UNRAVEL_CORE_NODE_USER}@${UNRAVEL_CORE_NODE_HOSTNAME}:${FSIMAGE_DESTINATION_BASEDIR}/${CLUSTER_ACCESS_ID}/
if [ $? -ne 0 ]
then
 echo "Failed to upload Fsimage /tmp/$$/fsimage.txt to ${UNRAVEL_CORE_NODE_USER}@${UNRAVEL_CORE_NODE_HOSTNAME} at ${FSIMAGE_DESTINATION_BASEDIR}/${CLUSTER_ACCESS_ID}/"
 exit 1
fi

rm -rf /tmp/$$
Verify the FSImage configuration

After the FSImage has been successfully fetched you can go to the UI to verify.

Important

Table worker daemon checks for table sizes every 24 hours by default. So even if FSImage is run, it would take that much time to reflect the size. To short-circuit you can restart the table_worker daemon.

Tip

  • The relevant log file is <unravel-installation-directory>/logs/ondemand_tasks.out

  • Run one of the following commands to display the progress of the etl_fsimage task.

    egrep 'ETL_FSIMAGE|FSIMAGE_REPORTS_UTILS' ondemand_tasks.out
    grep etl_fsimage\(\) unravel_ondemand.out
  • Run one of the following commands to display the progress of the run_small_files which is started whenever Small Files Report is triggered from UI.

    egrep 'SMALL_FILES_REPORT|FSIMAGE_REPORTS_UTILS' ondemand_tasks.out
    grep run_small_files\(\) ondemand_tasks.out
Disable FSImage status

The FSImage status is enabled by default, if you want to disable FSImage, perform the following steps:

Note

In a multi-cluster environment, you must perform the following steps on the core node.

  1. Stop Unravel

    <Unravel installation directory>/unravel/manager stop
    
  2. Change the setting.

    <Unravel installation directory>/unravel/manager config properties set unravel.python.reporting.files.disable true
    

    This property enables or disables Unravel ability to generate Small Files and File Reports. Default is false.

    Note

    false: enables the Small Files and Files reports in both the backend and UI.

    true: disables the Small Files and Files reports. in both the backend and UI.

  3. Apply the changes.

    <Unravel installation directory>/unravel/manager config apply
    
  4. Start Unravel

    <Unravel installation directory>/unravel/manager start