Home

Configuring small files and files reports

While the Small File Reports and File Reports features are enabled by default, extensive configuration is required. This topic explains how to configure them to make them available through the Unravel UI.

Important

This feature runs as hdfs user and uses Hive to insert FSimage data into tables.

Small Files and Files Reports can be turned off. See the enable/disable section for toggle the status of the reports.

1 - Determine how Unravel will access FSimage
  • Unravel can run as HDFS admin.

    This allows Unravel to access the FSimage on the Namenode using the hdfs dfsadmin command. See Running Unravel Daemons with Custom User to configure Unravel to run as hdfs admin.

    For CDH:

    1. Check to see if Unravel user has HDFS admin permission.

      hdfs dfsadmin -report
    2. If Unravel has HDFS admin permission, immediately go to Step 2, otherwise, locate Unravel's group name.

      id unravel
    3. Go to CDH HDFS configs page.

    4. Locate the Superuser Group property and change the value to the result obtained in step 2 above.

  • Unravel can not run as HDFS admin.

    This is the most common setup. The FSimage must be downloaded for Unravel to generate these reports. Create a cron to download it to the Unravel server.

    The download can take up to 10 hours depending on the size of FSimage. The time can determine how often cron is/should be run. If the raw FSimage can be parsed faster into a tab delimited file using the HDFS OIV utility, that can also be part of the cron job. The old image must be deleted before running the cron job.

    Unravel uses this image to create the report. Therefore, the Small Files and Files reports are analyzing a snapshot, which with the passage of time becomes outdated.

    Edit /usr/local/unravel/etc/unravel.properties and set the following properties.

    Important

    You must restart the unravel_ondemand daemon for any changes to take effect.

    service unravel_ondemand restart

    Property/Description

    Default

    unravel.python.reporting.files.skip_fetch_fsimage

    If HDFS admin privileges can not be granted, set this to true to allow Unravel's Ondemand process to use an externally fetched FSimage.

    true: Ondemand etl_fsimage process does not fetch FSimage from name node. Instead, the FSimage is expected to be available in directory specified by unravel.python.reporting.files.external_fsimage_dir.

    false

    unravel.python.reporting.files.external_fsimage_dir

    Directory for FSimage when skip_fetch_fsimage=true. The fsimage externally fetched is expected to be in this directory. Unravel uses the latest file in this directory which starts with " fsimage_".

    This directory must be different than the Unravel's internal directory, i.e., /srv/unravel/tmp/reports/fsimage.

    -

    unravel.python.reporting.files.skip_fetch_fsimage=true;
    unravel.python.reporting.files.external_fsimage_dir=/srv/unravel/tmp/fsimages/reports;
2 - Define the following

Edit /usr/local/unravel/etc/unravel.properties and set the following properties.

Property/Description

Default

unravel.hive.server2.host

FQDN or IP-Address of the HiveServer2 instance.

-

unravel.hive.server2.port

Port for the HiveServer2 instance.

You need only define this if the unravel.hive.server2.host port is not 1000.

10000

unravel.hive.server2.authentication

Define the authentication type. Possible values are: KERBEROS, LDAP, NOSASL, NONE, or CUSTOM.

When set to KERBEROS you must also set kerberos.service.name=hive.

-

unravel.hive.server2.kerberos.service.name

Set only when unravel.hive.server2.authentication=KERBEROS.

This must be set to hive to run the various reports in a kerberos enviornment.

-

Unravel can use Hive or Spark to process the FSImage; unravel.python.reporting.fsimage.run_mode defines which one to use. We recommend using Spark and have set it as the default. See the FSImage properties for a complete list of properties. When you are using Spark you must define the following properties.

3 - Configure your Sentry-Secured CDH or Ranger-Secured HDP cluster

SSL/TLS

unravel.hive.server2.authentication

Supported

Configuration Notes

Yes

KERBEROS

Yes

unravel.hive.server2.use.SSL=true

unravel.hive.server2.authentication=KERBEROS

unravel.hive.server2.kerberos.service.name=hive

com.unraveldata.kerberos.principal=unravel kerberos principal

com.unraveldata.kerberos.keytab.path= keytab path

No

KERBEROS

Yes

unravel.hive.server2.authentication=KERBEROS

unravel.hive.server2.kerberos.service.name=hive

com.unraveldata.kerberos.principal=unravel kerberos principal

com.unraveldata.kerberos.keytab.path=keytab path

No

NONE (No authentication)

Yes

No security config properties needed.

Yes

NOSASL, LDAP, or CUSTOM

No

N/A

No

NOSASL, LDAP, or CUSTOM

No

N/A

4 - Run the run_small_files task
curl -v “http://localhost:5000/small-files-etl”
Enable/Disable small files and files report status

Important

In order for the configuration changes to take effect, unravel_ondemand and unravel_ngui daemons need to be restarted.

/etc/init.d/unravel_ngui restart
/etc/init.d/unravel_ondemand restart

Set the following property in /usr/local/unravel/etc/unravel.properties.

Property/Description

Default

unravel.python.reporting.files.disable

Enables or disables Unravel ability to generate Small Files and File Reports.

Note

false: enables the Small Files and Files reports in both the backend and UI.

true: disables the Small Files and Files reports. in both the backend and UI.

false

Debugging tips
  1. The relevant log file is /usr/local/unravel/ondemand/logs/unravel_ondemand.out

  2. The etl_fsimage task extracts FSimage and runs File Reports. The run_small_files task runs ad hoc Small Files Report. It is triggered every day at 00:00 UTC. Run the following command to trigger run_small_files it manually. (You must trigger after the initial install.)

    curl -v “http://localhost:5000/small-files-etl”
  3. Run one of the following commands to display the progress of the etl_fsimage task.

    egrep 'ETL_FSIMAGE|FSIMAGE_REPORTS_UTILS' unravel_ondemand.out
    grep etl_fsimage\(\) unravel_ondemand.out
  4. Run one of the following commands to display the progress of the run_small_files which is started whenever Small Files Report is triggered from UI.

    egrep 'SMALL_FILES_REPORT|FSIMAGE_REPORTS_UTILS' unravel_ondemand.out
    grep run_small_files\(\) unravel_ondemand.out
  5. The FSimage file is present on Unravel node at /srv/unravel/tmp/reports/ fsimage/fsimage.txt.

  6. The FSimage file is present in HDFS at /tmp/fsimage/fsimage.txt.

  7. In case of problems it may be helpful to look at HiveServer2 and Yarn logs.