Configuring small files and files reports
While the Small File Reports and File Reports features are enabled by default, extensive configuration is required. This topic explains how to configure them to make them available through the Unravel UI.
Important
This feature runs as hdfs
user and uses Hive to insert FSimage data into tables.
Small Files and Files Reports can be turned off. See the enable/disable section for toggle the status of the reports.
1 - Determine how Unravel will access FSimage
Unravel can run as HDFS admin.
This allows Unravel to access the FSimage on the Namenode using the
hdfs dfsadmin
command. See Running Unravel Daemons with Custom User to configure Unravel to run as hdfs admin.For CDH:
Check to see if Unravel user has HDFS admin permission.
hdfs dfsadmin -report
If Unravel has HDFS admin permission, immediately go to Step 2, otherwise, locate Unravel's group name.
id unravel
Go to CDH HDFS configs page.
Locate the
Superuser Group
property and change the value to the result obtained in step 2 above.
Unravel can not run as HDFS admin.
This is the most common setup. The FSimage must be downloaded for Unravel to generate these reports. Create a
cron
to download it to the Unravel server.The download can take up to 10 hours depending on the size of FSimage. The time can determine how often
cron
is/should be run. If the raw FSimage can be parsed faster into a tab delimited file using the HDFS OIV utility, that can also be part of thecron
job. The old image must be deleted before running thecron
job.Unravel uses this image to create the report. Therefore, the Small Files and Files reports are analyzing a snapshot, which with the passage of time becomes outdated.
Edit
/usr/local/unravel/etc/unravel.properties
and set the following properties.Important
You must restart the
unravel_ondemand
daemon for any changes to take effect.service unravel_ondemand restart
Property/Description
Default
unravel.python.reporting.files.skip_fetch_fsimage
If HDFS admin privileges can not be granted, set this to true to allow Unravel's Ondemand process to use an externally fetched FSimage.
true
: Ondemand etl_fsimage process does not fetch FSimage from name node. Instead, the FSimage is expected to be available in directory specified by unravel.python.reporting.files.external_fsimage_dir.false
unravel.python.reporting.files.external_fsimage_dir
Directory for FSimage when skip_fetch_fsimage=true. The fsimage externally fetched is expected to be in this directory. Unravel uses the latest file in this directory which starts with " fsimage_".
This directory must be different than the Unravel's internal directory, i.e., /srv/unravel/tmp/reports/fsimage.
-
unravel.python.reporting.files.skip_fetch_fsimage=true; unravel.python.reporting.files.external_fsimage_dir=/srv/unravel/tmp/fsimages/reports;
2 - Define the following
Edit /usr/local/unravel/etc/unravel.properties
and set the following properties.
Property/Description | Default |
---|---|
unravel.hive.server2.host FQDN or IP-Address of the HiveServer2 instance. | - |
unravel.hive.server2.port Port for the HiveServer2 instance. You need only define this if the unravel.hive.server2.host port is not 1000. | 10000 |
unravel.hive.server2.authentication Define the authentication type. Possible values are: When set to | - |
unravel.hive.server2.kerberos.service.name Set only when unravel.hive.server2.authentication= This must be set to | - |
Unravel can use Hive or Spark to process the FSImage; unravel.python.reporting.fsimage.run_mode defines which one to use. We recommend using Spark and have set it as the default. See the FSImage properties for a complete list of properties. When you are using Spark you must define the following properties.
3 - Configure your Sentry-Secured CDH or Ranger-Secured HDP cluster
SSL/TLS | unravel.hive.server2.authentication | Supported | Configuration Notes |
---|---|---|---|
Yes | KERBEROS | Yes | unravel.hive.server2.use.SSL=true unravel.hive.server2.authentication=KERBEROS unravel.hive.server2.kerberos.service.name=hive com.unraveldata.kerberos.principal= com.unraveldata.kerberos.keytab.path= |
No | KERBEROS | Yes | unravel.hive.server2.authentication=KERBEROS unravel.hive.server2.kerberos.service.name=hive com.unraveldata.kerberos.principal= com.unraveldata.kerberos.keytab.path= |
No | NONE (No authentication) | Yes | No security config properties needed. |
Yes | NOSASL, LDAP, or CUSTOM | No | N/A |
No | NOSASL, LDAP, or CUSTOM | No | N/A |
4 - Run the run_small_files
task
curl -v “http://localhost:5000/small-files-etl”
Enable/Disable small files and files report status
Important
In order for the configuration changes to take effect, unravel_ondemand
and unravel_ngui
daemons need to be restarted.
/etc/init.d/unravel_ngui restart /etc/init.d/unravel_ondemand restart
Set the following property in /usr/local/unravel/etc/unravel.properties
.
Property/Description | Default |
---|---|
unravel.python.reporting.files.disable Enables or disables Unravel ability to generate Small Files and File Reports. Note
| false |
Debugging tips
The relevant log file is
/usr/local/unravel/ondemand/logs/unravel_ondemand.out
The
etl_fsimage
task extracts FSimage and runs File Reports. Therun_small_files
task runs ad hoc Small Files Report. It is triggered every day at 00:00 UTC. Run the following command to triggerrun_small_files
it manually. (You must trigger after the initial install.)curl -v “http://localhost:5000/small-files-etl”
Run one of the following commands to display the progress of the
etl_fsimage
task.egrep 'ETL_FSIMAGE|FSIMAGE_REPORTS_UTILS' unravel_ondemand.out
grep etl_fsimage\(\) unravel_ondemand.out
Run one of the following commands to display the progress of the
run_small_files
which is started whenever Small Files Report is triggered from UI.egrep 'SMALL_FILES_REPORT|FSIMAGE_REPORTS_UTILS' unravel_ondemand.out
grep run_small_files\(\) unravel_ondemand.out
The FSimage file is present on Unravel node at
/srv/unravel/tmp/reports/ fsimage/fsimage.txt
.The FSimage file is present in HDFS at
/tmp/fsimage/fsimage.txt
.In case of problems it may be helpful to look at HiveServer2 and Yarn logs.