Skip to main content

Home

Configure metadata processing for Delta tables

You must schedule a job on your workspace to run the Delta table extractor job and extract the metadata into a JSON file. The metadata extractor job supports single and multiple databases.

  1. From Databricks, schedule a job to extract metadata.

    1. Download the metadata_extractor_dbr_10_1.jar file from Unravel Customer Support and upload the JAR file to your respective Databricks DBFS location.

    2. On your workspace, click Create > Job.

      deltalake-create-job.png
    3. In the Task tab, specify the following parameters:

      deltalake-metadatjob.png
      • From the Type list, select Spark Submit.

      • From the Cluster list, select a cluster. The cluster should be of the latest (Databricks, Spark, or Scala) version. For better performance, select the cluster with the following options:

        Cluster Mode

        DBX runtime version

        Node type

        Cost (DBU/Hour)

        Time

        Delta Table

        Single Node

        10.4

        Standard_E8_v3(64,8)

        2

        50 minutes

        10000

        Single Node

        10.4

        Standard_E16_v3(128,16)

        4

        2.5 hours

        25000

      • In the Parameters text box, specify the following parameters as mentioned in the example:

        Example for a single database:

        ["--class","org.apache.spark.sql.hive.DeltaTableMetadataExtractor","dbfs:/FileStore/metadata_extractor_dbr_10_1.jar","customer_db","/dbfs/FileStore/delta/","25000"]

        Parameters

        Description

        "dbfs:/FileStore/metadata_extractor_dbr_10_1.jar"

        Specify the location of the JAR file that you uploaded to DBFS in Step 1.

        For a single database name:

        "customer_db"

        For multiple database names:

        "customer_db1","customer_db2","customer_db3"

        Specify a valid database name. (Mandatory)

        • Specify multiple database names in a comma-separated value.

        • Specify all database names using the ALL_DB parameter.

        After you specify multiple database names or the ALL_DB parameter, the following actions are triggered:

        • A separate JSON file is created for each database.

        • Databases are processed until the Delta table max counter is reached. The maximum Delta table limit value processed by the metadata extractor job is 25,000 across all databases. If the Delta table count is less than 25,000, then the job processes all the delta tables.

        • Databases are processed based on the order of database names specified in the database name parameter.

        "<path-of-the-output-directory>"

        Specify the output directory path where the metadata files are stored. (Optional)

        If you do not specify the path, by default, the metadata files are stored in the /dbfs/FileStore/delta directory.

        "25000"

        Specify the number of Delta tables. (Optional) You can extract up to 25,000 delta tables metadata in a single job.

        Note

        You cannot extract more than 25,000 delta tables in a database. For example, if the database contains 30,000 delta tables, then only the first 25,000 delta tables are processed, and the remaining are ignored.

    A metadata file is generated in DBFS.

  2. Set up Unravel to process Delta table metadata

    1. Copy the metadata file from DBFS to any location in Unravel node.

    2. Restart the table_worker daemon to synchronize all Delta tables to the Unravel database and Elasticsearch index.

      <Unravel/installation/directory>/manager restart table_worker
    3. Run the delta_file_handoff.sh utility.

      <Unravel Installation directory>/unravel/manager run script delta_file_handoff.sh </path/to/metadata file>

      The utility processes a single file or a directory path containing multiple files according to the specified parameters in the metadata extractor job. After successful processing, a message is displayed.

      Note

      If Delta table processing fails, check the following error logs in the delta_file_handoff.log file located at /opt/unravel/logs/.

      INFO table.DeltaTableProcessing: DeltaTableProcessing.start() => Delta tabel processing has started.
      WARN Cluster info is missing for metastoreId:
      ERROR Failed to read file
      ERROR Failed to parse json data.
      WARN Unable to update table for table id
    4. Log in to the Unravel UI, navigate to Data page > Tables, and verify that the Storage Format column contains the Delta value for the Delta tables listing.