Skip to main content

Home

Amazon Web Services (AWS) Databricks

Follow the below instructions to install and configure Unravel for AWS Databricks:

1. Create an EC2 instance, connect Databricks, and install DBFS CLI
  1. On your AWS Console, go to the EC2 dashboard and click Launch Instance.

  2. Select the following options based on Unravel's instance requirements:

    • Base OS

    • Instance type and size

    • Ports

    • Networking

      The EC2 instance must be in the same region as the target clusters, which Unravel EC2 node will be monitoring.

    • Security groups or policies

      • Create a security group that allows port 3000 and port 4043 from the cluster nodes' IP address, and put the security group member used on the cluster in this rule.

      Sample inbound rule

      Type

      Protocol

      Port range

      Source

      All traffic

      All

      All

      Security group ID of this group or subnet IP block.

      For example, 10.10.0.0/16

      SSH

      TCP

      22

      0.0.0.0/0 or trusted public IP for SSH access

      Custom TCP Rule

      TCP

      443

      Security group ID used on the cluster or subnet IP block (if the IP block belongs to a different VPC). Required for VPC peering connection

      Custom TCP Rule

      TCP

      3000

      Security group ID used on the cluster or subnet IP block (if the IP block belongs to a different VPC). Required for VPC peering connection.

      Custom TCP Rule

      TCP

      4043

      Security group ID used on the cluster or subnet IP block (if the IP block belongs to a different VPC). Required for VPC peering connection.

Review the Virtual Private Cloud (VPC) Peering options to connect Databricks with the Unravel VM.

Workspace

VPC Peering Options

Workspace and Unravel VM are in the same VPC

-

Workspace VPC is in a different Region

Use VPC Peering:

Workspace VPC is in a different AWS account

Use VPC Peering:

Install the Databricks File System (DBFS) command-line interface using the following commands:

sudo bash 
yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install python-pip
pip install databricks-cli
2. Download Unravel
3. Deploy Unravel
4. Run setup

You can run the setup command to install Unravel. The setup command allows you to do the following:

  • Runs Precheck automatically to detect possible issues that prevent a successful installation. Suggestions are provided to resolve issues. Refer to Precheck filters for the expected value for each filter.

  • Let you run extra parameters to integrate the database of your choice.

    The setup command allows you to use a managed database shipped with Unravel or an external database. The setup uses the Unravel managed PostgreSQL database when run without any additional parameters. Otherwise, you can specify one of the following types of databases in the setup command:

    • MySQL (Unravel managed as well as external MySQL database)

    • MariaDB (Unravel managed as well as external MariaDB database)

    • PostgreSQL (External PostgreSQL)

    Refer to Integrate database for details.

  • Let you specify a separate path for the data directory other than the default path.

    You can locate the Unravel data and configurations in the data directory. By default, the installer maintains the data directory under <Unravel installation directory>/data. You can also change the data directory's default location by running additional parameters with the setup command. To install Unravel with the setup command.

  • Provides more setup options.

Notice

The Unravel user who owns the installation directory should run the setup command to install Unravel.

To install Unravel with the setup command, do the following:

  1. Switch to Unravel user.

      su - <unravel user>
  2. Run setup command:

    Note

    Refer to setup Options for all the additional parameters that you can run with the setup command

    Refer to Integrate database topic and complete the pre-requisites before running the setup command with any other database other than Unravel managed PostgreSQL, which is shipped with the product. Extra parameters must be passed with the setup command when using another database.

    Tip

    Optionally, if you want to provide a different data directory, you can pass an extra parameter (--data-directory) with the setup command as shown below:

    <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-databricks --data-directory /the/data/directory

    Similarly, you can configure separate directories for other unravel directories —contact support for assistance.

    • PostgreSQL

      • Unravel managed PostgreSQL

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-databricks
      • External PostgreSQL

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-databricks --external-database postgresql <HOST> <PORT> <SCHEMA> <USERNAME> <PASSWORD>/
        
        ##The HOST, PORT, SCHEMA, USERNAME, PASSWORD are optional fields and are prompted if missing.
        
        ##For example:
        /opt/unravel/versions/abcd.992/setup --enable-databricks --external-database postgresql xyz.unraveldata.com 5432 unravel_db_prod unravel unraveldata
        
    • MySQL

      • Unravel managed MySQL

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-databricks --extra /tmp/mysql
      • External MySQL

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-databricks --extra /tmp/<MySQL-directory> --external-database mysql <HOST> <PORT> <SCHEMA> <USERNAME> <PASSWORD>
        
        ##The HOST, PORT, SCHEMA, USERNAME, PASSWORD are optional fields and are prompted if missing.
        
    • MariaDB

      • Unravel managed MariaDB

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-databricks  --extra /tmp/mariadb
      • External MariaDB

        <unravel_installation_directory>unravel/versions/<Unravel version>/setup --enable-databricks  --extra /tmp/<MariaDB-directory> --external-database mariadb <HOST> <PORT> <SCHEMA> <USERNAME> <PASSWORD>
        
        ##The HOST, PORT, SCHEMA, USERNAME, PASSWORD are optional fields and are prompted if missing.
        

    Precheck is automatically run when you run the setup command. Refer to Precheck filters for the expected value for each filter.

  3. Apply the changes.

    <Unravel installation directory>/unravel/manager config apply
    <Unravel installation directory>/unravel/manager refresh databricks
    
  4. Start all the services.

    <unravel_installation_directory>/unravel/manager start 
    
  5. Check the status of services.

    <unravel_installation_directory>/unravel/manager report 
    

    The following service statuses are reported:

    • OK: Service is up and running.

    • Not Monitored: Service is not running. (Has stopped or has failed to start)

    • Initializing: Services are starting up.

    • Does not exist: The process unexpectedly disappeared. A restart will be attempted ten times.

    You can also get the status and information for a specific service. Run the manager report command as follows:

    <unravel_installation_directory>/unravel/manager report <service> 
    ## For example: /opt/unravel/manager report auto_action
    

The Precheck output displays the issues that prevent a successful installation and provides suggestions to resolve them. You must resolve each of the issues before proceeding. See Precheck filters.

After resolving the precheck issues, you must re-login or reload the shell to execute the setup command again.

Note

You can skip the precheck using the setup --skip-precheck command in certain situations.

For example:

/opt/unravel/versions/<Unravel version>/setup --skip-precheck

You can also skip the checks that you know can fail. For example, if you want to skip the Check limits option and the Disk freespace option, pick the command within the parenthesis corresponding to these failed options and run the setup command as follows:

setup --filter-precheck ~check_limits,~check_freespace 

Tip

Run --help with the setup command and any combination of the setup command for complete usage details.

<unravel_installation_directory>/unravel/versions/<Unravel version>/setup --help
Precheck filters
Precheck Sample
/opt/unravel/versions/abcd.1004/setup 
2021-04-05 15:51:30 Sending logs to: /tmp/unravel-setup-20210405-155130.log
2021-04-05 15:51:30 Running preinstallation check...
2021-04-05 15:51:31 Gathering information ................. Ok
2021-04-05 15:51:51 Running checks .................. Ok
--------------------------------------------------------------------------------
system
 Check limits        : PASSED
 Clock sync          : PASSED
 CPU requirement     : PASSED, Available cores: 8 cores
 Disk access         : PASSED, /opt/unravel/versions/develop.1004/healthcheck/healthcheck/plugins/system is writable
 Disk freespace      : PASSED, 229 GB of free disk space is available for precheck dir.
 Kerberos tools      : PASSED
 Memory requirement  : PASSED, Available memory: 79 GB
 Network ports       : PASSED
 OS libraries        : PASSED
 OS release          : PASSED, OS release version: centos 7.6
 OS settings         : PASSED
 SELinux             : PASSED
--------------------------------------------------------------------------------
Healthcheck report bundle: /tmp/healthcheck-20210405155130-xyz.unraveldata.com.tar.gz
2021-04-05 15:51:53 Prepare to install with: /opt/unravel/versions/abcd.1004/installer/installer/../installer/conf/presets/default.yaml
2021-04-05 15:51:57 Sending logs to: /opt/unravel/logs/setup.log
2021-04-05 15:51:57 Instantiating templates ................................................................................................................................................................................................................................ Ok
2021-04-05 15:52:05 Creating parcels .................................... Ok
2021-04-05 15:52:20 Installing sensors file ............................ Ok
2021-04-05 15:52:20 Installing pgsql connector ... Ok
2021-04-05 15:52:22 Starting service monitor ... Ok
2021-04-05 15:52:27 Request start for elasticsearch_1 .... Ok
2021-04-05 15:52:27 Waiting for elasticsearch_1 for 120 sec ......... Ok
2021-04-05 15:52:35 Request start for zookeeper .... Ok
2021-04-05 15:52:35 Request start for kafka .... Ok
2021-04-05 15:52:35 Waiting for kafka for 120 sec ...... Ok
2021-04-05 15:52:37 Waiting for kafka to be alive for 120 sec ..... Ok
2021-04-05 15:52:42 Initializing pgsql ... Ok
2021-04-05 15:52:46 Request start for pgsql .... Ok
2021-04-05 15:52:46 Waiting for pgsql for 120 sec ..... Ok
2021-04-05 15:52:47 Creating database schema ................. Ok
2021-04-05 15:52:50 Generating hashes .... Ok
2021-04-05 15:52:52 Loading elasticsearch templates ............ Ok
2021-04-05 15:52:55 Creating kafka topics .................... Ok
2021-04-05 15:53:36 Creating schema objects ....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... Ok
2021-04-05 15:54:03 Request stop ....................................................... Ok
2021-04-05 15:54:16 Done
[unravel@xyz ~]$
5. Connect Databricks cluster to Unravel
  1. Register a workspace in Unravel using one of the following options

    • Option 1: From Unravel UI

      1. Go to the Unravel Instance UI and click Workspaces. The Workspaces Manager page is displayed.

      2. Click Add Workspace and enter the following details.

        Field

        Description

        Workspace Id

        Databricks workspace ID.

        Workspace Name

        Databricks workspace name.

        Instance (Region) URL

        Regional URL where the Databricks workspace is deployed.

        Tier

        Select a subscription option: Standard or Premium.

        Token

        Personal access token to access Databricks REST APIs. Refer to Authentication using Databricks personal access tokens to create personal access tokens.

        Note

        Personal access tokens can be created with admin or non-admin roles.

        Note

        After you click the Add button, it will take around 2-3 minutes to register the Databricks Workspace with Unravel.

    • Option 2: From command-line using manager command.

      1. Stop Unravel

        <Unravel installation directory>/unravel/manager stop
        
      2. Switch to Unravel user.

      3. Add the workspace details using the manager command as follows from the Unravel installation directory:

        source <path-to-python3-virtual environment-dir>/bin/activate
        <Unravel_installation_directory>/unravel/manager config databricks add --id <workspace-id> --name <workspace-name> --instance <workspace-instance> --access-token <workspace-token> --unravel-endpoint <unravel_server:port> 
        
        ##For example:
        /opt/unravel/manager config databricks add --id 0000000000000000 --name myworkspacename --instance https://adb-0000000000000000.16.azuredatabricks.net --access-token xxxx --unravel-endpoint 127.0.0.1:4043 --tier premium
        
      4. Apply the changes.

        <Unravel installation directory>/unravel/manager config apply
        
      5. Start Unravel

        <Unravel installation directory>/unravel/manager start
  2. Add Unravel configuration to Databricks clusters. Go to Unravel UI, click Manage > Workspaces > Cluster configuration to get the configuration details. Follow the instructions and update every cluster (Automated /Interactive) in your workspace.

    Note

    Alternatively, a Databricks Cluster Policy can be created with the following configurations and used with clusters

    databricks-cluster-config.png

    Note

    To add Unravel configurations to job clusters via API, use the JSON format as follows:

    {
        "settings": {
            "new_cluster": {
                "spark_conf": {
                    // Note: If extraJavaOptions is already in use, prepend the Unravel values. Also, for Databricks Runtime with spark 2.x.x, replace "spark-3.0" with "spark-2.4"
                    "spark.executor.extraJavaOptions": "-Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -Dcom.unraveldata.client.resolve.hostname=true -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=executor,libs=spark-3.0",
                    "spark.driver.extraJavaOptions": "-Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -Dcom.unraveldata.client.resolve.hostname=true -Dcom.unraveldata.agent.metrics.ganglia_enabled=true -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=driver,script=StreamingProbe.btclass,libs=spark-3.0",
                    "spark.unravel.server.hostport": "127.0.0.1:4043",
                    "spark.unravel.shutdown.delay.ms": "300",
    
                    // rest of your spark properties here ...
                    ... 
                },
                "init_scripts": [
                    {
                        "dbfs": {
                            "destination": "dbfs:/databricks/unravel/unravel-db-sensor-archive/dbin/install-unravel.sh"
                        }
                    },
    
                    // rest of your init scripts here ...
                    ...
                ],
                "cluster_log_conf": {
                    "dbfs": {
                        "destination": "dbfs:/cluster-logs"
                    }
                },
    
                // rest of your cluster properties here ...
                ...
            },
            ...
        }
    }
  3. Set additional configurations if required.

  4. Configure the Workspace for Data page.

  5. Optionally, you can run healthcheck, at this point, to verify that all the configurations and services are running successfully.

    <unravel_installation_directory>/unravel/manager healthcheck
    

    Healthcheck is run automatically on an hourly basis in the backend. You can set the healthcheck intervals and email alerts to receive the healthcheck reports.