Home

Amazon Web Services (AWS) Databricks

This topic explains how to deploy Unravel on Amazon Web Services (AWS) Databricks.

Create AWS components

Install Unravel

Configure and restart Unravel

Configure Databricks with Unravel

Install a compatible version of MySQL server and database.

  • On CentOS 6:

    wget https://dev.mysql.com/get/mysql80-community-release-el6-1.noarch.rpm
    sudo yum install yum-utils
    sudo rpm -ivh mysql80-community-release-el6-1.noarch.rpm
    sudo yum-config-manager --disable mysql80-community
    sudo yum-config-manager --enable mysql57-community
    sudo yum install mysql-community-server
  • On CentOS 7:

    wget https://dev.mysql.com/get/mysql80-community-release-el7-1.noarch.rpm
    sudo rpm -ivh mysql80-community-release-el7-1.noarch.rpm
    sudo yum-config-manager --disable mysql80-community
    sudo yum-config-manager --enable mysql57-community
    sudo yum install mysql-community-server
  • On SELinux:

    If you are installing MySQL on an SELinux host and are not using the default datadir, see Deploying Unravel on security-enhanced Linux.

Note

PostgreSQL is supported from Unravel version 4.6.1.6

You can either use the Unravel bundled PostgreSQL or external PostgreSQL version 12.

  1. Install Unravel. The Unravel bundled PostgreSQL is automatically installed.

  2. Run the following commands in the given sequence to set up and connect the bundled PostgresSQL:

    sudo /usr/local/unravel/bin/emdb_enable.sh
    sudo /etc/init.d/unravel_all.sh restart.
  1. Download PostgreSQL 12.

    sudo yum install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
  2. Run the following commands to install:

    sudo yum install postgresql12-server
    sudo /usr/pgsql-12/bin/postgresql-12-setup initdb
    sudo systemctl start postgresql-12
    sudo systemctl enable postgresql-12

Review the Virtual Private Cloud (VPC) Peering options to connect Databricks with the Unravel VM.

Workspace

VPC Peering Options

Workspace and Unravel VM are in the same VPC

-

Workspace VPC is in a different Region

Use VPC Peering:

Workspace VPC is in a different AWS account

Use VPC Peering:

Install the following Unravel prerequisites on EC2 instance:

  1. Install ntpd.

    sudo su -
    yum install ntp
    ntpd -u ntp:ntp
  2. Prepare the data disk. Set permissions for Unravel and symlink Unravel's directories to the /srv mount.

    mkdir -p /srv/local/unravel# chmod -R 755 /srv/local
    ln -s /srv/local/unravel /usr/local/unravel
    chmod 755 /usr/local/unravel
  3. Install MySQL if not done already.

    yum install mysql
  4. Install the Databricks File System (DBFS) command-line interface.

    sudo bash 
    yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    yum install python-pip
    pip install databricks-cli

    Note

    You can test the connectivity using the DBFS command-line interface. In case there are any errors, such as, Error: ValueError: Timeout value connect was Timeout, reinstall the DBFS command-line interface using Python virtualenv as follows:

    pip install databricks-cli 
    yum install python3
    virtualenv -p /usr/bin/python3 mypy3
    source mypy3/bin/activate# pip install databricks-cli
  1. Download the latest RPM for Cloud.

  2. Install the RPM using the following command:

    sudo rpm -ivh cloud_rpm 2> /tmp/rpm-install-log.txt
  1. Edit /usr/local/unravel/etc/unravel.properties.

  2. Set the following properties:

    com.unraveldata.cluster.type=DB
    com.unraveldata.tagging.enabled=true

    In case you do not find these properties, add it to the file.

  1. Using MySQL, create a database and user for Unravel. Enter MySQL admin login password when prompted:

    mysql> CREATE DATABASE unravel_mysql_prod;
    mysql> CREATE USER '<Unravel database user>'@'MySQL server name' IDENTIFIED BY '<Unravel database password>';
    mysql> GRANT ALL PRIVILEGES ON unravel_mysql_prod.* TO '<Unravel database user>'@'<MySQL server name>';
  2. Configure MySQL in /usr/local/unravel/etc/unravel.properties.

    unravel.jdbc.username=<Unravel database user>
    unravel.jdbc.password=<Unravel database password>
    unravel.jdbc.url=jdbc:mysql://<MySQL Server name>:3306/unravel_mysql_prod
    unravel.jdbc.url.params=useSSL=true&requireSSL=false
  3. Install MySQL JDBC connector driver in Unravel classpath.

    wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz -O /tmp/mysql-connector-java-5.1.47.tar.gz
    
    tar xvzf /tmp/mysql-connector-java-5.1.47.tar.gz
    sudo mkdir -p /usr/local/unravel/share/java
    sudo cp /tmp/mysql-connector-java-5.1.47/mysql-connector-java-5.1.47.jar /usr/local/unravel/share/java
    sudo chown unravel:unravel /usr/local/unravel/share/java/mysql-connector-java-5.1.47.jar
    
  4. Create database and tables for Unravel.

    /usr/local/unravel/dbin/db_schema_upgrade.sh

Note

PostgreSQL is supported from Unravel version 4.6.1.6

  1. Run psql and create a database and user for Unravel. Enter psql admin login password when prompted:

    sudo -u postgres psql
    create database unravel;
    create user unravel with encrypted password 'unraveldata';
    grant all privileges on database unravel to unravel;
    ALTER USER unravel WITH SUPERUSER;
    alter user unravel with createdb createrole inherit replication bypassrls;
    grant connect on database unravel to unravel;
    grant usage on schema public to unravel;
    grant all privileges on all tables in schema public to unravel;
    grant all privileges on all sequences in schema public to unravel;
    alter default privileges in schema public grant all privileges on tables to unravel;
    alter default privileges in schema public grant all privileges on sequences to unravel;
    grant pg_read_server_files to unravel;
    grant pg_write_server_files to unravel;
    grant pg_execute_server_program to unravel;
  2. Allow user/server to connect to the database.

    • Option 1: If PostgreSQL is installed on the same server as Unravel, add the following line in /var/lib/pgsql/12/data/pg_hba.conf at the first line of IPv4 local connections:

      host    all             unravel         127.0.0.1/32            md5
    • Option 2: If PostgreSQL is installed on a different server, do the following:

      1. Add the following line in /var/lib/pgsql/12/data/pg_hba.conf:

        host    all             unravel            <Unravel Server Internal IP Address>/32            md5
      2. Update /var/lib/pgsql/12/data/postgresql.conf and ensure listen_addresses = '*' is set to allow PostgreSQL to listen to all the traffic.

  3. Add following properties in /usr/local/unravel/etc/unravel.properties:

    unravel.jdbc.username=unravel
    unravel.jdbc.password=unraveldata
    unravel.jdbc.url=jdbc:postgresql://127.0.0.1:5432/unravel
  4. Restart PostgreSQL.

    sudo systemctl restart postgresql-12.service
  5. Install PostgreSQL JDBC connector driver in Unravel classpath.

    sudo mkdir -p /usr/local/unravel/share/java
    wget https://jdbc.postgresql.org/download/postgresql-42.2.18.jar -O /usr/local/unravel/share/java/postgresql-42.2.18.jar
    sudo chown unravel:unravel /usr/local/unravel/share/java/postgresql-42.2.18.jar
  6. Test and update db schema.

    /usr/local/unravel/install_bin/db_access.sh*
    /usr/local/unravel/dbin/db_schema_upgrade.sh
    
  1. In Databricks, go to Workspace > Admin Console > Access Control and enable Personal Access tokens. See Enable token-based authentication.

  2. Go to Workspace> User Settings> Access Tokens and click Generate New Token. See Authenticate using Databricks personal access tokens. Choose the lifetime of the token as indefinite.

  3. Install Unravel agents on the Workspace and update the Unravel config with the Workspace details. refer to Running the Databricks_setup.sh script.

    Note

    Run the following commands only if the Databricks command-line is installed using Python virtualenv.

    sudo bash
    virtualenv -p /usr/bin/python3 mypy3
    source mypy3/bin/activate
    
    /usr/local/unravel/install_bin/databricks_setup.sh --add-workspace -i <Workspace ID> -n <Workspace name> -t <Workspace token> -r https://<Workspace instance> -p <workspace_tier> -u <Unravel DNS or IP Address>:4043
  1. Restart all Unravel services

    service unravel_all.sh restart
  2. Using a supported web browser, (See compatibility matrix for AWS Databricks) navigate to http://unravel-host:3000 and log in with username admin with password unraveldata.

    signin.png

In your Databricks workspace, update the following tabs under Advanced Options for every cluster (Automated/Interactive) that you want to monitor:

Spark

Copy the following snippet to Spark > Spark Conf. Replace <Unravel DNS or IP Address>. This snippet is also generated by the Databricks setup script on Unravel.

Note

For spark-submit jobs, click Configure spark-submit and copy the following snippet in the Set Parameters > Parameters text box as spark-submit parameters. Replace <Unravel DNS or IP Address>.

"--conf", "spark.eventLog.enabled=true",
"--conf", "spark.eventLog.dir=dbfs:/databricks/unravel/eventLogs/",
"--conf", "spark.unravel.shutdown.delay.ms=300",
"--conf", "spark.unravel.server.hostport=<Unravel DNS or IP Address>:4043",
"--conf", "spark.executor.extraJavaOptions= -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=executor,libs=spark-2.3",
"--conf", "spark.driver.extraJavaOptions= -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=driver,script=StreamingProbe.btclass,libs=spark-2.3"
spark.eventLog.enabled true
spark.eventLog.dir dbfs:/databricks/unravel/eventLogs/
spark.unravel.server.hostport <Unravel DNS or IP Address>:4043
spark.unravel.shutdown.delay.ms 300
spark.executor.extraJavaOptions -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=executor,libs=spark-2.3
spark.driver.extraJavaOptions -Dcom.unraveldata.client.rest.request.timeout.ms=1000 -Dcom.unraveldata.client.rest.conn.timeout.ms=1000 -javaagent:/dbfs/databricks/unravel/unravel-agent-pack-bin/btrace-agent.jar=config=driver,script=StreamingProbe.btclass,libs=spark-2.3
Logging

Select DBFS as Destination, and copy the following as Cluster Log Path.

dbfs:/cluster-logs/
Init Scripts

In the Init Scripts tab, set Destination to DBFS. Copy the following as the Init script path and click Add.

dbfs:/databricks/unravel/unravel-db-sensor-archive/dbin/install-unravel.sh