Skip to main content

Home

Install Unravel in Amazon Elastic MapReduce (EMR)

Before installing Unravel on EMR, check and ensure that the Unravel installation requirements are completed and follow the instructions to install and configure Unravel:

1. Create and configure an EC2 instance

2. Download Unravel

3. Deploy Unravel binaries

4. Setup and install Unravel

5. Connect a new or existing EMR cluster to Unravel

6. Add AWS account details in Unravel for EMR chargeback data and cluster insights

1. Create and configure an EC2 instance

Run the following steps to create an EC2 instance:

Run the following steps to configure the EC2 instance:

  1. Disable selinux.

    sudo setenforce Permissive
  2. Edit /etc/selinux/config to make sure the setting persists after reboot and make sure SELINUX=permissive.

    sudo vi /etc/selinux/config
  3. Install libaio.x86_64, lzop.x86_64, and ntp.x86_64.

    sudo yum install -y libaio.x86_64 (##Only required if you use Unravel managed MySQL)
    sudo yum install -y lzop.x86_64
    sudo yum install -y ntp.x86_64
  4. Start ntpd and check the system time.

    sudo service ntpd start
    sudo ntpq -p
  5. Create a new user named hadoop.

    sudo useradd hadoop
2. Download Unravel
3. Deploy Unravel binaries
4. Run setup

You can run the setup command to install Unravel. The setup command allows you to do the following:

  • Runs Precheck automatically to detect possible issues that prevent a successful installation. Suggestions are provided to resolve issues. Refer to Precheck filters for the expected value for each filter.

  • Let you run extra parameters to integrate the database of your choice.

    The setup command allows you to use a managed database shipped with Unravel or an external database. When run without any additional parameters, the setup uses the Unravel managed PostgreSQL database. Otherwise, you can specify one of the following types of databases in the setup command:

    • MySQL (Unravel managed as well as external MySQL database)

    • MariaDB (Unravel managed)

    • PostgreSQL (Unravel managed)

    • Amazon RDS

    Refer to Integrate database for details.

  • Let you specify a separate path for the data directory other than the default path.

    The Unravel data and configurations are located in the data directory. By default, the installer maintains the data directory under <Unravel installation directory>/data. You can also change the data directory's default location by running additional parameters with the setup command.

  • Provides more setup options.

Notice

The Unravel user who owns the installation directory should run the setup command to install Unravel.

To install Unravel with the setup command, do the following:

  1. After deploying the binaries, if you are the root user, switch to Unravel user.

      su - <unravel user>
  2. Run setup command:

    Note

    Refer to setup Options for all the additional parameters that can be run with the setup command

    Refer to Integrate database topic and complete the pre-requisites before running the setup command with any other database other than Unravel managed PostgreSQL, which is shipped with the product. Extra parameters must be passed with the setup command when you use another database.

    Tip

    Optionally, if you want to provide a different data directory, you can pass an extra parameter (--data-directory) with the setup command as shown below:

    <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --data-directory /the/data/directory

    Similarly, you can configure separate directories for other unravel directories. Contact support for assistance.

    • PostgreSQL

      • Unravel managed PostgreSQL

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-emr
    • MySQL

      • Unravel managed MySQL

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-emr --extra /tmp/mysql
    • MariaDB

      • Unravel managed MariaDB

        <unravel_installation_directory>/unravel/versions/<Unravel version>/setup --enable-emr --extra /tmp/mariadb
    • Amazon Relational Database Service (RDS)

      Amazon RDS can be used optionally as an external database. To set up Amazon RDS with Unravel, do the following:

      1. Set up Amazon RDS.

      2. From the Unravel installation directory, run the following command to configure Amazon RDS with Unravel.

        <unravel_installation_directory>unravel/versions/<Unravel version>/setup --enable-emr --extra /tmp/mysql --external-database mysql <HOST> <PORT> <SCHEMA> <USERNAME> <PASSWORD>
        ##The HOST, PORT, SCHEMA, USERNAME, PASSWORD are optional fields and are prompted if missing.
        
        ##Example:
        /opt/unravel/versions/abcd-1234/setup --enable-emr --extra /tmp/mysql --external-database mysql unravelmysqlprod.csfws86cxagh.us-east-1.xyz.amazonaws.com 3306 unravel_mysql_prod unravel 1234
        

    Precheck is automatically run when you run the setup command. Refer to Precheck filters for the expected value for each filter.

  3. Set the following property:

    <unravel_installation_directory>/unravel/manager config properties set com.unraveldata.process.event.log false
  4. Apply changes

    <unravel_installation_directory>/unravel/manager config apply 
  5. Start all the services.

    <unravel_installation_directory>/unravel/manager start 
    
  6. Check the status of services.

    <unravel_installation_directory>/unravel/manager report 
    

    The following service statuses are reported:

    • OK: Service is up and running.

    • Not Monitored: Service is not running. (Has stopped or has failed to start)

    • Initializing: Services are starting up.

    • Does not exist: The process unexpectedly disappeared. Restarts will be attempted 10 times.

    You can also get the status and information for a specific service. Run the manager report command as follows:

    <unravel_installation_directory>/unravel/manager report <service> 
    ## For example: /opt/unravel/manager report auto_action
    

The Precheck output displays the issues that prevent a successful installation and also provides suggestions to resolve them. You must resolve each of the issues before proceeding. See Precheck filters.

After the prechecks are resolved, you must re-login or reload the shell to execute the setup command again.

Here is a sample of the Precheck run result:

/opt/unravel/versions/abcd.1004/setup 
2021-04-05 15:51:30 Sending logs to: /tmp/unravel-setup-20210405-155130.log
2021-04-05 15:51:30 Running preinstallation check...
2021-04-05 15:51:31 Gathering information ................. Ok
2021-04-05 15:51:51 Running checks .................. Ok
--------------------------------------------------------------------------------
system
 Check limits        : PASSED
 Clock sync          : PASSED
 CPU requirement     : PASSED, Available cores: 8 cores
 Disk access         : PASSED, /opt/unravel/versions/develop.1004/healthcheck/healthcheck/plugins/system is writable
 Disk freespace      : PASSED, 229 GB of free disk space is available for precheck dir.
 Kerberos tools      : PASSED
 Memory requirement  : PASSED, Available memory: 79 GB
 Network ports       : PASSED
 OS libraries        : PASSED
 OS release          : PASSED, OS release version: centos 7.6
 OS settings         : PASSED
 SELinux             : PASSED
--------------------------------------------------------------------------------
Healthcheck report bundle: /tmp/healthcheck-20210405155130-xyz.unraveldata.com.tar.gz
2021-04-05 15:51:53 Prepare to install with: /opt/unravel/versions/abcd.1004/installer/installer/../installer/conf/presets/default.yaml
2021-04-05 15:51:57 Sending logs to: /opt/unravel/logs/setup.log
2021-04-05 15:51:57 Instantiating templates ................................................................................................................................................................................................................................ Ok
2021-04-05 15:52:05 Creating parcels .................................... Ok
2021-04-05 15:52:20 Installing sensors file ............................ Ok
2021-04-05 15:52:20 Installing pgsql connector ... Ok
2021-04-05 15:52:22 Starting service monitor ... Ok
2021-04-05 15:52:27 Request start for elasticsearch_1 .... Ok
2021-04-05 15:52:27 Waiting for elasticsearch_1 for 120 sec ......... Ok
2021-04-05 15:52:35 Request start for zookeeper .... Ok
2021-04-05 15:52:35 Request start for kafka .... Ok
2021-04-05 15:52:35 Waiting for kafka for 120 sec ...... Ok
2021-04-05 15:52:37 Waiting for kafka to be alive for 120 sec ..... Ok
2021-04-05 15:52:42 Initializing pgsql ... Ok
2021-04-05 15:52:46 Request start for pgsql .... Ok
2021-04-05 15:52:46 Waiting for pgsql for 120 sec ..... Ok
2021-04-05 15:52:47 Creating database schema ................. Ok
2021-04-05 15:52:50 Generating hashes .... Ok
2021-04-05 15:52:52 Loading elasticsearch templates ............ Ok
2021-04-05 15:52:55 Creating kafka topics .................... Ok
2021-04-05 15:53:36 Creating schema objects ....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... Ok
2021-04-05 15:54:03 Request stop ....................................................... Ok
2021-04-05 15:54:16 Done
[unravel@xyz ~]$

Note

In certain situations, you can skip the precheck using the setup --skip-precheck command

For example:

/opt/unravel/versions/<Unravel version>/setup --skip-precheck

You can also skip the checks that you know can fail. For example, if you want to skip the Check limits option and the Disk freespace option, pick the command within the parenthesis corresponding to these failed options and run the setup command as follows:

setup --filter-precheck ~check_limits,~check_freespace 

Tip

Run --help with the setup command and any combination of the setup command for complete usage details.

<unravel_installation_directory>/unravel/versions/<Unravel version>/setup --help
Precheck filters
5. Connect a new or existing EMR cluster to Unravel

This topic explains how to set up and configure your EMR cluster so that Unravel can begin monitoring jobs running on the cluster.

Assumptions

  • The EC2 instance for Unravel is created.

  • Unravel services are running.

  • The security group on the Unravel EC2 instance allows traffic to/from EMR cluster nodes on TCP port 3000.

  • The nodes in the EMR cluster allow all traffic from the Unravel EC2 instance. This implies either of the following configurations:

  • Network ACL on VPC allows all traffic.

Connect to a new EMR cluster

Follow the steps below to run Unravel's bootstrap script, unravel_emr_bootstrap.py, on all nodes in the cluster. The bootstrap script makes the following changes:

  • On the master node:

    • On Hive clusters, it updates /lib/hive/conf/hive-site.xml

    • On Spark clusters, it updates /lib/spark/conf/spark-defaults.conf

    • It updates /lib/hadoop/etc/hadoop/mapred-site.xml

    • It updates /lib/hadoop/etc/hadoop/yarn-site.xml

    • If Tez is installed, it updates /etc/tez/conf/tez-site.xml

    • It installs and starts the unravel_es daemon in /usr/local/unravel_es

    • It installs the Spark and MapReduce sensors in /usr/local/unravel-agent/jars

    • It installs the Hive Hook sensor in /usr/lib/hive/lib/.

  • On all other nodes:

    • It installs the Spark and MapReduce sensors in /usr/local/unravel-agent/jars.

Run the following steps to connect Unravel to a new EMR cluster. You can also refer to the video tutorial:

  1. Download Unravel's bootstrap script, unravel_emr_bootstrap.py.

    curl https://s3.amazonaws.com/unraveldatarepo/unravel_emr_bootstrap.py -o /tmp/unravel_emr_bootstrap.py    
  2. Upload the bootstrap script to an S3 bucket.

    Permissions needed

    You need to write access to the S3 bucket that you want to upload the bootstrap script to. Also, the AWS account you use to create the EMR cluster must have read access to the bootstrap script to execute its directives.

    To upload the bootstrap script to the default EMR logging bucket, s3://aws-logs-account_number-region/elasticmapreduce, execute the following command:

    aws s3 cp unravel_emr_bootstrap.py s3://aws-logs-account_number-region/elasticmapreduce
  3. On the AWS console, select the EMR service and click Create cluster.

  4. In the Create Cluster - Quick Options screen, click Go to advanced options.

    create-emr-cluster-adv-options.png
  5. In Step 1: Software and Steps, select any release from emr-6.2.

    create-cluster-adv-options-soft-conf.png
  6. In Step 2: Hardware, enter a configuration for your EMR cluster and click Next.

    saas-hardware-conf.png
  7. In Step 3: General Cluster Settings, specify the following settings in Bootstrap Actions > Add Bootstrap Action, click Custom Action Add and then click Configure and add. The Amazon EMR cluster will start with this bootstrap action.

    Setting

    Action

    Name

    Select Custom action.

    Script location

    Specify the following bootstrap location:

    s3://unraveldatarepo/unravel_emr_bootstrap.py

    Note

    If you want to monitor MR jobs, then you must pass additional optional argument --all

    Optional arguments

    Enter the following:

    --unravel-server UNRAVEL-INSTANCE-IP --all --bootstrap
  8. Click Configure and add.

  9. In Step 4: Security, edit the configuration for the cluster as required. For example:

    • Choose the EC2 key pair.

    • Select the EC2 security groups. AWS EMR service automatically applies additional rules that are required for EMR nodes.

  10. Click Create cluster. Your new EMR cluster finishes the bootstrap process and will be in the Waiting state.

Connect to an existing EMR cluster

To connect the Unravel EC2 instance to an existing EMR cluster, follow the steps below to run the Unravel EMR Ansible playbook either on the EMR master node or on your Mac/Linux workstation.

Important

The following process is for existing clusters created without Unravel bootstrap. At this time, only those clusters of this type that do not have auto-scaling enabled are supported.

Whenever you upgrade Unravel Server, repeat the steps below to upgrade Unravel Sensors as well.

6. Add AWS account details in Unravel for EMR chargeback data and cluster insights

To view the chargeback data and cluster insights for EMR clusters, you must add your AWS account details in Unravel.

  1. On the Unravel UI, click the Manage tab.

  2. Click the AWS Account Settings tab. The AWS Account Settings page is displayed.

  3. On the left, click new-account.png.

  4. Follow the instructions provided to set up an IAM user. You can use key-based access to enable access to your Amazon account. Unravel uses the access keys to make secure requests to the AWS service API.

    You must generate Access key ID and Secret access key that Unravel can use to get AWS metrics.

    To set up the key access, do the following in the given sequence:

    • Create Policy

      The AWS monitoring policy defines the minimal scope of permissions that you need to give to Unravel to monitor the services running in your AWS account. Create it once and use it anytime when enabling Unravel access to your AWS account.

      1. On your Amazon console, go to Identity and Access Management (IAM).

      2. Go to Policies and click Create Policy.

      3. Select the JSON tab, and paste the following policy:

        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [
                        "pricing:GetProducts",
                        "elasticmapreduce:ListClusters",
                        "elasticmapreduce:DescribeCluster",
                        "elasticmapreduce:ListInstanceFleets",
                        "elasticmapreduce:ListInstanceGroups",
                        "elasticmapreduce:ListInstances",
                        "ec2:DescribeSpotPriceHistory"
                    ],
                    "Resource": "*"
                }
            ]
        }
      4. Provide a name for the policy and create it.

    • Create User

      1. On your Amazon console, click Users > Add User.

      2. Enter a name for the key you want to create.

      3. In Select AWS access type, select Programmatic access, and click Next:Permissions.

      4. Click Attach existing policies directly and choose the monitoring policy you defined earlier.

      5. Click Next:Review.

      6. Review the user details and click Create user.

        Store the Access Key ID name (AKID) and Secret access key values. These keys are used for setting up your Account in Unravel.

  5. Provide the following details in Step 2: AWS Account details:

    Field

    Description

    Name

    Provide a name

    Region

    Select a region.

    Account key

    Specify the Access key ID name (AKID).

    Secret key

    Enter the Secret access key.

    Namespace to Monitor

    Select EMR as namespace.

  6. Click Save.