Skip to main content

Home

Amazon EMR cluster setup guide

This section provides instructions to connect an Amazon EMR cluster to Unravel.

  1. Log in to AWS and navigate to Identity and Access Management (IAM).

  2. Click Policies and select Create Policy.

  3. Click JSON and paste the following code:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "pricing:GetProducts",
                    "elasticmapreduce:ListClusters",
                    "elasticmapreduce:DescribeCluster",
                    "elasticmapreduce:ListInstanceFleets",
                    "elasticmapreduce:ListInstanceGroups",
                    "elasticmapreduce:ListInstances",
                    "ec2:DescribeSpotPriceHistory",
                    "ec2:DescribeInstances",
                    "cloudwatch:GetMetricData",
                    "cloudwatch:GetMetricStatistics"
                ],
                "Resource": "*"
            }
        ]
    }

    Note

    You can find the JSON code in this location: Login to an Unravel EMR instance and navigate to Unravel > Settings > AWS Account Settings > New Account.

  4. Give a name for the policy. The new policy is created.

  5. Assign the created policy to a specific user or create a new user and attach the policy.

To create a new user, follow these steps:

  1. In the IAM dashboard, click Users in the left panel.

  2. Select Create a New User.

  3. In the Set Permissions section, choose Attach existing policies directly.

  4. Click Add Permissions to attach the policy to the user.

To set the security credentials, follow these steps:

  1. In the Security credentials tab for the IAM user, click Create access key.

  2. In the Access key best practices & alternatives section, select the Application Running on AWS Compute Service option.

  3. Under Set Description Tag, enter a name for the tag value (optional).

  4. Click Create access key to generate the access key. Ensure you save the Access Key ID and Secret Access Key securely. This is needed for further configuration on the Unravel interface.

Next, configure Unravel to connect to your AWS account settings by following these steps.

  1. Access the Unravel Home Page and navigate to AWS Account Settings.

  2. In the AWS Account details section, enter the user name for your AWS account.

  3. Choose the AWS region you want Unravel to monitor. This is the region where your Amazon EMR cluster and other resources are located.

  4. In the Account Key field, paste the Access Key ID that you obtained from the AWS Security Credentials section.

  5. In the Secret Key field, paste the Secret Access Key associated with the Access Key ID.

  6. Click Save. You have now configured unravel to connect to your AWS settings.

To connect your Amazon EMR clusters to Unravel for monitoring and management, follow the steps listed here:

Connect a new or existing EMR cluster to Unravel

This topic explains how to set up and configure your EMR cluster, so Unravel can begin monitoring jobs running on the cluster.

Assumptions

  • The EC2 instance for Unravel is created.

  • Unravel services are running.

  • The nodes in the EMR cluster allow all traffic from the Unravel EC2 instance. This implies either of the following configurations:

  • Network ACL on VPC allows all traffic.

Connect to a new EMR cluster

Follow the steps below to run Unravel's bootstrap script, unravel_emr_bootstrap.py, on all nodes in the cluster. The bootstrap script makes the following changes:

  • On the master node:

    • On Hive clusters, it updates /lib/hive/conf/hive-site.xml

    • On Spark clusters, it updates /lib/spark/conf/spark-defaults.conf

    • It updates /lib/hadoop/etc/hadoop/mapred-site.xml

    • It updates /lib/hadoop/etc/hadoop/yarn-site.xml

    • If Tez is installed, it updates /etc/tez/conf/tez-site.xml

    • It installs and starts the unravel_es daemon in /usr/local/unravel_es

    • It installs the Spark and MapReduce sensors in /usr/local/unravel-agent/jars

    • It installs the Hive Hook sensor in /usr/lib/hive/lib/.

  • On all other nodes:

    • It installs the Spark and MapReduce sensors in /usr/local/unravel-agent/jars.

Run the following steps to connect Unravel to a new EMR cluster. Simultaneously, you can watch the following video tutorial:

  1. Download Unravel's bootstrap script, unravel_emr_bootstrap.py.

    curl https://s3.amazonaws.com/unraveldatarepo/unravel_emr_bootstrap.py -o /tmp/unravel_emr_bootstrap.py    
  2. Upload the bootstrap script to an S3 bucket.

    Permissions needed

    You need to write access to the S3 bucket if you want to upload the bootstrap script to. Also, the AWS account you use to create the EMR cluster must have read access to the bootstrap script to execute its directives.

    To upload the bootstrap script to the default EMR logging bucket, s3://aws-logs-account_number-region/elasticmapreduce, execute the following command:

    aws s3 cp unravel_emr_bootstrap.py s3://aws-logs-account_number-region/elasticmapreduce
  3. On the AWS console, select the EMR service and click Create cluster.

  4. On the Create Cluster - Quick Options page, click Go to advanced options.

  5. In Step 1: Software and Steps, select emr-6.2 release.

  6. In Step 2: Hardware, enter the following configuration for your EMR cluster and click Next.

    Settings

    Action

    Instance group configuration

    By default, the Uniform instance groups option is selected.

    Network

    EC2 Subnet

    The default configuration works for Network and EC2 Subnet if you have a virtual machine hosted in AWS in the same subnet.

    If your virtual machine is hosted on a different subnet in AWS or on GCP or Azure, then the virtual machine and the cloud platform must have access to the public IP.

  7. In Step 3: General Cluster Settings, specify the following settings in Bootstrap Actions > Add Bootstrap Action, click Custom Action Add and then click Configure and add.

  8. In the Add Bootstrap Action window, enter the following details and click Add:

    Settings

    Action

    Name

    Select Custom action.

    Script location

    Specify the following bootstrap location:

    s3://unraveldatarepo/unravel_emr_bootstrap.py

    Note

    If you want to monitor MR jobs, then you must pass an additional optional argument --all

    Optional arguments

    Enter the following:

    --unravel-server UNRAVEL-INSTANCE-IP --all --bootstrap

    The Amazon EMR cluster starts with this bootstrap action.

  9. In Step 4: Security, edit the configuration for the cluster as required. For example:

    • Choose the EC2 key pair.

    • Select the EC2 security groups. AWS EMR service automatically applies additional rules that are required for EMR nodes.

  10. Click Create cluster. Your new EMR cluster finishes the bootstrap process and is in the Waiting state.

Connect to an existing EMR cluster

To connect the Unravel EC2 instance to an existing EMR cluster, follow the steps below to run the Unravel EMR Ansible playbook on either the EMR master node or your Mac/Linux workstation.

Important

The following process is for existing clusters created without Unravel bootstrap. Only those clusters of this type that do not have auto-scaling enabled are currently supported.

Repeat the steps below to upgrade Unravel Sensors whenever you upgrade Unravel Server.

Note

In case you have to run unravel_emr_bootstrap.py manually, you must run it with the full path of the system default python, that is /usr/bin/python, as follows:

sudo /usr/bin/python unravel_emr_bootstrap.py

Refer to the following video: