Install Unravel in Amazon Elastic MapReduce (EMR)
Before installing Unravel on EMR, check and ensure that the Unravel installation requirements are completed and follow the instructions to install and configure Unravel:
1. Create and configure an EC2 instance
5. Connect a new or existing EMR cluster to Unravel
6. Add AWS account details in Unravel for EMR chargeback data and cluster insights
1. Create and configure an EC2 instance
Run the following steps to create an EC2 instance:
Run the following steps to configure the EC2 instance:
Disable
selinux
.sudo setenforce Permissive
Edit
/etc/selinux/config
to make sure the setting persists after reboot and make sureSELINUX=permissive
.sudo vi /etc/selinux/config
Install
libaio.x86_64
,lzop.x86_64
, andntp.x86_64
.sudo yum install -y libaio.x86_64 (##Only required if you use Unravel managed MySQL) sudo yum install -y lzop.x86_64 sudo yum install -y ntp.x86_64
Start ntpd and check the system time.
sudo service ntpd start sudo ntpq -p
Create a new user named
hadoop
.sudo useradd hadoop
2. Download Unravel
3. Deploy Unravel binaries
4. Install Unravel
You can install Unravel with Interactive precheck. When you run the Interactive Precheck utility a bootstrap configuration file is generated for installation.
You can also install Unravel manually. Refer to Installing Unravel manually.
5. Connect a new or existing EMR cluster to Unravel
This topic explains how to set up and configure your EMR cluster, so Unravel can begin monitoring jobs running on the cluster.
Assumptions
The EC2 instance for Unravel is created.
Unravel services are running.
The nodes in the EMR cluster allow all traffic from the Unravel EC2 instance. This implies either of the following configurations:
The EMR cluster and Unravel EC2 instance are on the same VPC and same subnet, and their security group allows all traffic from the same subnet.
The EMR cluster is on a different VPC, and you've configured VPC peering, route table creation, and updated your security policy.
Network ACL on VPC allows all traffic.
Warning
If you encounter any EMR cluster configuration issues, see the Troubleshooting guide to resolve the issues.
Follow the steps below to run Unravel's bootstrap script, unravel_emr_bootstrap.py, on all nodes in the cluster. The bootstrap script makes the following changes:
On the master node:
On Hive clusters, it updates
/lib/hive/conf/hive-site.xml
On Spark clusters, it updates
/lib/spark/conf/spark-defaults.conf
It updates
/lib/hadoop/etc/hadoop/mapred-site.xml
It updates
/lib/hadoop/etc/hadoop/yarn-site.xml
If Tez is installed, it updates
/etc/tez/conf/tez-site.xml
It installs and starts the
unravel_es
daemon in/usr/local/unravel_es
It installs the Spark and MapReduce sensors in
/usr/local/unravel-agent/jars
It installs the Hive Hook sensor in
/usr/lib/hive/lib/
.
On all other nodes:
It installs the Spark and MapReduce sensors in
/usr/local/unravel-agent/jars
.
Run the following steps to connect Unravel to a new EMR cluster. Simultaneously, you can watch the following video tutorial:
Download Unravel's bootstrap script,
unravel_emr_bootstrap.py
.curl https://s3.amazonaws.com/unraveldatarepo/unravel_emr_bootstrap.py -o /tmp/unravel_emr_bootstrap.py
Upload the bootstrap script to an S3 bucket.
Permissions needed
You need to write access to the S3 bucket if you want to upload the bootstrap script to. Also, the AWS account you use to create the EMR cluster must have read access to the bootstrap script to execute its directives.
To upload the bootstrap script to the default EMR logging bucket,
s3://aws-logs-
, execute the following command:account_number
-region
/elasticmapreduceaws s3 cp unravel_emr_bootstrap.py s3://aws-logs-
account_number
-region
/elasticmapreduceOn the AWS console, select the EMR service and click Create cluster.
On the Create Cluster - Quick Options page, click Go to advanced options.
In Step 1: Software and Steps, select emr-6.2 release.
In Step 2: Hardware, enter the following configuration for your EMR cluster and click Next.
Settings
Action
Instance group configuration
By default, the Uniform instance groups option is selected.
Network
EC2 Subnet
The default configuration works for Network and EC2 Subnet if you have a virtual machine hosted in AWS in the same subnet.
If your virtual machine is hosted on a different subnet in AWS or on GCP or Azure, then the virtual machine and the cloud platform must have access to the public IP.
In Step 3: General Cluster Settings, specify the following settings in Bootstrap Actions > Add Bootstrap Action, click Custom Action Add and then click Configure and add.
In the Add Bootstrap Action window, enter the following details and click Add:
Settings
Action
Name
Select Custom action.
Script location
Specify the following bootstrap location:
s3://unraveldatarepo/unravel_emr_bootstrap.py
Note
If you want to monitor MR jobs, then you must pass an additional optional argument --all
Optional arguments
Enter the following:
--unravel-server
UNRAVEL-INSTANCE-IP
--all --bootstrapThe Amazon EMR cluster starts with this bootstrap action.
In Step 4: Security, edit the configuration for the cluster as required. For example:
Choose the EC2 key pair.
Select the EC2 security groups. AWS EMR service automatically applies additional rules that are required for EMR nodes.
Click Create cluster. Your new EMR cluster finishes the bootstrap process and is in the Waiting state.
To connect the Unravel EC2 instance to an existing EMR cluster, follow the steps below to run the Unravel EMR Ansible playbook on either the EMR master node or your Mac/Linux workstation.
Important
The following process is for existing clusters created without Unravel bootstrap. Only those clusters of this type that do not have auto-scaling enabled are currently supported.
Repeat the steps below to upgrade Unravel Sensors whenever you upgrade Unravel Server.
Note
In case you have to run unravel_emr_bootstrap.py
manually, you must run it with the full path of the system default python, that is /usr/bin/python
, as follows:
sudo /usr/bin/python unravel_emr_bootstrap.py
6. Add AWS account details in Unravel
Important
This is a mandatory step. If you do not add the AWS account details in Unravel, you cannot view the EMR clusters.
Configuring CloudWatch agent
You can configure the CloudWatch agent for Unravel monitoring of your EMR clusters.
If you already have CloudWatch agent setup in your environment, then you must add metrics and dimensions to the CloudWatch agent configuration file.
If you do not have the CloudWatch agent setup, then you can run the Unravel bootstrap script, which will set up the CloudWatch agent in your environment.
If CloudWatch agent setup is available in your environment
Add the following metrics and dimensions to the CloudWatch agent configuration file:
{ "agent": { "metrics_collection_interval": 60 }, "metrics": { "append_dimensions": { "AutoScalingGroupName": "${aws:AutoScalingGroupName}", "ImageId": "${aws:ImageId}", "InstanceId": "${aws:InstanceId}", "InstanceType": "${aws:InstanceType}", "InstanceName": "${aws:InstanceName}" }, "metrics_collected": { "mem": { "measurement": [ "mem_used_percent" ], "metrics_collection_interval": 60 }, "cpu": { "measurement": [ "cpu_usage_idle", "cpu_usage_iowait", "cpu_usage_user", "cpu_usage_system" ], "metrics_collection_interval": 60, "resources": [ "*" ], "totalcpu": False } } } }
If CloudWatch agent setup is NOT available in your environment
CloudWatch agent is packaged as part of the Unravel bootstrap script so to enable the setup of the CloudWatch agent, you must add the following optional argument within the bootstrap action:
Edit the bootstrap action and provide the following location to the bootstrap script:
s3://unraveldatarepo/unravel_emr_bootstrap.py
In the Optional arguments text box, add the following arguments as shown in the image:
-cwa or --cloud-watch-agent
Unravel bootstrap script will automatically deploy the CloudWatch Agent to all the EMR cluster nodes.