Running verification scripts and benchmarks

Home

Running verification scripts and benchmarks

This topic explains how to run verification tests and benchmarks after you install or upgrade Unravel Server.

Why run verification tests or benchmarks?

Verification tests highlight the value of Unravelâ€™s application performance management/analysis.
Benchmarks verify that Unravel features are working correctly.

Running verification tests

Unravel provides verification tests for Spark jobs only. Follow the instructions in the section that matches your deployment.

On your Unravel Server host, run the spark_test_via_parcel.sh script. This script runs a Spark app. Itâ€™s a good way to verify that Unravel Server captures the data (events) generated by the Spark app, after the sensor jars have been deployed to the cluster via the parcel, i.e., Installing the Unravel Parcel on CDH+CM. You should be able to see the data generated by this Spark app on Unravel Web UI. Substitute the hostname or LAN IP address of Unravel Server for unravel-host

/usr/local/unravel/install_bin/spark_test_via_parcel.sh --unravel-server unravel-host

Note: You can run this script before configuring the Gateway Automatic Deployment of Spark Instrumentation which instruments the Spark configuration file "spark-defaults.conf".

After you configure the Gateway Automatic Deployment of Spark Instrumentation which instruments "spark-defaults.conf", run a SparkPI job on your Unravel Server host to verify that the sensor is installed and configured correctly:

/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode client --master yarn /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples-*.jar 1000

Running benchmarks

We provide sample Spark apps that you can download from preview.unraveldata.com. These apps are useful for verifying that an upgrade is successful. Please follow the instructions below for the app you want.

Spark

Available benchmark packages

Package name	Location
Benchmarks 1.6.x	https://preview.unraveldata.com/img/spark-benchmarks1.tgz
Benchmarks 2.0.x	https://preview.unraveldata.com/img/demo-benchmarks-for-spark-2.0.tgz

The .tgz file includes everything needed to run the benchmarks including both the datasets and scripts.

Executing the benchmarks

Go to the directory where you want to download and unpack the benchmark package. Download the file, where location is full pathname of the benchmark (see above) and package-name is the package name.
```
curl location -o package-name
```
Once downloaded, run md5sum on package-name to ensure it's intact.
```
md5sum package-name 
```

Confirm that the output of md5sum is exactly as shown below for the package you just unpacked.

ff8e56b4d5abfb0fb9f9e4a624eeb771 Md5sum for spark-benchmarks1.tgz
71198901cedeadd7f8ebcf1bb1fd9779 Md5sum for demo-benchmarks-for-spark-2.0.tgz

Uncompress the package.
```
tar -zxvf package-name 
```
After unpacking, navigate to the created directory, demo_dir.
```
cd demo_dir
ls
benchmarks/   data/
```
The benchmarks folder includes the jar of the Spark examples, the source files, and the scripts used to execute the examples.
```
ls benchmarks
README          recommendations.png             src/
lib/            scripts/                        tpch-query-instances/
```
- lib contains compiled jar of all examples.
- scripts contain all the scripts needed to run the example. There are two scripts for each example: ./example*.sh is the initial execution and ./example*-after.sh is the re-execution of the same example after applying Unravelâ€™s recommendations:
- src contains the source files for the driver program.
- tpch-query-instances contains the queries for a TPC-H benchmark.
Navigate to the data folder which contains the datasets used by the examples.
```
cd data
ls data
DATA.BIG.2G/    tpch10g/
```

Upload the datasets (requiring 12 GB size)

hdfs dfs -put tpch10g/ /tmp/
hdfs dfs -put DATA.BIG.2G/ /tmp/

Execute the first benchmark script, where script-number is the number of the script you wish to execute.
```
./examplescript-number.sh
```
After the run, Unravel recommendations are shown in the UI, on the application page. Once the example script is issued, the application metadata is displayed. Use the app id listed in the metadata to locate the app in Unravel UI.
Recommendations are deployment specific so you need to edit the Spark properties in the example script-number-after.sh scripts as suggested in the Recommendations tab of Unravel UI.
The categories of recommendations and insights are:
- actionable recommendations (examples 1, 2 , and 5)*
- Spark SQL (example 3)
- error view and insights for failed applications (example4) and
- recommendations for caching (example 5)
*if running Benchmarks 2.0.x, example 6 is an actionable recommendation.
Sample Spark recommendations
Execute the edited *-after script, that includes the Spark configuration properties as suggested in the Recommendations tab of the Unravel UI.
```
$ ./example$-after.sh
```
After you install, after running the sample, check whether the re-execution of the script improved the performance or resource efficiency of the application. You can also check the Program and the Execution Graph tabs in Unravel UI. Click an RDD in the Execution Graph to see the corresponding line of code in the app.
Example 5
In order to run this script you must enable insights for caching, which are disabled by default as it consumes additional heap from the memory allocation of the Spark worker daemon. You should enable insights for caching only if you expect that caching will improve performance of your Spark application.
Add the following property to /usr/local/unravel/etc/unravel.properties on the Unravel server node:
```
com.unraveldata.spark.events.enableCaching=true
```
Restart the spark worker daemon.
```
sudo /etc/init.d/unravel_sw_1 restart
```
Repeat step 9 - 12.
Once you have completed running example5.sh and example5-after.sh, reset the caching insight option to false.
```
com.unraveldata.spark.events.enableCaching=false
```
Restart the spark worker daemon.
```
$ sudo /etc/init.d/unravel_sw_1 restart
```

Benchmarks for 1.6.x

Description	Demonstrates
example1 A Scala-based application which generates its input and applies multiple transformations to the generated data.	How Unravel helps select the number of partitions and container sizes for best performance, e.g., increasing the number of partitions and reducing per-container memory resources.
example2 A Scala-based application which generates its input and applies multiple transformations to the generated data.	How Unravel helps select the container sizes for best performance, in other words, reducing per-container memory resources.
example3 A Scala program containing a SparkSQL query. The program runs TPC-H Query #9 on a 10GB database.	How Unravel helps select the number of executors for best performance when dynamic allocation is disabled. For example, increasing the number of executors.
example4 A Scala-based application. This application generates its input and applies multiple transformations to the generated data.	How Unravel helps to root-cause a failed application. For example, failure-related insights and Error View when the application runs out of memory.
example5 A Scala-based application. The application runs on an input of 2GB and applies multiple join and co-group transformations on the input data. Certain RDDs are evaluated multiple times. Pre-requirement: Add the property com.unraveldata.spark.events.enableCaching=true to `unravel.properties` file to enable caching. This property is disabled by default as it consumes additional heap from the memory allocation of the Spark worker daemon. Enable it only if caching related insights are considered for tuning performance of Spark applications.	Unravelâ€™s insights for caching by showing the caching opportunities within the application, i.e., where in the program to use persist() to cache the corresponding RDD. In this example, dynamic allocation is disabled.

Benchmarks for 2.0.x

Example	Demonstrates
example1 see example1 in Benchmarks for Spark 1.6.x
example2 A Scala-based application which generates its input and applies multiple transformations to the generated data including the coalesce transformation which reduces the level of parallelism to a suboptimal value.	How Unravel helps select the number of partitions and container sizes for best performance of a Spark application. For example, by increasing the number of partitions.
example3 - example5 see example3 - example5 in Benchmarks for Spark 1.6.x
example6 A Scala-based Spark application which generates its input and applies multiple transformations to the generated data.	How Unravel helps select the container sizes for best performance of a Spark application, e.g., reducing the memory requirements per executor.

In this section:

Would you like to provide feedback? Just click here to suggest edits.

Home

Running verification scripts and benchmarks

Why run verification tests or benchmarks?

Running verification tests

Running benchmarks

Spark

Search results