Skip to main content

Home

Unravel Spark pipeline and Table worker benchmarking

The Unravel 4.7.9.1 release is aimed at bringing the performance of Spark pipeline and Table workers to scale. Extensive benchmark testing was conducted to compare the performance of the Spark worker and Table worker between versions 4.7.5.0 and 4.7.9.1. Based on the analysis, this document presents the results and recommended settings for 4.7.9.1.

  • Spark worker

    The Spark worker was tested for the following:

    • Average message processing rate per consumer per second is the number of Kafka messages processed per second.

    • Average percentage of CPU utilization

    • Total memory allocation

    A comparative study was also done between Unravel 4.7.5.0 and 4.7.9.1 versions. This study used G1GC (Garbage First Garbage Collector) and ZGC (Z Garbage Collector) configurations for different JAVA versions. The study analyzed the extent of performance improvement that can be achieved with and without these configurations for Unravel versions 4.7.5.0 and 4.7.9.1.

    Based on the findings, Unravel recommends node sizing, database sizing, and GC settings to help you achieve enhanced performance.

  • Table worker

    The Table worker daemon, responsible for populating the Data page, was tested for the following with polling batch sizes of 1 million, 500 K, and 100 K against tables and partitions metadata.

    • Polling and processing time

    • Memory utilization

    Based on the findings, Unravel provides recommendations for the polling limits of tables and partitions that are best suited for achieving the desired polling and processing time as well as max memory consumption.

Spark worker

In this section, the Spark worker performance results are presented in terms of processing rate, CPU utilization, and memory allocation. These results were generated with simulated data without event log loading. The performance benchmark testing was done in a standalone (on-prem) environment, multi-cluster (on-prem) environment, and Databricks (cloud) environment.

spark-per-processing-rate-per-second.gif
cpu-utilization-memory-allocation.gif

The following configurations were used to survey the Spark worker performance:

  • G1GC configuration in a standalone (on-prem) environment with custom JDK JAVA 8.

  • G1GC and ZGC are configured in the multi-cluster (on-prem) environment with default JRE 17.

  • G1GC and ZGC configuration in Databricks cloud environment with default JRE 17.

Configurations

Standalone (on-prem) environment

Multi-cluster (on-prem) environment

Databricks (cloud) environment

JAVA version

Custom JDK - JAVA 8

Default JRE 17

Default JRE 17

Setup

Single node setup with 1 Spark worker and 8 consumers

Insights worker with a default memory of 6 GB

Single node setup with 1 Spark worker and 8 consumers

Insights worker with a default memory of 6 GB

Single node setup with 1 Spark worker and 32 consumers.

Insights worker with a default memory of 6 GB

Maximum Allocated memory

16 GB

16 GB

54 GB

Unravel node sizing

CPU: 8 cores

Memory: 64 GB

Disk space: 500 GB

CPU: 8 cores

Memory: 64 GB

Disk space: 500 GB

CPU: 32 cores

Memory: 125 GB

Disk space: 1 TB

Cluster type

CDH.

4 edge nodes (2 CDH and 2 CDP clusters)

Azure Databricks

Database

External MySQL 8.0.33

Tuning parameters:

  • max_connections - 3000

  • Innodb_buffer_pool_size - 12 GB

  • query_cache_size - 1MB

External MariaDB 10.5

Tuning parameters:

  • max_connections - 3000

  • Innodb_buffer_pool_size - 12 GB

  • query_cache_size - 1MB

Azure MYSQL 5.7 was used with the following parameters:

  • max_connections - 2731 (Dynamic)

  • query_cache_size : 0(Dynamic)

  • IOPS: 4000 (Dynamic)

  • Innodb_buffer_pool_size - 25.7 GB (Dynamic)

  • query_cache_size - 1MB

  • Cores: 8

  • Memory: 32 GB

  • Storage: 1479 GB

Spark worker performance results

The following graphs present the Spark worker performance results in terms of message processing rate, which is the number of Kafka messages processed per second, CPU utilization, and memory allocation.

Note

These results are generated with com.unraveldata.process.event.log set to false, which means the results were generated without event log loading. Setting this parameter to true may result in a slight deviation in the performance numbers.

The results showed that an average of 40 to 65 mps processing rate per consumer could be expected even with the event log loading.

The average CPU utilization was 25% to 30% against a total allocated memory of 16% for on-prem (standalone and multi-cluster environments). For Databricks, the average CPU utilization was 10% against an allocated memory of 54%.

Findings

  • The Standalone environment (custom JDK JAVA 8), with the G1GC configuration, showed an average processing rate of 70 mps per consumer.

  • The multi-cluster environment (default JRE 17), with the ZGC configuration, showed an average of 80 mps improvement in the processing rate compared to the G1GC configuration, which showed an average of 65 mps.

  • The Databricks environment (default JRE 17), with the ZGC configuration, showed an average of 45 mps improvement in the processing rate compared to the G1GC configuration, which showed an average of 40 mps.

Comparative analysis of Spark worker performance in 4.7.9.1 Vs. 4.7.5.0

A comparative analysis was conducted to study the processing rate per consumer of the Spark worker performance in the Standalone on-prem environment with custom JDK - JAVA 8. The following image show the Spark worker performance in Unravel version 4.7.9.1 versus the performance in version 4.7.5.0.

spark-performance-compare-4791-vs-4750-hotfix.gif

Upgrading from Unravel version 4.7.5.0 to 4.7.9.1 with the G1GC config can achieve a 212 % improvement in the average processing rate for the Spark worker. The average CPU utilization is reduced to 20 - 25 %, and the average memory allocation is reduced to 16 GB for 8 consumers in a single Spark worker. The results are presented in the following tables:

Configurations

Results for average processing rate

4.7.5.0 version without G1GC

25 mps

4.7.5.0 version with G1GC

40 mps

60% improvement over 4.7.5.0 version without G1GC configuration after applying the G1GC configurations.

4.7.9.1 without G1GC

55 mps

120% improvement over 4.7.5.0 version without G1GC configuration in the Unravel 4.7.9.1 version without applying the G1GC configuration.

4.7.9.1 with G1GC

78 mps

212% improvement over 4.7.5.0 version without G1GC configuration in the Unravel 4.7.9.1 version after applying the G1GC configuration.

Findings

  • Upgrading from Unravel version4.7.5.0 to Unravel version 4.7.9.1, there is an improvement of 120% in your average processing rate, per consumer, per second for Spark worker.

  • Upgrading from Unravel version 4.7.5.0 to Unravel version 4.7.9.1, after applying the G1GC configurations, there is a massive improvement of 212% in the average message processing rate, per consumer, per second.

Configurations

Results for average % of CPU utilization

4.7.5.0 version without G1GC

40%

4.7.5.0 version with G1GC

30%

4.7.9.1 without G1GC

23%

4.7.9.1 with G1GC

25%

Findings

Upgrading from Unravel version 4.7.5.0 to Unravel version 4.7.9.1, with or without applying the G1GC configurations, the CPU utilization is between 20 - 25 percent. Whereas with the 4.7.5.0, with or without the G1GC configurations, the CPU utilization is between 30 - 40 percent.

Configurations

Results for memory allocation

4.7.5.0 version without G1GC

48 GB memory allocated for 8 Spark workers with 6 GB memory for each worker.

4.7.5.0 version with G1GC

48 GB memory allocated for 8 Spark workers with 6 GB memory for each worker.

4.7.9.1 without G1GC

16 GB memory allocated for 8 consumers in a single Spark worker.

4.7.9.1 with G1GC

16 GB memory allocated for 8 consumers in a single Spark worker.

Findings

Improved memory allocation for the Spark worker after upgrading to 4.7.9.1.

Learnings
  • For JRE version 8, using G1GC settings resulted in better performance.

  • JRE 17 already runs with G1GC settings by default. For JRE version 17, using ZGC settings resulted in better performance over G1GC.

Recommendations

Unravel recommends the following node sizing, DB sizing, and GC settings.

Node sizing recommendations

DB sizing recommendations

GC settings recommendations

  • Memory:  128 GB VM

  • CPU: 16-24 cores

  • Disk: Approximately 2 TB for storing data for 6 months

  • Database: MySQL 8.0, MariaDB 10.X

  • For DB instance, Unravel recommends using a dedicated node separate from Unravel node.

  • Disk: Approximately 2 TB for storing 6 months of data.

  • Use MySQL 8.0 or MariaDB 10.X databases for enhanced performance.

  • For JRE version 8, use G1GC settings, now enabled by default, for better performance.

  • For JRE versions 15  and above, use ZGC settings for better performance.

Table worker

A benchmark test was conducted to ensure the Table worker daemon was functioning efficiently with extensive scaling. The scaling test was done with Hive metastore using 2 million tables and 17 million partitions.

Two separate multi-cluster setups were set to get the performance results for the Table worker performance.

Configurations

Multi-cluster setup 1

Multi-cluster setup 2

Setup

Single node setup with the Table worker configured with 10 GB memory and with 8 consumers in Spark worker.

10 GB is the default memory configuration for the Table worker. The rest of the daemons run with their corresponding default configurations.

Single node setup with the Table worker configured with 10 GB memory and with 8 consumers in Spark worker was used.

Core node

CPU and memory configuration included machines with 8 cores and 64 GB RAM.

CPU and memory configuration included machines with 8 cores and 64 GB RAM.

Cluster type

CDP.

CDH.

Edge nodes

3

3

DB configurations

  • Unravel DB: MySQL 8.0.33, max_connections - 500, and Innodb_buffer_pool_size - 4GB

  • Hive Metastore: MySQL 8.0.33 - running on a dedicated node

  • Unravel DB: MySQL 8.0.33, max_connections - 500, and Innodb_buffer_pool_size - 4GB

  • Hive Metastore: MySQL 8.0.33 - running on a dedicated node

The results for collective polling/processing time and memory utilization were derived using the following scenarios in batch sizes of 1 Million, 500 K, and 100K.

  • Scenario 1: Total tables: 110.6K, Total partitions: 16.5M

  • Scenario 2: Total tables: 1.6M, Total partitions: 7M

  • Scenario 3: Total tables: 9135, Total partitions: 15M

  • Scenario 4: Total tables: 74.9K, Total partitions: 27M

The results are listed in the following table:

table-worker-results.png
Recommendations
  • Table worker should not exceed the maximum memory utilization of 6 GB for any number of tables and partitions count, which is approximately 60% utilization of the total allocated memory of 10 GB (default).

  • Unravel recommends keeping the polling batch size at 1 million, the default batch size.

  • Configuration properties are available to modify the batch size for polling the table and partitions metadata based on the number of tables and partitions available in your Hive metastore. You can set different polling limits for tables and partition metadata.

  • Polling and processing time is inversely proportional to the polling limit, whereas the max memory is directly proportional to the polling limit. If you modify the batch sizes to below or above the default 1 million, you may have to compromise on the polling and processing time and the memory utilization.

    Note

    When the number of partition records in the Hive Metastore exceeds 27 million, there is a possibility that the data page may not display data in all the KPIs.