Home

Some keywords and error message

Commonly searched keywords/terms and error messages organized by job type.

Spark keywords

Spark key term

Explanation

Deploy mode

Specifies where the driver runs. In “cluster” mode the driver runs on the cluster. In “client” mode the driver runs on the edge node, outside of the cluster.

Driver

The process that coordinates the application execution.

Executor

The process launched by the application on a worker node.

Resilient Distributed Dataset (RDD)

Fault tolerant distributed dataset.

spark.default.parallelism

Default number of partitions.

spark.dynamicAllocation.enabled

Enables dynamic allocation in Spark.

spark.executor.memory

Related to executor memory.

spark.io.compression.codec

Codec used to compress RDDs, the event log file, and broadcast variables.

spark.shuffle.service.enabled

Enables the external shuffle service to preserve shuffle files even when executors are removed. It is required by dynamic allocation.

spark.shuffle.spill.compress

Specifies whether to compress the shuffle files.

spark.sql.shuffle.partitions

Number of SparkSQL partitions.

spark.yarn.executor.memoryOverhead

YARN memory overhead.

SparkContext

Main Spark entry point; used to create RDDs, accumulators, and broadcast variables.

SparkConf

Spark configuration object.

SQLContext

Main Spark SQL entry point.

StreamingContext

Main Spark Streaming entry point.

Spark error messages

Spark error messages

Explanation

Container killed by YARN for exceeding memory limits.

The amount of off-heap memory was insufficent at container level. "spark.yarn.executor.memoryOverhead" should be increased to a larger value.

java.io.IOException: Connection reset by peer

Connection reset by peer. Generally occurs in the driver logs when some of the executors fail or are shutdown unexpectedly.

java.lang.OutOfMemoryError

Out of memory error, insufficient Java heap space at executor or driver levels.

org.apache.hadoop.mapred.InvalidInputException

Input path does not exist.

org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow.

Buffer overflow.

org.apache.spark.sql.catalyst.errors.package$TreeNodeException

Exception observed when executing a SparkSQL query on non existing data.

MapReduce/Hive keywords

Key Term

Explanation

hive.exec.parallel

Determines whether to execute jobs in parallel.

hive.exec.reducers.bytes.per.reducer

The size per reducer.

io.sort.mb

The total amount of buffer memory to use while sorting files, in megabytes.

io.sort.record.percent

The percentage of io.sort.mb dedicated to tracking record boundaries.

mapreduce.input.fileinputformat.split.maxsize

Buffer overflow.

mapreduce.input.fileinputformat.split.minsize

Maximum chunk size map input should be split into.

mapreduce.job.reduces

Default number of reduce tasks per job.

mapreduce.map.cpu.vcores

Number of virtual cores to request from the scheduler for each map task.

mapreduce.map.java.opts

JVM heap size for each map task.

mapreduce.map.memory.mb

The amount of memory to request from the scheduler for each map task.

mapreduce.reduce.cpu.vcores

Number of virtual cores to request from the scheduler for each reduce task.

mapreduce.reduce.java.opts

JVM heap size for each reduce task.

mapreduce.reduce.memory.mb

The amount of memory to request from the scheduler for each reduce task.