Some keywords and error message

Skip to main content

Home

Some keywords and error message

Commonly searched keywords/terms and error messages organized by job type.

Spark keywords

Spark key term	Explanation
Deploy mode	Specifies where the driver runs. In “cluster” mode the driver runs on the cluster. In “client” mode the driver runs on the edge node, outside of the cluster.
Driver	The process that coordinates the application execution.
Executor	The process launched by the application on a worker node.
Resilient Distributed Dataset (RDD)	Fault tolerant distributed dataset.
spark.default.parallelism	Default number of partitions.
spark.dynamicAllocation.enabled	Enables dynamic allocation in Spark.
spark.executor.memory	Related to executor memory.
spark.io.compression.codec	Codec used to compress RDDs, the event log file, and broadcast variables.
spark.shuffle.service.enabled	Enables the external shuffle service to preserve shuffle files even when executors are removed. It is required by dynamic allocation.
spark.shuffle.spill.compress	Specifies whether to compress the shuffle files.
spark.sql.shuffle.partitions	Number of SparkSQL partitions.
spark.yarn.executor.memoryOverhead	YARN memory overhead.
SparkContext	Main Spark entry point; used to create RDDs, accumulators, and broadcast variables.
SparkConf	Spark configuration object.
SQLContext	Main Spark SQL entry point.
StreamingContext	Main Spark Streaming entry point.

Spark error messages

Spark error messages	Explanation
Container killed by YARN for exceeding memory limits.	The amount of off-heap memory was insufficent at container level. "spark.yarn.executor.memoryOverhead" should be increased to a larger value.
java.io.IOException: Connection reset by peer	Connection reset by peer. Generally occurs in the driver logs when some of the executors fail or are shutdown unexpectedly.
java.lang.OutOfMemoryError	Out of memory error, insufficient Java heap space at executor or driver levels.
org.apache.hadoop.mapred.InvalidInputException	Input path does not exist.
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow.	Buffer overflow.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException	Exception observed when executing a SparkSQL query on non existing data.

MapReduce/Hive keywords

Key Term	Explanation
hive.exec.parallel	Determines whether to execute jobs in parallel.
hive.exec.reducers.bytes.per.reducer	The size per reducer.
io.sort.mb	The total amount of buffer memory to use while sorting files, in megabytes.
io.sort.record.percent	The percentage of io.sort.mb dedicated to tracking record boundaries.
mapreduce.input.fileinputformat.split.maxsize	Buffer overflow.
mapreduce.input.fileinputformat.split.minsize	Maximum chunk size map input should be split into.
mapreduce.job.reduces	Default number of reduce tasks per job.
mapreduce.map.cpu.vcores	Number of virtual cores to request from the scheduler for each map task.
mapreduce.map.java.opts	JVM heap size for each map task.
mapreduce.map.memory.mb	The amount of memory to request from the scheduler for each map task.
mapreduce.reduce.cpu.vcores	Number of virtual cores to request from the scheduler for each reduce task.
mapreduce.reduce.java.opts	JVM heap size for each reduce task.
mapreduce.reduce.memory.mb	The amount of memory to request from the scheduler for each reduce task.

In this section:

Would you like to provide feedback? Just click here to suggest edits.