Some keywords and error message
Commonly searched keywords/terms and error messages organized by job type.
Spark keywords
| Spark key term | Explanation | 
|---|---|
| Deploy mode | Specifies where the driver runs. In “cluster” mode the driver runs on the cluster. In “client” mode the driver runs on the edge node, outside of the cluster. | 
| Driver | The process that coordinates the application execution. | 
| Executor | The process launched by the application on a worker node. | 
| Resilient Distributed Dataset (RDD) | Fault tolerant distributed dataset. | 
| spark.default.parallelism | Default number of partitions. | 
| spark.dynamicAllocation.enabled | Enables dynamic allocation in Spark. | 
| spark.executor.memory | Related to executor memory. | 
| spark.io.compression.codec | Codec used to compress RDDs, the event log file, and broadcast variables. | 
| spark.shuffle.service.enabled | Enables the external shuffle service to preserve shuffle files even when executors are removed. It is required by dynamic allocation. | 
| spark.shuffle.spill.compress | Specifies whether to compress the shuffle files. | 
| spark.sql.shuffle.partitions | Number of SparkSQL partitions. | 
| spark.yarn.executor.memoryOverhead | YARN memory overhead. | 
| SparkContext | Main Spark entry point; used to create RDDs, accumulators, and broadcast variables. | 
| SparkConf | Spark configuration object. | 
| SQLContext | Main Spark SQL entry point. | 
| StreamingContext | Main Spark Streaming entry point. | 
Spark error messages
| Spark error messages | Explanation | 
|---|---|
| Container killed by YARN for exceeding memory limits. | The amount of off-heap memory was insufficent at container level. "spark.yarn.executor.memoryOverhead" should be increased to a larger value. | 
| java.io.IOException: Connection reset by peer | Connection reset by peer. Generally occurs in the driver logs when some of the executors fail or are shutdown unexpectedly. | 
| java.lang.OutOfMemoryError | Out of memory error, insufficient Java heap space at executor or driver levels. | 
| org.apache.hadoop.mapred.InvalidInputException | Input path does not exist. | 
| org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. | Buffer overflow. | 
| org.apache.spark.sql.catalyst.errors.package$TreeNodeException | Exception observed when executing a SparkSQL query on non existing data. | 
MapReduce/Hive keywords
| Key Term | Explanation | 
|---|---|
| hive.exec.parallel | Determines whether to execute jobs in parallel. | 
| hive.exec.reducers.bytes.per.reducer | The size per reducer. | 
| io.sort.mb | The total amount of buffer memory to use while sorting files, in megabytes. | 
| io.sort.record.percent | The percentage of io.sort.mb dedicated to tracking record boundaries. | 
| mapreduce.input.fileinputformat.split.maxsize | Buffer overflow. | 
| mapreduce.input.fileinputformat.split.minsize | Maximum chunk size map input should be split into. | 
| mapreduce.job.reduces | Default number of reduce tasks per job. | 
| mapreduce.map.cpu.vcores | Number of virtual cores to request from the scheduler for each map task. | 
| mapreduce.map.java.opts | JVM heap size for each map task. | 
| mapreduce.map.memory.mb | The amount of memory to request from the scheduler for each map task. | 
| mapreduce.reduce.cpu.vcores | Number of virtual cores to request from the scheduler for each reduce task. | 
| mapreduce.reduce.java.opts | JVM heap size for each reduce task. | 
| mapreduce.reduce.memory.mb | The amount of memory to request from the scheduler for each reduce task. |