Spark
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.spark.live.pipeline.enabled Specifies if Unravel should process the live job data coming from sensor or not. true: The live job data will be processed as soon as it is received. false: Live job data will not be processed. | boolean | true | |
com.unraveldata.spark.live.pipeline.maxStoredStages Maximum number of jobs/stages stored in the DB. If an application has This setting affects only the live pipeline. When processing the event log file (after the application has completed its execution) this property is not considered. | count | 5000 | |
com.unraveldata.spark.master Default spark master mode to be used if not available from Sensor. Possible values: local, standalone or yarn (default) | set member | yarn |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.jobtime.to.apptime.ratio.threshold Defines the threshold of the ratio of jobtime to apptime and is used to determine if SQL analysis of queries should be done for an app. | percent | 0.6 | |
com.unraveldata.num.sql.queries.to.analyze This property defines the maximum number of queries to analyze in a spark app. | count | 20 | |
spark.unravel.sql.op.timing.enabled This property is a unravel specific Spark conf which can be defined in spark-submit command to enable or disable the capturing of SQL timing data. | boolean | true | |
com.unraveldata.sqloperator.to.query.ratio.threshold This property defines the ratio of operatorTime/querytime and determines whether the operator is slow. | percent | 0.2 | |
com.unraveldata.gctime.to.querytime.ratio.threshold This property defines the ratio of gcTime/querytime and determines if the query is spending significant time in GC. | percent | 0.2 | |
com.unraveldata.query.to.app.ratio.threshold This property defines the ratio of queryTime/appTime and identifies the most significant query in the app. | percent | 0.2 |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.spark.eventlog.location All the possible locations of the event log files. Multiple locations are supported as a comma-separated list of values. This property is used only when the Unravel sensor is not enabled. When the sensor is enabled, the event log path is taken from the application configuration at runtime. | string |
| |
com.unraveldata.spark.eventlog.maxSize Maximum size of the event log file that will be processed by the Spark worker daemon. Event logs larger than | bytes | 1000000000 (~1GB) | |
com.unraveldata.spark.eventlog.appDuration.mins Maximum duration (in minutes) of application to pull Spark event log. | min | 1440 (1 day) | |
com.unraveldata.spark.hadoopFsMulti.useFilteredFiles Specifies how to search the event log files.
Prefix + suffix search is faster as it avoids listFiles() API which may take a long time for large directories on HDFS. This search requires that all the possible suffixes for the event log files are known. Possible suffixes are specified by com.unraveldata.spark.hadoopFsMulti.eventlog.suffixes.. | boolean | false | |
com.unraveldata.spark.hadoopFsMulti.eventlog.suffixes Specifies suffixes used for prefix+suffix search of the event logs when com.unraveldata.spark.hadoopFsMulti.useFilteredFiles= NOTE: the empty suffix (,,) be part of this value for uncompressed event log files. | CSL | ,,.lz4,.snappy,.inprogres | |
com.unraveldata.spark.appLoading.maxAttempts Maximum number of attempts for loading the event log file from HDFS/S3/ ADL/WASB etc. | count | 3 | |
com.unraveldata.spark.appLoading.delayForRetry Delay used among consecutive retries when loading the event log files. The actual delay is not constant, it increases progressively by 2^attempt * delayForRetry. | ms | 2000 (2 s) | |
com.unraveldata.spark.tasks.inMemoryLimit Number of tasks to be kept in memory and DB per stage. All stats are calculated for all the task attempts but only the configured number of tasks will be kept in memory/DB. | count | 1000 | |
Events Related | |||
com.unraveldata.spark.events.enableCaching Enables logic for executing caching events. | boolean | false |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.spark.appLoading.maxConcurrentApps The number of applications Unravel keep metadata in Spark worker daemon memory. | count | 5 | |
com.unraveldata.spark.time.histogram Specifies whether the timeline histogram is generated or not. Note: Timeline histogram generation is memory intensive. | boolean | false |
spark-default.conf
Property/Description | Set by user | Unit | Default |
---|---|---|---|
spark.unravel.shutdown.delay.ms Amount of time to delay shutdown so the last messages are processed (allows Btrace sensor to send all the data before the spark driver exits). | ms | 300 | |
spark.unravel.live.update.interval.sec This is the interval in seconds after which live application data is updated. It allows for the tracking of Spark tasks. The Spark APM updates on Task completion in addition Job start, and Job and Stage completion. | s | 60 |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.job.collector.running.load.conf When set to true
| boolean | false | |
com.unraveldata.job.collector.hive.queries.cache.size This is used to improve the Hive-MR pipeline by caching data so it can be retrieved from cache instead of external API. You should not have to change this value. | count | 1000 | |
com.unraveldata.max.attempt.log.dir.size.in.bytes Maximum size of the aggregated executor log that are imported and processed by the Spark worker for a successful application. | byte | 500000000 (~500 MB) | |
com.unraveldata.max.failed.attempt.log.dir.size.in.bytes Maximum size of the aggregated executor log that are imported and processed by the Spark worker for a failed application. | byte | 2000000000 (~2 GB) | |
com.unraveldata.min.job.duration.for.attempt.log Minimum duration of a successful application or which executor logs are processed (in milliseconds). | ms | 600000 (10 mins) | |
com.unraveldata.min.failed.job.duration.for.attempt.log Minimum duration of failed/killed application for which executor logs are processed (in milliseconds). | ms | 60000 | |
com.unraveldata.attempt.log.max.containers Maximum number of containers for the application. If application has more than configured number of containers then the aggregated executor log is processed for the application. | ms | 500 | |
com.unraveldata.spark.master Default master for spark applications. (Used to download executor log using correct APIs.) Valid Options: | string | yarn | |
com.unraveldata.process.executor.log Set the flag to process the executor logs.
| boolean | true |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.s3.profile.config.file.path The path to the s3 profile file, e.g., | string | - | |
com.unraveldata.spark.s3.profilesToBuckets Comma separated list of profile to bucket mappings in the following format: <s3_profile>:<s3_bucket>, for example, com.unraveldata.spark.s3.profileToBuckets=profile-prod:com.unraveldata.dev,profile-dev:com.unraveldata.dev. Note ImportantEnsure that the profiles defined in the property above are actually present in the s3 properties file and that each profile has associated a corresponding pair of credentials | CSL | - |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.tagging.enabled Enables tagging functionality. | boolean | true | |
com.unraveldata.tagging.script.enabled Enables tagging. | boolean | false | |
com.unraveldata.app.tagging.script.path Specifies tagging script path to use when enabled=true. | string (path) | /usr/local/unravel/etc/apptag.py | |
com.unraveldata.app.tagging.script.method.name The name of the method in the python script that generates the tagging dictionary. | string | generate_unravel_tags |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.hdinsight.storage-account. Storage account name that a HDInsight cluster uses. You must define this property for each storage account. | Optional | string | Azure storage account name. |
com.unraveldata.hdinsight.access-key. Storage account key. For each storage-account. | Optional | string | Azure storage account key. |
Property/Description | Set by user | Unit | Default |
---|---|---|---|
com.unraveldata.adl.accountFQDN The data lake's fully qualified domain name, for example, mydatalake.azuredatalakestore.net. | Optional | string | Azure storage account name. |
com.unraveldata.adl.clientId An application ID. An application registration has to be created in the Azure Active Directory. | Optional | string | Azure application id. |
com.unraveldata.adl.clientKey An application access key which can be created after registering an application. | Optional | string | Azure storage access key. |
com.unraveldata.adl.accessTokenEndpoint The OAUTH 2.0 Access Token Endpoint. It is obtained from the application registration tab on Azure portal. | Optional | string | Azure OAUTH 2.0 token endpoint |
com.unraveldata.adl.clientRootPath The path in the Data lake store where the target cluster has been given access. | Optional | string URL | Azure CONTAINER/DIRECTORY path for storage account name. |