You can add two Unravel tags (<key, value> pairs) to mark queries and jobs that belong to a particular workflow:
unravel.workflow.name: a string that represents the name of the workflow. The recommended format is
TenantName-ProjectName-WorkflowName
.unravel.workflow.utctimestamp: a timestamp in
yyyyMMddThhmmssZ
format representing the logical time of a run of the workflow in UTC/ISO format. In UNIX/LINUX bash. You can get a timestamp in UTC format by running the command "$(date -u '+%Y%m%dT%H%M%SZ')
".Note
Do not put quotes ("") or blank spaces in/around the tag keys or values. For example:
SET unravel.workflow.name="ETL-Workflow;
[Incorrect syntax]SET unravel.workflow.name=ETL-Workflow;
[Correct syntax]
Different runs of the same workflow have
The same value for unravel.workflow.name but
different values for unravel.workflow.utctimestamp.
Different workflows have different values for unravel.workflow.name
.
This is a Hive query that was marked as part of the Financial-Tenant-ETL-Workflow
workflow that ran on February 1, 2016:
SET unravel.workflow.name=Financial-Tenant-ETL-Workflow; SET unravel.workflow.utctimestamp=20160201T000000Z; SELECT foo FROM table WHERE … Your Hive Query text goes here
Export the workflow name and UTC timestamp from your top-level script that schedules each run of the workflow.
Here, we use
bash
'sdate
command to generate the timestamp.export WORKFLOW_NAME=Financial-Tenant-ETL-Workflow export UTC_TIME_STAMP=$(date -u '+%Y%m%dT%H%M%SZ')
Follow the instructions for your job type.
hive -f hive/simple_wf.hql
In hive/simple_wf.hql
:
SET unravel.workflow.name=Financial-Tenant-ETL-Workflow; SET unravel.workflow.utctimestamp=20160201T000000Z; SELECT foo FROM table WHERE … Your Hive Query text goes here
sqoop export \ -D"unravel.workflow.name=$WORKFLOW_NAME" -D"unravel.workflow.utctimestamp=$UTC_TIME_STAMP" \ --connect jdbc:mysql://127.0.0.1:3316/unravel_mysql_prod --table settings -m 1 \ --export-dir /tmp/sqoop_test --username unravel --verbose --password foobar
Note
Sqoop has bugs related to quotes.
Substitute your file name for /tmp/data/small
and /tmp/outsmoke
.
hadoop jar libs/ooziemr-1.0.jar com.unraveldata.mr.apps.Driver \ -D"unravel.workflow.name=$WORKFLOW_NAME" -D"unravel.workflow.utctimestamp=$UTC_TIME_STAMP" \ -p /wordcount.properties -input/tmp/data/small
-output/tmp/outsmoke
Note
For Spark jobs, you must prefix the Unravel tags with "spark.
". For example, unravel.workflow.name becomes spark.unravel.workflow.name.
spark-submit \ --conf "spark.unravel.workflow.name=$WORKFLOW_NAME" --conf "spark.unravel.workflow.utctimestamp=$UTC_TIME_STAMP" --conf "spark.eventLog.enabled=true" \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --deploy-mode cluster
pig \ -param WORKFLOW_NAME=$WORKFLOW_NAME -param UTC_TIME_STAMP=$UTC_TIME_STAMP \ -x mapreduce -f pig/simple.pig
In pig/simple.pig
:
SET unravel.workflow.name $WORKFLOW_NAME; SET unravel.workflow.utctimestamp $UTC_TIME_STAMP; lines = LOAD '/tmp/data/small' using PigStorage('|') AS (line:chararray); words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word; grouped = GROUP words BY word; wordcount = FOREACH grouped GENERATE group, COUNT(words); DUMP wordcount;
impala-shell -i <impald_host:port> \ -f simpleImpala.sql \ --var=workflowname='ourImpalaWorkflow' \ --var=utctimestamp=$(date -u '+%Y%m%dT%H%M%SZ')
In ../simpleImpala.sql
:
SET DEBUG_ACTION="::::unravel.workflow.name::${var:workflowname}::::unravel.workflow.utctimestamp::${var:utctimestamp}::::"; select * from usstates;;
Once your tagged workflows have been run, log into Unravel Web UI and select Jobs > Pipeline to start exploring Unravel's Workflow Management features.