How to do parallel tasks in oozie

Question

How to do parallel tasks in oozie

I have a wrapper script in HDFS. I have scheduled this script in oozie with the following workflow.

Workflow:

<workflow-app name="Shell_test" xmlns="uri:oozie:workflow:0.5">
<start to="shell-8f63"/>
<kill name="Kill">
    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-8f63">
    <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <exec>shell.sh</exec>
        <argument>${input_file}</argument>
        <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
        <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
        <file>/user/xxxx/args/${input_file}#${input_file}</file>
    </shell>
    <ok to="End"/>
    <error to="Kill"/>
</action>
<end name="End"/>

work properties

nameNode=xxxxxxxxxxxxxxxxxxxx
jobTracker=xxxxxxxxxxxxxxxxxxxxxxxx
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/xxxxxxx/xxxxxx

args file

tableA
tableB
tablec
tableD

The shell script now works for a single job name in the args file. How can I schedule this shell script to run in parallel.

I want the script to run for 10 jobs at the same time.

What are the steps required to do this. What changes should be made to the workflow.

Should I create 10 workflows to run 10 parallel jobs. Or what are the best scenarios for solving this problem.

My shell script:

#!/bin/bash

[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }

table=$1

job_name=${table}

sqoop job  --exec ${job_name}

My sqoop script work:

sqoop job --create ${table} -- import --connect ${domain}:${port}/${database} --username ${username} --password ${password} --query "SELECT * from ${database}.${table} WHERE \$CONDITIONS" -m 1 --hive-import --hive-database ${hivedatabase} --hive-table ${table} --as-parquetfile --incremental append --check-column id --last-value "${last_val}"  --target-dir /user/xxxxx/hive/${hivedatabase}.db/${table} --outdir /home/$USER/logs/outdir

+3

shell hadoop hdfs oozie oozie-coordinator

user7543621 Apr 17 17 at 17:44

source to share

2 answers

Rob · Answer 1 · 2017-04-19T14:18:37+0000

To do parallel execution you can make a workflow.xml with forks in it. Below is an example to help you.

If you notice the XML below, you will see that I am using the same script passing in a different config file where in your case you need to pass different table names from the config file, or you can also go through in your workflow.XML

Taking the sqoop example, your sqoop should be in a .sh script like below:

sqoop job --create ${table} -- import --connect ${domain}:${port}/${database} --username ${username} --password ${password} --query "SELECT * from "${database}"."${table}" WHERE \$CONDITIONS" -m 1 --hive-import --hive-database "${hivedatabase}" --hive-table "${hivetable}" --as-parquetfile --incremental append --check-column id --last-value "${last_val}"  --target-dir /user/xxxxx/hive/${hivedatabase}.db/${table} --outdir /home/$USER/logs/outdir

So basically you will write your sqoop work as generic as you can where it should expect hive table, database, source table, source database names from workflow.xml. This way you will call the same script for all activities, but the Env-var in the workflow activities will change. See below for the changes I made for the first step.

 <workflow-app xmlns='uri:oozie:workflow:0.5' name='Workflow_Name'>
    <start to="forking"/>
     
     <fork name="forking">
      <path start="shell-8f63"/>
      <path start="shell-8f64"/>
      <path start="SCRIPT3CONFIG3"/>
      <path start="SCRIPT4CONFIG4"/>
      <path start="SCRIPT5CONFIG5"/>
      <path start="script6config6"/>
    </fork>

    <action name="shell-8f63">
    <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <exec>shell.sh</exec>
        <argument>${input_file}</argument>
		<env-var>database=sourcedatabase</env-var>
	<env-var>table=sourcetablename</env-var>
	<env-var>hivedatabase=yourhivedataabsename</env-var>
	<env-var>hivetable=yourhivetablename</env-var>
	<env-var>You can pass how many ever variables you want between the env-var</env-var>
	<env-var>parameters should be passed with double quotes in order to work through shell actions</env-var>
	<env-var></env-var> 
        <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
        <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
        <file>/user/xxxx/args/${input_file}#${input_file}</file>
    </shell>	 
     <ok to="joining"/>
     <error to="sendEmail"/>
     </action>

    <action name="shell-8f64">
   <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <exec>shell.sh</exec>
        <argument>${input_file}</argument>
		<env-var>database=sourcedatabase1</env-var>
	<env-var>table=sourcetablename1</env-var>
	<env-var>hivedatabase=yourhivedataabsename1</env-var>
	<env-var>hivetable=yourhivetablename2</env-var>
	<env-var>You can pass how many ever variables you want between the env-var</env-var>
	<env-var>parameters should be passed with double quotes in order to work through shell actions</env-var>
	<env-var></env-var> 
        <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
        <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
        <file>/user/xxxx/args/${input_file}#${input_file}</file>
    </shell>
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="SCRIPT3CONFIG3">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/THIRD_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="SCRIPT4CONFIG4">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/FOURTH_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="SCRIPT5CONFIG5">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/FIFTH_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="script6config6">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/SIXTH_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <join name="joining" to="end"/>

    <action name="sendEmail">
    <email xmlns="uri:oozie:email-action:0.1">
    <to>youremail.com</to>
    <subject>your subject</subject>
    <body>your email body</body>
    </email>
    <ok to="kill"/>
    <error to="kill"/>
    </action>
     
    <kill name="kill">
    <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
    </workflow-app>

Run code Hide result

I showed you an example of 6 parallel jobs above, if you want to run parallel activities you can add a little more at the beginning and write the activities in the workflow.

This is what he looks like from HUE

Suvarna pattayil · Answer 2 · 2017-04-18T05:08:40+0000

From what I understand, you have a requirement to run "x" number of jobs in parallel in Oozie. This "x" can change every time. What can you do,

You have a two-step workflow.

Shell action
Subtitle action

Shell Action - This will launch a shell script which will based on your "x", decide dynamically which table you need to select, etc., and create an XML file that will serve as an XML document for the subworkflow workflow action next. This subplot action will have a fork shell job so they can run in parallel. Note that you will need to put this xml in HDFS as well as inorder for it to be available for your subquery.
Workflow Activity - It will simply execute the workflow created in the previous activity.

How to do parallel tasks in oozie

More articles: