Friday, November 22, 2013

Apache Oozie workflow scheduling system - Introduction and Tips

Oozie is a workflow scheduling system for managing hadoop jobs and other executable components such as: Java, shell script, Hive, Sqoop, Pig, SSH, MapReduce, Hadoop FS, and provides kinds of EL Function.

Using oozie cooradinator + oozie workflow we can schedule our data processing tasks, it could be also used as a monitoring and error handling system.


To introduce the Oozie framework in detail we need to know some backgrounds of oozie.

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. The workflows and scheduler (in oozie it is a coordinator) will be defined with XML files. Oozie provides a management CLI and a web interface to show its running workflows. And you can also communicate with oozie from a Java program using Oozie Java Client library.

XML Schema
There is two types of XML, workflow, which defines the payload of a oozie job, and the other is coordinator, which defines the scheduling information of a oozie job.

You can use following command to validate if a xml file is valid workflow or coordinator definition in oozie framework.

$ oozie validate workflow.xml

Then we take a look at the detail of these two XML schema.

Workflow schema defined, a workflow must have extractly one start element and end element. And between them, 0 to unbounded actions or decisions and so on. Some of actions and decision element will be explained in Example chapter.

Coordinator schema shows, that one coordinator can only involve extractly one workflow in action element.
That means, one coordinator can only control one workflow.

But, one workflow can have any number of sub-workflows.

Management Interface
Oozie provides three kinds of management interface, all you want do with oozie can be done in command line interface (CLI), there is also a Web interface which is using a so called Oozie URL, but from the web interface you can only take a look the information, you can not manage your oozie server or oozie workflows from browser. And third choice is also powerful, you can do everything in your Java program with Oozie Java Client library.

Some useful command:

Start a oozie workflow (or coordinator):
$ oozie job -oozie -config /some/where/ -run

Get the information of a oozie job with its ID (such as 0000001-130104191423486-oozie-oozi-W):
$ oozie job -oozie http:/ -info 0000001-130104191423486-oozie-oozi-W

Get the task log of a oozie job:
$ oozie job -oozie -log 0000001-130104191423486-oozie-oozi-W

Stop a oozie job:
$ oozie job -oozie -kill 0000003-130104191423486-oozie-oozi-W

The "-oozie" refers to a URL that called Oozie URL, by each command you have to point this URL explicit.

The web interface is just the same as the "Oozie URL". For example, in this case, it is:

Using this URL you can get all of informations about running jobs and configurations by your browser.

Java Client 
There is also a java client library of oozie.

=== Tips ===

1) Deploy Oozie ShareLib in HDFS

$ mkdir /tmp/ooziesharelib
$ cd /tmp/ooziesharelib
$ tar zxf /usr/lib/oozie/oozie-sharelib.tar.gz
$ sudo -u oozie hadoop fs -put share /user/oozie/share

2) Oozie Sqoop Action arguments from properties

Oozie Sqoop Action does not support multi-lines sqoop command from property file very well, we should use <arg> tag to set sqoop command line by line as sqoop job parameters.

3) ZooKeeper connection problem by importing to HBase using Sqoop-action

Problem: The mapper of a sqoop action tries to access zookeeper on localhost and not the one of the cluster.
  1. Go to cloudera manager of the corresponding cluster
  2. Go to zookeeper service and get the hbase-site.xml
  3. Copy hbase-site.xml into hdfs under /tmp/ooziesharelib/share/lib/sqoop/

4) ZooKeeper connection problem by HBase Java client

Just like the similar ZooKeeper problem by Sqoop-action, we can put the hbase-site.xml into oozie common sharelib, or if we want to manually load HBase ZooKeeper configuration in Java, we can put the hbase-site.xml in jar, and then:

Configuration conf = new Configuration();

5) Hive action throws NestedThrowables: JDOFatalInternalException and InvocationTargetException

Put MySQL Java Connector into share lib or in Hive Workflow Root

If it still doesn't work, the take a look at

In short form:
hadoop fs  -chmod g+w  /tmp
hadoop fs  -chmod 777  /tmp
hadoop fs  -chmod g+w  /user/hive/warehouse
hadoop fs  -chmod 777  /user/hive/warehouse

6) Fork-Join action errorTo same transitions

It is fixed in Oozie version 3.3.2 (

A temporarily solution for old Oozie version is shown here:

In short:

In oozie-site.xml , set oozie.validate.ForkJoin to false and restart Oozie.

7) Default maximum output data size is only 2 KB

sometime you will get this error:

Failing Oozie Launcher, Output data size [4 321] exceeds maximum [2 048]
Failing Oozie Launcher, Main class [com.myactions.action.InitAction], exception invoking main(), null
 at org.apache.oozie.action.hadoop.LauncherMapper.failLauncher(
yep, it will happen sooner or later, because the default maximum output data size is only 2KB -_-

if you want to change this setting, you need set the property to a larger one in oozie-site.xml , such as:
will set the max output size to 1024 KB .

8) SSH-Tunnel to bypass the firewall to get the web interface

The port 11000 may be blocked by default in some firewall, so if you want to use the web interface of oozie, you may need to set a ssh tunnel, to redirect the traffic with localhost:11000 to port 11000 on oozie server.

Then you can get the web interface using URL: http://localhost:11000/

9) sendmail-action after a decision-action

Java program set a status property and message property. by decision-action check if the status equals 0.

Sponsors: mobilcom-debitel Online-Shop

No comments:

Post a Comment

© Chutium / Teng Qiu @ ABC Netz Group