ABC Networks Blog: November 2013

Friday, November 29, 2013

Amazon Cyber Monday Woche / 伍迪·艾伦纪念套装 20碟DVD！Woody Allen Collection (20 DVDs)

Amazon Cyber Monday Woche 又一个不错的东西

伍迪·艾伦纪念套装 20碟DVD！
Woody Allen Collection (20 DVDs)
bis 21:00 Uhr 48,97 EUR statt 69,97 EUR

>> Zu allen Deals 全部打折信息 <<

Saturn 250€ Gutschein! | Saturn 250欧 Gutschein！

Amazon Cyber Monday Woche / Weinpaket / AKG Mini Kopfhörer / Severin Mikrowelle / Sony 32" LED / WIKO山寨手机

Rabatte von bis zu 50% von 9 bis 23 Uhr !

Ein extra Highlight der Cyber Monday Woche 2013 ist der 99 EUR Kindle Fire HD 8 GB, bis einschließlich 2. Dezember (10 Uhr) , Gegenüber des normalen Preises von 129 €.

Kindle Fire HD 8 GB 在这次的 Cyber Monday Woche 2013 中，只卖 99 欧！原价 129 欧，活动截止12月2日早10点。

16:30 Highlights
Hier die Highlights der Cyber Monday Blitzangebote um 16:30, den 29.11.2013:

Rabatte von bis zu 50% von 9 bis 23 Uhr !

今天的 Highlights
Hier die Highlights der Cyber Monday Blitzangebote vom 29.11.2013:

iPhone 5S Gold Sonderangebot 689 € bis 01.Dez.，苹果5s土豪金，12月1日前，特价689欧！

iPhone 5S 土豪金，竟然便宜了10欧，难得买了一回苹果的产品，tmd刚买没多久就降价了！：（

不得不声讨一把……

不过对正打算买的人确实是个好消息~

iPhone 5S Gold 16GB Sonderangebot 689€ ! nur bis 01. Dez. 2013

特价截止日期12月1日，点此购买

iPhone 5S Gold Sonderangebot 689 € bis 01.Dez.，苹果5s土豪金，12月1日前，特价689欧！

Tuesday, November 26, 2013

Data Mining: Practical Machine Learning Tools and Techniques

I want to recommand a good book about the open source machine learning toolkit: WEKA, this is an introduction to data mining and different machine learning algorithms and methods, such as Decision Trees, Association, Classification or Clustering.

Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [English]

get [Taschenbuch] in Amazon.de

get [Kindle Edition] in Amazon.de Kindle Shop

In our current projects, we used several WEKA's implementation of data mining and machine learning algorithms, it is fine, but a little bit slow for our data scale.

Furthermore, maybe we can build something based on WEKA and UIMA.

more info:

Books Homepage on WEKA Machine Learning Group at the University of Waikato

Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Taschenbuch]

Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Kindle Edition]

Friday, November 22, 2013

Apache Oozie workflow scheduling system - Introduction and Tips

Oozie is a workflow scheduling system for managing hadoop jobs and other executable components such as: Java, shell script, Hive, Sqoop, Pig, SSH, MapReduce, Hadoop FS, and provides kinds of EL Function.

Using oozie cooradinator + oozie workflow we can schedule our data processing tasks, it could be also used as a monitoring and error handling system.

Background

To introduce the Oozie framework in detail we need to know some backgrounds of oozie.

Oozie http://oozie.apache.org/ is a workflow scheduler system to manage Apache Hadoop jobs. The workflows and scheduler (in oozie it is a coordinator) will be defined with XML files. Oozie provides a management CLI and a web interface to show its running workflows. And you can also communicate with oozie from a Java program using Oozie Java Client library.

XML Schema

There is two types of XML, workflow, which defines the payload of a oozie job, and the other is coordinator, which defines the scheduling information of a oozie job.

You can use following command to validate if a xml file is valid workflow or coordinator definition in oozie framework.

$ oozie validate workflow.xml

Then we take a look at the detail of these two XML schema.

Workflow

http://oozie.apache.org/docs/3.3.1/WorkflowFunctionalSpec.html#Oozie_Schema_Version_0.4

Workflow schema defined, a workflow must have extractly one start element and end element. And between them, 0 to unbounded actions or decisions and so on. Some of actions and decision element will be explained in Example chapter.

Coordinator

http://oozie.apache.org/docs/3.3.1/CoordinatorFunctionalSpec.html#Oozie_Coordinator_Schema_0.4

Coordinator schema shows, that one coordinator can only involve extractly one workflow in action element.

That means, one coordinator can only control one workflow.

But, one workflow can have any number of sub-workflows.

Management Interface

Oozie provides three kinds of management interface, all you want do with oozie can be done in command line interface (CLI), there is also a Web interface which is using a so called Oozie URL, but from the web interface you can only take a look the information, you can not manage your oozie server or oozie workflows from browser. And third choice is also powerful, you can do everything in your Java program with Oozie Java Client library.

CLI

Some useful command:

Start a oozie workflow (or coordinator):

$ oozie job -oozie http://fxlive.de:11000/oozie -config /some/where/job.properties -run

Get the information of a oozie job with its ID (such as 0000001-130104191423486-oozie-oozi-W):

$ oozie job -oozie http:/fxlive.de:11000/oozie -info 0000001-130104191423486-oozie-oozi-W

Get the task log of a oozie job:

$ oozie job -oozie http://fxlive.de:11000/oozie -log 0000001-130104191423486-oozie-oozi-W

Stop a oozie job:

$ oozie job -oozie http://fxlive.de:11000/oozie -kill 0000003-130104191423486-oozie-oozi-W

The "-oozie" refers to a URL that called Oozie URL, by each command you have to point this URL explicit.

Web

The web interface is just the same as the "Oozie URL". For example, in this case, it is: http://fxlive.de:11000/oozie

Using this URL you can get all of informations about running jobs and configurations by your browser.

Java Client

There is also a java client library of oozie.

=== Tips ===

1) Deploy Oozie ShareLib in HDFS

http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/

https://ccp.cloudera.com/display/CDH4DOC/Oozie+Installation#OozieInstallation-InstallingtheOozieShareLibinHadoopHDFS

$ mkdir /tmp/ooziesharelib
$ cd /tmp/ooziesharelib
$ tar zxf /usr/lib/oozie/oozie-sharelib.tar.gz
$ sudo -u oozie hadoop fs -put share /user/oozie/share

2) Oozie Sqoop Action arguments from properties

Oozie Sqoop Action does not support multi-lines sqoop command from property file very well, we should use <arg> tag to set sqoop command line by line as sqoop job parameters.

3) ZooKeeper connection problem by importing to HBase using Sqoop-action

Problem: The mapper of a sqoop action tries to access zookeeper on localhost and not the one of the cluster.
Solution:

Go to cloudera manager of the corresponding cluster
Go to zookeeper service and get the hbase-site.xml
Copy hbase-site.xml into hdfs under /tmp/ooziesharelib/share/lib/sqoop/

4) ZooKeeper connection problem by HBase Java client

Just like the similar ZooKeeper problem by Sqoop-action, we can put the hbase-site.xml into oozie common sharelib, or if we want to manually load HBase ZooKeeper configuration in Java, we can put the hbase-site.xml in jar, and then:

Configuration conf = new Configuration();
conf.addResource("hbase-site.xml");
conf.reloadConfiguration();

5) Hive action throws NestedThrowables: JDOFatalInternalException and InvocationTargetException

Put MySQL Java Connector into share lib or in Hive Workflow Root

If it still doesn't work, the take a look at http://cloudfront.blogspot.de/2012/06/failed-error-in-metadata.html

In short form:

hadoop fs  -chmod g+w  /tmp
hadoop fs  -chmod 777  /tmp
hadoop fs  -chmod g+w  /user/hive/warehouse
hadoop fs  -chmod 777  /user/hive/warehouse

6) Fork-Join action errorTo same transitions

It is fixed in Oozie version 3.3.2 (https://issues.apache.org/jira/browse/OOZIE-1035)

A temporarily solution for old Oozie version is shown here: https://issues.apache.org/jira/browse/OOZIE-1142

In short:

In oozie-site.xml , set oozie.validate.ForkJoin to false and restart Oozie.

7) Default maximum output data size is only 2 KB

sometime you will get this error:


Failing Oozie Launcher, Output data size [4 321] exceeds maximum [2 048]
Failing Oozie Launcher, Main class [com.myactions.action.InitAction], exception invoking main(), null
org.apache.oozie.action.hadoop.LauncherException 
 at org.apache.oozie.action.hadoop.LauncherMapper.failLauncher(LauncherMapper.java:571)

yep, it will happen sooner or later, because the default maximum output data size is only 2KB -_-

if you want to change this setting, you need set the property oozie.action.max.output.data to a larger one in oozie-site.xml , such as:

<property>
    <name>oozie.action.max.output.data</name>
    <value>1048576</value>
</property>

will set the max output size to 1024 KB .

8) SSH-Tunnel to bypass the firewall to get the web interface

The port 11000 may be blocked by default in some firewall, so if you want to use the web interface of oozie, you may need to set a ssh tunnel, to redirect the traffic with localhost:11000 to port 11000 on oozie server.

Then you can get the web interface using URL: http://localhost:11000/

9) sendmail-action after a decision-action

Java program set a status property and message property. by decision-action check if the status equals 0.

Sponsors: TUI.com mobilcom-debitel Online-Shop

Tuesday, November 19, 2013

Integration of Hive and HBase: Hive MAP column mapped to an entire column family with binary value (byte value)

准备陆续把一些工作中遇到的问题整理成文档贴到这里。

先来一个最要命的~

在我们之前的工作中，遇到了一个情况，我们需要把一个 hbase 的 column family mapping 到 hive table 里，但是这个cf存的都是 byte value ，column qualifier 是类似timestamp的随机不确定数值，不可能在column mapping时明确指定，必须把整个cf mapping到hive里。

就是说
Hive MAP column mapped to an entire column family with binary value

解决方法
using following mapping type specification:

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:#s:b")

cf:#s:b means, cf is column family name, and #s:b is the type specification for this entire HBase column family, column names (column qualifiers) as the keys of the Hive MAP, their datatype is string (s), and column values as Hive MAP values, with datatype binary (b, byte values).

搜了各种文档，没有一个提到这种情况的，官方文档
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Examplewithbinarycolumns
也只是在讲某个具体的列，是binary column的情况，用 hbase.table.default.storage.type 不适用于要mapping整个cf的情况

参照文档里给的mapping一个具体列的格式

column-family-name:[column-name][#(binary|string)]

我们可以把这个 mapping entire column family with binary value 的写成如下范式：

column-family-name:[#(binary|string):(binary|string)]

举个例子

customer journey analytics 是 big data analytics 中一个很好的用 hbase + hive 的use case，我们假设所有的 customer events 都 tracking 到了一个 hbase table 里，key 是 customer id , long value , 在hbase里，为了保证key定长，long value 都通过 Bytes.toByte() 存为 byte value。一个 column family 存这个 customer 的所有 events ，比如 view , click , buy ，column qualifier 是 event 的timestamp，value 是 event id，同样是 long value，存为 byte。

这样这个表就是：

create 'customer_journey', 'events'

entries look like:

hbase(main):011:0> scan 'customer_journey'
ROW                               COLUMN+CELL
 \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824339000, timestamp=1354824339000, value=\x00\x00\x00\x00\x00\x00 \x08
 \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824340000, timestamp=1354824340000, value=\x00\x00\x00\x00\x00\x00'\x9E
 \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824350000, timestamp=1354824350000, value=\x00\x00\x00\x00\x00\x00\x00\x0F
 \x00\x00\x00\x00\x00\x00\x00\x10 column=events:1354824350000, timestamp=1354824350000, value=\x00\x00\x00\x00\x00\x00\xF0\x08
 \x00\x00\x00\x00\x00\x00\x00\x10 column=events:1354824359000, timestamp=1354824359000, value=\x00\x00\x00\x00\x00\x00\x10\xD8

hive external table should be created like:

hive> CREATE EXTERNAL TABLE customer_journey (customer_id bigint, events map<string, bigint>)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b, events:#s:b")
    > TBLPROPERTIES ("hbase.table.name" = "customer_journey");

then you will get:

hive> select * from customer_journey;
OK
1 {"1354824339000":8200,"1354824340000":10142,"1354824350000":15}
16 {"1354824350000":61448,"1354824359000":4312}
Time taken: ...

o2 Flat M: Flatrate ins dt. Festnetz + Flatrate ins o2-Netz
Sponsors: TUI.com mobilcom-debitel Online-Shop