![]() |
伍迪·艾伦纪念套装 20碟DVD! Woody Allen Collection (20 DVDs) bis 21:00 Uhr 48,97 EUR statt 69,97 EUR |
Friday, November 29, 2013
Amazon Cyber Monday Woche / 伍迪·艾伦纪念套装 20碟DVD!Woody Allen Collection (20 DVDs)
Amazon Cyber Monday Woche 又一个不错的东西
Amazon Cyber Monday Woche / Weinpaket / AKG Mini Kopfhörer / Severin Mikrowelle / Sony 32" LED / WIKO山寨手机
Rabatte von bis zu 50% von 9 bis 23 Uhr !
Ein extra Highlight der Cyber Monday Woche 2013 ist der 99 EUR Kindle Fire HD 8 GB, bis einschließlich 2. Dezember (10 Uhr) , Gegenüber des normalen Preises von 129 €.
Kindle Fire HD 8 GB 在这次的 Cyber Monday Woche 2013 中,只卖 99 欧!原价 129 欧,活动截止12月2日早10点。
Kindle Fire HD 8 GB 在这次的 Cyber Monday Woche 2013 中,只卖 99 欧!原价 129 欧,活动截止12月2日早10点。
16:30 Highlights
Hier die Highlights der Cyber Monday Blitzangebote um 16:30, den 29.11.2013:
Hier die Highlights der Cyber Monday Blitzangebote um 16:30, den 29.11.2013:
Amazon Cyber Monday Woche 2013
Rabatte von bis zu 50% von 9 bis 23 Uhr !
Ein extra Highlight der Cyber Monday Woche 2013 ist der 99 EUR Kindle Fire HD 8 GB, bis einschließlich 2. Dezember (10 Uhr) , Gegenüber des normalen Preises von 129 €.
Kindle Fire HD 8 GB 在这次的 Cyber Monday Woche 2013 中,只卖 99 欧!原价 129 欧,活动截止12月2日早10点。
Kindle Fire HD 8 GB 在这次的 Cyber Monday Woche 2013 中,只卖 99 欧!原价 129 欧,活动截止12月2日早10点。
今天的 Highlights
Hier die Highlights der Cyber Monday Blitzangebote vom 29.11.2013:
Hier die Highlights der Cyber Monday Blitzangebote vom 29.11.2013:
Wednesday, November 27, 2013
iPhone 5S Gold Sonderangebot 689 € bis 01.Dez.,苹果5s土豪金,12月1日前,特价689欧!
iPhone 5S 土豪金,竟然便宜了10欧,难得买了一回苹果的产品,tmd刚买没多久就降价了!:(
不得不声讨一把……
不过对正打算买的人确实是个好消息~
iPhone 5S Gold 16GB Sonderangebot 689€ ! nur bis 01. Dez. 2013
特价截止日期12月1日,点此购买

Tuesday, November 26, 2013
Data Mining: Practical Machine Learning Tools and Techniques
I want to recommand a good book about the open source machine learning toolkit: WEKA, this is an introduction to data mining and different machine learning algorithms and methods, such as Decision Trees, Association, Classification or Clustering.
Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [English]
get [Taschenbuch] in Amazon.de
get [Kindle Edition] in Amazon.de Kindle Shop
In our current projects, we used several WEKA's implementation of data mining and machine learning algorithms, it is fine, but a little bit slow for our data scale.
Furthermore, maybe we can build something based on WEKA and UIMA.
more info:
Books Homepage on WEKA Machine Learning Group at the University of Waikato
Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Taschenbuch]
Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Kindle Edition]
Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [English]
get [Taschenbuch] in Amazon.de
get [Kindle Edition] in Amazon.de Kindle Shop
In our current projects, we used several WEKA's implementation of data mining and machine learning algorithms, it is fine, but a little bit slow for our data scale.
Furthermore, maybe we can build something based on WEKA and UIMA.
more info:
Books Homepage on WEKA Machine Learning Group at the University of Waikato
Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Taschenbuch]
Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Kindle Edition]
Labels:
Data Mining,
Data Science,
Machine Learning
Friday, November 22, 2013
Apache Oozie workflow scheduling system - Introduction and Tips
Oozie is a workflow scheduling system for managing hadoop jobs and other executable components such as: Java, shell script, Hive, Sqoop, Pig, SSH, MapReduce, Hadoop FS, and provides kinds of EL Function.
Using oozie cooradinator + oozie workflow we can schedule our data processing tasks, it could be also used as a monitoring and error handling system.
To introduce the Oozie framework in detail we need to know some backgrounds of oozie.
1) Deploy Oozie ShareLib in HDFS
http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/
https://ccp.cloudera.com/display/CDH4DOC/Oozie+Installation#OozieInstallation-InstallingtheOozieShareLibinHadoopHDFS
2) Oozie Sqoop Action arguments from properties
Oozie Sqoop Action does not support multi-lines sqoop command from property file very well, we should use <arg> tag to set sqoop command line by line as sqoop job parameters.
3) ZooKeeper connection problem by importing to HBase using Sqoop-action
Problem: The mapper of a sqoop action tries to access zookeeper on localhost and not the one of the cluster.
Solution:
4) ZooKeeper connection problem by HBase Java client
Just like the similar ZooKeeper problem by Sqoop-action, we can put the hbase-site.xml into oozie common sharelib, or if we want to manually load HBase ZooKeeper configuration in Java, we can put the hbase-site.xml in jar, and then:
5) Hive action throws NestedThrowables: JDOFatalInternalException and InvocationTargetException
Put MySQL Java Connector into share lib or in Hive Workflow Root
If it still doesn't work, the take a look at http://cloudfront.blogspot.de/2012/06/failed-error-in-metadata.html
In short form:
6) Fork-Join action errorTo same transitions
It is fixed in Oozie version 3.3.2 (https://issues.apache.org/jira/browse/OOZIE-1035)
A temporarily solution for old Oozie version is shown here: https://issues.apache.org/jira/browse/OOZIE-1142
In short:
In oozie-site.xml , set oozie.validate.ForkJoin to false and restart Oozie.
7) Default maximum output data size is only 2 KB
sometime you will get this error:
if you want to change this setting, you need set the property oozie.action.max.output.data to a larger one in oozie-site.xml , such as:
8) SSH-Tunnel to bypass the firewall to get the web interface
The port 11000 may be blocked by default in some firewall, so if you want to use the web interface of oozie, you may need to set a ssh tunnel, to redirect the traffic with localhost:11000 to port 11000 on oozie server.
Then you can get the web interface using URL: http://localhost:11000/
9) sendmail-action after a decision-action
Java program set a status property and message property. by decision-action check if the status equals 0.
Sponsors: TUI.com
mobilcom-debitel Online-Shop
Using oozie cooradinator + oozie workflow we can schedule our data processing tasks, it could be also used as a monitoring and error handling system.
Background
To introduce the Oozie framework in detail we need to know some backgrounds of oozie.
Oozie http://oozie.apache.org/ is a workflow scheduler system
to manage Apache Hadoop jobs. The workflows and scheduler (in oozie it is a
coordinator) will be defined with XML files. Oozie provides a management
CLI and a web interface to show its running workflows. And you can also
communicate with oozie from a Java program using Oozie Java Client library.
XML
Schema
There is two types of XML, workflow,
which defines the payload of a oozie job, and the other is coordinator, which
defines the scheduling information of a oozie job.
You can use following command to
validate if a xml file is valid workflow or coordinator definition in oozie
framework.
$ oozie validate workflow.xml
Then we take a look at the detail of
these two XML schema.
Workflow
Workflow schema defined, a workflow must have extractly one start element
and end element. And between them, 0 to unbounded actions
or decisions and so on. Some of actions and decision element will
be explained in Example chapter.
Coordinator
Coordinator schema shows,
that one coordinator can only involve extractly one workflow in action element.
That means, one coordinator can only
control one workflow.
But, one workflow can have any
number of sub-workflows.
Management
Interface
Oozie provides three kinds of
management interface, all you want do with oozie can be done in
command line interface (CLI), there is also a Web interface which is using a so
called Oozie URL, but from the web interface you can only take a look the
information, you can not manage your oozie server or oozie workflows from
browser. And third choice is also powerful, you can do everything in your Java
program with Oozie Java Client library.
CLI
Some useful command:
Start a oozie workflow (or
coordinator):
$ oozie job -oozie http://fxlive.de:11000/oozie -config /some/where/job.properties -run
Get the information of a
oozie job with its ID (such as 0000001-130104191423486-oozie-oozi-W):
$ oozie job -oozie http:/fxlive.de:11000/oozie -info 0000001-130104191423486-oozie-oozi-W
Get the task log of a oozie job:
$ oozie job -oozie http://fxlive.de:11000/oozie -log 0000001-130104191423486-oozie-oozi-W
Stop a oozie job:
$ oozie job -oozie http://fxlive.de:11000/oozie -kill 0000003-130104191423486-oozie-oozi-W
The "-oozie" refers to a
URL that called Oozie URL, by each command you have to point
this URL explicit.
Web
The web interface is just the same
as the "Oozie URL". For example, in this case, it is: http://fxlive.de:11000/oozie
Using this URL you can get all of
informations about running jobs and configurations by your browser.
Java
Client
There is also a java client library of oozie.
=== Tips ===
1) Deploy Oozie ShareLib in HDFS
http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/
https://ccp.cloudera.com/display/CDH4DOC/Oozie+Installation#OozieInstallation-InstallingtheOozieShareLibinHadoopHDFS
$ mkdir /tmp/ooziesharelib $ cd /tmp/ooziesharelib $ tar zxf /usr/lib/oozie/oozie-sharelib.tar.gz $ sudo -u oozie hadoop fs -put share /user/oozie/share
2) Oozie Sqoop Action arguments from properties
Oozie Sqoop Action does not support multi-lines sqoop command from property file very well, we should use <arg> tag to set sqoop command line by line as sqoop job parameters.
3) ZooKeeper connection problem by importing to HBase using Sqoop-action
Problem: The mapper of a sqoop action tries to access zookeeper on localhost and not the one of the cluster.
Solution:
- Go to cloudera manager of the corresponding cluster
- Go to zookeeper service and get the hbase-site.xml
- Copy hbase-site.xml into hdfs under /tmp/ooziesharelib/share/lib/sqoop/
4) ZooKeeper connection problem by HBase Java client
Just like the similar ZooKeeper problem by Sqoop-action, we can put the hbase-site.xml into oozie common sharelib, or if we want to manually load HBase ZooKeeper configuration in Java, we can put the hbase-site.xml in jar, and then:
Configuration conf = new Configuration(); conf.addResource("hbase-site.xml"); conf.reloadConfiguration();
5) Hive action throws NestedThrowables: JDOFatalInternalException and InvocationTargetException
Put MySQL Java Connector into share lib or in Hive Workflow Root
If it still doesn't work, the take a look at http://cloudfront.blogspot.de/2012/06/failed-error-in-metadata.html
In short form:
hadoop fs -chmod g+w /tmp hadoop fs -chmod 777 /tmp hadoop fs -chmod g+w /user/hive/warehouse hadoop fs -chmod 777 /user/hive/warehouse
6) Fork-Join action errorTo same transitions
It is fixed in Oozie version 3.3.2 (https://issues.apache.org/jira/browse/OOZIE-1035)
A temporarily solution for old Oozie version is shown here: https://issues.apache.org/jira/browse/OOZIE-1142
In short:
In oozie-site.xml , set oozie.validate.ForkJoin to false and restart Oozie.
7) Default maximum output data size is only 2 KB
sometime you will get this error:
Failing Oozie Launcher, Output data size [4 321] exceeds maximum [2 048]
Failing Oozie Launcher, Main class [com.myactions.action.InitAction], exception invoking main(), null
org.apache.oozie.action.hadoop.LauncherException
at org.apache.oozie.action.hadoop.LauncherMapper.failLauncher(LauncherMapper.java:571)
yep, it will happen sooner or later, because the default maximum output data size is only 2KB -_-if you want to change this setting, you need set the property oozie.action.max.output.data to a larger one in oozie-site.xml , such as:
<property> <name>oozie.action.max.output.data</name> <value>1048576</value> </property>will set the max output size to 1024 KB .
8) SSH-Tunnel to bypass the firewall to get the web interface
The port 11000 may be blocked by default in some firewall, so if you want to use the web interface of oozie, you may need to set a ssh tunnel, to redirect the traffic with localhost:11000 to port 11000 on oozie server.
Then you can get the web interface using URL: http://localhost:11000/
9) sendmail-action after a decision-action
Java program set a status property and message property. by decision-action check if the status equals 0.
Sponsors: TUI.com
Tuesday, November 19, 2013
Integration of Hive and HBase: Hive MAP column mapped to an entire column family with binary value (byte value)
准备陆续把一些工作中遇到的问题整理成文档贴到这里。
先来一个最要命的~
在我们之前的工作中,遇到了一个情况,我们需要把一个 hbase 的 column family mapping 到 hive table 里,但是这个cf存的都是 byte value ,column qualifier 是类似timestamp的随机不确定数值,不可能在column mapping时明确指定,必须把整个cf mapping到hive里。
就是说
Hive MAP column mapped to an entire column family with binary value
解决方法
using following mapping type specification:
搜了各种文档,没有一个提到这种情况的,官方文档
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Examplewithbinarycolumns
也只是在讲某个具体的列,是binary column的情况,用 hbase.table.default.storage.type 不适用于要mapping整个cf的情况
参照文档里给的mapping一个具体列的格式
举个例子
customer journey analytics 是 big data analytics 中一个很好的用 hbase + hive 的use case,我们假设所有的 customer events 都 tracking 到了一个 hbase table 里,key 是 customer id , long value , 在hbase里,为了保证key定长,long value 都通过 Bytes.toByte() 存为 byte value。一个 column family 存这个 customer 的所有 events ,比如 view , click , buy ,column qualifier 是 event 的timestamp,value 是 event id,同样是 long value,存为 byte。
这样这个表就是:
o2 Flat M: Flatrate ins dt. Festnetz + Flatrate ins o2-Netz
Sponsors: TUI.com
mobilcom-debitel Online-Shop
先来一个最要命的~
在我们之前的工作中,遇到了一个情况,我们需要把一个 hbase 的 column family mapping 到 hive table 里,但是这个cf存的都是 byte value ,column qualifier 是类似timestamp的随机不确定数值,不可能在column mapping时明确指定,必须把整个cf mapping到hive里。
就是说
Hive MAP column mapped to an entire column family with binary value
解决方法
using following mapping type specification:
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:#s:b")
cf:#s:b means, cf is column family name, and #s:b is the type specification for this entire HBase column family, column names (column qualifiers) as the keys of the Hive MAP, their datatype is string (s), and column values as Hive MAP values, with datatype binary (b, byte values).搜了各种文档,没有一个提到这种情况的,官方文档
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Examplewithbinarycolumns
也只是在讲某个具体的列,是binary column的情况,用 hbase.table.default.storage.type 不适用于要mapping整个cf的情况
参照文档里给的mapping一个具体列的格式
column-family-name:[column-name][#(binary|string)]
我们可以把这个 mapping entire column family with binary value 的写成如下范式:column-family-name:[#(binary|string):(binary|string)]
举个例子
customer journey analytics 是 big data analytics 中一个很好的用 hbase + hive 的use case,我们假设所有的 customer events 都 tracking 到了一个 hbase table 里,key 是 customer id , long value , 在hbase里,为了保证key定长,long value 都通过 Bytes.toByte() 存为 byte value。一个 column family 存这个 customer 的所有 events ,比如 view , click , buy ,column qualifier 是 event 的timestamp,value 是 event id,同样是 long value,存为 byte。
这样这个表就是:
create 'customer_journey', 'events'entries look like:
hbase(main):011:0> scan 'customer_journey' ROW COLUMN+CELL \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824339000, timestamp=1354824339000, value=\x00\x00\x00\x00\x00\x00 \x08 \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824340000, timestamp=1354824340000, value=\x00\x00\x00\x00\x00\x00'\x9E \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824350000, timestamp=1354824350000, value=\x00\x00\x00\x00\x00\x00\x00\x0F \x00\x00\x00\x00\x00\x00\x00\x10 column=events:1354824350000, timestamp=1354824350000, value=\x00\x00\x00\x00\x00\x00\xF0\x08 \x00\x00\x00\x00\x00\x00\x00\x10 column=events:1354824359000, timestamp=1354824359000, value=\x00\x00\x00\x00\x00\x00\x10\xD8hive external table should be created like:
hive> CREATE EXTERNAL TABLE customer_journey (customer_id bigint, events map<string, bigint>) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b, events:#s:b") > TBLPROPERTIES ("hbase.table.name" = "customer_journey");then you will get:
hive> select * from customer_journey; OK 1 {"1354824339000":8200,"1354824340000":10142,"1354824350000":15} 16 {"1354824350000":61448,"1354824359000":4312} Time taken: ...
Sponsors: TUI.com
Labels:
Big Data,
Binary Value,
Byte Value,
Column Family,
Data Science,
Hadoop,
HBase,
Hive,
Schema,
Tips
Subscribe to:
Posts (Atom)