ABC Networks Blog: Data Mining

Showing posts with label Data Mining. Show all posts

Wednesday, July 23, 2014

Spark play with HBase's Result object: handling HBase KeyValue and ByteArray in Scala with Spark -- Real World Examples

This is second part of "Lighting a Spark With HBase Full Edition"

you should read the previous part about HBase dependencies, and spark classpaths first: http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

and you'd better read this for some background knowledge about combining HBase and Spark: http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase

this post aims to provide some additional complicated real world examples of above post.

at first, you can put your hbase-site.xml into spark's conf folder, otherwise you have to specify the full path (absolute path) of hbase-site.xml in your code.

ln -s /etc/hbase/conf/hbase-site.xml $SPARK_HOME/conf/

now, we use a very simple HBase table with string rowkey and string values to warm up.

table contents:

hbase(main):001:0> scan 'tmp'
ROW                   COLUMN+CELL
 abc                  column=cf:test, timestamp=1401466636075, value=789
 abc                  column=cf:val, timestamp=1401466435722, value=789
 bar                  column=cf:val, timestamp=1396648974135, value=bb
 sku_2                column=cf:val, timestamp=1401464467396, value=999
 test                 column=cf:val, timestamp=1396649021478, value=bb
 tmp                  column=cf:val, timestamp=1401466616160, value=test

in the post from vidyasource.com we can find how to get values from HBase Result's tuple, but no keys.

following code shows how to create a RDD of key-value pairs RDD[(key, value)] from HBase Results:

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

import org.apache.spark.rdd.NewHadoopRDD

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tmp")

var hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("cf".getBytes(), "val".getBytes()))).map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
    (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue
)
}).take(10)

you will get

Array[(String, Array[Byte])] = Array((abc,Array(55, 56, 57)), (bar,Array(98, 98)), (sku_2,Array(57, 57, 57)), (test,Array(98, 98)), (tmp,Array(116, 101, 115, 116)))

in scala, we can use map(_.toChar).mkString to convert Array[Byte] to a string (because we said, in this warm up example, the HBase table has only string values)

hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("cf".getBytes(), "val".getBytes()))).map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
    (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
}).take(10)

then we get

Array[(String, String)] = Array((abc,789), (bar,bb), (sku_2,999), (test,bb), (tmp,test))

=======================================================================

after warm up, let us take a complicated HBase table example:

this table stores the UUID/cookie or whatever of user's different devices, you can image this table is a part of some kind of platform which is used for cross device user tracking and/or analyzing user behavior on different devices.

userid as rowkey, is string (such as some kind of hashed value)
column family is d (device family)
column qualifiers are the name or id of device (such as some internal id of User Agent Strings, in this example we use some simple string like app1, app2 for mobile apps, pc1, ios2 for different browser on different devices)
value of row is an 8 bytes long (a ByteArray with length 8)

it looks like this:

hbase(main):001:0> scan 'test1'
ROW                   COLUMN+CELL
 user1                column=lf:app1, timestamp=1401645690042, value=\x00\x00\x00\x00\x00\x00\x00\x0F
 user1                column=lf:app2, timestamp=1401645690093, value=\x00\x00\x00\x00\x00\x00\x00\x10
 user2                column=lf:app1, timestamp=1401645690142, value=\x00\x00\x00\x00\x00\x00\x00\x11
 user2                column=lf:pc1,  timestamp=1401645690170, value=\x00\x00\x00\x00\x00\x00\x00\x12
 user3                column=lf:ios2, timestamp=1401645690180, value=\x00\x00\x00\x00\x00\x00\x00\x02

to create such a table, you should put like this in hbase shell

put 'test1', 'user1', 'lf:app1', "\x00\x00\x00\x00\x00\x00\x00\x0F"
put 'test1', 'user1', 'lf:app2', "\x00\x00\x00\x00\x00\x00\x00\x10"
put 'test1', 'user2', 'lf:app1', "\x00\x00\x00\x00\x00\x00\x00\x11"
put 'test1', 'user2', 'lf:pc1',  "\x00\x00\x00\x00\x00\x00\x00\x12"
put 'test1', 'user3', 'lf:ios2', "\x00\x00\x00\x00\x00\x00\x00\x02"

ok, then, how can we read/scan this table from spark?

let us see this code:

conf.set(TableInputFormat.INPUT_TABLE, "test1")

var hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("lf".getBytes(), "app1".getBytes()))).map(row => if (row._2.size > 0) {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
    (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toInt).mkString
)
}).take(10)

why this time it is map(._toInt) ? because in this Array[Byte], those Bytes are numbers, not Chars.

but we get

Array((user1,000000015), (user2,000000017), ())

what? 000000015 ?... yes, because _.toInt convert each element in this Array[Byte] to Int, to avoid this, we can use java.nio.ByteBuffer

this code should be changed to

import java.nio.ByteBuffer
hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("lf".getBytes(), "app1".getBytes()))).map(row => if (row._2.size > 0) {
(
  row._1.map(_.toChar).mkString,
  ByteBuffer.wrap(row._2.asScala.reduceLeft {
    (a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue).getLong
)
}).take(10)

then we get

Array((user1,15), (user2,17), ())

finally looked better, but what is the last () ?!...

it is because rowkey user3 has no value with column lf:app1, so, again, we can do it better! in HBaseConfiguration object we can set TableInputFormat.SCAN_COLUMNS to a particular column qualifier, so we change the code to FINAL EDITION...

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.spark.rdd.NewHadoopRDD

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "test1")
conf.set(TableInputFormat.SCAN_COLUMNS, "lf:app1")

var hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

import java.nio.ByteBuffer
hBaseRDD.map(tuple => tuple._2).map(result => {
  ( result.getRow.map(_.toChar).mkString,
    ByteBuffer.wrap(result.value).getLong
  )
}).take(10)

and now, finally we get:

Array[(String, Long)] = Array((user1,15), (user2,17))

=======================================================================

FINAL FULL EDITION

now, if you want to get all of key-value pairs of a HBase table (all versions of values from all of column qualifiers)

you can try this code (for string values table "tmp"):

import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

import org.apache.spark.rdd.NewHadoopRDD

import java.nio.ByteBuffer

type HBaseRow = java.util.NavigableMap[Array[Byte],
  java.util.NavigableMap[Array[Byte], java.util.NavigableMap[java.lang.Long, Array[Byte]]]]

type CFTimeseriesRow = Map[Array[Byte], Map[Array[Byte], Map[Long, Array[Byte]]]]

def navMapToMap(navMap: HBaseRow): CFTimeseriesRow =
  navMap.asScala.toMap.map(cf =>
    (cf._1, cf._2.asScala.toMap.map(col =>
      (col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2))))))

type CFTimeseriesRowStr = Map[String, Map[String, Map[Long, String]]]

def rowToStrMap(navMap: CFTimeseriesRow): CFTimeseriesRowStr =
  navMap.map(cf =>
    (cf._1.map(_.toChar).mkString, cf._2.map(col =>
      (col._1.map(_.toChar).mkString, col._2.map(elem => (elem._1, elem._2.map(_.toChar).mkString))))))

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "tmp")

val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])

hBaseRDD.map(kv => (kv._1.get(), navMapToMap(kv._2.getMap))).map(kv => (kv._1.map(_.toChar).mkString, rowToStrMap(kv._2))).take(10)

for long values column family "lf" in table "test1", you can try to define CFTimeseriesRowStr and rowToStrMap as follows:

type CFTimeseriesRowStr = Map[String, Map[String, Map[Long, Long]]]

def rowToStrMap(navMap: CFTimeseriesRow): CFTimeseriesRowStr =
  navMap.map(cf =>
    (cf._1.map(_.toChar).mkString, cf._2.map(col =>
      (col._1.map(_.toChar).mkString, col._2.map(elem => (elem._1, ByteBuffer.wrap(elem._2).getLong))))))

=======================================================================

beyond all of these code, there are more particulars you should think about when you querying HBase table, such as scan cache, enable block cache or not, whether or not to use bloom filters

and most important is, spark is still using org.apache.hadoop.hbase.mapreduce.TableInputFormat to read from HBase, it is the same as MapReduce program or hive hbase table mapping, so there is a big problem, your job will fail when one of HBase Region for target HBase table is splitting ! because the original region will be offline by splitting.

so if your HBase regions must be splittable, you should be careful to use spark or hive to read from HBase table. maybe you should write coprocessor instead of using hbase.mapreduce API.

if not, you should disable auto region split. following slide summarized all of HBase config properties related to control HBase region split.

Thursday, May 22, 2014

Install and config Graphite on Debian/Ubuntu

Install graphite server using python-pip

apt-get install gcc python-dev python-pip

sudo pip install https://github.com/graphite-project/ceres/tarball/master
sudo pip install whisper
sudo pip install carbon
sudo pip install graphite-web

cd /opt/graphite/conf

sudo cp carbon.conf.example carbon.conf

sudo cp storage-schemas.conf.example storage-schemas.conf

sudo cp graphite.wsgi.example graphite.wsgi

===========================================

Run carbon service and test

===========================================

sudo /opt/graphite/bin/carbon-cache.py start

if you get something like Python Error - ImportError: cannot import name daemonize
take a look at: Can't Start Carbon - 12.04 - Python Error - ImportError: cannot import name daemonize
try:
sudo pip install 'Twisted<12.0'

check if the port of carbon service is opened:

netstat -naep | grep 2003

test:

perl -e '$ts = time(); for (1..1000) { printf "foo.bar %d %d\n", int(rand(10000)), $ts - 90 * $_ }' \ | nc -c localhost 2003

===========================================

Run Graphite web UI (apache2 with mod_python, mod_wsgi)

===========================================

sudo apt-get install apache2 libapache2-mod-python libapache2-mod-wsgi

sudo chown -R www-data:www-data /opt/graphite/storage/

sudo mv /etc/apache2/sites-available/default /etc/apache2/sites-available/default.bak

sudo cp /opt/graphite/examples/example-graphite-vhost.conf /etc/apache2/sites-available/default

sudo mkdir /etc/httpd

sudo mkdir /etc/httpd/wsgi

cd /opt/graphite/webapp/graphite/

sudo cp local_settings.py.example local_settings.py

restart apache2
sudo service apache2 restart

===========================================

Initialize database after installed apache2

===========================================

sudo apt-get install python-django

sudo pip install django-tagging==0.3.1

cd /opt/graphite/webapp/graphite/

sudo python manage.py syncdb

Username: root

Password: xxx

E-mail: info@fxlive.de

Edit

/opt/graphite/webapp/graphite/local_settings.py

add:
SECRET_KEY

TIME_ZONE = 'Europe/Berlin'

LOG_RENDERING_PERFORMANCE = True

LOG_CACHE_PERFORMANCE = True

LOG_METRIC_ACCESS = True

===========================================

Troubleshooting

===========================================

1) Edit the following file:

/etc/apache2/sites-available/default

Make sure that the configuration for WSGISocketPrefix is set as follows:

WSGISocketPrefix /var/run/apache2/wsgi

Otherwise, you will get the following error:

[Tue Jun 19 13:21:28 2012] [error] [client 192.168.xxx.xxx] (2)No such file or directory: mod_wsgi (pid=19506): Unable to connect to WSGI daemon process 'graphite' on '/etc/apache2/run/wsgi.19365.1.1.sock' after multiple attempts.

2) if you changed the VirtualHost post, do not forget to add this port in
/etc/apache2/ports.conf
add something like this:
NameVirtualHost *:8000
Listen 8000

3) if you can not see the images, get something like this:

ViewDoesNotExist: Could not import graphite.render.views. Error was: No module named cairo

you need install:
sudo apt-get install python-cairo-dev

4) watch the logs if something still not working right:
cd /opt/graphite/storage/log/webapp
find . -name '*.log' | xargs tail -F

5) location of graphite's whisper database (RDD storage)
/opt/graphite/storage/whisper

===========================================

example: storage-schemas.conf

===========================================

[stats_1day]
pattern = ^stats_1day\.
retentions = 1d:1y

[stats_1hour]
pattern = ^stats_1hour\.
retentions = 1h:90d

===========================================

example: /etc/apache2/sites-available/default

===========================================

LoadModule wsgi_module modules/mod_wsgi.so

</IfModule>

WSGISocketPrefix /var/run/apache2/wsgi

ServerName graphite

DocumentRoot "/opt/graphite/webapp"

ErrorLog /opt/graphite/storage/log/webapp/error.log

CustomLog /opt/graphite/storage/log/webapp/access.log common

# enable XORS (Cross-origin resource sharing), see below

Header set Access-Control-Allow-Origin "*"

# I've found that an equal number of processes & threads tends

# to show the best performance for Graphite (ymmv).

WSGIDaemonProcess graphite processes=5 threads=5 display-name='%{GROUP}' inactivity-timeout=120

WSGIProcessGroup graphite

WSGIApplicationGroup %{GLOBAL}

WSGIImportScript /opt/graphite/conf/graphite.wsgi process-group=graphite application-group=%{GLOBAL}

# XXX You will need to create this file! There is a graphite.wsgi.example

# file in this directory that you can safely use, just copy it to graphite.wgsi

WSGIScriptAlias / /opt/graphite/conf/graphite.wsgi

Alias /content/ /opt/graphite/webapp/content/

SetHandler None

</Location>

# XXX In order for the django admin site media to work you

# must change @DJANGO_ROOT@ to be the path to your django

# installation, which is probably something like:

# /usr/lib/python2.6/site-packages/django

Alias /media/ "@DJANGO_ROOT@/contrib/admin/media/"

SetHandler None

</Location>

# The graphite.wsgi file has to be accessible by apache. It won't

# be visible to clients because of the DocumentRoot though.

Order deny,allow

Allow from all

</Directory>

</VirtualHost>

you may want to enable XORS (Cross-origin resource sharing) or cross site scripting or whatever~

ref.: http://enable-cors.org/server_apache.html

you should add this line

Header set Access-Control-Allow-Origin "*"

in to

/etc/apache2/sites-available/default

and then restart apache2

===========================================

bugfix

===========================================

there is a small bug in Graphite's Dashboard JS, take a look this post:
http://www.abcn.net/2014/01/graphites-dashboard-set-auto-hide-navbar.html

Tuesday, December 3, 2013

FX Live 正式上线 — 基于大数据分析的实时外汇交易信号，会陆续加入比特币Bitcoin的内容

FX Live 中国

FX Live Europa

FX Live Deutschland

Big Data Analytics -> Real-Time Trading Signals -> Smart Trading

我们使用的大数据分析系统决策流程可抽象为以下步骤：

Real-Time Big Data Analytics Trading Platform and abstracted decision flow can be summarized as follows:

Rich Real-time Data Sources -> historical Big Data Processing -> Trading Strategies -> Real-time Trading Signals

including Complex Event Processing (CEP), signal generation methods, trading algorithms, and market risk models.

欢迎交流，微博：邱腾邱导导

FX Live Deutschland

FX Live Europa

FX Live 中国

Tuesday, November 26, 2013

Data Mining: Practical Machine Learning Tools and Techniques

I want to recommand a good book about the open source machine learning toolkit: WEKA, this is an introduction to data mining and different machine learning algorithms and methods, such as Decision Trees, Association, Classification or Clustering.

Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [English]

get [Taschenbuch] in Amazon.de

get [Kindle Edition] in Amazon.de Kindle Shop

In our current projects, we used several WEKA's implementation of data mining and machine learning algorithms, it is fine, but a little bit slow for our data scale.

Furthermore, maybe we can build something based on WEKA and UIMA.

more info:

Books Homepage on WEKA Machine Learning Group at the University of Waikato

Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Taschenbuch]

Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann Series in Data Management Systems) [Englisch] [Kindle Edition]