Tuesday, July 22, 2014

Lighting a Spark With HBase Full Edition with real world examples ~ dependencies, classpaths, handling ByteArray in HBase KeyValue object

First of all, there are many resources in internet about integrating HBase and Spark

such as

Spark has their own example: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala

MapR has also some cool sample: http://www.mapr.com/developercentral/code/loading-hbase-tables-spark

and here, a more detailed code snippet: http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase

but all of them, has no information about:
  • which jar library are needed, let us say dependency problem
  • how should i set the classpath when i start my spark job/application with HBase connection
  • sc.newAPIHadoopRDD uses this holly class org.apache.hadoop.hbase.client.Result as a return value type, but objects in this Result are org.apache.hadoop.hbase.KeyValue, this is a core client-side Java API of HBase, sometimes it is really not enough to use it just with getColumn("columnFamily".getBytes(), "columnQualifier".getBytes()), and more important is, in scala, to use this KeyValue object is even more complicated.
therefore this post aims to create a "Full" Version...

assume you have already read the samples above. i will go ahead directly to solve this three problems.

if you only want to see some code, jump to the next part of this doc: http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html

1. dependency problem

it is similar as a HBase client program

for maven:

<dependency>
        <groupid>org.apache.spark</groupid>
        <artifactid>spark-core_2.10</artifactid>
        <version>1.0.1</version>
</dependency>

<dependency>
        <groupid>org.apache.hbase</groupid>
        <artifactid>hbase</artifactid>
        <version>0.98.2-hadoop2</version>
</dependency>

<dependency>
        <groupid>org.apache.hbase</groupid>
        <artifactid>hbase-client</artifactid>
        <version>0.98.2-hadoop2</version>
</dependency>

<dependency>
        <groupid>org.apache.hbase</groupid>
        <artifactid>hbase-common</artifactid>
        <version>0.98.2-hadoop2</version>
</dependency>

<dependency>
        <groupid>org.apache.hbase</groupid>
        <artifactid>hbase-server</artifactid>
        <version>0.98.2-hadoop2</version>
</dependency>

sbt:

libraryDependencies ++= Seq(
        "org.apache.spark" % "spark-core_2.10" % "1.0.1",
        "org.apache.hbase" % "hbase" % "0.98.2-hadoop2",
        "org.apache.hbase" % "hbase-client" % "0.98.2-hadoop2",
        "org.apache.hbase" % "hbase-common" % "0.98.2-hadoop2",
        "org.apache.hbase" % "hbase-server" % "0.98.2-hadoop2"
)

change the version of spark and hbase to yours.

2. classpath

in the time of Spark 0.9.x, you just need to set this environment: SPARK_CLASSPATH with HBase's Jars, for example, start spark-shell with local mode, in CDH5 Hadoop distribution:
export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
and then
./bin/spark-shell --master local[2]
or just
SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar ./bin/spark-shell --master local[2]

in your cluster, you should change the path of those jars to your HBase's path, such as in other Hadoop distribution should be some path like /usr/lib/xxx (Hortonworks HDP) or /opt/mapr/hbase-xxx (MapR)

but, but... this lovely SPARK_CLASSPATH is deprecated in the new era of Spark 1.x  !!! -_-

so, in Spark 1.x

there is one conf property and one command line augment for this:
spark.executor.extraClassPath
and
--driver-class-path

WTF... but, yes, you must give the whole jar paths twice!... and spark.executor.extraClassPath must be set in a conf file, can not be set via command line...

so, you need to do this:

edit conf/spark-defaults.conf

add this:
spark.executor.extraClassPath  /opt/cloudera/parcels/CDH/lib/hive/lib/hive-hbase-handler.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
and then, start spark shell or submit your spark job with command line args for driver --driver-class-path:
./bin/spark-shell --master local[2]  --driver-class-path  /opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
unbelievable, but it is so in spark 1.x ...

3. how to use org.apache.hadoop.hbase.KeyValue in scala with Spark

it seems this post is already long enough, let us take a break, to see the code of real world examples, you can go to the next part of this doc: http://www.abcn.net/2014/07/spark-hbase-result-keyvalue-bytearray.html

7 comments:

  1. Thanks
    Great article...

    ReplyDelete
  2. This is what I want to know. Thanks!

    ReplyDelete
  3. These are only a few ideas and there are lots more available online. I hope I've given you some inspiration on what you can do to make your Halloween party a spooky success. Bath mirror lamps

    ReplyDelete
  4. It's very useful blog post with inforamtive and insightful content and i had good experience with this information.I have gone through CRS Info Solutions Home which really nice. Learn more details About Us of CRS info solutions. Here you can see the Courses CRS Info Solutions full list. Find Student Registration page and register now.Find this real time DevOps Training and great teaching. Join now on Selenium Training online course. Upskill career with Tableau training by crs info solutions. Latest trending course is Salesforce Lightning training with excellent jobs.

    ReplyDelete

© Chutium / Teng Qiu @ ABC Netz Group