Thursday, April 10, 2014

Install Shark 0.9.1 on CDH 5.0.0 GA (hadoop 2.3.0) + Spark Configuration on CDH5


How to install Shark in CDH 5.0.0 GA

Requirements for Shark

1. CDH5
2. Spark
spark should be already installed in CDH 5, under /opt/cloudera/parcels/CDH/lib/spark

follow these steps you will install Shark 0.9.1 in /var/lib/spark on CDH 5.0.0 GA with Hadoop version 2.3.0

/var/lib/spark is the default user home of spark user in CDH 5.0.0 GA

you need run these scripts as root or spark user (you need to change the shell of spark user to /bin/bash, by default it is nologin)

1. Download Shark source code

export SPARK_USER_HOME=/var/lib/spark

cd $SPARK_USER_HOME

wget http://www.scala-lang.org/files/archive/scala-2.10.3.tgz

tar zxf scala-2.10.3.tgz

wget https://github.com/amplab/shark/archive/v0.9.1.tar.gz

tar zxf v0.9.1.tar.gz

OR YOU CAN DOWNLOAD MY shark-0.9.1 version, it is complied with CDH 5.0.0 packages:

http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz

2. Configure Shark

we can use the hive 0.12 in CDH5, so we do not need to download spark/shark version of hive 0.11 bin

set following configs in $SPARK_USER_HOME/shark-0.9.1/conf/shark-env.sh :
export SPARK_USER_HOME=/var/lib/spark
export SPARK_MEM=2g
export SHARK_MASTER_MEM=1g
export SCALA_HOME="$SPARK_USER_HOME/scala-2.10.3"
export HIVE_HOME="/opt/cloudera/parcels/CDH/lib/hive"
export HIVE_CONF_DIR="$HIVE_HOME/conf"
export HADOOP_HOME="/opt/cloudera/parcels/CDH/lib/hadoop"
export SPARK_HOME="/opt/cloudera/parcels/CDH/lib/spark"
export MASTER="spark://test01:7077"

SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS

(change the host name test01 in MASTER="spark://test01:7077" to your host)

3. Build Shark with Hadoop 2.3.0-cdh5.0.0

if you downloaded my shark-0.9.1 version above (http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz), you do not need to build it, you can jump to Step 5. Otherwise, you need to compile your shark-0.9.1 with hadoop 2.3.0-cdh5 :

cd $SPARK_USER_HOME/shark-0.9.1/
SHARK_HADOOP_VERSION=2.3.0-cdh5.0.0 ./sbt/sbt package

it takes a long time, depends on your network... normally it will be very slow... -_-

so may be now you want to download the pre-built shark-0.9.1 package for cdh 5.0.0 GA ...

again, it is here:
http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz

4. Parquet support

wget http://repo1.maven.org/maven2/com/twitter/parquet-hive/1.2.8/parquet-hive-1.2.8.jar -O $SPARK_USER_HOME/shark-0.9.1/lib/parquet-hive-1.2.8.jar

ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-hadoop.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-common.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-encoding.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-format.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-avro.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-column.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-thrift.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-generator.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-cascading.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-hadoop-bundle.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-scrooge.jar $SPARK_USER_HOME/shark-0.9.1/lib/

i am not sure if all of these jars are needed. but it works with these parquet jars...

and if you enable this parquet support, you need to set the SPARK_MEM in $SPARK_USER_HOME/shark-0.9.1/conf/shark-env.sh with at least 2GB .

5. Deploy shark to all the worker nodes

#MASTER

cd $SPARK_USER_HOME
tar zcf shark.tgz shark-0.9.1

scp this file to each worker, or

#WORKER

sudo ln -s /usr/bin/java /bin/java

export SPARK_USER_HOME=/var/lib/spark
cd $SPARK_USER_HOME
scp shark@test01:$SPARK_USER_HOME/shark.tgz $SPARK_USER_HOME/
tar zxf shark.tgz

6. Configure Spark

if your spark service can not be started in CM5 (Cloudera Manager 5), you may need to remove the "noexec" part of the /var or /var/run mount point. Using command:
mount -o remount,exec /var/run

and change the mount parameter in the line of /var or /var/run in /lib/init/fstab for a permanent solution.
you may need to go back to #MASTER, and add the workers in /etc/spark/conf/slaves .

such as you have 2 worker nodes:
echo "test02" >> /etc/spark/conf/slaves
echo "test03" >> /etc/spark/conf/slaves

and, in /etc/spark/conf/spark-env.sh you may need to change
export STANDALONE_SPARK_MASTER_HOST=`hostname`
to
export STANDALONE_SPARK_MASTER_HOST=`hostname -f`

7. Run it!

finally, i believe you can run the shark shell now!

go back to #MASTER
$SPARK_USER_HOME/shark-0.9.1/bin/shark-withinfo -skipRddReload

this -skipRddReload is only needed when you have some table with hive/hbase mapping, because of some issus in PassthroughOutputFormat by hive hbase handler.

the error message is something like:
"Property value must not be null"
or
"java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat"

8. Issuses

ref.: http://bigdataanalyze.blogspot.de/2014/03/installing-shark-in-cdh5-beta2.html

it is a good guide for Installing Shark in CDH5 beta2.

the author has also collect some common issues about Shark in CDH5 beta2: http://bigdataanalyze.blogspot.de/2014/03/issues-on-shark-with-cdh5-beta2-1.html

15 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I've been trying to get spark/shark working on cdh5 for several days now, but with no success. Apparently, cdh5 now uses a newer version of the jets3t library (0.9.0) instead of the version that ships with spark/shark (0.7.1).

    Consequently, installing shark on cdh5 results in a class not found error. (See https://issues.apache.org/jira/browse/SPARK-1556).

    But when I try either a) to replace the jets3t jars in the binary releases with 0.9.0, or b) when I try to use a version of spark and/or shark that I've compiled myself with 0.9.0 I get a verify error. (See https://github.com/apache/spark/pull/468#issuecomment-42027309)

    Any ideas on how to fix this and work around this dilemma?

    Thanks!

    ReplyDelete
    Replies
    1. For the record, the following solved the issue for me:

      cd /usr/lib/shark/lib
      ln -s /usr/lib/hadoop/lib/jets3t-0.9.0.jar

      Delete
  5. I attempted to follow the instructions here, but am stuck on this error. Any ideas?

    [root@cdh-head spark]# ./shark-0.9.1/bin/shark-withinfo
    -hiveconf hive.root.logger=INFO,console
    Starting the Shark Command Line Client
    Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)V
    at org.apache.hadoop.mapreduce.util.ConfigUtil.addDeprecatedKeys(ConfigUtil.java:54)
    at org.apache.hadoop.mapreduce.util.ConfigUtil.loadResources(ConfigUtil.java:42)
    at org.apache.hadoop.mapred.JobConf.(JobConf.java:118)
    at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:1077)
    at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:1039)
    at org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:74)
    at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:58)
    at shark.SharkCliDriver$.main(SharkCliDriver.scala:94)
    at shark.SharkCliDriver.main(SharkCliDriver.scala)

    ReplyDelete
    Replies
    1. Same exception here. Hadoop version 2.3.0-cdh5.0.1

      Delete
    2. Same here, did anyone ever solve this?

      Delete
    3. Fails for me too... I guess no one solved it yet? Thanks!

      Delete
    4. I had a similar class problem that I managed to solve by not skipping Step 3 and actually recompiling shark against the cloudera that I had installed (5.0.2).

      Delete
    5. to solve this remove 2 obsolete files:

      mv /root/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-core/hadoop-core-1.0.4.jar{,.backup}
      mv /root/shark-0.9.1/lib_managed/jars/org.apache.hadoop/hadoop-test/hadoop-test-0.20.2.jar{,.backup}

      Delete
    6. First we want to say thanks for the link seems to be working well the steps provided however we have version compatible issue any help appreciated.
      We have hadoop CDH 5.0.0 and and followed above procedure and we were not able to select * from table; as we saw the jars were used /var/lib/spark/shark-0.9.1/lib_managed/jars/edu.berkeley.cs.shark/ directory we have replaced with hive 0.12 jars now we are getting the below errors when we run $SPARK_USER_HOME/shark-0.9.1/bin/shark-withinfo -skipRddReload
      Exception in thread "main" java.lang.IncompatibleClassChangeError: Implementing class
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
      at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
      at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
      at java.lang.Class.forName0(Native Method)
      at java.lang.Class.forName(Class.java:190)
      at org.apache.hadoop.hive.shims.ShimLoader.createShim(ShimLoader.java:120)
      at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:115)
      at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:80)
      at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:51)
      at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
      at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:365)
      at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:285)
      at shark.SharkCliDriver$.main(SharkCliDriver.scala:128)
      at shark.SharkCliDriver.main(SharkCliDriver.scala)
      Thanks in advance.

      Delete
    7. Adding to above one initially when we followed the original steps we were able to start the shark service and able to see the tables from hive metastore but not able to do anything further then we replaced with hive 0.12 jars then the service is not starting we are getting the above error. Thanks once again.

      Delete
    8. oh, are you still using shark? maybe you can try the new release of spark 1.1, and its hive thrift server, it should be a good replacement for shark

      Delete
  6. Hey guys, I had another go at this tonight. Seems you need to reference the Hadoop jars in the lib folder too. I did it with:


    cd $SPARK_USER_HOME/shark-0.9.1/lib/
    for a in `ls /opt/cloudera/parcels/CDH/lib/hadoop/hadoop*jar`; do ln -s $a `echo $a | cut -d"/" -f8`; done

    but you get the idea. I also did all the parquet jars with the below but i don't think you need to:

    for a in `ls /opt/cloudera/parcels/CDH/lib/parquet/parquet*jar`; do ln -s $a `echo $a | cut -d"/" -f8`; done

    ReplyDelete

© Chutium / Teng Qiu @ ABC Netz Group