How to install Shark in CDH 5.0.0 GA
Requirements for Shark
1. CDH5
2. Spark
spark should be already installed in CDH 5, under /opt/cloudera/parcels/CDH/lib/spark
follow these steps you will install Shark 0.9.1 in /var/lib/spark on CDH 5.0.0 GA with Hadoop version 2.3.0
/var/lib/spark is the default user home of spark user in CDH 5.0.0 GA
you need run these scripts as root or spark user (you need to change the shell of spark user to /bin/bash, by default it is nologin)
1. Download Shark source code
export SPARK_USER_HOME=/var/lib/spark cd $SPARK_USER_HOME wget http://www.scala-lang.org/files/archive/scala-2.10.3.tgz tar zxf scala-2.10.3.tgz wget https://github.com/amplab/shark/archive/v0.9.1.tar.gz tar zxf v0.9.1.tar.gz
OR YOU CAN DOWNLOAD MY shark-0.9.1 version, it is complied with CDH 5.0.0 packages:
http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz
2. Configure Shark
we can use the hive 0.12 in CDH5, so we do not need to download spark/shark version of hive 0.11 bin
set following configs in $SPARK_USER_HOME/shark-0.9.1/conf/shark-env.sh :
export SPARK_USER_HOME=/var/lib/spark export SPARK_MEM=2g export SHARK_MASTER_MEM=1g export SCALA_HOME="$SPARK_USER_HOME/scala-2.10.3" export HIVE_HOME="/opt/cloudera/parcels/CDH/lib/hive" export HIVE_CONF_DIR="$HIVE_HOME/conf" export HADOOP_HOME="/opt/cloudera/parcels/CDH/lib/hadoop" export SPARK_HOME="/opt/cloudera/parcels/CDH/lib/spark" export MASTER="spark://test01:7077" SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp " SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 " SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps " export SPARK_JAVA_OPTS
(change the host name test01 in MASTER="spark://test01:7077" to your host)
3. Build Shark with Hadoop 2.3.0-cdh5.0.0
if you downloaded my shark-0.9.1 version above (http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz), you do not need to build it, you can jump to Step 5. Otherwise, you need to compile your shark-0.9.1 with hadoop 2.3.0-cdh5 :
cd $SPARK_USER_HOME/shark-0.9.1/ SHARK_HADOOP_VERSION=2.3.0-cdh5.0.0 ./sbt/sbt package
it takes a long time, depends on your network... normally it will be very slow... -_-
so may be now you want to download the pre-built shark-0.9.1 package for cdh 5.0.0 GA ...
again, it is here:
http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz
4. Parquet support
wget http://repo1.maven.org/maven2/com/twitter/parquet-hive/1.2.8/parquet-hive-1.2.8.jar -O $SPARK_USER_HOME/shark-0.9.1/lib/parquet-hive-1.2.8.jar ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-hadoop.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-common.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-encoding.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-format.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-avro.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-column.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-thrift.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-generator.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-cascading.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-hadoop-bundle.jar $SPARK_USER_HOME/shark-0.9.1/lib/ ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-scrooge.jar $SPARK_USER_HOME/shark-0.9.1/lib/
i am not sure if all of these jars are needed. but it works with these parquet jars...
and if you enable this parquet support, you need to set the SPARK_MEM in $SPARK_USER_HOME/shark-0.9.1/conf/shark-env.sh with at least 2GB .
5. Deploy shark to all the worker nodes
#MASTER cd $SPARK_USER_HOME tar zcf shark.tgz shark-0.9.1
scp this file to each worker, or
#WORKER sudo ln -s /usr/bin/java /bin/java export SPARK_USER_HOME=/var/lib/spark cd $SPARK_USER_HOME scp shark@test01:$SPARK_USER_HOME/shark.tgz $SPARK_USER_HOME/ tar zxf shark.tgz
6. Configure Spark
if your spark service can not be started in CM5 (Cloudera Manager 5), you may need to remove the "noexec" part of the /var or /var/run mount point. Using command:
mount -o remount,exec /var/run
and change the mount parameter in the line of /var or /var/run in /lib/init/fstab for a permanent solution.
you may need to go back to #MASTER, and add the workers in /etc/spark/conf/slaves .
such as you have 2 worker nodes:
echo "test02" >> /etc/spark/conf/slaves echo "test03" >> /etc/spark/conf/slaves
and, in /etc/spark/conf/spark-env.sh you may need to change
export STANDALONE_SPARK_MASTER_HOST=`hostname`to
export STANDALONE_SPARK_MASTER_HOST=`hostname -f`
7. Run it!
finally, i believe you can run the shark shell now!
go back to #MASTER $SPARK_USER_HOME/shark-0.9.1/bin/shark-withinfo -skipRddReload
this -skipRddReload is only needed when you have some table with hive/hbase mapping, because of some issus in PassthroughOutputFormat by hive hbase handler.
the error message is something like:
"Property value must not be null"
or
"java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat"
8. Issuses
ref.: http://bigdataanalyze.blogspot.de/2014/03/installing-shark-in-cdh5-beta2.html
it is a good guide for Installing Shark in CDH5 beta2.
the author has also collect some common issues about Shark in CDH5 beta2: http://bigdataanalyze.blogspot.de/2014/03/issues-on-shark-with-cdh5-beta2-1.html