Install Shark 0.9.1 on CDH 5.0.0 GA (hadoop 2.3.0) + Spark Configuration on CDH5 : ABC Networks Blog

Thursday, April 10, 2014

Install Shark 0.9.1 on CDH 5.0.0 GA (hadoop 2.3.0) + Spark Configuration on CDH5

How to install Shark in CDH 5.0.0 GA

Requirements for Shark

1. CDH5
2. Spark
spark should be already installed in CDH 5, under /opt/cloudera/parcels/CDH/lib/spark

follow these steps you will install Shark 0.9.1 in /var/lib/spark on CDH 5.0.0 GA with Hadoop version 2.3.0

/var/lib/spark is the default user home of spark user in CDH 5.0.0 GA

you need run these scripts as root or spark user (you need to change the shell of spark user to /bin/bash, by default it is nologin)

1. Download Shark source code

export SPARK_USER_HOME=/var/lib/spark

cd $SPARK_USER_HOME

wget http://www.scala-lang.org/files/archive/scala-2.10.3.tgz

tar zxf scala-2.10.3.tgz

wget https://github.com/amplab/shark/archive/v0.9.1.tar.gz

tar zxf v0.9.1.tar.gz

OR YOU CAN DOWNLOAD MY shark-0.9.1 version, it is complied with CDH 5.0.0 packages:

http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz

2. Configure Shark

we can use the hive 0.12 in CDH5, so we do not need to download spark/shark version of hive 0.11 bin

set following configs in $SPARK_USER_HOME/shark-0.9.1/conf/shark-env.sh :

export SPARK_USER_HOME=/var/lib/spark
export SPARK_MEM=2g
export SHARK_MASTER_MEM=1g
export SCALA_HOME="$SPARK_USER_HOME/scala-2.10.3"
export HIVE_HOME="/opt/cloudera/parcels/CDH/lib/hive"
export HIVE_CONF_DIR="$HIVE_HOME/conf"
export HADOOP_HOME="/opt/cloudera/parcels/CDH/lib/hadoop"
export SPARK_HOME="/opt/cloudera/parcels/CDH/lib/spark"
export MASTER="spark://test01:7077"

SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS

(change the host name test01 in MASTER="spark://test01:7077" to your host)

3. Build Shark with Hadoop 2.3.0-cdh5.0.0

if you downloaded my shark-0.9.1 version above (http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz), you do not need to build it, you can jump to Step 5. Otherwise, you need to compile your shark-0.9.1 with hadoop 2.3.0-cdh5 :

cd $SPARK_USER_HOME/shark-0.9.1/
SHARK_HADOOP_VERSION=2.3.0-cdh5.0.0 ./sbt/sbt package

it takes a long time, depends on your network... normally it will be very slow... -_-

so may be now you want to download the pre-built shark-0.9.1 package for cdh 5.0.0 GA ...

again, it is here:
http://user.cs.tu-berlin.de/~tqiu/fxlive/dataset/shark-0.9.1-cdh-5.0.0.tar.gz

4. Parquet support

wget http://repo1.maven.org/maven2/com/twitter/parquet-hive/1.2.8/parquet-hive-1.2.8.jar -O $SPARK_USER_HOME/shark-0.9.1/lib/parquet-hive-1.2.8.jar

ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-hadoop.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-common.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-encoding.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-format.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-avro.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-column.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-thrift.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-generator.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-cascading.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-hadoop-bundle.jar $SPARK_USER_HOME/shark-0.9.1/lib/
ln -s /opt/cloudera/parcels/CDH/lib/hadoop/parquet-scrooge.jar $SPARK_USER_HOME/shark-0.9.1/lib/

i am not sure if all of these jars are needed. but it works with these parquet jars...

and if you enable this parquet support, you need to set the SPARK_MEM in $SPARK_USER_HOME/shark-0.9.1/conf/shark-env.sh with at least 2GB .

5. Deploy shark to all the worker nodes

#MASTER

cd $SPARK_USER_HOME
tar zcf shark.tgz shark-0.9.1

scp this file to each worker, or

#WORKER

sudo ln -s /usr/bin/java /bin/java

export SPARK_USER_HOME=/var/lib/spark
cd $SPARK_USER_HOME
scp shark@test01:$SPARK_USER_HOME/shark.tgz $SPARK_USER_HOME/
tar zxf shark.tgz

6. Configure Spark

if your spark service can not be started in CM5 (Cloudera Manager 5), you may need to remove the "noexec" part of the /var or /var/run mount point. Using command:

mount -o remount,exec /var/run

and change the mount parameter in the line of /var or /var/run in /lib/init/fstab for a permanent solution.
you may need to go back to #MASTER, and add the workers in /etc/spark/conf/slaves .

such as you have 2 worker nodes:

echo "test02" >> /etc/spark/conf/slaves
echo "test03" >> /etc/spark/conf/slaves

and, in /etc/spark/conf/spark-env.sh you may need to change

export STANDALONE_SPARK_MASTER_HOST=`hostname`

export STANDALONE_SPARK_MASTER_HOST=`hostname -f`

7. Run it!

finally, i believe you can run the shark shell now!

go back to #MASTER
$SPARK_USER_HOME/shark-0.9.1/bin/shark-withinfo -skipRddReload

this -skipRddReload is only needed when you have some table with hive/hbase mapping, because of some issus in PassthroughOutputFormat by hive hbase handler.

the error message is something like:
"Property value must not be null"
or
"java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat"

8. Issuses

ref.: http://bigdataanalyze.blogspot.de/2014/03/installing-shark-in-cdh5-beta2.html

it is a good guide for Installing Shark in CDH5 beta2.

the author has also collect some common issues about Shark in CDH5 beta2: http://bigdataanalyze.blogspot.de/2014/03/issues-on-shark-with-cdh5-beta2-1.html

17 comments:

Manas KarApril 20, 2014 at 5:48 AM
This comment has been removed by the author.
ReplyDelete
Replies
Manas KarMay 2, 2014 at 7:05 AM
This comment has been removed by the author.
ReplyDelete
Replies
daroseMay 2, 2014 at 9:47 PM
This comment has been removed by the author.
ReplyDelete
Replies
daroseMay 2, 2014 at 9:48 PM
I've been trying to get spark/shark working on cdh5 for several days now, but with no success. Apparently, cdh5 now uses a newer version of the jets3t library (0.9.0) instead of the version that ships with spark/shark (0.7.1).

Consequently, installing shark on cdh5 results in a class not found error. (See https://issues.apache.org/jira/browse/SPARK-1556).

But when I try either a) to replace the jets3t jars in the binary releases with 0.9.0, or b) when I try to use a version of spark and/or shark that I've compiled myself with 0.9.0 I get a verify error. (See https://github.com/apache/spark/pull/468#issuecomment-42027309)

Any ideas on how to fix this and work around this dilemma?

Thanks!
ReplyDelete
Replies
AnonymousMay 24, 2014 at 3:08 AM
I attempted to follow the instructions here, but am stuck on this error. Any ideas?

[root@cdh-head spark]# ./shark-0.9.1/bin/shark-withinfo
-hiveconf hive.root.logger=INFO,console
Starting the Shark Command Line Client
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)V
at org.apache.hadoop.mapreduce.util.ConfigUtil.addDeprecatedKeys(ConfigUtil.java:54)
at org.apache.hadoop.mapreduce.util.ConfigUtil.loadResources(ConfigUtil.java:42)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:118)
at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:1077)
at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:1039)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:74)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:58)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:94)
at shark.SharkCliDriver.main(SharkCliDriver.scala)
ReplyDelete
Replies
AnonymousJune 23, 2014 at 11:17 PM
Hey guys, I had another go at this tonight. Seems you need to reference the Hadoop jars in the lib folder too. I did it with:

cd $SPARK_USER_HOME/shark-0.9.1/lib/
for a in `ls /opt/cloudera/parcels/CDH/lib/hadoop/hadoop*jar`; do ln -s $a `echo $a | cut -d"/" -f8`; done

but you get the idea. I also did all the parquet jars with the below but i don't think you need to:

for a in `ls /opt/cloudera/parcels/CDH/lib/parquet/parquet*jar`; do ln -s $a `echo $a | cut -d"/" -f8`; done
ReplyDelete
Replies
Rajasthan 9th Syllabus 2023June 1, 2022 at 8:55 AM
The Rajasthan board has conducted 9th class exam from 30th March, . The board examination was conducted in over Rajasthan state. When the board will announce BSER 9th class Syllabus 2023 in June, first week, then here we will Rajasthan 9th Syllabus 2023 update soon it link. Rajasthan Board 9th Syllabus 2023 It BSER 9th Syllabus 2023 official site is BSER 9th Syllabus 2023
ReplyDelete
Replies
TBSE 7th Class SyllabusJuly 9, 2022 at 11:35 AM
Tripura Board High School Parents can use the Syllabus to Understand the Concepts and Prepare their Children for the Exam, Accordingly, The TBSE 6th, 7th, 8th, 9th, 10th Class Syllabus 2023 All Subject Chapter Wise Students Prepare for the Upcoming 6th, 7th, 8th, 9th, 10th Class Exam, it is quite Essential that they have the Complete Knowledge of the TBSE 6th, 7th, 8th, 9th, 10th Class Latest Syllabus 2023 of All Relevant Subjects, TBSE 7th Class Syllabus Knowing the Details of Prescribed Topics and the weight age Allotted to Them Makes it easy to plan your Studies Meticulously so that you can make Effective Preparations for your Exam and obtain desired marks.
ReplyDelete
Replies

Add comment