you should read the previous part about HBase dependencies, and spark classpaths first: http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
and you'd better read this for some background knowledge about combining HBase and Spark: http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
this post aims to provide some additional complicated real world examples of above post.
at first, you can put your hbase-site.xml into spark's conf folder, otherwise you have to specify the full path (absolute path) of hbase-site.xml in your code.
ln -s /etc/hbase/conf/hbase-site.xml $SPARK_HOME/conf/
now, we use a very simple HBase table with string rowkey and string values to warm up.
table contents:
hbase(main):001:0> scan 'tmp' ROW COLUMN+CELL abc column=cf:test, timestamp=1401466636075, value=789 abc column=cf:val, timestamp=1401466435722, value=789 bar column=cf:val, timestamp=1396648974135, value=bb sku_2 column=cf:val, timestamp=1401464467396, value=999 test column=cf:val, timestamp=1396649021478, value=bb tmp column=cf:val, timestamp=1401466616160, value=test
in the post from vidyasource.com we can find how to get values from HBase Result's tuple, but no keys.
following code shows how to create a RDD of key-value pairs RDD[(key, value)] from HBase Results:
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.spark.rdd.NewHadoopRDD val conf = HBaseConfiguration.create() conf.set(TableInputFormat.INPUT_TABLE, "tmp") var hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("cf".getBytes(), "val".getBytes()))).map(row => { ( row._1.map(_.toChar).mkString, row._2.asScala.reduceLeft { (a, b) => if (a.getTimestamp > b.getTimestamp) a else b }.getValue ) }).take(10)you will get
Array[(String, Array[Byte])] = Array((abc,Array(55, 56, 57)), (bar,Array(98, 98)), (sku_2,Array(57, 57, 57)), (test,Array(98, 98)), (tmp,Array(116, 101, 115, 116)))
in scala, we can use map(_.toChar).mkString to convert Array[Byte] to a string (because we said, in this warm up example, the HBase table has only string values)
hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("cf".getBytes(), "val".getBytes()))).map(row => { ( row._1.map(_.toChar).mkString, row._2.asScala.reduceLeft { (a, b) => if (a.getTimestamp > b.getTimestamp) a else b }.getValue.map(_.toChar).mkString ) }).take(10)then we get
Array[(String, String)] = Array((abc,789), (bar,bb), (sku_2,999), (test,bb), (tmp,test))=======================================================================
after warm up, let us take a complicated HBase table example:
this table stores the UUID/cookie or whatever of user's different devices, you can image this table is a part of some kind of platform which is used for cross device user tracking and/or analyzing user behavior on different devices.
userid as rowkey, is string (such as some kind of hashed value)
column family is d (device family)
column qualifiers are the name or id of device (such as some internal id of User Agent Strings, in this example we use some simple string like app1, app2 for mobile apps, pc1, ios2 for different browser on different devices)
value of row is an 8 bytes long (a ByteArray with length 8)
it looks like this:
hbase(main):001:0> scan 'test1' ROW COLUMN+CELL user1 column=lf:app1, timestamp=1401645690042, value=\x00\x00\x00\x00\x00\x00\x00\x0F user1 column=lf:app2, timestamp=1401645690093, value=\x00\x00\x00\x00\x00\x00\x00\x10 user2 column=lf:app1, timestamp=1401645690142, value=\x00\x00\x00\x00\x00\x00\x00\x11 user2 column=lf:pc1, timestamp=1401645690170, value=\x00\x00\x00\x00\x00\x00\x00\x12 user3 column=lf:ios2, timestamp=1401645690180, value=\x00\x00\x00\x00\x00\x00\x00\x02
to create such a table, you should put like this in hbase shell
put 'test1', 'user1', 'lf:app1', "\x00\x00\x00\x00\x00\x00\x00\x0F" put 'test1', 'user1', 'lf:app2', "\x00\x00\x00\x00\x00\x00\x00\x10" put 'test1', 'user2', 'lf:app1', "\x00\x00\x00\x00\x00\x00\x00\x11" put 'test1', 'user2', 'lf:pc1', "\x00\x00\x00\x00\x00\x00\x00\x12" put 'test1', 'user3', 'lf:ios2', "\x00\x00\x00\x00\x00\x00\x00\x02"
ok, then, how can we read/scan this table from spark?
let us see this code:
conf.set(TableInputFormat.INPUT_TABLE, "test1") var hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("lf".getBytes(), "app1".getBytes()))).map(row => if (row._2.size > 0) { ( row._1.map(_.toChar).mkString, row._2.asScala.reduceLeft { (a, b) => if (a.getTimestamp > b.getTimestamp) a else b }.getValue.map(_.toInt).mkString ) }).take(10)
why this time it is map(._toInt) ? because in this Array[Byte], those Bytes are numbers, not Chars.
but we get
Array((user1,000000015), (user2,000000017), ())what? 000000015 ?... yes, because _.toInt convert each element in this Array[Byte] to Int, to avoid this, we can use java.nio.ByteBuffer
this code should be changed to
import java.nio.ByteBuffer hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("lf".getBytes(), "app1".getBytes()))).map(row => if (row._2.size > 0) { ( row._1.map(_.toChar).mkString, ByteBuffer.wrap(row._2.asScala.reduceLeft { (a, b) => if (a.getTimestamp > b.getTimestamp) a else b }.getValue).getLong ) }).take(10)then we get
Array((user1,15), (user2,17), ())finally looked better, but what is the last () ?!...
it is because rowkey user3 has no value with column lf:app1, so, again, we can do it better! in HBaseConfiguration object we can set TableInputFormat.SCAN_COLUMNS to a particular column qualifier, so we change the code to FINAL EDITION...
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.spark.rdd.NewHadoopRDD val conf = HBaseConfiguration.create() conf.set(TableInputFormat.INPUT_TABLE, "test1") conf.set(TableInputFormat.SCAN_COLUMNS, "lf:app1") var hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) import java.nio.ByteBuffer hBaseRDD.map(tuple => tuple._2).map(result => { ( result.getRow.map(_.toChar).mkString, ByteBuffer.wrap(result.value).getLong ) }).take(10)
and now, finally we get:
Array[(String, Long)] = Array((user1,15), (user2,17))
=======================================================================
FINAL FULL EDITION
now, if you want to get all of key-value pairs of a HBase table (all versions of values from all of column qualifiers)
you can try this code (for string values table "tmp"):
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.spark.rdd.NewHadoopRDD import java.nio.ByteBuffer type HBaseRow = java.util.NavigableMap[Array[Byte], java.util.NavigableMap[Array[Byte], java.util.NavigableMap[java.lang.Long, Array[Byte]]]] type CFTimeseriesRow = Map[Array[Byte], Map[Array[Byte], Map[Long, Array[Byte]]]] def navMapToMap(navMap: HBaseRow): CFTimeseriesRow = navMap.asScala.toMap.map(cf => (cf._1, cf._2.asScala.toMap.map(col => (col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2)))))) type CFTimeseriesRowStr = Map[String, Map[String, Map[Long, String]]] def rowToStrMap(navMap: CFTimeseriesRow): CFTimeseriesRowStr = navMap.map(cf => (cf._1.map(_.toChar).mkString, cf._2.map(col => (col._1.map(_.toChar).mkString, col._2.map(elem => (elem._1, elem._2.map(_.toChar).mkString)))))) val conf = HBaseConfiguration.create() conf.set(TableInputFormat.INPUT_TABLE, "tmp") val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) hBaseRDD.map(kv => (kv._1.get(), navMapToMap(kv._2.getMap))).map(kv => (kv._1.map(_.toChar).mkString, rowToStrMap(kv._2))).take(10)
for long values column family "lf" in table "test1", you can try to define CFTimeseriesRowStr and rowToStrMap as follows:
type CFTimeseriesRowStr = Map[String, Map[String, Map[Long, Long]]] def rowToStrMap(navMap: CFTimeseriesRow): CFTimeseriesRowStr = navMap.map(cf => (cf._1.map(_.toChar).mkString, cf._2.map(col => (col._1.map(_.toChar).mkString, col._2.map(elem => (elem._1, ByteBuffer.wrap(elem._2).getLong))))))
=======================================================================
beyond all of these code, there are more particulars you should think about when you querying HBase table, such as scan cache, enable block cache or not, whether or not to use bloom filters
and most important is, spark is still using org.apache.hadoop.hbase.mapreduce.TableInputFormat to read from HBase, it is the same as MapReduce program or hive hbase table mapping, so there is a big problem, your job will fail when one of HBase Region for target HBase table is splitting ! because the original region will be offline by splitting.
so if your HBase regions must be splittable, you should be careful to use spark or hive to read from HBase table. maybe you should write coprocessor instead of using hbase.mapreduce API.
if not, you should disable auto region split. following slide summarized all of HBase config properties related to control HBase region split.
Thanks for this! Most examples of how to use Spark with HBase use the example of a query followed by just a .count() call, but don't go into any detail on how to read the actual records.
ReplyDeleteThis comment has been removed by a blog administrator.
Deletethanks for sharing the information was really helpful ,
ReplyDeleteOn that not is there a way in which we can read the data from hbase and inset into hive
This is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. psc exam result
ReplyDeleteIf you know what a victory is, then come into the best online casino in the world, you have not seen anything like that, I assure you. to live casino online Play and win with us.
ReplyDeletenow present in your city
ReplyDeleteInvestigate that states how prescient information examination discovering its way in different ventures:
ReplyDeleteData Analytics Course in Bangalore
this blog that understand the value of providing a quality resource for free thanks for also PSC Result 2019
ReplyDeletethis post has been of great help to me. thanks. Govt Job
ReplyDeleteI was taking a gander at some of your posts on this site and I consider this site is truly informational! Keep setting up..
ReplyDeleteFor more info:
https://360digitmg.com/course/certification-program-in-data-science
https://360digitmg.com/course/data-analytics-using-python-r
https://360digitmg.com/course/data-visualization-using-tableau
This is my first time i visit here and I found so many interesting stuff in your blog especially it’s discussion. dgfood teletalk com bd 2020.
ReplyDelete
ReplyDeleteReally nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
digital marketing course in guduvanchery
I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.
ReplyDeleteeducational course
"Very good article with very useful information. Visit our websitedata science training in Hyderabad
ReplyDelete"
Neither a transistor nor an artificial neuron could manage itself; but an actual neuron can. data science course in india
ReplyDeleteExcellent Blog!!! Waiting for your new blog... thanks for sharing with us.
ReplyDeleteandroid developer vs web developer salary
how to use selenium webdriver
which is the best language in the world
professional hacking
devops interview questions and answers pdf
rpa interview questions and answers for experienced
The Ministry of Education date fixed to results of the hsc result 2020 will be published by all boards on 28 January 2021. All board students check via sms, online, apps way
ReplyDelete
ReplyDeleteVery awesome!!! When I searched for this I found this website at the top of all blogs in search engines.
Data Science Training in Hyderabad
Excellent Blog!!! Waiting for your new blog... thanks for sharing with us.
ReplyDeleteAWS Training in Hyderabad
AWS Course in Hyderabad
nice blog!! i hope you will share a blog on Data Science.
ReplyDeletedata science courses
I was very pleased to find this site.I wanted to thank you for this great read!! I definitely enjoy every little bit of it and I have you bookmarked to check out new stuff you post.
ReplyDeletebusiness analytics course
National University is published the nu honours 4th year exam result 2021 on online. Students now can check the result from nu.ac.bd/results as well as examresulthub.com
ReplyDeleteWow, happy to see this awesome post. I hope this think help any newbie for their awesome work and by the way thanks for share this awesomeness, i thought this was a pretty interesting read when it comes to this topic. Thank you..
ReplyDeleteData Science Training in Hyderabad
Good work, unique site and interesting too… keep it up…looking forward for more updates. Good luck to all of you and thanks so much for your hard-work…
ReplyDeleteData Science Training in Hyderabad
Very helpful post. Thanks to the author for presenting such a post so simply.
ReplyDeleteI also have a blog. Where job and education related content is shared. You can visit when you have time.
Thanks.
I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
ReplyDeletereal estate company
Well we really like to visit this site, many useful information we can get here.
ReplyDeletedata science training
Your website is very valuable. Thanks for sharing.
ReplyDeleteflat for sale
I really appreciate this wonderful post that you have provided for us. I assure this would be beneficial for most of the people.
ReplyDeletedata analytics course in hyderabad
Online casino & jackpot | Kadang Pintar
ReplyDeleteKadang Pintar kadangpintar has partnered with SBOBET.com หาเงินออนไลน์ to 인카지노 provide you with the latest slot games, jackpot games, lottery, free games, free slots, and more!
Really an awesome blog and very useful information for many people. Keep sharing more blogs again soon. Thank you.
ReplyDeleteData Science Training in Hyderabad
NCERT Class 12 English Solutions 2023 2023 for Answers & Questions 2023 in Available in Subject Wise Chapter Wise Pdf format Complete Book Solutions 2023. The Solutions 2023 here are as per the current Academic year ready to CBSE. to make it easy and Convenient for you, here is a simplified way to read NCERT English Solutions 2023 for Class 12 Chapter wise Online Download. NCERT English Solutions 2023 for 12th ClassAccountancy Solutions 2023 Very Important Home Work & Final Examination Students Better Performance NCERT 12 Class Solutions 2023 2023 for English Download Pdf Format Chapter Wise and Work Start Easy two Pass 12th Class grad 10 Points.
ReplyDeleteHimachal Pradesh 6th, 7th, 8th, 9th, 10th Book 2023 is Available here for Free Download, We are Providing the Chapter-wise Links which can be Downloaded as Pdf which Students may Refer whenever Required. This is the Latest Edition of has been Published by the Himachal Pradesh Board of School Education is Agency Government of Himachal Pradesh entrusted with the Responsibilities of Prescribing Courses of instructions and Textbook for Secondary School Students in Himachal Pradesh. HPBOSE 10th Class Textbook Students may also check here the HP Board 6th, 7th, 8th, 9th, 10th Books 2023 Prepared by Subject Experts. This Study Materiel For a better understanding of concepts used in 6th, 7th, 8th, 9th, 10th, Download the Book From the Link Provided at the end of this article.
ReplyDeleteIt is very interesting! Really useful for me and thank you for this amazing blog.
ReplyDeleteVA Divorce Lawyers
How to get a Divorce in VA
Please always keep us informed like this. Thank you!
ReplyDeleteAn amazing web journal I visit this blog, it's unbelievably wonderful. Discover the gateway to success with Ziyyara Edutech's prestigious online English tuition in Qatar.
ReplyDeleteFor more info visit Spoken English language Class in Qatar
Jharkhand Students Start to Learn Several Important topics and Concepts of Jharkhand Board class Syllabus 2024 From in-Depth Articles and interactive Fascinating Lesson videos to Exhaustive List Jharkhand 8th Class Syllabus 2024 of Resources for Jharkhand Board class new Syllabus 2024 Available in Hindi, English, Mathematics, Science, Social Science All Subjects.Those Students who are eagerly Searching for Jharkhand 8th class Syllabus 2024 can get All the Details From This Article, Jharkhand Academic Council is an state Government Agency has Provided All the Subject wise Syllabus on their official website Only, To help all the Exam Appearing Students.
ReplyDeleteErzurum
ReplyDeleteElazığ
Konya
Zonguldak
Eskişehir
QRDJ85
Kırşehir Lojistik
ReplyDeleteHakkari Lojistik
Kars Lojistik
Konya Lojistik
Kilis Lojistik
OİG30
çankırı evden eve nakliyat
ReplyDeletekırşehir evden eve nakliyat
kütahya evden eve nakliyat
hakkari evden eve nakliyat
antalya evden eve nakliyat
BA4
202C5
ReplyDeleteDüzce Lojistik
Erzurum Evden Eve Nakliyat
Osmaniye Lojistik
Çerkezköy Çelik Kapı
Omlira Coin Hangi Borsada
Huobi Güvenilir mi
Çanakkale Parça Eşya Taşıma
Elazığ Lojistik
Luffy Coin Hangi Borsada
I think this is the best information for me that I need in the past some time. Ziyyara Edutech's specialized Online GCSE Tuition – where academic challenges become opportunities for growth.
ReplyDeleteFor more info visit GCSE online tuition
A73E1
ReplyDeleteKripto Para Madenciliği Nasıl Yapılır
Soundcloud Beğeni Hilesi
Shibanomi Coin Hangi Borsada
Clysterum Coin Hangi Borsada
Mefa Coin Hangi Borsada
Coin Nasıl Oynanır
Facebook Takipçi Hilesi
Kripto Para Nasıl Oynanır
Threads Takipçi Satın Al
CABD1
ReplyDeletebkex
https://kapinagelsin.com.tr/
güvenilir kripto para siteleri
canli sohbet
4g mobil
gate io
kucoin
cointiger
coinex
4BD96
ReplyDeletekizlarla canli sohbet
filtre kağıdı
kripto para kanalları telegram
paribu
copy trade nedir
kripto para telegram
4g proxy
probit
kripto ne demek