Tuesday, November 19, 2013

Integration of Hive and HBase: Hive MAP column mapped to an entire column family with binary value (byte value)

准备陆续把一些工作中遇到的问题整理成文档贴到这里。

先来一个最要命的~

在我们之前的工作中,遇到了一个情况,我们需要把一个 hbase 的 column family mapping 到 hive table 里,但是这个cf存的都是 byte value ,column qualifier 是类似timestamp的随机不确定数值,不可能在column mapping时明确指定,必须把整个cf mapping到hive里。

就是说
Hive MAP column mapped to an entire column family with binary value

解决方法
using following mapping type specification:
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:#s:b")
cf:#s:b means, cf is column family name, and #s:b is the type specification for this entire HBase column family, column names (column qualifiers) as the keys of the Hive MAP, their datatype is string (s), and column values as Hive MAP values, with datatype binary (b, byte values).

搜了各种文档,没有一个提到这种情况的,官方文档
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-Examplewithbinarycolumns
也只是在讲某个具体的列,是binary column的情况,用 hbase.table.default.storage.type 不适用于要mapping整个cf的情况

参照文档里给的mapping一个具体列的格式
column-family-name:[column-name][#(binary|string)]
我们可以把这个 mapping entire column family with binary value 的写成如下范式:
column-family-name:[#(binary|string):(binary|string)]

举个例子

customer journey analytics 是 big data analytics 中一个很好的用 hbase + hive 的use case,我们假设所有的 customer events 都 tracking 到了一个 hbase table 里,key 是 customer id , long value , 在hbase里,为了保证key定长,long value 都通过 Bytes.toByte() 存为 byte value。一个 column family 存这个 customer 的所有 events ,比如 view , click , buy ,column qualifier 是 event 的timestamp,value 是 event id,同样是 long value,存为 byte。

这样这个表就是:
create 'customer_journey', 'events'
entries look like:
hbase(main):011:0> scan 'customer_journey'
ROW                               COLUMN+CELL
 \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824339000, timestamp=1354824339000, value=\x00\x00\x00\x00\x00\x00 \x08
 \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824340000, timestamp=1354824340000, value=\x00\x00\x00\x00\x00\x00'\x9E
 \x00\x00\x00\x00\x00\x00\x00\x01 column=events:1354824350000, timestamp=1354824350000, value=\x00\x00\x00\x00\x00\x00\x00\x0F
 \x00\x00\x00\x00\x00\x00\x00\x10 column=events:1354824350000, timestamp=1354824350000, value=\x00\x00\x00\x00\x00\x00\xF0\x08
 \x00\x00\x00\x00\x00\x00\x00\x10 column=events:1354824359000, timestamp=1354824359000, value=\x00\x00\x00\x00\x00\x00\x10\xD8
hive external table should be created like:
hive> CREATE EXTERNAL TABLE customer_journey (customer_id bigint, events map<string, bigint>)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b, events:#s:b")
    > TBLPROPERTIES ("hbase.table.name" = "customer_journey");
then you will get:
hive> select * from customer_journey;
OK
1 {"1354824339000":8200,"1354824340000":10142,"1354824350000":15}
16 {"1354824350000":61448,"1354824359000":4312}
Time taken: ...

o2 Flat M: Flatrate ins dt. Festnetz + Flatrate ins o2-Netz
Sponsors: TUI.com mobilcom-debitel Online-Shop

6 comments:

  1. Chittagong is also another best education board under all education board Bangladesh, and this is also one of the divisions under eight education boards of the country, the Secondary and Higher Secondary Education Board has successfully completed those JSC Result 2022 Chittangong Board Junior School Certificate and Junior Dakil (Grade-8) annual final examination tests between 2nd to 11th November 2022 with the same schedule of all education board.

    ReplyDelete

© Chutium / Teng Qiu @ ABC Netz Group