You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Demai Ni <ni...@gmail.com> on 2016/10/21 20:48:08 UTC

ETL HBase HFile+HLog to ORC(or Parquet) file?

hi,

I am wondering whether there are existing methods to ETL HBase data to
ORC(or other open source columnar) file?

I understand in Hive "insert into Hive_ORC_Table from SELET * from
HIVE_HBase_Table", can probably get the job done. Is this the common way to
do so? Performance is acceptable and able to handle the delta update in the
case HBase table changed?

I did a bit google, and find this
https://community.hortonworks.com/questions/2632/loading-hbase-from-hive-orc-tables.html

which is another way around.

Will it perform better(comparing to above Hive stmt) if using either
replication logic or snapshot backup to generate ORC file from hbase tables
and with incremental update ability?

I hope to has as fewer dependency as possible. in the Example of ORC, will
only depend on Apache ORC's API, and not depend on Hive

Demai

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

Posted by Demai Ni <ni...@gmail.com>.

Jerry and Mich,

thanks. I will look a bit more into this. probably an interesting and
useful feature to have.

Demai

On Sat, Oct 22, 2016 at 12:02 PM, Jerry He <je...@gmail.com> wrote:

> Hi, Demai
>
> If you think something helpful can be done within HBase, feel free to
> propose on the JIRA.
>
> Jerry
>
> On Fri, Oct 21, 2016 at 2:41 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Hi Demai,
> >
> > As I understand you want to use Hbase as the real time layer and Hive
> Data
> > Warehouse as the batch layer for analytics.
> >
> > In other words ingest data real time from source into Hbase and push that
> > data into Hive recurring
> >
> > If you partition your target ORC table with DtStamp and INSERT/OVERWRITE
> > into this table using Spark as the execution engine for Hive (as opposed
> to
> > map-reduce) it should pretty fast.
> >
> > Hive is going to get an in-memory database in the next release or so it
> is
> > a perfect choice.
> >
> >
> > HTH
> >
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 22:28, Demai Ni <ni...@gmail.com> wrote:
> >
> > > Mich,
> > >
> > > thanks for the detail instructions.
> > >
> > > While aware of the Hive method, I have a few questions/concerns:
> > > 1) the Hive method is a "INSERT FROM SELECT " ,which usually not
> perform
> > as
> > > good as a bulk load though I am not familiar with the real
> implementation
> > > 2) I have another SQL-on-Hadoop engine working well with ORC file. So
> if
> > > possible, I'd like to avoid the system dependency on Hive(one fewer
> > > component to maintain).
> > > 3) HBase has well running back-end process for Replication(HBASE-1295)
> or
> > > Backup(HBASE-7912), so  wondering anything can be piggy-back on it to
> > deal
> > > with day-to-day works
> > >
> > > The goal is to have HBase as a OLTP front(to receive data), and the ORC
> > > file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
> > > file will also serve as my backup in the case for DR.
> > >
> > > Demai
> > >
> > >
> > > On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <
> > > mich.talebzadeh@gmail.com>
> > > wrote:
> > >
> > > > Create an external table in Hive on Hbase atble. Pretty straight
> > forward.
> > > >
> > > > hive>  create external table marketDataHbase (key STRING, ticker
> > STRING,
> > > > timecreated STRING, price STRING)
> > > >
> > > >     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> > > > SERDEPROPERTIES ("hbase.columns.mapping" =
> > > > ":key,price_info:ticker,price_info:timecreated, price_info:price")
> > > >
> > > >     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
> > > >
> > > >
> > > >
> > > > then create a normal table in hive as ORC
> > > >
> > > >
> > > > CREATE TABLE IF NOT EXISTS marketData (
> > > >      KEY string
> > > >    , TICKER string
> > > >    , TIMECREATED string
> > > >    , PRICE float
> > > > )
> > > > PARTITIONED BY (DateStamp  string)
> > > > STORED AS ORC
> > > > TBLPROPERTIES (
> > > > "orc.create.index"="true",
> > > > "orc.bloom.filter.columns"="KEY",
> > > > "orc.bloom.filter.fpp"="0.05",
> > > > "orc.compress"="SNAPPY",
> > > > "orc.stripe.size"="16777216",
> > > > "orc.row.index.stride"="10000" )
> > > > ;
> > > > --show create table marketData;
> > > > --Populate target table
> > > > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> > > > SELECT
> > > >       KEY
> > > >     , TICKER
> > > >     , TIMECREATED
> > > >     , PRICE
> > > > FROM MarketDataHbase
> > > >
> > > >
> > > > Run this job as a cron every often
> > > >
> > > >
> > > > HTH
> > > >
> > > >
> > > >
> > > > Dr Mich Talebzadeh
> > > >
> > > >
> > > >
> > > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > > <https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > > OABUrV8Pw>*
> > > >
> > > >
> > > >
> > > > http://talebzadehmich.wordpress.com
> > > >
> > > >
> > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> > any
> > > > loss, damage or destruction of data or any other property which may
> > arise
> > > > from relying on this email's technical content is explicitly
> > disclaimed.
> > > > The author will in no case be liable for any monetary damages arising
> > > from
> > > > such loss, damage or destruction.
> > > >
> > > >
> > > >
> > > > On 21 October 2016 at 21:48, Demai Ni <ni...@gmail.com> wrote:
> > > >
> > > > > hi,
> > > > >
> > > > > I am wondering whether there are existing methods to ETL HBase data
> > to
> > > > > ORC(or other open source columnar) file?
> > > > >
> > > > > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > > > > HIVE_HBase_Table", can probably get the job done. Is this the
> common
> > > way
> > > > to
> > > > > do so? Performance is acceptable and able to handle the delta
> update
> > in
> > > > the
> > > > > case HBase table changed?
> > > > >
> > > > > I did a bit google, and find this
> > > > > https://community.hortonworks.com/questions/2632/loading-
> > > > > hbase-from-hive-orc-tables.html
> > > > >
> > > > > which is another way around.
> > > > >
> > > > > Will it perform better(comparing to above Hive stmt) if using
> either
> > > > > replication logic or snapshot backup to generate ORC file from
> hbase
> > > > tables
> > > > > and with incremental update ability?
> > > > >
> > > > > I hope to has as fewer dependency as possible. in the Example of
> ORC,
> > > > will
> > > > > only depend on Apache ORC's API, and not depend on Hive
> > > > >
> > > > > Demai
> > > > >
> > > >
> > >
> >
>

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

Posted by Jerry He <je...@gmail.com>.

Hi, Demai

If you think something helpful can be done within HBase, feel free to
propose on the JIRA.

Jerry

On Fri, Oct 21, 2016 at 2:41 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Demai,
>
> As I understand you want to use Hbase as the real time layer and Hive Data
> Warehouse as the batch layer for analytics.
>
> In other words ingest data real time from source into Hbase and push that
> data into Hive recurring
>
> If you partition your target ORC table with DtStamp and INSERT/OVERWRITE
> into this table using Spark as the execution engine for Hive (as opposed to
> map-reduce) it should pretty fast.
>
> Hive is going to get an in-memory database in the next release or so it is
> a perfect choice.
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 22:28, Demai Ni <ni...@gmail.com> wrote:
>
> > Mich,
> >
> > thanks for the detail instructions.
> >
> > While aware of the Hive method, I have a few questions/concerns:
> > 1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform
> as
> > good as a bulk load though I am not familiar with the real implementation
> > 2) I have another SQL-on-Hadoop engine working well with ORC file. So if
> > possible, I'd like to avoid the system dependency on Hive(one fewer
> > component to maintain).
> > 3) HBase has well running back-end process for Replication(HBASE-1295) or
> > Backup(HBASE-7912), so  wondering anything can be piggy-back on it to
> deal
> > with day-to-day works
> >
> > The goal is to have HBase as a OLTP front(to receive data), and the ORC
> > file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
> > file will also serve as my backup in the case for DR.
> >
> > Demai
> >
> >
> > On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <
> > mich.talebzadeh@gmail.com>
> > wrote:
> >
> > > Create an external table in Hive on Hbase atble. Pretty straight
> forward.
> > >
> > > hive>  create external table marketDataHbase (key STRING, ticker
> STRING,
> > > timecreated STRING, price STRING)
> > >
> > >     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> > > SERDEPROPERTIES ("hbase.columns.mapping" =
> > > ":key,price_info:ticker,price_info:timecreated, price_info:price")
> > >
> > >     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
> > >
> > >
> > >
> > > then create a normal table in hive as ORC
> > >
> > >
> > > CREATE TABLE IF NOT EXISTS marketData (
> > >      KEY string
> > >    , TICKER string
> > >    , TIMECREATED string
> > >    , PRICE float
> > > )
> > > PARTITIONED BY (DateStamp  string)
> > > STORED AS ORC
> > > TBLPROPERTIES (
> > > "orc.create.index"="true",
> > > "orc.bloom.filter.columns"="KEY",
> > > "orc.bloom.filter.fpp"="0.05",
> > > "orc.compress"="SNAPPY",
> > > "orc.stripe.size"="16777216",
> > > "orc.row.index.stride"="10000" )
> > > ;
> > > --show create table marketData;
> > > --Populate target table
> > > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> > > SELECT
> > >       KEY
> > >     , TICKER
> > >     , TIMECREATED
> > >     , PRICE
> > > FROM MarketDataHbase
> > >
> > >
> > > Run this job as a cron every often
> > >
> > >
> > > HTH
> > >
> > >
> > >
> > > Dr Mich Talebzadeh
> > >
> > >
> > >
> > > LinkedIn * https://www.linkedin.com/profile/view?id=
> > > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > > <https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > > OABUrV8Pw>*
> > >
> > >
> > >
> > > http://talebzadehmich.wordpress.com
> > >
> > >
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any
> > > loss, damage or destruction of data or any other property which may
> arise
> > > from relying on this email's technical content is explicitly
> disclaimed.
> > > The author will in no case be liable for any monetary damages arising
> > from
> > > such loss, damage or destruction.
> > >
> > >
> > >
> > > On 21 October 2016 at 21:48, Demai Ni <ni...@gmail.com> wrote:
> > >
> > > > hi,
> > > >
> > > > I am wondering whether there are existing methods to ETL HBase data
> to
> > > > ORC(or other open source columnar) file?
> > > >
> > > > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > > > HIVE_HBase_Table", can probably get the job done. Is this the common
> > way
> > > to
> > > > do so? Performance is acceptable and able to handle the delta update
> in
> > > the
> > > > case HBase table changed?
> > > >
> > > > I did a bit google, and find this
> > > > https://community.hortonworks.com/questions/2632/loading-
> > > > hbase-from-hive-orc-tables.html
> > > >
> > > > which is another way around.
> > > >
> > > > Will it perform better(comparing to above Hive stmt) if using either
> > > > replication logic or snapshot backup to generate ORC file from hbase
> > > tables
> > > > and with incremental update ability?
> > > >
> > > > I hope to has as fewer dependency as possible. in the Example of ORC,
> > > will
> > > > only depend on Apache ORC's API, and not depend on Hive
> > > >
> > > > Demai
> > > >
> > >
> >
>

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Demai,

As I understand you want to use Hbase as the real time layer and Hive Data
Warehouse as the batch layer for analytics.

In other words ingest data real time from source into Hbase and push that
data into Hive recurring

If you partition your target ORC table with DtStamp and INSERT/OVERWRITE
into this table using Spark as the execution engine for Hive (as opposed to
map-reduce) it should pretty fast.

Hive is going to get an in-memory database in the next release or so it is
a perfect choice.


HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 22:28, Demai Ni <ni...@gmail.com> wrote:

> Mich,
>
> thanks for the detail instructions.
>
> While aware of the Hive method, I have a few questions/concerns:
> 1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as
> good as a bulk load though I am not familiar with the real implementation
> 2) I have another SQL-on-Hadoop engine working well with ORC file. So if
> possible, I'd like to avoid the system dependency on Hive(one fewer
> component to maintain).
> 3) HBase has well running back-end process for Replication(HBASE-1295) or
> Backup(HBASE-7912), so  wondering anything can be piggy-back on it to deal
> with day-to-day works
>
> The goal is to have HBase as a OLTP front(to receive data), and the ORC
> file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
> file will also serve as my backup in the case for DR.
>
> Demai
>
>
> On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com>
> wrote:
>
> > Create an external table in Hive on Hbase atble. Pretty straight forward.
> >
> > hive>  create external table marketDataHbase (key STRING, ticker STRING,
> > timecreated STRING, price STRING)
> >
> >     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> > SERDEPROPERTIES ("hbase.columns.mapping" =
> > ":key,price_info:ticker,price_info:timecreated, price_info:price")
> >
> >     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
> >
> >
> >
> > then create a normal table in hive as ORC
> >
> >
> > CREATE TABLE IF NOT EXISTS marketData (
> >      KEY string
> >    , TICKER string
> >    , TIMECREATED string
> >    , PRICE float
> > )
> > PARTITIONED BY (DateStamp  string)
> > STORED AS ORC
> > TBLPROPERTIES (
> > "orc.create.index"="true",
> > "orc.bloom.filter.columns"="KEY",
> > "orc.bloom.filter.fpp"="0.05",
> > "orc.compress"="SNAPPY",
> > "orc.stripe.size"="16777216",
> > "orc.row.index.stride"="10000" )
> > ;
> > --show create table marketData;
> > --Populate target table
> > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> > SELECT
> >       KEY
> >     , TICKER
> >     , TIMECREATED
> >     , PRICE
> > FROM MarketDataHbase
> >
> >
> > Run this job as a cron every often
> >
> >
> > HTH
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn * https://www.linkedin.com/profile/view?id=
> > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> > OABUrV8Pw>*
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising
> from
> > such loss, damage or destruction.
> >
> >
> >
> > On 21 October 2016 at 21:48, Demai Ni <ni...@gmail.com> wrote:
> >
> > > hi,
> > >
> > > I am wondering whether there are existing methods to ETL HBase data to
> > > ORC(or other open source columnar) file?
> > >
> > > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > > HIVE_HBase_Table", can probably get the job done. Is this the common
> way
> > to
> > > do so? Performance is acceptable and able to handle the delta update in
> > the
> > > case HBase table changed?
> > >
> > > I did a bit google, and find this
> > > https://community.hortonworks.com/questions/2632/loading-
> > > hbase-from-hive-orc-tables.html
> > >
> > > which is another way around.
> > >
> > > Will it perform better(comparing to above Hive stmt) if using either
> > > replication logic or snapshot backup to generate ORC file from hbase
> > tables
> > > and with incremental update ability?
> > >
> > > I hope to has as fewer dependency as possible. in the Example of ORC,
> > will
> > > only depend on Apache ORC's API, and not depend on Hive
> > >
> > > Demai
> > >
> >
>

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

Posted by Demai Ni <ni...@gmail.com>.

Mich,

thanks for the detail instructions.

While aware of the Hive method, I have a few questions/concerns:
1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as
good as a bulk load though I am not familiar with the real implementation
2) I have another SQL-on-Hadoop engine working well with ORC file. So if
possible, I'd like to avoid the system dependency on Hive(one fewer
component to maintain).
3) HBase has well running back-end process for Replication(HBASE-1295) or
Backup(HBASE-7912), so  wondering anything can be piggy-back on it to deal
with day-to-day works

The goal is to have HBase as a OLTP front(to receive data), and the ORC
file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC
file will also serve as my backup in the case for DR.

Demai


On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Create an external table in Hive on Hbase atble. Pretty straight forward.
>
> hive>  create external table marketDataHbase (key STRING, ticker STRING,
> timecreated STRING, price STRING)
>
>     STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
> SERDEPROPERTIES ("hbase.columns.mapping" =
> ":key,price_info:ticker,price_info:timecreated, price_info:price")
>
>     TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");
>
>
>
> then create a normal table in hive as ORC
>
>
> CREATE TABLE IF NOT EXISTS marketData (
>      KEY string
>    , TICKER string
>    , TIMECREATED string
>    , PRICE float
> )
> PARTITIONED BY (DateStamp  string)
> STORED AS ORC
> TBLPROPERTIES (
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="KEY",
> "orc.bloom.filter.fpp"="0.05",
> "orc.compress"="SNAPPY",
> "orc.stripe.size"="16777216",
> "orc.row.index.stride"="10000" )
> ;
> --show create table marketData;
> --Populate target table
> INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
> SELECT
>       KEY
>     , TICKER
>     , TIMECREATED
>     , PRICE
> FROM MarketDataHbase
>
>
> Run this job as a cron every often
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=
> AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 October 2016 at 21:48, Demai Ni <ni...@gmail.com> wrote:
>
> > hi,
> >
> > I am wondering whether there are existing methods to ETL HBase data to
> > ORC(or other open source columnar) file?
> >
> > I understand in Hive "insert into Hive_ORC_Table from SELET * from
> > HIVE_HBase_Table", can probably get the job done. Is this the common way
> to
> > do so? Performance is acceptable and able to handle the delta update in
> the
> > case HBase table changed?
> >
> > I did a bit google, and find this
> > https://community.hortonworks.com/questions/2632/loading-
> > hbase-from-hive-orc-tables.html
> >
> > which is another way around.
> >
> > Will it perform better(comparing to above Hive stmt) if using either
> > replication logic or snapshot backup to generate ORC file from hbase
> tables
> > and with incremental update ability?
> >
> > I hope to has as fewer dependency as possible. in the Example of ORC,
> will
> > only depend on Apache ORC's API, and not depend on Hive
> >
> > Demai
> >
>

Re: ETL HBase HFile+HLog to ORC(or Parquet) file?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Create an external table in Hive on Hbase atble. Pretty straight forward.

hive>  create external table marketDataHbase (key STRING, ticker STRING,
timecreated STRING, price STRING)

    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH
SERDEPROPERTIES ("hbase.columns.mapping" =
":key,price_info:ticker,price_info:timecreated, price_info:price")

    TBLPROPERTIES ("hbase.table.name" = "marketDataHbase");



then create a normal table in hive as ORC


CREATE TABLE IF NOT EXISTS marketData (
     KEY string
   , TICKER string
   , TIMECREATED string
   , PRICE float
)
PARTITIONED BY (DateStamp  string)
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="KEY",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;
--show create table marketData;
--Populate target table
INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}")
SELECT
      KEY
    , TICKER
    , TIMECREATED
    , PRICE
FROM MarketDataHbase


Run this job as a cron every often


HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 October 2016 at 21:48, Demai Ni <ni...@gmail.com> wrote:

> hi,
>
> I am wondering whether there are existing methods to ETL HBase data to
> ORC(or other open source columnar) file?
>
> I understand in Hive "insert into Hive_ORC_Table from SELET * from
> HIVE_HBase_Table", can probably get the job done. Is this the common way to
> do so? Performance is acceptable and able to handle the delta update in the
> case HBase table changed?
>
> I did a bit google, and find this
> https://community.hortonworks.com/questions/2632/loading-
> hbase-from-hive-orc-tables.html
>
> which is another way around.
>
> Will it perform better(comparing to above Hive stmt) if using either
> replication logic or snapshot backup to generate ORC file from hbase tables
> and with incremental update ability?
>
> I hope to has as fewer dependency as possible. in the Example of ORC, will
> only depend on Apache ORC's API, and not depend on Hive
>
> Demai
>