You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Chetan Khatri <ch...@gmail.com> on 2016/12/21 10:28:59 UTC

Approach: Incremental data load from HBASE

Hello Guys,

I would like to understand different approach for Distributed Incremental
load from HBase, Is there any *tool / incubactor tool* which satisfy
requirement ?

*Approach 1:*

Write Kafka Producer and maintain manually column flag for events and
ingest it with Linkedin Gobblin to HDFS / S3.

*Approach 2:*

Run Scheduled Spark Job - Read from HBase and do transformations and
maintain flag column at HBase Level.

In above both approach, I need to maintain column level flags. such as 0 -
by default, 1-sent,2-sent and acknowledged. So next time Producer will take
another 1000 rows of batch where flag is 0 or 1.

I am looking for best practice approach with any distributed tool.

Thanks.

- Chetan Khatri

Re: Approach: Incremental data load from HBASE

Posted by Jerry He <je...@gmail.com>.
There is no magic in the sqoop incremental import. You need a key column or
timestamp column to let sqoop know where to start each incremental.

HBase has built in timestamp.  Please look at the hbase bundled MR tool.
Export. https://hbase.apache.org/book.html#tools
There are options that let you specify starttime and endtime.
You can also write your own MR or Spark job to do incremental export of
HBase data by providing timestamps to the Scan or providing filter.

Jerry

On Sat, Dec 24, 2016 at 7:24 PM, Chetan Khatri <ch...@gmail.com>
wrote:

> Hello HBase Community,
>
> What is suggested approach for Incremental import from HBase to HDFS, like
> RDBMS to HDFS Sqoop provides support with below script
>
> sqoop job --create myssb1 -- import --connect
> jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
> district, city_id, postal_code, alast_update, cityid, city, country_id,
> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
> a.last_update as alast_update, c.city_id as cityid, c.city as city,
> c.country_id as country_id, c.last_update as clast_update FROM
> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
> address=String
>
>
> Thanks.
>
> On Wed, Dec 21, 2016 at 3:58 PM, Chetan Khatri <
> chetan.opensource@gmail.com>
> wrote:
>
> > Hello Guys,
> >
> > I would like to understand different approach for Distributed Incremental
> > load from HBase, Is there any *tool / incubactor tool* which satisfy
> > requirement ?
> >
> > *Approach 1:*
> >
> > Write Kafka Producer and maintain manually column flag for events and
> > ingest it with Linkedin Gobblin to HDFS / S3.
> >
> > *Approach 2:*
> >
> > Run Scheduled Spark Job - Read from HBase and do transformations and
> > maintain flag column at HBase Level.
> >
> > In above both approach, I need to maintain column level flags. such as 0
> -
> > by default, 1-sent,2-sent and acknowledged. So next time Producer will
> take
> > another 1000 rows of batch where flag is 0 or 1.
> >
> > I am looking for best practice approach with any distributed tool.
> >
> > Thanks.
> >
> > - Chetan Khatri
> >
>

Re: Approach: Incremental data load from HBASE

Posted by Chetan Khatri <ch...@gmail.com>.
Hello HBase Community,

What is suggested approach for Incremental import from HBase to HDFS, like
RDBMS to HDFS Sqoop provides support with below script

sqoop job --create myssb1 -- import --connect
jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
--driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
district, city_id, postal_code, alast_update, cityid, city, country_id,
clast_update FROM(SELECT a.address_id as address_id, a.address as address,
a.district as district, a.city_id as city_id, a.postal_code as postal_code,
a.last_update as alast_update, c.city_id as cityid, c.city as city,
c.country_id as country_id, c.last_update as clast_update FROM
sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
--last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
--hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
address=String


Thanks.

On Wed, Dec 21, 2016 at 3:58 PM, Chetan Khatri <ch...@gmail.com>
wrote:

> Hello Guys,
>
> I would like to understand different approach for Distributed Incremental
> load from HBase, Is there any *tool / incubactor tool* which satisfy
> requirement ?
>
> *Approach 1:*
>
> Write Kafka Producer and maintain manually column flag for events and
> ingest it with Linkedin Gobblin to HDFS / S3.
>
> *Approach 2:*
>
> Run Scheduled Spark Job - Read from HBase and do transformations and
> maintain flag column at HBase Level.
>
> In above both approach, I need to maintain column level flags. such as 0 -
> by default, 1-sent,2-sent and acknowledged. So next time Producer will take
> another 1000 rows of batch where flag is 0 or 1.
>
> I am looking for best practice approach with any distributed tool.
>
> Thanks.
>
> - Chetan Khatri
>

Re: Approach: Incremental data load from HBASE

Posted by Chetan Khatri <ch...@gmail.com>.
Hello HBase Community,

What is suggested approach for Incremental import from HBase to HDFS, like
RDBMS to HDFS Sqoop provides support with below script

sqoop job --create myssb1 -- import --connect
jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
--driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
district, city_id, postal_code, alast_update, cityid, city, country_id,
clast_update FROM(SELECT a.address_id as address_id, a.address as address,
a.district as district, a.city_id as city_id, a.postal_code as postal_code,
a.last_update as alast_update, c.city_id as cityid, c.city as city,
c.country_id as country_id, c.last_update as clast_update FROM
sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
--last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
--hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
address=String


Thanks.

On Wed, Dec 21, 2016 at 3:58 PM, Chetan Khatri <ch...@gmail.com>
wrote:

> Hello Guys,
>
> I would like to understand different approach for Distributed Incremental
> load from HBase, Is there any *tool / incubactor tool* which satisfy
> requirement ?
>
> *Approach 1:*
>
> Write Kafka Producer and maintain manually column flag for events and
> ingest it with Linkedin Gobblin to HDFS / S3.
>
> *Approach 2:*
>
> Run Scheduled Spark Job - Read from HBase and do transformations and
> maintain flag column at HBase Level.
>
> In above both approach, I need to maintain column level flags. such as 0 -
> by default, 1-sent,2-sent and acknowledged. So next time Producer will take
> another 1000 rows of batch where flag is 0 or 1.
>
> I am looking for best practice approach with any distributed tool.
>
> Thanks.
>
> - Chetan Khatri
>