You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Chetan Khatri <ch...@gmail.com> on 2017/01/04 11:37:45 UTC

Re: Approach: Incremental data load from HBASE

Ted Yu,

You understood wrong, i said Incremental load from HBase to Hive,
individually you can say Incremental Import from HBase.

On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:

> Incremental load traditionally means generating hfiles and
> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
> data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0 or
> 1.
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be generic best practice solution for Incremental load
>> from HBASE.
>>
>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> I haven't used Gobblin.
>>> You can consider asking Gobblin mailing list of the first option.
>>>
>>> The second option would work.
>>>
>>>
>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>>
>>>> Hello Guys,
>>>>
>>>> I would like to understand different approach for Distributed
>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>> satisfy requirement ?
>>>>
>>>> *Approach 1:*
>>>>
>>>> Write Kafka Producer and maintain manually column flag for events and
>>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>>
>>>> *Approach 2:*
>>>>
>>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>>> maintain flag column at HBase Level.
>>>>
>>>> In above both approach, I need to maintain column level flags. such as
>>>> 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
>>>> take another 1000 rows of batch where flag is 0 or 1.
>>>>
>>>> I am looking for best practice approach with any distributed tool.
>>>>
>>>> Thanks.
>>>>
>>>> - Chetan Khatri
>>>>
>>>
>>>
>>
>

Re: Approach: Incremental data load from HBASE

Posted by Chetan Khatri <ch...@gmail.com>.
Ayan, Thanks
Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses !


On Fri, Jan 6, 2017 at 3:23 PM, ayan guha <gu...@gmail.com> wrote:

> IMHO you should not "think" HBase in RDMBS terms, but you can use
> ColumnFilters to filter out new records
>
> On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri <chetan.opensource@gmail.com
> > wrote:
>
>> Hi Ayan,
>>
>> I mean by Incremental load from HBase, weekly running batch jobs takes
>> rows from HBase table and dump it out to Hive. Now when next i run Job it
>> only takes newly arrived jobs.
>>
>> Same as if we use Sqoop for incremental load from RDBMS to Hive with
>> below command,
>>
>> sqoop job --create myssb1 -- import --connect
>> jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
>> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
>> district, city_id, postal_code, alast_update, cityid, city, country_id,
>> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
>> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
>> a.last_update as alast_update, c.city_id as cityid, c.city as city,
>> c.country_id as country_id, c.last_update as clast_update FROM
>> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
>> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
>> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
>> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
>> address=String
>>
>> Probably i am looking for any tool from HBase incubator family which does
>> the job for me, or other alternative approaches can be done through reading
>> Hbase tables in RDD and saving RDD to Hive.
>>
>> Thanks.
>>
>>
>> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <gu...@gmail.com> wrote:
>>
>>> Hi Chetan
>>>
>>> What do you mean by incremental load from HBase? There is a timestamp
>>> marker for each cell, but not at Row level.
>>>
>>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>>
>>>> Ted Yu,
>>>>
>>>> You understood wrong, i said Incremental load from HBase to Hive,
>>>> individually you can say Incremental Import from HBase.
>>>>
>>>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> Incremental load traditionally means generating hfiles and
>>>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>>>> the data into hbase.
>>>>>
>>>>> For your use case, the producer needs to find rows where the flag is 0
>>>>> or 1.
>>>>> After such rows are obtained, it is up to you how the result of
>>>>> processing is delivered to hbase.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>>>> chetan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Ok, Sure will ask.
>>>>>>
>>>>>> But what would be generic best practice solution for Incremental load
>>>>>> from HBASE.
>>>>>>
>>>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> I haven't used Gobblin.
>>>>>>> You can consider asking Gobblin mailing list of the first option.
>>>>>>>
>>>>>>> The second option would work.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>>>>> chetan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello Guys,
>>>>>>>>
>>>>>>>> I would like to understand different approach for Distributed
>>>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>>>>> satisfy requirement ?
>>>>>>>>
>>>>>>>> *Approach 1:*
>>>>>>>>
>>>>>>>> Write Kafka Producer and maintain manually column flag for events
>>>>>>>> and ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>>>>
>>>>>>>> *Approach 2:*
>>>>>>>>
>>>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations
>>>>>>>> and maintain flag column at HBase Level.
>>>>>>>>
>>>>>>>> In above both approach, I need to maintain column level flags. such
>>>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>>>>>>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>>>>>>
>>>>>>>> I am looking for best practice approach with any distributed tool.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> - Chetan Khatri
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Approach: Incremental data load from HBASE

Posted by Chetan Khatri <ch...@gmail.com>.
Ayan, Thanks
Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses !


On Fri, Jan 6, 2017 at 3:23 PM, ayan guha <gu...@gmail.com> wrote:

> IMHO you should not "think" HBase in RDMBS terms, but you can use
> ColumnFilters to filter out new records
>
> On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri <chetan.opensource@gmail.com
> > wrote:
>
>> Hi Ayan,
>>
>> I mean by Incremental load from HBase, weekly running batch jobs takes
>> rows from HBase table and dump it out to Hive. Now when next i run Job it
>> only takes newly arrived jobs.
>>
>> Same as if we use Sqoop for incremental load from RDBMS to Hive with
>> below command,
>>
>> sqoop job --create myssb1 -- import --connect
>> jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
>> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
>> district, city_id, postal_code, alast_update, cityid, city, country_id,
>> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
>> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
>> a.last_update as alast_update, c.city_id as cityid, c.city as city,
>> c.country_id as country_id, c.last_update as clast_update FROM
>> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
>> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
>> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
>> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
>> address=String
>>
>> Probably i am looking for any tool from HBase incubator family which does
>> the job for me, or other alternative approaches can be done through reading
>> Hbase tables in RDD and saving RDD to Hive.
>>
>> Thanks.
>>
>>
>> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <gu...@gmail.com> wrote:
>>
>>> Hi Chetan
>>>
>>> What do you mean by incremental load from HBase? There is a timestamp
>>> marker for each cell, but not at Row level.
>>>
>>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>>
>>>> Ted Yu,
>>>>
>>>> You understood wrong, i said Incremental load from HBase to Hive,
>>>> individually you can say Incremental Import from HBase.
>>>>
>>>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> Incremental load traditionally means generating hfiles and
>>>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>>>> the data into hbase.
>>>>>
>>>>> For your use case, the producer needs to find rows where the flag is 0
>>>>> or 1.
>>>>> After such rows are obtained, it is up to you how the result of
>>>>> processing is delivered to hbase.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>>>> chetan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Ok, Sure will ask.
>>>>>>
>>>>>> But what would be generic best practice solution for Incremental load
>>>>>> from HBASE.
>>>>>>
>>>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>>
>>>>>>> I haven't used Gobblin.
>>>>>>> You can consider asking Gobblin mailing list of the first option.
>>>>>>>
>>>>>>> The second option would work.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>>>>> chetan.opensource@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello Guys,
>>>>>>>>
>>>>>>>> I would like to understand different approach for Distributed
>>>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>>>>> satisfy requirement ?
>>>>>>>>
>>>>>>>> *Approach 1:*
>>>>>>>>
>>>>>>>> Write Kafka Producer and maintain manually column flag for events
>>>>>>>> and ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>>>>
>>>>>>>> *Approach 2:*
>>>>>>>>
>>>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations
>>>>>>>> and maintain flag column at HBase Level.
>>>>>>>>
>>>>>>>> In above both approach, I need to maintain column level flags. such
>>>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>>>>>>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>>>>>>
>>>>>>>> I am looking for best practice approach with any distributed tool.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> - Chetan Khatri
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Approach: Incremental data load from HBASE

Posted by ayan guha <gu...@gmail.com>.
IMHO you should not "think" HBase in RDMBS terms, but you can use
ColumnFilters to filter out new records

On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri <ch...@gmail.com>
wrote:

> Hi Ayan,
>
> I mean by Incremental load from HBase, weekly running batch jobs takes
> rows from HBase table and dump it out to Hive. Now when next i run Job it
> only takes newly arrived jobs.
>
> Same as if we use Sqoop for incremental load from RDBMS to Hive with below
> command,
>
> sqoop job --create myssb1 -- import --connect
> jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
> district, city_id, postal_code, alast_update, cityid, city, country_id,
> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
> a.last_update as alast_update, c.city_id as cityid, c.city as city,
> c.country_id as country_id, c.last_update as clast_update FROM
> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
> address=String
>
> Probably i am looking for any tool from HBase incubator family which does
> the job for me, or other alternative approaches can be done through reading
> Hbase tables in RDD and saving RDD to Hive.
>
> Thanks.
>
>
> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <gu...@gmail.com> wrote:
>
>> Hi Chetan
>>
>> What do you mean by incremental load from HBase? There is a timestamp
>> marker for each cell, but not at Row level.
>>
>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>>
>>> Ted Yu,
>>>
>>> You understood wrong, i said Incremental load from HBase to Hive,
>>> individually you can say Incremental Import from HBase.
>>>
>>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> Incremental load traditionally means generating hfiles and
>>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>>> the data into hbase.
>>>>
>>>> For your use case, the producer needs to find rows where the flag is 0
>>>> or 1.
>>>> After such rows are obtained, it is up to you how the result of
>>>> processing is delivered to hbase.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>>> chetan.opensource@gmail.com> wrote:
>>>>
>>>>> Ok, Sure will ask.
>>>>>
>>>>> But what would be generic best practice solution for Incremental load
>>>>> from HBASE.
>>>>>
>>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>> I haven't used Gobblin.
>>>>>> You can consider asking Gobblin mailing list of the first option.
>>>>>>
>>>>>> The second option would work.
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>>>> chetan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Guys,
>>>>>>>
>>>>>>> I would like to understand different approach for Distributed
>>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>>>> satisfy requirement ?
>>>>>>>
>>>>>>> *Approach 1:*
>>>>>>>
>>>>>>> Write Kafka Producer and maintain manually column flag for events
>>>>>>> and ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>>>
>>>>>>> *Approach 2:*
>>>>>>>
>>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>>>>>> maintain flag column at HBase Level.
>>>>>>>
>>>>>>> In above both approach, I need to maintain column level flags. such
>>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>>>>>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>>>>>
>>>>>>> I am looking for best practice approach with any distributed tool.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> - Chetan Khatri
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: Approach: Incremental data load from HBASE

Posted by ayan guha <gu...@gmail.com>.
IMHO you should not "think" HBase in RDMBS terms, but you can use
ColumnFilters to filter out new records

On Fri, Jan 6, 2017 at 7:22 PM, Chetan Khatri <ch...@gmail.com>
wrote:

> Hi Ayan,
>
> I mean by Incremental load from HBase, weekly running batch jobs takes
> rows from HBase table and dump it out to Hive. Now when next i run Job it
> only takes newly arrived jobs.
>
> Same as if we use Sqoop for incremental load from RDBMS to Hive with below
> command,
>
> sqoop job --create myssb1 -- import --connect
> jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
> --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
> district, city_id, postal_code, alast_update, cityid, city, country_id,
> clast_update FROM(SELECT a.address_id as address_id, a.address as address,
> a.district as district, a.city_id as city_id, a.postal_code as postal_code,
> a.last_update as alast_update, c.city_id as cityid, c.city as city,
> c.country_id as country_id, c.last_update as clast_update FROM
> sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
> WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
> --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
> --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
> address=String
>
> Probably i am looking for any tool from HBase incubator family which does
> the job for me, or other alternative approaches can be done through reading
> Hbase tables in RDD and saving RDD to Hive.
>
> Thanks.
>
>
> On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <gu...@gmail.com> wrote:
>
>> Hi Chetan
>>
>> What do you mean by incremental load from HBase? There is a timestamp
>> marker for each cell, but not at Row level.
>>
>> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>>
>>> Ted Yu,
>>>
>>> You understood wrong, i said Incremental load from HBase to Hive,
>>> individually you can say Incremental Import from HBase.
>>>
>>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> Incremental load traditionally means generating hfiles and
>>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>>> the data into hbase.
>>>>
>>>> For your use case, the producer needs to find rows where the flag is 0
>>>> or 1.
>>>> After such rows are obtained, it is up to you how the result of
>>>> processing is delivered to hbase.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>>> chetan.opensource@gmail.com> wrote:
>>>>
>>>>> Ok, Sure will ask.
>>>>>
>>>>> But what would be generic best practice solution for Incremental load
>>>>> from HBASE.
>>>>>
>>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>
>>>>>> I haven't used Gobblin.
>>>>>> You can consider asking Gobblin mailing list of the first option.
>>>>>>
>>>>>> The second option would work.
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>>>> chetan.opensource@gmail.com> wrote:
>>>>>>
>>>>>>> Hello Guys,
>>>>>>>
>>>>>>> I would like to understand different approach for Distributed
>>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>>>> satisfy requirement ?
>>>>>>>
>>>>>>> *Approach 1:*
>>>>>>>
>>>>>>> Write Kafka Producer and maintain manually column flag for events
>>>>>>> and ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>>>
>>>>>>> *Approach 2:*
>>>>>>>
>>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>>>>>> maintain flag column at HBase Level.
>>>>>>>
>>>>>>> In above both approach, I need to maintain column level flags. such
>>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>>>>>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>>>>>
>>>>>>> I am looking for best practice approach with any distributed tool.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> - Chetan Khatri
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: Approach: Incremental data load from HBASE

Posted by Chetan Khatri <ch...@gmail.com>.
Hi Ayan,

I mean by Incremental load from HBase, weekly running batch jobs takes rows
from HBase table and dump it out to Hive. Now when next i run Job it only
takes newly arrived jobs.

Same as if we use Sqoop for incremental load from RDBMS to Hive with below
command,

sqoop job --create myssb1 -- import --connect
jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
--driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
district, city_id, postal_code, alast_update, cityid, city, country_id,
clast_update FROM(SELECT a.address_id as address_id, a.address as address,
a.district as district, a.city_id as city_id, a.postal_code as postal_code,
a.last_update as alast_update, c.city_id as cityid, c.city as city,
c.country_id as country_id, c.last_update as clast_update FROM
sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
--last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
--hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
address=String

Probably i am looking for any tool from HBase incubator family which does
the job for me, or other alternative approaches can be done through reading
Hbase tables in RDD and saving RDD to Hive.

Thanks.


On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <gu...@gmail.com> wrote:

> Hi Chetan
>
> What do you mean by incremental load from HBase? There is a timestamp
> marker for each cell, but not at Row level.
>
> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
>
>> Ted Yu,
>>
>> You understood wrong, i said Incremental load from HBase to Hive,
>> individually you can say Incremental Import from HBase.
>>
>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Incremental load traditionally means generating hfiles and
>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>> the data into hbase.
>>>
>>> For your use case, the producer needs to find rows where the flag is 0
>>> or 1.
>>> After such rows are obtained, it is up to you how the result of
>>> processing is delivered to hbase.
>>>
>>> Cheers
>>>
>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>>
>>>> Ok, Sure will ask.
>>>>
>>>> But what would be generic best practice solution for Incremental load
>>>> from HBASE.
>>>>
>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> I haven't used Gobblin.
>>>>> You can consider asking Gobblin mailing list of the first option.
>>>>>
>>>>> The second option would work.
>>>>>
>>>>>
>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>>> chetan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Hello Guys,
>>>>>>
>>>>>> I would like to understand different approach for Distributed
>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>>> satisfy requirement ?
>>>>>>
>>>>>> *Approach 1:*
>>>>>>
>>>>>> Write Kafka Producer and maintain manually column flag for events and
>>>>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>>
>>>>>> *Approach 2:*
>>>>>>
>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>>>>> maintain flag column at HBase Level.
>>>>>>
>>>>>> In above both approach, I need to maintain column level flags. such
>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>>>>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>>>>
>>>>>> I am looking for best practice approach with any distributed tool.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> - Chetan Khatri
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Approach: Incremental data load from HBASE

Posted by Chetan Khatri <ch...@gmail.com>.
Hi Ayan,

I mean by Incremental load from HBase, weekly running batch jobs takes rows
from HBase table and dump it out to Hive. Now when next i run Job it only
takes newly arrived jobs.

Same as if we use Sqoop for incremental load from RDBMS to Hive with below
command,

sqoop job --create myssb1 -- import --connect
jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin
--driver=com.mysql.jdbc.Driver --query "SELECT address_id, address,
district, city_id, postal_code, alast_update, cityid, city, country_id,
clast_update FROM(SELECT a.address_id as address_id, a.address as address,
a.district as district, a.city_id as city_id, a.postal_code as postal_code,
a.last_update as alast_update, c.city_id as cityid, c.city as city,
c.country_id as country_id, c.last_update as clast_update FROM
sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub
WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update
--last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import
--hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java
address=String

Probably i am looking for any tool from HBase incubator family which does
the job for me, or other alternative approaches can be done through reading
Hbase tables in RDD and saving RDD to Hive.

Thanks.


On Thu, Jan 5, 2017 at 2:02 AM, ayan guha <gu...@gmail.com> wrote:

> Hi Chetan
>
> What do you mean by incremental load from HBase? There is a timestamp
> marker for each cell, but not at Row level.
>
> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
> chetan.opensource@gmail.com> wrote:
>
>> Ted Yu,
>>
>> You understood wrong, i said Incremental load from HBase to Hive,
>> individually you can say Incremental Import from HBase.
>>
>> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Incremental load traditionally means generating hfiles and
>>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>>> the data into hbase.
>>>
>>> For your use case, the producer needs to find rows where the flag is 0
>>> or 1.
>>> After such rows are obtained, it is up to you how the result of
>>> processing is delivered to hbase.
>>>
>>> Cheers
>>>
>>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>>> chetan.opensource@gmail.com> wrote:
>>>
>>>> Ok, Sure will ask.
>>>>
>>>> But what would be generic best practice solution for Incremental load
>>>> from HBASE.
>>>>
>>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>
>>>>> I haven't used Gobblin.
>>>>> You can consider asking Gobblin mailing list of the first option.
>>>>>
>>>>> The second option would work.
>>>>>
>>>>>
>>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>>> chetan.opensource@gmail.com> wrote:
>>>>>
>>>>>> Hello Guys,
>>>>>>
>>>>>> I would like to understand different approach for Distributed
>>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>>> satisfy requirement ?
>>>>>>
>>>>>> *Approach 1:*
>>>>>>
>>>>>> Write Kafka Producer and maintain manually column flag for events and
>>>>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>>
>>>>>> *Approach 2:*
>>>>>>
>>>>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>>>>> maintain flag column at HBase Level.
>>>>>>
>>>>>> In above both approach, I need to maintain column level flags. such
>>>>>> as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer
>>>>>> will take another 1000 rows of batch where flag is 0 or 1.
>>>>>>
>>>>>> I am looking for best practice approach with any distributed tool.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> - Chetan Khatri
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Approach: Incremental data load from HBASE

Posted by ayan guha <gu...@gmail.com>.
Hi Chetan

What do you mean by incremental load from HBase? There is a timestamp
marker for each cell, but not at Row level.

On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <ch...@gmail.com>
wrote:

> Ted Yu,
>
> You understood wrong, i said Incremental load from HBase to Hive,
> individually you can say Incremental Import from HBase.
>
> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Incremental load traditionally means generating hfiles and
>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>> the data into hbase.
>>
>> For your use case, the producer needs to find rows where the flag is 0 or
>> 1.
>> After such rows are obtained, it is up to you how the result of
>> processing is delivered to hbase.
>>
>> Cheers
>>
>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>>
>>> Ok, Sure will ask.
>>>
>>> But what would be generic best practice solution for Incremental load
>>> from HBASE.
>>>
>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> I haven't used Gobblin.
>>>> You can consider asking Gobblin mailing list of the first option.
>>>>
>>>> The second option would work.
>>>>
>>>>
>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>> chetan.opensource@gmail.com> wrote:
>>>>
>>>>> Hello Guys,
>>>>>
>>>>> I would like to understand different approach for Distributed
>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>> satisfy requirement ?
>>>>>
>>>>> *Approach 1:*
>>>>>
>>>>> Write Kafka Producer and maintain manually column flag for events and
>>>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>
>>>>> *Approach 2:*
>>>>>
>>>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>>>> maintain flag column at HBase Level.
>>>>>
>>>>> In above both approach, I need to maintain column level flags. such as
>>>>> 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
>>>>> take another 1000 rows of batch where flag is 0 or 1.
>>>>>
>>>>> I am looking for best practice approach with any distributed tool.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> - Chetan Khatri
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha

Re: Approach: Incremental data load from HBASE

Posted by ayan guha <gu...@gmail.com>.
Hi Chetan

What do you mean by incremental load from HBase? There is a timestamp
marker for each cell, but not at Row level.

On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <ch...@gmail.com>
wrote:

> Ted Yu,
>
> You understood wrong, i said Incremental load from HBase to Hive,
> individually you can say Incremental Import from HBase.
>
> On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Incremental load traditionally means generating hfiles and
>> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load
>> the data into hbase.
>>
>> For your use case, the producer needs to find rows where the flag is 0 or
>> 1.
>> After such rows are obtained, it is up to you how the result of
>> processing is delivered to hbase.
>>
>> Cheers
>>
>> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
>> chetan.opensource@gmail.com> wrote:
>>
>>> Ok, Sure will ask.
>>>
>>> But what would be generic best practice solution for Incremental load
>>> from HBASE.
>>>
>>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> I haven't used Gobblin.
>>>> You can consider asking Gobblin mailing list of the first option.
>>>>
>>>> The second option would work.
>>>>
>>>>
>>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>>> chetan.opensource@gmail.com> wrote:
>>>>
>>>>> Hello Guys,
>>>>>
>>>>> I would like to understand different approach for Distributed
>>>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>>>> satisfy requirement ?
>>>>>
>>>>> *Approach 1:*
>>>>>
>>>>> Write Kafka Producer and maintain manually column flag for events and
>>>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>>>
>>>>> *Approach 2:*
>>>>>
>>>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>>>> maintain flag column at HBase Level.
>>>>>
>>>>> In above both approach, I need to maintain column level flags. such as
>>>>> 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
>>>>> take another 1000 rows of batch where flag is 0 or 1.
>>>>>
>>>>> I am looking for best practice approach with any distributed tool.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> - Chetan Khatri
>>>>>
>>>>
>>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha