You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Krishna Kalyan <kr...@gmail.com> on 2014/09/27 05:32:25 UTC

Pig HBase integration

Hi,
We have a use case that involves ETL on data coming from several different
sources using pig.
We plan to store the final output table in HBase.
What will be the performance impact if we do a join with an external CSV
table using pig?.

Regards,
Krishna

Re: Pig HBase integration

Posted by Krishna Kalyan <kr...@gmail.com>.

Thank you so much Serega.

Regards,
Krishna

On Sun, Sep 28, 2014 at 11:01 PM, Serega Sheypak <se...@gmail.com>
wrote:

>
> https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
> I'm not sure how does Pig HBaseStroage works. I suppose it would read all
> data and then join it as usual dataset. So you should get serious hbase
> perfomace degradation during read, you would get key-by-key read from the
> whole table.
> 1. so join in pig
> 2. At first you load data from hbase table then operate on it. I don't see
> a cse where you can use hbase table directly in join.
>
>
> 2014-09-28 17:02 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>
>>
>> We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
>> in each record) and weblog (2-3 TB, approx 50 columns in each record). We
>> need to join the data sets using the locationId, which is in both the
>> data-sets.
>>
>> We have 2 options:
>> 1. Have both the data-sets in HDFS only and JOIN then on locationId, may
>> be using Pig.
>> 2. Since JOIN will be on locaitonId, which is primary key for location
>> data set, if we store the location data set with locationId as rowkey in
>> HBase and then use Pig query to do the join of weblog data set and location
>> table (using HBaseStorage).
>>
>> The reason to think about this idea is reading data based on the key is
>> faster in HBase, however we are not sure that in case of JOIN of 2 data
>> sets, will Pig internally go for picking the individual location record for
>> based on key or it reads through entire or few records from location table
>> and then do the join. Based on this we can make the choice.
>>
>> We are free to use HDFS or HBase for any input or output data set, please
>> advise which option can provide us better performance. Also if required,
>> please point us to some good article on this.
>>
>>
>> On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <serega.sheypak@gmail.com
>> > wrote:
>>
>>> store location to hdfs
>>> store weblog to hdfs
>>> join them
>>> use HBase bulk load tool to load join result to hbase.
>>>
>>> What's the reason to keep location dataset in hbase and weblogs in hdfs?
>>>
>>> You can expect data load perfomance improvement. For me it takes few
>>> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
>>> table.
>>>
>>> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>>>
>>>> Thanks Serega,
>>>>
>>>> Our usecase details:
>>>> We have a location table which will be stored in HBase with locationID
>>>> as the rowkey / Joinkey.
>>>> We intend to join this table with a transactional WebLog file in HDFS
>>>> (Expected size can be around 2TB).
>>>> Joining query will be passed from Pig.
>>>> Can we expect a performance improvement when compared with mapreduce
>>>> appoach?.
>>>>
>>>> Regards,
>>>> Krishna
>>>>
>>>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <
>>>> serega.sheypak@gmail.com> wrote:
>>>>
>>>>> Depends on the datasets size and HBase workload. The best way is to do
>>>>> join
>>>>> in pig, store it and then use HBase bulk load tool.
>>>>> It's general recommendation. I have no idea about your task details
>>>>>
>>>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>>>>>
>>>>> > Hi,
>>>>> > We have a use case that involves ETL on data coming from several
>>>>> different
>>>>> > sources using pig.
>>>>> > We plan to store the final output table in HBase.
>>>>> > What will be the performance impact if we do a join with an external
>>>>> CSV
>>>>> > table using pig?.
>>>>> >
>>>>> > Regards,
>>>>> > Krishna
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Pig HBase integration

Posted by Serega Sheypak <se...@gmail.com>.

https://pig.apache.org/docs/r0.11.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
I'm not sure how does Pig HBaseStroage works. I suppose it would read all
data and then join it as usual dataset. So you should get serious hbase
perfomace degradation during read, you would get key-by-key read from the
whole table.
1. so join in pig
2. At first you load data from hbase table then operate on it. I don't see
a cse where you can use hbase table directly in join.


2014-09-28 17:02 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:

>
> We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
> in each record) and weblog (2-3 TB, approx 50 columns in each record). We
> need to join the data sets using the locationId, which is in both the
> data-sets.
>
> We have 2 options:
> 1. Have both the data-sets in HDFS only and JOIN then on locationId, may
> be using Pig.
> 2. Since JOIN will be on locaitonId, which is primary key for location
> data set, if we store the location data set with locationId as rowkey in
> HBase and then use Pig query to do the join of weblog data set and location
> table (using HBaseStorage).
>
> The reason to think about this idea is reading data based on the key is
> faster in HBase, however we are not sure that in case of JOIN of 2 data
> sets, will Pig internally go for picking the individual location record for
> based on key or it reads through entire or few records from location table
> and then do the join. Based on this we can make the choice.
>
> We are free to use HDFS or HBase for any input or output data set, please
> advise which option can provide us better performance. Also if required,
> please point us to some good article on this.
>
>
> On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
>> store location to hdfs
>> store weblog to hdfs
>> join them
>> use HBase bulk load tool to load join result to hbase.
>>
>> What's the reason to keep location dataset in hbase and weblogs in hdfs?
>>
>> You can expect data load perfomance improvement. For me it takes few
>> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
>> table.
>>
>> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>>
>>> Thanks Serega,
>>>
>>> Our usecase details:
>>> We have a location table which will be stored in HBase with locationID
>>> as the rowkey / Joinkey.
>>> We intend to join this table with a transactional WebLog file in HDFS
>>> (Expected size can be around 2TB).
>>> Joining query will be passed from Pig.
>>> Can we expect a performance improvement when compared with mapreduce
>>> appoach?.
>>>
>>> Regards,
>>> Krishna
>>>
>>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <
>>> serega.sheypak@gmail.com> wrote:
>>>
>>>> Depends on the datasets size and HBase workload. The best way is to do
>>>> join
>>>> in pig, store it and then use HBase bulk load tool.
>>>> It's general recommendation. I have no idea about your task details
>>>>
>>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>>>>
>>>> > Hi,
>>>> > We have a use case that involves ETL on data coming from several
>>>> different
>>>> > sources using pig.
>>>> > We plan to store the final output table in HBase.
>>>> > What will be the performance impact if we do a join with an external
>>>> CSV
>>>> > table using pig?.
>>>> >
>>>> > Regards,
>>>> > Krishna
>>>> >
>>>>
>>>
>>>
>>
>

Re: Pig HBase integration

Posted by Krishna Kalyan <kr...@gmail.com>.

We actually have 2 data sets in HDFS, location (3-5 GB, approx 10 columns
in each record) and weblog (2-3 TB, approx 50 columns in each record). We
need to join the data sets using the locationId, which is in both the
data-sets.

We have 2 options:
1. Have both the data-sets in HDFS only and JOIN then on locationId, may be
using Pig.
2. Since JOIN will be on locaitonId, which is primary key for location data
set, if we store the location data set with locationId as rowkey in HBase
and then use Pig query to do the join of weblog data set and location table
(using HBaseStorage).

The reason to think about this idea is reading data based on the key is
faster in HBase, however we are not sure that in case of JOIN of 2 data
sets, will Pig internally go for picking the individual location record for
based on key or it reads through entire or few records from location table
and then do the join. Based on this we can make the choice.

We are free to use HDFS or HBase for any input or output data set, please
advise which option can provide us better performance. Also if required,
please point us to some good article on this.

On Sun, Sep 28, 2014 at 5:51 PM, Serega Sheypak <se...@gmail.com>
wrote:

> store location to hdfs
> store weblog to hdfs
> join them
> use HBase bulk load tool to load join result to hbase.
>
> What's the reason to keep location dataset in hbase and weblogs in hdfs?
>
> You can expect data load perfomance improvement. For me it takes few
> minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
> table.
>
> 2014-09-28 16:04 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>
>> Thanks Serega,
>>
>> Our usecase details:
>> We have a location table which will be stored in HBase with locationID as
>> the rowkey / Joinkey.
>> We intend to join this table with a transactional WebLog file in HDFS
>> (Expected size can be around 2TB).
>> Joining query will be passed from Pig.
>> Can we expect a performance improvement when compared with mapreduce
>> appoach?.
>>
>> Regards,
>> Krishna
>>
>> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <serega.sheypak@gmail.com
>> > wrote:
>>
>>> Depends on the datasets size and HBase workload. The best way is to do
>>> join
>>> in pig, store it and then use HBase bulk load tool.
>>> It's general recommendation. I have no idea about your task details
>>>
>>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>>>
>>> > Hi,
>>> > We have a use case that involves ETL on data coming from several
>>> different
>>> > sources using pig.
>>> > We plan to store the final output table in HBase.
>>> > What will be the performance impact if we do a join with an external
>>> CSV
>>> > table using pig?.
>>> >
>>> > Regards,
>>> > Krishna
>>> >
>>>
>>
>>
>

Re: Pig HBase integration

Posted by Serega Sheypak <se...@gmail.com>.

store location to hdfs
store weblog to hdfs
join them
use HBase bulk load tool to load join result to hbase.

What's the reason to keep location dataset in hbase and weblogs in hdfs?

You can expect data load perfomance improvement. For me it takes few
minutes to bulk load 500.000.000 records to 10-nodes hbase with presplitted
table.

2014-09-28 16:04 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:

> Thanks Serega,
>
> Our usecase details:
> We have a location table which will be stored in HBase with locationID as
> the rowkey / Joinkey.
> We intend to join this table with a transactional WebLog file in HDFS
> (Expected size can be around 2TB).
> Joining query will be passed from Pig.
> Can we expect a performance improvement when compared with mapreduce
> appoach?.
>
> Regards,
> Krishna
>
> On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <se...@gmail.com>
> wrote:
>
>> Depends on the datasets size and HBase workload. The best way is to do
>> join
>> in pig, store it and then use HBase bulk load tool.
>> It's general recommendation. I have no idea about your task details
>>
>> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>>
>> > Hi,
>> > We have a use case that involves ETL on data coming from several
>> different
>> > sources using pig.
>> > We plan to store the final output table in HBase.
>> > What will be the performance impact if we do a join with an external CSV
>> > table using pig?.
>> >
>> > Regards,
>> > Krishna
>> >
>>
>
>

Re: Pig HBase integration

Posted by Krishna Kalyan <kr...@gmail.com>.

Thanks Serega,

Our usecase details:
We have a location table which will be stored in HBase with locationID as
the rowkey / Joinkey.
We intend to join this table with a transactional WebLog file in HDFS
(Expected size can be around 2TB).
Joining query will be passed from Pig.
Can we expect a performance improvement when compared with mapreduce
appoach?.

Regards,
Krishna

On Sat, Sep 27, 2014 at 9:13 PM, Serega Sheypak <se...@gmail.com>
wrote:

> Depends on the datasets size and HBase workload. The best way is to do join
> in pig, store it and then use HBase bulk load tool.
> It's general recommendation. I have no idea about your task details
>
> 2014-09-27 7:32 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:
>
> > Hi,
> > We have a use case that involves ETL on data coming from several
> different
> > sources using pig.
> > We plan to store the final output table in HBase.
> > What will be the performance impact if we do a join with an external CSV
> > table using pig?.
> >
> > Regards,
> > Krishna
> >
>

Re: Pig HBase integration

Posted by Serega Sheypak <se...@gmail.com>.

Depends on the datasets size and HBase workload. The best way is to do join
in pig, store it and then use HBase bulk load tool.
It's general recommendation. I have no idea about your task details

2014-09-27 7:32 GMT+04:00 Krishna Kalyan <kr...@gmail.com>:

> Hi,
> We have a use case that involves ETL on data coming from several different
> sources using pig.
> We plan to store the final output table in HBase.
> What will be the performance impact if we do a join with an external CSV
> table using pig?.
>
> Regards,
> Krishna
>