You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by byambaa <by...@gmail.com> on 2011/08/22 12:42:36 UTC

What is implemented behind the PIG Joins

Hello
I have a cluster with 11 nodes  each of them have 16 GB RAM, 6 core CPU,
1 TB HDD and i am using cloudera distribution CHD4b with Pig. I have two Pig
Join queries  which are a Parallel and a Replicated version of pig Join and MapReduce Reduce side  and Map side joins.

Theoretically Replicated Join could be faster than Parallel join but in
my case Parallel is faster.
i have a questions :

1.I am wondering why the replicated join is so slowly how it works what is the behind the replicated join.
2. MR reduce side join was faster than parallel pig join, what is implemented background the parallel pig join. i guess pig implement also MR reduce side join.

Could you explain me about the Pig joins how it works and what is run behind the pig scripts


	Replicated Join in HDFS	Replicated Join in Hbase 	MR Reduce side join 
MR Joins (Singleton pattern)
obr_wp_annotation 1786MB
	29 sec 	50 sec 	36 sec 	19
obr_ct_annotation 5916MB
	799 sec 	523 sec
	108 sec 	69
obr_pm_annotation 16983MB
	1794 sec
	707 sec 	248 sec 	138

the relation file is 659MB

  thanks you very much

Byambajargal

Re: What is implemented behind the PIG Joins

Posted by byambajav byambajargal <by...@gmail.com>.

Pig 0.8.1.

On Mon, Aug 22, 2011 at 10:58 PM, Thejas Nair <th...@hortonworks.com>wrote:

> Hi Byambajargal,
> What version of pig does your distribution use ?
> -Thejas
>
>
> On 8/22/11 3:42 AM, byambaa wrote:
>
>> Hello
>> I have a cluster with 11 nodes each of them have 16 GB RAM, 6 core CPU,
>> 1 TB HDD and i am using cloudera distribution CHD4b with Pig. I have two
>> Pig
>> Join queries which are a Parallel and a Replicated version of pig Join
>> and MapReduce Reduce side and Map side joins.
>>
>> Theoretically Replicated Join could be faster than Parallel join but in
>> my case Parallel is faster.
>> i have a questions :
>>
>> 1.I am wondering why the replicated join is so slowly how it works what
>> is the behind the replicated join.
>> 2. MR reduce side join was faster than parallel pig join, what is
>> implemented background the parallel pig join. i guess pig implement also
>> MR reduce side join.
>>
>> Could you explain me about the Pig joins how it works and what is run
>> behind the pig scripts
>>
>>
>> Replicated Join in HDFS Replicated Join in Hbase MR Reduce side join MR
>> Joins (Singleton pattern)
>> obr_wp_annotation 1786MB
>> 29 sec 50 sec 36 sec 19
>> obr_ct_annotation 5916MB
>> 799 sec 523 sec
>> 108 sec 69
>> obr_pm_annotation 16983MB
>> 1794 sec
>> 707 sec 248 sec 138
>>
>> the relation file is 659MB
>>
>> thanks you very much
>>
>> Byambajargal
>>
>>
>>
>

Re: What is implemented behind the PIG Joins

Posted by Thejas Nair <th...@hortonworks.com>.

Hi Byambajargal,
What version of pig does your distribution use ?
-Thejas

On 8/22/11 3:42 AM, byambaa wrote:
> Hello
> I have a cluster with 11 nodes each of them have 16 GB RAM, 6 core CPU,
> 1 TB HDD and i am using cloudera distribution CHD4b with Pig. I have two
> Pig
> Join queries which are a Parallel and a Replicated version of pig Join
> and MapReduce Reduce side and Map side joins.
>
> Theoretically Replicated Join could be faster than Parallel join but in
> my case Parallel is faster.
> i have a questions :
>
> 1.I am wondering why the replicated join is so slowly how it works what
> is the behind the replicated join.
> 2. MR reduce side join was faster than parallel pig join, what is
> implemented background the parallel pig join. i guess pig implement also
> MR reduce side join.
>
> Could you explain me about the Pig joins how it works and what is run
> behind the pig scripts
>
>
> Replicated Join in HDFS Replicated Join in Hbase MR Reduce side join MR
> Joins (Singleton pattern)
> obr_wp_annotation 1786MB
> 29 sec 50 sec 36 sec 19
> obr_ct_annotation 5916MB
> 799 sec 523 sec
> 108 sec 69
> obr_pm_annotation 16983MB
> 1794 sec
> 707 sec 248 sec 138
>
> the relation file is 659MB
>
> thanks you very much
>
> Byambajargal
>
>

Re: What is implemented behind the PIG Joins

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Let's say you are joining tables A, B, and C (listed in that order). The
default join just does a regular Hadoop MR join: read in all relations, tag
each row with source relation, emit with the key being the join key, collect
on the reducers.

Replicated join is intended for small relations that fit in memory of a
single map task. They work as follows: put all but the leftmost relation
into the distributed cache; read relation A in the mappers; in each mapper,
during initialization, load B and C from dist cache into memory; stream
through the chunk of A allocated to each mapper, and join it with the
in-memory B and C.

If B and C are bigger than your available memory, this clearly doesn't work
very well and you need to do a regular join.

D

On Mon, Aug 22, 2011 at 3:42 AM, byambaa <by...@gmail.com> wrote:

> Hello
> I have a cluster with 11 nodes  each of them have 16 GB RAM, 6 core CPU,
> 1 TB HDD and i am using cloudera distribution CHD4b with Pig. I have two
> Pig
> Join queries  which are a Parallel and a Replicated version of pig Join and
> MapReduce Reduce side  and Map side joins.
>
> Theoretically Replicated Join could be faster than Parallel join but in
> my case Parallel is faster.
> i have a questions :
>
> 1.I am wondering why the replicated join is so slowly how it works what is
> the behind the replicated join.
> 2. MR reduce side join was faster than parallel pig join, what is
> implemented background the parallel pig join. i guess pig implement also MR
> reduce side join.
>
> Could you explain me about the Pig joins how it works and what is run
> behind the pig scripts
>
>
>        Replicated Join in HDFS Replicated Join in Hbase        MR Reduce
> side join MR Joins (Singleton pattern)
> obr_wp_annotation 1786MB
>        29 sec  50 sec  36 sec  19
> obr_ct_annotation 5916MB
>        799 sec         523 sec
>        108 sec         69
> obr_pm_annotation 16983MB
>        1794 sec
>        707 sec         248 sec         138
>
> the relation file is 659MB
>
>  thanks you very much
>
> Byambajargal
>
>