You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mehmet Tepedelenlioglu <me...@yahoo.com> on 2013/05/19 22:26:33 UTC

Cross product bug pig 0.10?

Hi,

Recently I was taking the cross product between 2 bags of tuples one of
which has only one tuple, to append the one with one element to all the
others (I know this is not the best way to do this, it was done as a
prototype). There seems to be a bug with the cross product where not all the
tuples of the larger bag are replicated. All but one of the part files are
empty, and everything works just fine in the local mode (probably because it
uses only one reducer). Is anybody else aware of this issue?

The version is:

Apache Pig version 0.10.0-cdh4.1.2 (rexported)
compiled Nov 01 2012, 18:38:33

Thanks,

Mehmet



Re: Cross product bug pig 0.10?

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.
Hi,

So I found a somewhat easy way to replicate this error with this script
running in a cluster (distributed). The setting at the top are artificial
to produce the result with only a few lines:

set pig.exec.reducers.bytes.per.reducer 32
set pig.exec.reducers.max 20
X = LOAD '$INPUT' USING PigStorage('$SEPARATOR');
Y = FOREACH X GENERATE COUNT_STAR(TOBAG($0 ..)) as count;
GROUPED = GROUP Y BY count;
MAX = FOREACH GROUPED GENERATE group as tokennum, COUNT(Y) as count;
MAXG = GROUP MAX ALL;
MAXX = foreach MAXG generate FLATTEN(TOP(1,1,MAX));
MAXX = foreach MAXX generate $0 as tokennum;
Z = CROSS MAXX, X;
STORE Z INTO '$OUT' USING PigStorage('$SEPARATOR');





As input I took the line:
1	1
Repeated 13 times. 

I think the only think that matters is that pig decides to use more than 1
reducer. In my case this was enough for pig to use 20 reducers. This will
yield:

Input(s):
Successfully read 13 records (413 bytes) from: "/user/mehmet/input2"

Output(s):
Successfully stored 2 records (12 bytes) in: "/tmp/mehmet/out"



But it should be creating 13 lines as it just appends the MAXX to each
input line.

2 odd facts:

1. If you replace 
	Z = CROSS MAXX, X
 by
	Z = CROSS MAXX, X parallel 20

the problem goes away. (Perhaps the CROSS function is not getting the
number of reducers value correctly when it is calculated):

Input(s):
Successfully read 13 records (413 bytes) from: "/user/mehmet/input2"

Output(s):
Successfully stored 13 records (78 bytes) in: "/tmp/mehmet/out"



2. If you skip all the steps that yield MAXX and just load MAXX from a
file, the problem goes away also, which is strange as why should it matter
where MAXX originated from?


I am using Hadoop 2.0.0-cdh4.2.0, Pig version 0.10.0-cdh4.1.2
 

Mehmet




On 5/21/13 1:41 AM, "Jonathan Coveney" <jc...@gmail.com> wrote:

>Any chance you could replicate this for us? Ideally some dummy data and a
>script?
>
>
>2013/5/19 Mehmet Tepedelenlioglu <me...@yahoo.com>
>
>> Hi,
>>
>> Recently I was taking the cross product between 2 bags of tuples one of
>> which has only one tuple, to append the one with one element to all the
>> others (I know this is not the best way to do this, it was done as a
>> prototype). There seems to be a bug with the cross product where not all
>> the
>> tuples of the larger bag are replicated. All but one of the part files
>>are
>> empty, and everything works just fine in the local mode (probably
>>because
>> it
>> uses only one reducer). Is anybody else aware of this issue?
>>
>> The version is:
>>
>> Apache Pig version 0.10.0-cdh4.1.2 (rexported)
>> compiled Nov 01 2012, 18:38:33
>>
>> Thanks,
>>
>> Mehmet
>>
>>
>>



Re: Cross product bug pig 0.10?

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.
I'll try to do that with as simple an example as possible. I ran into this
problem in 2 independent scripts.

On 5/21/13 1:41 AM, "Jonathan Coveney" <jc...@gmail.com> wrote:

>Any chance you could replicate this for us? Ideally some dummy data and a
>script?
>
>
>2013/5/19 Mehmet Tepedelenlioglu <me...@yahoo.com>
>
>> Hi,
>>
>> Recently I was taking the cross product between 2 bags of tuples one of
>> which has only one tuple, to append the one with one element to all the
>> others (I know this is not the best way to do this, it was done as a
>> prototype). There seems to be a bug with the cross product where not all
>> the
>> tuples of the larger bag are replicated. All but one of the part files
>>are
>> empty, and everything works just fine in the local mode (probably
>>because
>> it
>> uses only one reducer). Is anybody else aware of this issue?
>>
>> The version is:
>>
>> Apache Pig version 0.10.0-cdh4.1.2 (rexported)
>> compiled Nov 01 2012, 18:38:33
>>
>> Thanks,
>>
>> Mehmet
>>
>>
>>



Re: Cross product bug pig 0.10?

Posted by Jonathan Coveney <jc...@gmail.com>.
Any chance you could replicate this for us? Ideally some dummy data and a
script?


2013/5/19 Mehmet Tepedelenlioglu <me...@yahoo.com>

> Hi,
>
> Recently I was taking the cross product between 2 bags of tuples one of
> which has only one tuple, to append the one with one element to all the
> others (I know this is not the best way to do this, it was done as a
> prototype). There seems to be a bug with the cross product where not all
> the
> tuples of the larger bag are replicated. All but one of the part files are
> empty, and everything works just fine in the local mode (probably because
> it
> uses only one reducer). Is anybody else aware of this issue?
>
> The version is:
>
> Apache Pig version 0.10.0-cdh4.1.2 (rexported)
> compiled Nov 01 2012, 18:38:33
>
> Thanks,
>
> Mehmet
>
>
>