You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Prashanth Pappu <pr...@conviva.com> on 2008/05/30 20:44:07 UTC
Another Possible PIG bug (with SPLIT)

There seems to be a problem with SPLIT when the conditions include
'IsEmpty'.

Here's an example -

grunt> a = load '/test' using PigStorage(' ') as (x,y,z);
grunt> dump a;
(1, 2, 3)
(2, 3, 4)
(3, 4, 5)

grunt> b = load '/test' using PigStorage(' ') as (x,y,z);
grunt> dump b;
(1, 2, 3)
(2, 3, 4)
(3, 4, 5)

grunt> c = cogroup a by x, b by z;
grunt> dump c;
(1, {(1, 2, 3)}, {})
(2, {(2, 3, 4)}, {})
(3, {(3, 4, 5)}, {(1, 2, 3)})
(4, {}, {(2, 3, 4)})
(5, {}, {(3, 4, 5)})

grunt> d = filter c by IsEmpty(b);
grunt> dump d;
(1, {(1, 2, 3)}, {})
(2, {(2, 3, 4)}, {})

[This is correct!]

grunt> split c into c1 if IsEmpty(b), c2 if NOT(IsEmpty(b));
grunt> dump c1;
(1, {(1, 2, 3)}, {})

[**BUG: Returns only the first tuple!]

I'm not sure if I missed something here. But I have a script which runs over
a lot of data and the behavior is the same. SPLIT always seems to return
only the first tuple (whe used with IsEmpty).

Prashanth

On Thu, May 29, 2008 at 4:12 PM, pi song <pi...@gmail.com> wrote:

> It seems like a bug in execution engine.
> Please file a bug in Jira.
>
> Thanks for reporting!!
>
> Pi
>
>
> On 5/30/08, Prashanth Pappu <pr...@conviva.com> wrote:
> >
> > Shouldn't applying SPLIT on an empty bag return empty bags?
> > PIG/GRUNT throws up an exception in such cases. .
> >
> >
> >
> -----------------------------------------------------------------------------------------
> > For example,
> >
> > grunt> a = load '/test' using PigStorage(' ') as (x,y,z);
> > grunt> dump a;
> > (1, 2, 3)
> > (2, 3, 4)
> > (3, 4, 5)
> >
> > grunt> split a into b if x>3, c if x<=3;
> >
> > grunt> describe b;
> > a: (x, y, z )
> >
> > grunt>  dump b;
> >
> > [This is ok!]
> >
> > grunt> split b into b1 if x > 5, b2 if x <= 5;
> > grunt> dump b1;
> >
> > 2008-05-29 15:57:20,691 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - -----
> MapReduce
> > Job -----
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
> > [/tmp/temp1967712747/tmp-1452064591:org.apache.pig.builtin.BinStorage]
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
> > /tmp/temp1967712747/tmp1362492691:org.apache.pig.builtin.BinStorage
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split:
> > ([PROJECT
> > $0] > ['5']);/tmp/temp1967712747/tmp-844891734;([PROJECT $0] <=
> > ['5']);/tmp/temp1967712747/tmp-748499407
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map
> > parallelism:
> > -1
> > 2008-05-29 15:57:20,692 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce
> > parallelism: -1
> > 2008-05-29 15:57:22,974 [main] INFO
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > - Pig progress = 33%
> > 2008-05-29 15:57:23,975 [main] INFO
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > - Pig progress = 66%
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - -----
> MapReduce
> > Job -----
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Input:
> > [/tmp/temp1967712747/tmp-844891734:org.apache.pig.builtin.BinStorage]
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map: [[*]]
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Group: null
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Combine: null
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce: null
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Output:
> > /tmp/temp1967712747/tmp189359268:org.apache.pig.builtin.BinStorage
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Split: null
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Map
> > parallelism:
> > -1
> > 2008-05-29 15:57:23,989 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.POMapreduce - Reduce
> > parallelism: -1
> > java.io.IOException: /tmp/temp1967712747/tmp-844891734 does not exist
> >        at
> >
> >
> org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:100)
> >        at
> >
> >
> org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:42)
> >        at
> >
> >
> org.apache.pig.impl.io.ValidatingInputFileSpec.<init>(ValidatingInputFileSpec.java:25)
> >        at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigInputFormat.getSplits(PigInputFormat.java:96)
> >        at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:544)
> >        at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher.launchPig(MapReduceLauncher.java:244)
> >        at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.POMapreduce.open(POMapreduce.java:177)
> >        at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:251)
> >        at
> org.apache.pig.PigServer.optimizeAndRunQuery(PigServer.java:413)
> >        at org.apache.pig.PigServer.openIterator(PigServer.java:332)
> >        at
> > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:258)
> >        at
> >
> >
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:162)
> >        at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:62)
> >        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
> >        at org.apache.pig.Main.main(Main.java:296)
> >
> >
> -------------------------------------------------------------------------------------------------------
> >
> > Is this expected? It does seem undesirable to me as it breaks my scripts
> > which otherwise execute successfully on a different dataset. And I see
> that
> > if b1 is simply treated as a null-set, the rest of the script would have
> > executed correctly.
> >
> > Prashanth
> >
>