You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Michael Moss <mi...@gmail.com> on 2010/12/08 22:50:23 UTC

Custom UDF + Grouping - Unexpected Output

Hello,

I'm having an issue with a script that uses an EvalFunc I wrote. The issue
is the final output contains characters that I am not expecting (commas -
followed by what I'm guessing are null fields which I do not see).

Snippet:
C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
grunt> DUMP C;
(2,3)
(2,4)
(2,5)
(3,4)
(3,5)
(4,5)
(2,3)
(2,4)
(2,5)
(3,4)
(3,5)
(4,5)

D = GROUP C by (f1,f2);
grunt> describe D;
D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}

grunt> DUMP D;
((2,3,),{(2,3,),(2,3,)})
((2,4,),{(2,4,),(2,4,)})
((2,5,),{(2,5,),(2,5,)})
((3,4,),{(3,4,),(3,4,)})
((3,5,),{(3,5,),(3,5,)})
((4,5,),{(4,5,),(4,5,)})

My question is, what are these extra comma/null fiends in each tuple? I
expected the first row to read as:
((2,3),{(2,3),(2,3)})

It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
at
org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
at
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
at
org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
at
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
at org.apache.pig.PigServer.getExamples(PigServer.java:785)
at
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)

Excruciating detail below:

My script:
REGISTER udf.jar
A = LOAD '/pig_input/co.txt' as (line:chararray);
B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
D = GROUP C by (f1,f2);
E = FOREACH D GENERATE group, COUNT(C);
STORE E INTO 'output' USING PigStorage(',');

Here's what I'm trying to do:
For input:
A,1,2,3
B,1,2,3

Produce combinations for each row (My UDF does this):
(1,2),(1,3),(2,3)
(1,2),(1,3),(2,3)

Flatten them:
(1,2),
(1,3),
(2,3),
(1,2),
(1,3),
(2,3)

Group and count them:
(1,2),2
(1,3),2
(2,3),2

Re: Custom UDF + Grouping - Unexpected Output

Posted by Daniel Dai <ji...@yahoo-inc.com>.
In your udf:
if (num1 < num2)
{
    t.set(0, num1 + "," + num2);
}
else if (num2 < num1)
{
    t.set(0, num2 + "," + num1);
}

You actually only put one item into the tuple. So your UDF generate a 
bag of tuples of one field, not two fields.
I think what you mean is:
if (num1 < num2)
{
    t.set(0, num1);
    t.set(1, num2);
}
else if (num2 < num1)
{
    t.set(0, num2);
    t.set(1, num1);
}

Daniel

Michael Moss wrote:
> Thanks, Daniel.
>
> My UDF with schema (which I suspect is culprit) is below. I've tried
> excluding the "outputSchema()" method entirely and a several variations:
>
> (Full source here: http://pastie.org/1362084)
>
> public class NormalizeListUDF extends EvalFunc<DataBag>
> {
> public DataBag exec(Tuple input) throws IOException
> {
> if (input == null || input.size() == 0)
> return null;
> try
> {
> DataBag output = DefaultBagFactory.getInstance().newDefaultBag();
>
> List<Object> tuples = input.getAll();
> String line = (String) tuples.remove(0);
> line = line.trim();
> String[] items = line.split(",");
>
> for (int i = 1; i < items.length - 1; i++)
> {
> for (int j = i + 1; j < items.length; j++)
> {
> int num1 = Integer.parseInt(items[i]);
> int num2 = Integer.parseInt(items[j]);
>
> Tuple t = TupleFactory.getInstance().newTuple(1);
>
> if (num1 < num2)
> {
> t.set(0, num1 + "," + num2);
> }
> else if (num2 < num1)
> {
> t.set(0, num2 + "," + num1);
> }
> output.add(t);
> }
> }
> return output;
> }
> catch (Exception e)
> {
> throw WrappedIOException.wrap("Caught exception processing input row ", e);
> }
> }
>
> public Schema outputSchema(Schema input)
> {
> try
> {
> List<Schema.FieldSchema> fields = new ArrayList<Schema.FieldSchema>();
> Schema.FieldSchema f1 = new Schema.FieldSchema("f1", DataType.INTEGER);
> Schema.FieldSchema f2 = new Schema.FieldSchema("f2", DataType.INTEGER);
> fields.add(f1);
> fields.add(f2);
>  Schema tupleInner = new Schema(fields);
> Schema.FieldSchema tupleSchema = new Schema.FieldSchema("t1", tupleInner,
> DataType.TUPLE);
>
> Schema bagInner = new Schema(tupleSchema);
> Schema.FieldSchema bagSchema = new Schema.FieldSchema("bag", bagInner,
> DataType.BAG);
> return new Schema(bagSchema);
> }
> catch (Exception e)
> {
> return null;
> }
> }
> }
>
> On Wed, Dec 8, 2010 at 7:04 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:
>
>   
>> It is not expected. I would think something wrong inside NormalizeListUDF.
>> Make sure you feed bag of tuples which has the schema (int, int) inside your
>> UDF. If you can post your UDF, I can know better.
>>
>> Daniel
>>
>>
>> Michael Moss wrote:
>>
>>     
>>> Hello,
>>>
>>> I'm having an issue with a script that uses an EvalFunc I wrote. The issue
>>> is the final output contains characters that I am not expecting (commas -
>>> followed by what I'm guessing are null fields which I do not see).
>>>
>>> Snippet:
>>> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
>>> grunt> DUMP C;
>>> (2,3)
>>> (2,4)
>>> (2,5)
>>> (3,4)
>>> (3,5)
>>> (4,5)
>>> (2,3)
>>> (2,4)
>>> (2,5)
>>> (3,4)
>>> (3,5)
>>> (4,5)
>>>
>>> D = GROUP C by (f1,f2);
>>> grunt> describe D;
>>> D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}
>>>
>>> grunt> DUMP D;
>>> ((2,3,),{(2,3,),(2,3,)})
>>> ((2,4,),{(2,4,),(2,4,)})
>>> ((2,5,),{(2,5,),(2,5,)})
>>> ((3,4,),{(3,4,),(3,4,)})
>>> ((3,5,),{(3,5,),(3,5,)})
>>> ((4,5,),{(4,5,),(4,5,)})
>>>
>>> My question is, what are these extra comma/null fiends in each tuple? I
>>> expected the first row to read as:
>>> ((2,3),{(2,3),(2,3)})
>>>
>>> It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
>>> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>> at java.util.ArrayList.get(ArrayList.java:322)
>>> at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>>> at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
>>> at
>>>
>>> org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
>>> at
>>>
>>> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
>>> at
>>>
>>> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
>>> at
>>> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
>>> at org.apache.pig.PigServer.getExamples(PigServer.java:785)
>>> at
>>>
>>> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
>>> at
>>>
>>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
>>> at
>>>
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
>>> at
>>>
>>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
>>> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
>>> at org.apache.pig.Main.main(Main.java:357)
>>>
>>> Excruciating detail below:
>>>
>>> My script:
>>> REGISTER udf.jar
>>> A = LOAD '/pig_input/co.txt' as (line:chararray);
>>> B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
>>> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
>>> D = GROUP C by (f1,f2);
>>> E = FOREACH D GENERATE group, COUNT(C);
>>> STORE E INTO 'output' USING PigStorage(',');
>>>
>>> Here's what I'm trying to do:
>>> For input:
>>> A,1,2,3
>>> B,1,2,3
>>>
>>> Produce combinations for each row (My UDF does this):
>>> (1,2),(1,3),(2,3)
>>> (1,2),(1,3),(2,3)
>>>
>>> Flatten them:
>>> (1,2),
>>> (1,3),
>>> (2,3),
>>> (1,2),
>>> (1,3),
>>> (2,3)
>>>
>>> Group and count them:
>>> (1,2),2
>>> (1,3),2
>>> (2,3),2
>>>
>>>
>>>       
>>     


Re: Custom UDF + Grouping - Unexpected Output

Posted by Michael Moss <mi...@gmail.com>.
Thanks, Daniel.

My UDF with schema (which I suspect is culprit) is below. I've tried
excluding the "outputSchema()" method entirely and a several variations:

(Full source here: http://pastie.org/1362084)

public class NormalizeListUDF extends EvalFunc<DataBag>
{
public DataBag exec(Tuple input) throws IOException
{
if (input == null || input.size() == 0)
return null;
try
{
DataBag output = DefaultBagFactory.getInstance().newDefaultBag();

List<Object> tuples = input.getAll();
String line = (String) tuples.remove(0);
line = line.trim();
String[] items = line.split(",");

for (int i = 1; i < items.length - 1; i++)
{
for (int j = i + 1; j < items.length; j++)
{
int num1 = Integer.parseInt(items[i]);
int num2 = Integer.parseInt(items[j]);

Tuple t = TupleFactory.getInstance().newTuple(1);

if (num1 < num2)
{
t.set(0, num1 + "," + num2);
}
else if (num2 < num1)
{
t.set(0, num2 + "," + num1);
}
output.add(t);
}
}
return output;
}
catch (Exception e)
{
throw WrappedIOException.wrap("Caught exception processing input row ", e);
}
}

public Schema outputSchema(Schema input)
{
try
{
List<Schema.FieldSchema> fields = new ArrayList<Schema.FieldSchema>();
Schema.FieldSchema f1 = new Schema.FieldSchema("f1", DataType.INTEGER);
Schema.FieldSchema f2 = new Schema.FieldSchema("f2", DataType.INTEGER);
fields.add(f1);
fields.add(f2);
 Schema tupleInner = new Schema(fields);
Schema.FieldSchema tupleSchema = new Schema.FieldSchema("t1", tupleInner,
DataType.TUPLE);

Schema bagInner = new Schema(tupleSchema);
Schema.FieldSchema bagSchema = new Schema.FieldSchema("bag", bagInner,
DataType.BAG);
return new Schema(bagSchema);
}
catch (Exception e)
{
return null;
}
}
}

On Wed, Dec 8, 2010 at 7:04 PM, Daniel Dai <ji...@yahoo-inc.com> wrote:

> It is not expected. I would think something wrong inside NormalizeListUDF.
> Make sure you feed bag of tuples which has the schema (int, int) inside your
> UDF. If you can post your UDF, I can know better.
>
> Daniel
>
>
> Michael Moss wrote:
>
>> Hello,
>>
>> I'm having an issue with a script that uses an EvalFunc I wrote. The issue
>> is the final output contains characters that I am not expecting (commas -
>> followed by what I'm guessing are null fields which I do not see).
>>
>> Snippet:
>> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
>> grunt> DUMP C;
>> (2,3)
>> (2,4)
>> (2,5)
>> (3,4)
>> (3,5)
>> (4,5)
>> (2,3)
>> (2,4)
>> (2,5)
>> (3,4)
>> (3,5)
>> (4,5)
>>
>> D = GROUP C by (f1,f2);
>> grunt> describe D;
>> D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}
>>
>> grunt> DUMP D;
>> ((2,3,),{(2,3,),(2,3,)})
>> ((2,4,),{(2,4,),(2,4,)})
>> ((2,5,),{(2,5,),(2,5,)})
>> ((3,4,),{(3,4,),(3,4,)})
>> ((3,5,),{(3,5,),(3,5,)})
>> ((4,5,),{(4,5,),(4,5,)})
>>
>> My question is, what are these extra comma/null fiends in each tuple? I
>> expected the first row to read as:
>> ((2,3),{(2,3),(2,3)})
>>
>> It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
>> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>> at java.util.ArrayList.get(ArrayList.java:322)
>> at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
>> at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
>> at
>>
>> org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
>> at
>>
>> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
>> at
>>
>> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
>> at
>> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
>> at org.apache.pig.PigServer.getExamples(PigServer.java:785)
>> at
>>
>> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
>> at
>>
>> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
>> at
>>
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
>> at
>>
>> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
>> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
>> at org.apache.pig.Main.main(Main.java:357)
>>
>> Excruciating detail below:
>>
>> My script:
>> REGISTER udf.jar
>> A = LOAD '/pig_input/co.txt' as (line:chararray);
>> B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
>> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
>> D = GROUP C by (f1,f2);
>> E = FOREACH D GENERATE group, COUNT(C);
>> STORE E INTO 'output' USING PigStorage(',');
>>
>> Here's what I'm trying to do:
>> For input:
>> A,1,2,3
>> B,1,2,3
>>
>> Produce combinations for each row (My UDF does this):
>> (1,2),(1,3),(2,3)
>> (1,2),(1,3),(2,3)
>>
>> Flatten them:
>> (1,2),
>> (1,3),
>> (2,3),
>> (1,2),
>> (1,3),
>> (2,3)
>>
>> Group and count them:
>> (1,2),2
>> (1,3),2
>> (2,3),2
>>
>>
>
>

Re: Custom UDF + Grouping - Unexpected Output

Posted by Daniel Dai <ji...@yahoo-inc.com>.
It is not expected. I would think something wrong inside 
NormalizeListUDF. Make sure you feed bag of tuples which has the schema 
(int, int) inside your UDF. If you can post your UDF, I can know better.

Daniel

Michael Moss wrote:
> Hello,
>
> I'm having an issue with a script that uses an EvalFunc I wrote. The issue
> is the final output contains characters that I am not expecting (commas -
> followed by what I'm guessing are null fields which I do not see).
>
> Snippet:
> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
> grunt> DUMP C;
> (2,3)
> (2,4)
> (2,5)
> (3,4)
> (3,5)
> (4,5)
> (2,3)
> (2,4)
> (2,5)
> (3,4)
> (3,5)
> (4,5)
>
> D = GROUP C by (f1,f2);
> grunt> describe D;
> D: {group: (f1: int,f2: int),C: {f1: int,f2: int}}
>
> grunt> DUMP D;
> ((2,3,),{(2,3,),(2,3,)})
> ((2,4,),{(2,4,),(2,4,)})
> ((2,5,),{(2,5,),(2,5,)})
> ((3,4,),{(3,4,),(3,4,)})
> ((3,5,),{(3,5,),(3,5,)})
> ((4,5,),{(4,5,),(4,5,)})
>
> My question is, what are these extra comma/null fiends in each tuple? I
> expected the first row to read as:
> ((2,3),{(2,3),(2,3)})
>
> It seems related, but when I run 'ILLUSTRATE C', I get an exeption:
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
> at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
> at org.apache.pig.pen.util.ExampleTuple.get(ExampleTuple.java:80)
> at
> org.apache.pig.pen.util.DisplayExamples.MakeArray(DisplayExamples.java:190)
> at
> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:86)
> at
> org.apache.pig.pen.util.DisplayExamples.printTabular(DisplayExamples.java:69)
> at
> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:143)
> at org.apache.pig.PigServer.getExamples(PigServer.java:785)
> at
> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:555)
> at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:246)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
> at org.apache.pig.Main.main(Main.java:357)
>
> Excruciating detail below:
>
> My script:
> REGISTER udf.jar
> A = LOAD '/pig_input/co.txt' as (line:chararray);
> B = FOREACH A GENERATE com.thumbplay.pig.NormalizeListUDF(line) as B;
> C = FOREACH B GENERATE FLATTEN(B) as (f1:int,f2:int);
> D = GROUP C by (f1,f2);
> E = FOREACH D GENERATE group, COUNT(C);
> STORE E INTO 'output' USING PigStorage(',');
>
> Here's what I'm trying to do:
> For input:
> A,1,2,3
> B,1,2,3
>
> Produce combinations for each row (My UDF does this):
> (1,2),(1,3),(2,3)
> (1,2),(1,3),(2,3)
>
> Flatten them:
> (1,2),
> (1,3),
> (2,3),
> (1,2),
> (1,3),
> (2,3)
>
> Group and count them:
> (1,2),2
> (1,3),2
> (2,3),2
>