You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Daan Gerits <da...@gmail.com> on 2011/11/29 10:47:50 UTC

Pig FLATTEN statement

Hello everyone,

I have some issues with the pig flatten statement as I receive several exceptions when trying to flatten a bag.

I read in Jira and on the mailiinglists that other people had issues with flattening a bag which was embedded within a tuple, but that should have been solved in version 0.8.1 and 0.9.0. I've tried using 0.9.0 and 0.9.1, both giving me the same results.

Any help is greatly appreciated,

Daan Gerits

==================== Snippet Start ====================

parserResults =
    FOREACH fetchResultsFlattened {
        parsed = parse('GoogleSearchItems.xml', content);
        GENERATE queryString, FLATTEN(parsed);
    }

DESCRIBE parserResults;
parserResults: {null::queryString: chararray,fields::id: chararray,fields::path: chararray,fields::selector: chararray,fields::type: chararray,fields::values: {(value: chararray)}}

parserValues =
    FOREACH parserResults
    GENERATE queryString, id, path, selector, type, FLATTEN(values);

DESCRIBE parserValues;
parserValues: {null::queryString: chararray,fields::id: chararray,fields::path: chararray,fields::selector: chararray,fields::type: chararray,fields::values::value: chararray}

DUMP parserValues;
java.lang.IndexOutOfBoundsException: Index: 5, Size: 2
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:158)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:575)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)

==================== Snippet End ====================

-- 
kind regards,

Gerits Daan
BigData Consultant 
Foundation.be

Hoekskensweg 12, 9290 Berlare
btw: BE 0872.859.349
gsm: +32 477 759533
web1: http://www.foundation.be
web2: http://www.relacss.com

Re: Pig FLATTEN statement

Posted by Thejas Nair <th...@hortonworks.com>.

I think the parse udf returns a schema with 5 columns, but all tuples 
returned by the udf don't have the 5 columns.
Can you check if the udf always returns tuple with 5 columns ?

Thanks,
Thejas


On 11/29/11 1:47 AM, Daan Gerits wrote:
> Hello everyone,
>
> I have some issues with the pig flatten statement as I receive several exceptions when trying to flatten a bag.
>
> I read in Jira and on the mailiinglists that other people had issues with flattening a bag which was embedded within a tuple, but that should have been solved in version 0.8.1 and 0.9.0. I've tried using 0.9.0 and 0.9.1, both giving me the same results.
>
> Any help is greatly appreciated,
>
> Daan Gerits
>
> ==================== Snippet Start ====================
>
> parserResults =
>      FOREACH fetchResultsFlattened {
>          parsed = parse('GoogleSearchItems.xml', content);
>          GENERATE queryString, FLATTEN(parsed);
>      }
>
> DESCRIBE parserResults;
> parserResults: {null::queryString: chararray,fields::id: chararray,fields::path: chararray,fields::selector: chararray,fields::type: chararray,fields::values: {(value: chararray)}}
>
> parserValues =
>      FOREACH parserResults
>      GENERATE queryString, id, path, selector, type, FLATTEN(values);
>
> DESCRIBE parserValues;
> parserValues: {null::queryString: chararray,fields::id: chararray,fields::path: chararray,fields::selector: chararray,fields::type: chararray,fields::values::value: chararray}
>
> DUMP parserValues;
> java.lang.IndexOutOfBoundsException: Index: 5, Size: 2
> 	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> 	at java.util.ArrayList.get(ArrayList.java:322)
> 	at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:158)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:575)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261)
> 	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> 	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
>
> ==================== Snippet End ====================
>