You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by deepak kumar v <de...@gmail.com> on 2011/04/01 06:38:56 UTC

Re: How to group on a group id that is present inside a complex hierarchy

any response?

On Tue, Mar 29, 2011 at 3:32 PM, deepak kumar v <de...@gmail.com> wrote:

> Hi,
> Below are list of tuples generated by a UDF.
>
> ( ( [stdout#{ (day, age, name, address, ['k1#v1','k2#v2'] ) } ] ) )
> ( ( [stdout#{ (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] ) } ] ) )
> ( ( [stdout#{ (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] ) } ] ) )
> group a -- ( v1 , { (day, age, name, address, ['k1#v1','k2#v2']
> ), (12/3,22,deepak,newjersy,  ['k1#v1','k2#v2']) } )
> group b -- ( v2 , { (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'])} )
>
> I need to run group by on k1 so that i have two groups.
> *
> Approach #1*
> grped = group inputTuples by $0.$0.#'stdout'.$0.$0.$5#'k1'
>
> Error:
> 2011-03-29 15:16:44,589 [main] WARN  org.apache.pig.PigServer - Encountered
> Warning IMPLICIT_CAST_TO_MAP 1 time(s).
> 2011-03-29 15:16:44,589 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script:
> GROUP_BY
> 2011-03-29 15:16:44,589 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> pig.usenewlogicalplan is set to true. New logical plan will be used.
> 2011-03-29 15:16:44,593 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
> Details at logfile:
> /home/deepakkv/pigtemp/testworkflow/pig_1301391996435.log
>
> *Approach #2*
> As a result i flattened inputTulpes as follows
> flat = foreach inputTuples generate flatten($0.$0#'stdout');
>
> (day, age, name, address, ['k1#v1','k2#v2'] )
> (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] )
> (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] )
>
> So now as i need to group on k1 which is present in a map that is the 5th
> item (4 index) i
> grped = group flat by $4#'k1';
>
> Error
> 2011-03-29 15:25:28,459 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log
> 2011-03-29 15:25:28,554 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: file:///
> 2011-03-29 15:25:28,750 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Out of bound access. Trying to access
> non-existent column: 4. Schema {bytearray} has 1 column(s).
> Details at logfile:
> /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log
>
> *Approach #3*
> As i result i tried
> grped = group flat by $0.$4#'k1';
>
> Error:
> 2011-03-29 15:27:18,081 [Thread-13] WARN
> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> java.lang.ClassCastException: java.lang.String cannot be cast to
> org.apache.pig.data.Tuple
>         at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392)
>         at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
>         at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138)
>         at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:276)
>         at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:916)
>
>
>
> How can i group tuples on group id which is present inside a Tuple -> Bag
> -> Map -> Tuple (Given key) -> 4thItem (Is a Map again) -> Key
>
> Regards,
> Deepak
>

Re: How to group on a group id that is present inside a complex hierarchy

Posted by Alan Gates <ga...@yahoo-inc.com>.
Approach 2 should work except for a bug in the way flatten schemas are  
handled (the bug will be fixed in 0.9 fwiw).  If you specify the  
schema after the flatten I think it will work.

Change

at = foreach inputTuples generate flatten($0.$0#'stdout');

to

at = foreach inputTuples generate flatten($0.$0#'stdout') as (day,  
age, name, address, m);

The issue is that when a flatten doesn't have a schema it assigns a  
schema of bytearray instead of setting the schema to null so that it  
can figure it out at runtime.

Alan.


On Apr 5, 2011, at 9:41 PM, deepak kumar v wrote:

> gentle reminder
>
> On Fri, Apr 1, 2011 at 10:08 AM, deepak kumar v  
> <de...@gmail.com> wrote:
>
>> any response?
>>
>>
>> On Tue, Mar 29, 2011 at 3:32 PM, deepak kumar v  
>> <de...@gmail.com>wrote:
>>
>>> Hi,
>>> Below are list of tuples generated by a UDF.
>>>
>>> ( ( [stdout#{ (day, age, name, address, ['k1#v1','k2#v2'] ) } ] ) )
>>> ( ( [stdout#{ (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] ) } ] ) )
>>> ( ( [stdout#{ (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] ) } ] ) )
>>> group a -- ( v1 , { (day, age, name, address, ['k1#v1','k2#v2']
>>> ), (12/3,22,deepak,newjersy,  ['k1#v1','k2#v2']) } )
>>> group b -- ( v2 , { (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'])} )
>>>
>>> I need to run group by on k1 so that i have two groups.
>>> *
>>> Approach #1*
>>> grped = group inputTuples by $0.$0.#'stdout'.$0.$0.$5#'k1'
>>>
>>> Error:
>>> 2011-03-29 15:16:44,589 [main] WARN  org.apache.pig.PigServer -
>>> Encountered Warning IMPLICIT_CAST_TO_MAP 1 time(s).
>>> 2011-03-29 15:16:44,589 [main] INFO
>>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in  
>>> the script:
>>> GROUP_BY
>>> 2011-03-29 15:16:44,589 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>>> 2011-03-29 15:16:44,593 [main] ERROR  
>>> org.apache.pig.tools.grunt.Grunt -
>>> ERROR 2042: Error in new logical plan. Try - 
>>> Dpig.usenewlogicalplan=false.
>>> Details at logfile:
>>> /home/deepakkv/pigtemp/testworkflow/pig_1301391996435.log
>>>
>>> *Approach #2*
>>> As a result i flattened inputTulpes as follows
>>> flat = foreach inputTuples generate flatten($0.$0#'stdout');
>>>
>>> (day, age, name, address, ['k1#v1','k2#v2'] )
>>> (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] )
>>> (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] )
>>>
>>> So now as i need to group on k1 which is present in a map that is  
>>> the 5th
>>> item (4 index) i
>>> grped = group flat by $4#'k1';
>>>
>>> Error
>>> 2011-03-29 15:25:28,459 [main] INFO  org.apache.pig.Main - Logging  
>>> error
>>> messages to: /home/deepakkv/pigtemp/testworkflow/ 
>>> pig_1301392528456.log
>>> 2011-03-29 15:25:28,554 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -  
>>> Connecting
>>> to hadoop file system at: file:///
>>> 2011-03-29 15:25:28,750 [main] ERROR  
>>> org.apache.pig.tools.grunt.Grunt -
>>> ERROR 1000: Error during parsing. Out of bound access. Trying to  
>>> access
>>> non-existent column: 4. Schema {bytearray} has 1 column(s).
>>> Details at logfile:
>>> /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log
>>>
>>> *Approach #3*
>>> As i result i tried
>>> grped = group flat by $0.$4#'k1';
>>>
>>> Error:
>>> 2011-03-29 15:27:18,081 [Thread-13] WARN
>>> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
>>> java.lang.ClassCastException: java.lang.String cannot be cast to
>>> org.apache.pig.data.Tuple
>>>        at
>>> org 
>>> .apache 
>>> .pig 
>>> .backend 
>>> .hadoop 
>>> .executionengine 
>>> .physicalLayer 
>>> .expressionOperators.POProject.getNext(POProject.java:392)
>>>        at
>>> org 
>>> .apache 
>>> .pig 
>>> .backend 
>>> .hadoop 
>>> .executionengine 
>>> .physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java: 
>>> 276)
>>>        at
>>> org 
>>> .apache 
>>> .pig 
>>> .backend 
>>> .hadoop 
>>> .executionengine 
>>> .physicalLayer 
>>> .expressionOperators.POProject.getNext(POProject.java:138)
>>>        at
>>> org 
>>> .apache 
>>> .pig 
>>> .backend 
>>> .hadoop 
>>> .executionengine 
>>> .physicalLayer 
>>> .expressionOperators.POProject.getNext(POProject.java:276)
>>>        at
>>> org 
>>> .apache 
>>> .pig 
>>> .backend 
>>> .hadoop 
>>> .executionengine 
>>> .physicalLayer.expressionOperators.POCast.getNext(POCast.java:916)
>>>
>>>
>>>
>>> How can i group tuples on group id which is present inside a Tuple  
>>> -> Bag
>>> -> Map -> Tuple (Given key) -> 4thItem (Is a Map again) -> Key
>>>
>>> Regards,
>>> Deepak
>>>
>>
>>


Re: How to group on a group id that is present inside a complex hierarchy

Posted by deepak kumar v <de...@gmail.com>.
gentle reminder

On Fri, Apr 1, 2011 at 10:08 AM, deepak kumar v <de...@gmail.com> wrote:

> any response?
>
>
> On Tue, Mar 29, 2011 at 3:32 PM, deepak kumar v <de...@gmail.com>wrote:
>
>> Hi,
>> Below are list of tuples generated by a UDF.
>>
>> ( ( [stdout#{ (day, age, name, address, ['k1#v1','k2#v2'] ) } ] ) )
>> ( ( [stdout#{ (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] ) } ] ) )
>> ( ( [stdout#{ (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] ) } ] ) )
>> group a -- ( v1 , { (day, age, name, address, ['k1#v1','k2#v2']
>> ), (12/3,22,deepak,newjersy,  ['k1#v1','k2#v2']) } )
>> group b -- ( v2 , { (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'])} )
>>
>> I need to run group by on k1 so that i have two groups.
>> *
>> Approach #1*
>> grped = group inputTuples by $0.$0.#'stdout'.$0.$0.$5#'k1'
>>
>> Error:
>> 2011-03-29 15:16:44,589 [main] WARN  org.apache.pig.PigServer -
>> Encountered Warning IMPLICIT_CAST_TO_MAP 1 time(s).
>> 2011-03-29 15:16:44,589 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script:
>> GROUP_BY
>> 2011-03-29 15:16:44,589 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>> 2011-03-29 15:16:44,593 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
>> Details at logfile:
>> /home/deepakkv/pigtemp/testworkflow/pig_1301391996435.log
>>
>> *Approach #2*
>> As a result i flattened inputTulpes as follows
>> flat = foreach inputTuples generate flatten($0.$0#'stdout');
>>
>> (day, age, name, address, ['k1#v1','k2#v2'] )
>> (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] )
>> (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] )
>>
>> So now as i need to group on k1 which is present in a map that is the 5th
>> item (4 index) i
>> grped = group flat by $4#'k1';
>>
>> Error
>> 2011-03-29 15:25:28,459 [main] INFO  org.apache.pig.Main - Logging error
>> messages to: /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log
>> 2011-03-29 15:25:28,554 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>> to hadoop file system at: file:///
>> 2011-03-29 15:25:28,750 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1000: Error during parsing. Out of bound access. Trying to access
>> non-existent column: 4. Schema {bytearray} has 1 column(s).
>> Details at logfile:
>> /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log
>>
>> *Approach #3*
>> As i result i tried
>> grped = group flat by $0.$4#'k1';
>>
>> Error:
>> 2011-03-29 15:27:18,081 [Thread-13] WARN
>> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
>> java.lang.ClassCastException: java.lang.String cannot be cast to
>> org.apache.pig.data.Tuple
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:276)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:916)
>>
>>
>>
>> How can i group tuples on group id which is present inside a Tuple -> Bag
>> -> Map -> Tuple (Given key) -> 4thItem (Is a Map again) -> Key
>>
>> Regards,
>> Deepak
>>
>
>

Re: How to group on a group id that is present inside a complex hierarchy

Posted by deepak kumar v <de...@gmail.com>.
gentle reminder

On Fri, Apr 1, 2011 at 10:08 AM, deepak kumar v <de...@gmail.com> wrote:

> any response?
>
>
> On Tue, Mar 29, 2011 at 3:32 PM, deepak kumar v <de...@gmail.com>wrote:
>
>> Hi,
>> Below are list of tuples generated by a UDF.
>>
>> ( ( [stdout#{ (day, age, name, address, ['k1#v1','k2#v2'] ) } ] ) )
>> ( ( [stdout#{ (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] ) } ] ) )
>> ( ( [stdout#{ (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] ) } ] ) )
>> group a -- ( v1 , { (day, age, name, address, ['k1#v1','k2#v2']
>> ), (12/3,22,deepak,newjersy,  ['k1#v1','k2#v2']) } )
>> group b -- ( v2 , { (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'])} )
>>
>> I need to run group by on k1 so that i have two groups.
>> *
>> Approach #1*
>> grped = group inputTuples by $0.$0.#'stdout'.$0.$0.$5#'k1'
>>
>> Error:
>> 2011-03-29 15:16:44,589 [main] WARN  org.apache.pig.PigServer -
>> Encountered Warning IMPLICIT_CAST_TO_MAP 1 time(s).
>> 2011-03-29 15:16:44,589 [main] INFO
>> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script:
>> GROUP_BY
>> 2011-03-29 15:16:44,589 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
>> pig.usenewlogicalplan is set to true. New logical plan will be used.
>> 2011-03-29 15:16:44,593 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false.
>> Details at logfile:
>> /home/deepakkv/pigtemp/testworkflow/pig_1301391996435.log
>>
>> *Approach #2*
>> As a result i flattened inputTulpes as follows
>> flat = foreach inputTuples generate flatten($0.$0#'stdout');
>>
>> (day, age, name, address, ['k1#v1','k2#v2'] )
>> (12/2,22,deepak,newyork,  ['k1#v2','k2#v2'] )
>> (12/3,22,deepak,newyork,  ['k1#v1','k2#v2'] )
>>
>> So now as i need to group on k1 which is present in a map that is the 5th
>> item (4 index) i
>> grped = group flat by $4#'k1';
>>
>> Error
>> 2011-03-29 15:25:28,459 [main] INFO  org.apache.pig.Main - Logging error
>> messages to: /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log
>> 2011-03-29 15:25:28,554 [main] INFO
>> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
>> to hadoop file system at: file:///
>> 2011-03-29 15:25:28,750 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1000: Error during parsing. Out of bound access. Trying to access
>> non-existent column: 4. Schema {bytearray} has 1 column(s).
>> Details at logfile:
>> /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log
>>
>> *Approach #3*
>> As i result i tried
>> grped = group flat by $0.$4#'k1';
>>
>> Error:
>> 2011-03-29 15:27:18,081 [Thread-13] WARN
>> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
>> java.lang.ClassCastException: java.lang.String cannot be cast to
>> org.apache.pig.data.Tuple
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:276)
>>         at
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:916)
>>
>>
>>
>> How can i group tuples on group id which is present inside a Tuple -> Bag
>> -> Map -> Tuple (Given key) -> 4thItem (Is a Map again) -> Key
>>
>> Regards,
>> Deepak
>>
>
>