You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by praveenesh kumar <pr...@gmail.com> on 2012/02/06 19:48:45 UTC
Question on how GroupBy and Join works in Pig
Hi everyone,
I have a question on behavior of how Group By and Join works in Pig :
Suppose I have Two data files:
1. cust_info
2. premium_data
cust_info:
ID name region
2321 Austin Pondicherry
2375 Martin California
4286 Lisa Chennai
premium_data:
ID premium start_year end_year
2321 345 2009 2010
2375 845 2009 2011
4286 286 2010 2012
2321 213 2001 2004
3041 452 2010 2013
3041 423 2006 2009
================================
Load the premium_data, group by ID and sum their total premium
grunt> premium_data = load 'premium_data';
grunt> illustrate premium_data;
------------------------------------------------------------------------------------------
| premium_data | ID:int | premium:float | start_year:int |
end_year:int |
------------------------------------------------------------------------------------------
| | 4286 | 286 | 2010 |
2012 |
------------------------------------------------------------------------------------------
grunt> cust_info = load 'cust_info';
grunt> illustrate cust_info;
------------------------------------------------------------------------
| cust_info | ID:int | name:chararray | region:chararray |
------------------------------------------------------------------------
| | 2375 | Martin | California |
------------------------------------------------------------------------
grunt> grouped_ID = group premium_data by ID;
When I am giving schema inside my Load statement, I am facing errors on
using group By and Joins.
But if I don't give schema, my fields are treated as ByteArrays and working
fine.
I don't think its a usual behavior. Am I doing something wrong the way I
should use Join and GroupBy ?
grunt> illustrate grouped_ID; -throws errors
grunt> illustrate grouped_ID;
2012-02-06 22:47:31,452 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
to hadoop file system at: file:///
2012-02-06 22:47:31,651 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-02-06 22:47:31,680 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-02-06 22:47:31,680 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-02-06 22:47:31,698 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-02-06 22:47:31,719 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-02-06 22:47:31,850 [main] INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
to process : 1
2012-02-06 22:47:31,851 [main] INFO
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
paths to process : 1
2012-02-06 22:47:31,867 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2012-02-06 22:47:31,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2012-02-06 22:47:31,869 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2012-02-06 22:47:31,870 [main] INFO
org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
to the job
2012-02-06 22:47:31,870 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-02-06 22:47:31,884 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=292
2012-02-06 22:47:31,885 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Neither PARALLEL nor default parallelism is set for this job. Setting
number of reducers to 1
java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be
cast to java.lang.Integer
at
org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:81)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:117)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:273)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at
org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:205)
at
org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
at
org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)
at
org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103)
at
org.apache.pig.pen.LineageTrimmingVisitor.<init>(LineageTrimmingVisitor.java:98)
at
org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)
at org.apache.pig.PigServer.getExamples(PigServer.java:1202)
at
org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:700)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:597)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:308)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:523)
at org.apache.pig.Main.main(Main.java:148)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
2012-02-06 22:47:31,936 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 2997: Encountered IOException. Exception :
org.apache.pig.data.DataByteArray cannot be cast to java.lang.Integer
Details at logfile:
/usr/local/hadoop/pig/trunk/learning/insurance/pig_1328548575504.log
Thanks,
Praveenesh
Re: Question on how GroupBy and Join works in Pig
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
What are you using to load the data? It sounds like your loader is reporting a desired schema, but does not actually convert the data into the schema. So it tells pig to expect ints, but gives it byte arrays.
On Feb 6, 2012, at 10:48 AM, praveenesh kumar <pr...@gmail.com> wrote:
> Hi everyone,
>
> I have a question on behavior of how Group By and Join works in Pig :
>
> Suppose I have Two data files:
>
> 1. cust_info
>
> 2. premium_data
>
>
> cust_info:
>
> ID name region
>
> 2321 Austin Pondicherry
>
> 2375 Martin California
>
> 4286 Lisa Chennai
>
>
>
> premium_data:
>
> ID premium start_year end_year
>
> 2321 345 2009 2010
>
> 2375 845 2009 2011
>
> 4286 286 2010 2012
>
> 2321 213 2001 2004
>
> 3041 452 2010 2013
>
> 3041 423 2006 2009
>
> ================================
>
> Load the premium_data, group by ID and sum their total premium
>
>
>
> grunt> premium_data = load 'premium_data';
>
> grunt> illustrate premium_data;
>
>
>
> ------------------------------------------------------------------------------------------
>
> | premium_data | ID:int | premium:float | start_year:int |
> end_year:int |
>
> ------------------------------------------------------------------------------------------
>
> | | 4286 | 286 | 2010 |
> 2012 |
>
> ------------------------------------------------------------------------------------------
>
>
>
> grunt> cust_info = load 'cust_info';
>
> grunt> illustrate cust_info;
>
> ------------------------------------------------------------------------
>
> | cust_info | ID:int | name:chararray | region:chararray |
>
> ------------------------------------------------------------------------
>
> | | 2375 | Martin | California |
>
> ------------------------------------------------------------------------
>
>
>
> grunt> grouped_ID = group premium_data by ID;
>
> When I am giving schema inside my Load statement, I am facing errors on
> using group By and Joins.
>
> But if I don't give schema, my fields are treated as ByteArrays and working
> fine.
>
> I don't think its a usual behavior. Am I doing something wrong the way I
> should use Join and GroupBy ?
>
>
> grunt> illustrate grouped_ID; -throws errors
>
>
>
>
>
> grunt> illustrate grouped_ID;
>
> 2012-02-06 22:47:31,452 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting
> to hadoop file system at: file:///
>
> 2012-02-06 22:47:31,651 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
>
> 2012-02-06 22:47:31,680 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
>
> 2012-02-06 22:47:31,680 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
>
> 2012-02-06 22:47:31,698 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
> to the job
>
> 2012-02-06 22:47:31,719 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>
> 2012-02-06 22:47:31,850 [main] INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : 1
>
> 2012-02-06 22:47:31,851 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input
> paths to process : 1
>
> 2012-02-06 22:47:31,867 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
>
> 2012-02-06 22:47:31,869 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
>
> 2012-02-06 22:47:31,869 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
>
> 2012-02-06 22:47:31,870 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added
> to the job
>
> 2012-02-06 22:47:31,870 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
>
> 2012-02-06 22:47:31,884 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=292
>
> 2012-02-06 22:47:31,885 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Neither PARALLEL nor default parallelism is set for this job. Setting
> number of reducers to 1
>
> java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be
> cast to java.lang.Integer
>
> at
> org.apache.pig.backend.hadoop.HDataType.getWritableComparableTypes(HDataType.java:81)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Map.collect(PigGenericMapReduce.java:117)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:273)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:266)
>
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at
> org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:205)
>
> at
> org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
>
> at
> org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)
>
> at
> org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103)
>
> at
> org.apache.pig.pen.LineageTrimmingVisitor.<init>(LineageTrimmingVisitor.java:98)
>
> at
> org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)
>
> at org.apache.pig.PigServer.getExamples(PigServer.java:1202)
>
> at
> org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:700)
>
> at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:597)
>
> at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:308)
>
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:190)
>
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
>
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
>
> at org.apache.pig.Main.run(Main.java:523)
>
> at org.apache.pig.Main.main(Main.java:148)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:601)
>
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> 2012-02-06 22:47:31,936 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 2997: Encountered IOException. Exception :
> org.apache.pig.data.DataByteArray cannot be cast to java.lang.Integer
>
> Details at logfile:
> /usr/local/hadoop/pig/trunk/learning/insurance/pig_1328548575504.log
>
>
>
>
> Thanks,
> Praveenesh