You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Juan Gentile (JIRA)" <ji...@apache.org> on 2012/07/31 21:36:34 UTC

[jira] [Commented] (PIG-1965) Aggregation not working in conjunction with REGEX_EXTRACT_ALL

    [ https://issues.apache.org/jira/browse/PIG-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426050#comment-13426050 ] 

Juan Gentile commented on PIG-1965:
-----------------------------------

You can also do something like this

c = foreach a generate FLATTEN (REGEX_EXTRACT_ALL (line, '.* f1:(\\S+) .* f2:(\\d+).*')) as (f1:chararray, f2:chararray);
e = foreach (group d by f1) generate group, AVG((bag{tuple(int)}) c.f2);
                
> Aggregation not working in conjunction with REGEX_EXTRACT_ALL
> -------------------------------------------------------------
>
>                 Key: PIG-1965
>                 URL: https://issues.apache.org/jira/browse/PIG-1965
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Sven Krasser
>
> For the script below, the AVG function fails with an exception:
> a = load 'bug.txt' using TextLoader() as (line:chararray);
> c = foreach a generate FLATTEN (REGEX_EXTRACT_ALL (line, '.* f1:(\\S+) .* f2:(\\d+).*')) as (f1:chararray, f2:int);
> describe c; -- this is c: {f1: chararray,f2: int}
> d = group c by f1;
> e = foreach d generate group, AVG(c.f2);
> dump e
> Data for bug.txt:
> a f1:abc d f2:12
> f f1:def d f2:23
> w f1:abc w f2:45
> r f1:abc w f2:24
> e f1:def q f2:34
> If I store the data after parsing it, the schema stays the same, but now the AVG function works. Workaround script:
> a = load 'bug.txt' using TextLoader() as (line:chararray);
> b = foreach a generate FLATTEN (REGEX_EXTRACT_ALL (line, '.* f1:(\\S+) .* f2:(\\d+).*')) as (f1:chararray, f2:int);
> describe b; -- this is b: {f1: chararray,f2: int}
> store b into 'temp';
> c = load 'temp' as (f1:chararray, f2:int);
> describe c; -- this is c: {f1: chararray,f2: int}
> d = group c by f1;
> e = foreach d generate group, AVG(c.f2);
> dump e
> Exception in the error case:
> org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing average in Initial
> 	at org.apache.pig.builtin.IntAvg$Initial.exec(IntAvg.java:100)
> 	at org.apache.pig.builtin.IntAvg$Initial.exec(IntAvg.java:75)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:273)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:343)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
> 	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> 	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> 	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
> 	at org.apache.pig.builtin.IntAvg$Initial.exec(IntAvg.java:87)
> 	... 14 more
> [...]
> 2011-04-05 18:18:33,865 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
> 2011-04-05 18:18:33,866 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 2011-04-05 18:18:33,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias e

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira