You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Ankur (JIRA)" <ji...@apache.org> on 2010/01/15 08:57:54 UTC

[jira] Commented: (PIG-1191) POCast throws exception for certain sequences of LOAD, FILTER, FORACH

    [ https://issues.apache.org/jira/browse/PIG-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800609#action_12800609 ] 

Ankur commented on PIG-1191:
----------------------------

Listed below are the identified cases. 

CASE 1: LOAD -> FILTER -> FOREACH -> LIMIT -> STORE
===================================================

SCRIPT
-----------
sds = LOAD '/my/data/location'
      USING my.org.MyMapLoader()
      AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
queries = FILTER sds BY mapFields#'page_params'#'query' is NOT NULL;
queries_rand = FOREACH queries
               GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS query_string;
queries_limit = LIMIT queries_rand 100;
STORE queries_limit INTO 'out'; 

RESULT 
------------
FAILS in reduce stage with the following exception

org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine
how to convert the bytearray to string.
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
        at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:423)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:391)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:371)


CASE 2: LOAD -> FOREACH -> FILTER -> LIMIT -> STORE
===================================================
Note that FILTER and FOREACH order is reversed

SCRIPT
-----------
sds = LOAD '/my/data/location'
      USING my.org.MyMapLoader()
      AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
queries_rand = FOREACH sds
               GENERATE (CHARARRAY) (mapFields#'page_params'#'query') AS query_string;
queries = FILTER queries_rand BY query_string IS NOT null;
queries_limit = LIMIT queries 100; 
STORE queries_limit INTO 'out';

RESULT
-----------
SUCCESS - Results are correctly stored. So if a projection is done before FILTER it recieves the LoadFunc in the POCast
operator and everything is cool.


CASE 3: LOAD -> FOREACH -> FOREACH -> FILTER -> LIMIT -> STORE
==============================================================

SCRIPT
-----------
ds = LOAD '/my/data/location'
      USING my.org.MyMapLoader()
      AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE 
          (map[]) (mapFields#'page_params') AS params;
queries = FOREACH params
          GENERATE (CHARARRAY) (params#'query') AS query_string;
queries_filtered = FILTER queries
                   BY query_string IS NOT null;
queries_limit = LIMIT queries_filtered 100;
STORE queries_limit INTO 'out';

RESULT
-----------
FAILS in Map stage. Looks like the 2nd FOREACH did not get the loadFunc and bailed out with following stack trace

org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine
how to convert the bytearray to string. at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLimit.getNext(POLimit.java:85) at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260) at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at

CASE 4: LOAD -> FOREACH -> FOREACH -> LIMIT -> STORE
====================================================

SCRIPT
-----------
sds = LOAD '/my/data/location'
      USING my.org.MyMapLoader()
      AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE
          (map[]) (mapFields#'page_params') AS params;
queries = FOREACH params
          GENERATE (CHARARRAY) (params#'query') AS query_string;
queries_limit = LIMIT queries 100;
STORE queries_limit INTO 'out';

RESULT
-----------
SUCCESS. The two FOREACH seem to be getting the loadFunc. 

CASE 5: LOAD -> FOREACH -> FOREACH -> FOREACH -> LIMIT -> STORE
================================================================

SCRIPT
-----------
ds = LOAD '/my/data/location'
      USING my.org.MyMapLoader()
      AS (simpleFields:map[], mapFields:map[], listMapFields:map[]);
params = FOREACH sds GENERATE
          (map[]) (mapFields#'page_params') AS params;
queries = FOREACH params
          GENERATE (CHARARRAY) (params#'query') AS query_string;
rand_queries = FOREACH queries GENERATE query_string as query;
queries_limit = LIMIT rand_queries 100;
STORE rand_queries INTO 'out';

RESULT
-----------
FAILS in map stage. Again the poor second FOREACH seems to be bailing out with stack trace

org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine
how to convert the bytearray to string. at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639) at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:364)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:288)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:260)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:237)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) 
 

> POCast throws exception for certain sequences of LOAD, FILTER, FORACH
> ---------------------------------------------------------------------
>
>                 Key: PIG-1191
>                 URL: https://issues.apache.org/jira/browse/PIG-1191
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>            Reporter: Ankur
>            Priority: Blocker
>         Attachments: PIG-1191-1.patch
>
>
> When using a custom load/store function, one that returns complex data (map of maps, list of maps), for certain sequences  of LOAD, FILTER, FOREACH pig script throws an exception of the form -
>  
> org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to <actual-type>
> at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:639)
> ...
> Looking through the code of POCast, apparently the operator was unable to find the right load function for doing the conversion and consequently bailed out with the exception failing the entire pig script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.