You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2011/01/11 02:44:46 UTC
[jira] Commented: (PIG-496) project of bags from complex data causes failures

    [ https://issues.apache.org/jira/browse/PIG-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979904#action_12979904 ] 

Daniel Dai commented on PIG-496:
--------------------------------

We need to decide how to load empty bag, eg.
{code}
A = load 'data.txt' as (x: bag{});
{code}
Currently, we load x as bag, inside x we don't do any interpretation. So what we load is a bag of bytearrays.

This however cause problem when we do further processing for this bag. Assume in data.txt, the bag actually contains three item tuples:
{code}
B = foreach A generate x.($1, $2); 
{code}
We expect it will project 2nd, 3th field of the tuple. But in current code, x is a bag of one field bytearray, this results an error
{code}
B = foreach A generate flatten x;
{code}
We expect it will flatten x into 3 fields. But in current code, we cannot even flatten x, since x does not contain tuple.

The problem stems in two sources:
1. Currently bag requires tuple in some cases, but not require tuple in other cases. This is inconsistent. We should make it a rule. So when we load a bag, actually means load a bag of tuples

2. When we load a tuple with unknown number of fields (tuple inner schema is unknown), we assume it contains only one bytearray field. However, it is not possible to cast one byte field to multiple fields later. Recall when we load a file with unknown schema:
{code}
A = load 'data.txt';
{code}
We actually load multiple fields seperated by delimit, each field is of type bytearray. When we load empty bag, we can mimic this behavior. 

So I propose two changes:
1. Load a bag implies loading a bag of tuples, even when bag inner schema is empty.
2. When we convert bytearray to tuple with no inner schema, we no longer assume one field. We will take comma as delimit (in the case of UTF8StorageConverter) and produce a tuple of multiple bytearray fields.

Assume data.txt is:
{(1,2,3),(4,5,6)}
After this change, 
A = load 'data.txt' as (x: bag{});
describe A:
We get: bag{}
dump A:
We get: {(1,2,3),(4,5,6)}, which is not a bag of byteArrays, but a bag of three item tuples.

> project of bags from complex data causes failures
> -------------------------------------------------
>
>                 Key: PIG-496
>                 URL: https://issues.apache.org/jira/browse/PIG-496
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> A = load 'complex data' as (x: bag{});
> B = foreach A generate x.($1, $2);
> produces stack trace:
> 2008-10-14 15:11:07,639 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (reduce) task_200809241441_9923_r_000000java.lang.NullPointerException
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:215)
>         at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:166)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:252)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:222)
>         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:134)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
>         at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
> Pradeep suspects that the problem is in src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POProject.java; line 374

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.