You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2010/02/03 03:06:19 UTC

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828896#action_12828896 ] 

Alan Gates commented on PIG-1188:
---------------------------------

After further thought I want to change my position on this.

There are two cases to consider, when schema is present and when it isn't.  The problem is by the time Pig is trying to access the missing field (in the backend), it has no idea whether the schema exists or not.  So at runtime, Pig should just return a null if it gets ArrayOutOfBoundsException.

How to pad missing data should be left up to the load function.  Perhaps certain load functions do know how to pad missing data, or are ok with the pad at the end scheme proposed here.  If the load function does not check, then Pig would effectively pad at the end, given the proposal above.  If the load function implementer does not what this to happen, s/he can check each tuple being read from the input to assure it matches the schema, and then decide to pad the tuple with nulls, reject the tuple, or return a tuple full of nulls.

In the case of PigStorage, checking each tuple for a match against the schema is too expensive.  Ideally I would like it to, because I think that when the user gives a schema it's an error if the data doesn't match.  But I don't want to pay the performance penalty in this case.  

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.