You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2010/01/14 20:33:58 UTC

[jira] Created: (PIG-1188) Padding nulls to the input tuple according to input schema

Padding nulls to the input tuple according to input schema
----------------------------------------------------------

                 Key: PIG-1188
                 URL: https://issues.apache.org/jira/browse/PIG-1188
             Project: Pig
          Issue Type: Bug
          Components: impl
    Affects Versions: 0.6.0
            Reporter: Daniel Dai
             Fix For: 0.7.0


Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:

Pig script:
{code}
a = load '1.txt' as (a0, a1);
dump a;
{code}
Input file:
{code}
1       2
1       2       3
1
{code}
Current result:
{code}
(1,2)
(1,2,3)
(1)
{code}

Desired result:
{code}
(1,2)
(1,2)
(1, null)
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates reassigned PIG-1188:
-------------------------------

    Assignee: Alan Gates  (was: Richard Ding)

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Alan Gates
>             Fix For: 0.9.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833274#action_12833274 ] 

Richard Ding commented on PIG-1188:
-----------------------------------

I suggest we don't change the current behavior of Pig regarding the non-confirming input data. Pig already handles invalid access (projection) of non-exist field and return a null as a substitute. Pig does this optimistically, not checking every tuple up front. 

With PIG-1131,  the runtime exception user encountered is also fixed.



> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800342#action_12800342 ] 

Alan Gates commented on PIG-1188:
---------------------------------

I don't think padding is a good idea.  We don't know which field in the record is missing.  We're just guessing that the last field is missing, when in fact it might be the first.  Then we've made the situation worse by inserting invalid data in the all the fields.

I think the loader should either throw the record out, or make all fields in the record null.  This guarantees that we are not further propagating the error.  Then a warning can be issued that the record was invalid (I'm assuming even in the above proposal the loader would issue a warning.) 

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich reassigned PIG-1188:
-----------------------------------

    Assignee: Richard Ding

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1188:
--------------------------------

    Fix Version/s:     (was: 0.7.0)

Looks like most common cases are already working. Unlinking from 0.7.0 release.

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828896#action_12828896 ] 

Alan Gates commented on PIG-1188:
---------------------------------

After further thought I want to change my position on this.

There are two cases to consider, when schema is present and when it isn't.  The problem is by the time Pig is trying to access the missing field (in the backend), it has no idea whether the schema exists or not.  So at runtime, Pig should just return a null if it gets ArrayOutOfBoundsException.

How to pad missing data should be left up to the load function.  Perhaps certain load functions do know how to pad missing data, or are ok with the pad at the end scheme proposed here.  If the load function does not check, then Pig would effectively pad at the end, given the proposal above.  If the load function implementer does not what this to happen, s/he can check each tuple being read from the input to assure it matches the schema, and then decide to pad the tuple with nulls, reject the tuple, or return a tuple full of nulls.

In the case of PigStorage, checking each tuple for a match against the schema is too expensive.  Ideally I would like it to, because I think that when the user gives a schema it's an error if the data doesn't match.  But I don't want to pay the performance penalty in this case.  

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831776#action_12831776 ] 

Alan Gates commented on PIG-1188:
---------------------------------

A few thoughts:

In a job that is going to process a billion rows and run for 3 hours 1 bad row should not cause the whole job to fail.

This invalid access should certainly cause a warning.  Users can look at the warnings at the end of the query and decide they do not want to keep the output because of the warnings.  But failure should not be the default case (see previous point).  Perhaps we should have a warnings = error option like compilers do so users who are very worried about the warnings can make sure they fail.  But that's a different proposal for a different JIRA.

bq. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators.

I don't follow.  Nulls in the pipeline shouldn't cause a problem.  UDFs and operators need to be able to handle null values whether they come from processing or from the data itself.

bq. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way.

One could argue that it can't be assumed the user wants his query to fail when a field is missing.  We have to assume one way or another.  Null is a better assumption than failure, since it is possible for a user who doesn't want that behavior to detect it and deal with it.  As it is now, the user has to modify his data or write a new load function to deal with padding his data.

I agree with you that in the schema case, it would be ideal if not having a field was an error.  However, given the architecture this is difficult.  And stipulating that load functions test every record to assure it matches the schema is too much of a performance penalty.  But for the non-schema case I don't agree.  Pig's philsophy of "Pigs eat anything" doesn't mean much if Pig gags as soon as it gets a record that doesn't match it's expectation.




> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800348#action_12800348 ] 

Daniel Dai commented on PIG-1188:
---------------------------------

I am fine with throwing this record away and put a warning to the user. The key issue is not to introduce a tuple with less items in it. The follow-up operation depends on the consistency of the tuple size otherwise we will see strange errors which is very hard to diagnose. 

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835944#action_12835944 ] 

Richard Ding commented on PIG-1188:
-----------------------------------

To summarize where we are:

Right now Pig project operator pads null if the value to be projected doesn't exist. As a consequence, the desired result is achieved if  PigStorage is used and a schema with data types is specified, since in this case Pig inserts a project+cast operator for each field in the schema.

In the case where no schema is specified in the load statement, Pig is doing a good job adhering to the Pig's philosophy and  let the program run without throwing runtime exception.

Now leave the case where a schema is specified without data types. There are several options:

   * Pig automatically insert a project operator for each field in the schema to ensure the input data matches the schema. The trade-off for this is the performance penalty. Is it worthwhile if most user data is well-behaved?

   * Users can explicitly add a foreach statement after the load statement which projects all the fields in the schema. This is similar to the practice by the users to run a map job first to cleanup the data.  

   * Pig can also delegate the padding work to the loaders. The problem is that now  the schema isn't passed to the loaders. 





> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831749#action_12831749 ] 

Ashutosh Chauhan commented on PIG-1188:
---------------------------------------

I have a different take on this. Referring to original description of Jira, I would expect Pig's behavior should be one given in "Current result" and not as given in "Desired result". Pig should not try to do anything behind the scenes with data which "Desired result" is proposing to do. In cases where columns are not consistent, there are two scenarios with or without schema. If user did supply the schema, then I would consider that user is telling to Pig that data is consistent with the schema he is providing and if thats not the case, its perfectly fine to throw exception at runtime. Tricky case is when schema is not provided and user tries to access a non-existent field. I think even in such cases its valid to throw exception at runtime, instead of returning null. First, if user is trying to access a non-existent field thats an error condition in any case. Second, it can't be assumed that user wants those non-existent field to be treated as null. If he wants it that way, he should implement LoadFunc interface which treats them that way. Third, doing further operations on these columns down the pipeline may result in non-predictable results in other operators. Fourth, returning null will obscure the bugs in Pig where Pig (instead of user himself) tries to access non-existent fields to construct new tuples at run time to do e.g. joins (see PIG-1131). 

In short, I am suggesting that Pig should continue to have a behavior it has today. That is it can load variable number of columns in a tuple. But, if user access a non-existent field throw the exception and let user deal with  such scenario himself by implementing his own LoadFunc interface. 

Thoughts ?

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Richard Ding (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834362#action_12834362 ] 

Richard Ding commented on PIG-1188:
-----------------------------------

Actually, Pig is already padding nulls to the input tuple according to input schema (with data types):

For example, given Pig script:

{code}
a = load '1.txt' as (a0:int, a1:int);
dump a;
{code}

and input file:

{code}
1       2
1       2       3
1
{code}

The result is

{code}
(1,2)
(1,2)
(1, null)
{code}


> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.7.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1188) Padding nulls to the input tuple according to input schema

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-1188:
--------------------------------

    Fix Version/s: 0.9.0

> Padding nulls to the input tuple according to input schema
> ----------------------------------------------------------
>
>                 Key: PIG-1188
>                 URL: https://issues.apache.org/jira/browse/PIG-1188
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Daniel Dai
>            Assignee: Richard Ding
>             Fix For: 0.9.0
>
>
> Currently, the number of fields in the input tuple is determined by the data. When we have schema, we should generate input data according to the schema, and padding nulls if necessary. Here is one example:
> Pig script:
> {code}
> a = load '1.txt' as (a0, a1);
> dump a;
> {code}
> Input file:
> {code}
> 1       2
> 1       2       3
> 1
> {code}
> Current result:
> {code}
> (1,2)
> (1,2,3)
> (1)
> {code}
> Desired result:
> {code}
> (1,2)
> (1,2)
> (1, null)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.