You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2017/05/25 20:21:04 UTC

[jira] [Resolved] (PIG-5231) PigStorage with -schema may produce inconsistent outputs with more fields

     [ https://issues.apache.org/jira/browse/PIG-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Noguchi resolved PIG-5231.
-------------------------------
       Resolution: Fixed
     Hadoop Flags: Reviewed
    Fix Version/s: 0.17.0

Thanks for the review Daniel!   Committed it to trunk.

Another option I was dreaming about would check all the schema files and dynamically create load statement for each unique schema, then add all of them with Union onschema. Have no idea if this is even feasible or not :)

For now, I think my quick fix is a good step forward since it fixes the issue with one common use case when PigStorage schema evolves by adding more fields.


> PigStorage with -schema may produce inconsistent outputs with more fields
> -------------------------------------------------------------------------
>
>                 Key: PIG-5231
>                 URL: https://issues.apache.org/jira/browse/PIG-5231
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Minor
>             Fix For: 0.17.0
>
>         Attachments: pig-5231-v01.patch
>
>
> When multiple directories are passed to PigStorage(',','-schema'), pig will 
> {quote}
> No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used.
> {quote}
> For two directories input with schema
> file1: (f1:chararray, f2:int) and 
> file2: (f1:chararray, f2:int, f3:int) 
> Pig will pick the first schema from file1 and only allow f1, f2 access. 
> However, output would still contain 3 fields for tuples from file2.  This later leads to complete corrupt outputs due to shifted fields resulting in incorrect references. 
> (This may also happen when input itself contains the delimiter.)
> If file2 schema is picked, this is already handled by filling the missing fields with null.  (PIG-3100)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)