You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2017/05/25 18:21:04 UTC

[jira] [Commented] (PIG-5231) PigStorage with -schema may produce inconsistent outputs with more fields

    [ https://issues.apache.org/jira/browse/PIG-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025155#comment-16025155 ] 

Daniel Dai commented on PIG-5231:
---------------------------------

Vote for 3. We pick the first schema in dirs in all LoadFunc, such as OrcStorage, AvroStorage. I don't think we shall make an exception for PigStorage. +1 for the patch.

> PigStorage with -schema may produce inconsistent outputs with more fields
> -------------------------------------------------------------------------
>
>                 Key: PIG-5231
>                 URL: https://issues.apache.org/jira/browse/PIG-5231
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Minor
>         Attachments: pig-5231-v01.patch
>
>
> When multiple directories are passed to PigStorage(',','-schema'), pig will 
> {quote}
> No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used.
> {quote}
> For two directories input with schema
> file1: (f1:chararray, f2:int) and 
> file2: (f1:chararray, f2:int, f3:int) 
> Pig will pick the first schema from file1 and only allow f1, f2 access. 
> However, output would still contain 3 fields for tuples from file2.  This later leads to complete corrupt outputs due to shifted fields resulting in incorrect references. 
> (This may also happen when input itself contains the delimiter.)
> If file2 schema is picked, this is already handled by filling the missing fields with null.  (PIG-3100)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)