You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2017/05/25 18:21:04 UTC
[jira] [Commented] (PIG-5231) PigStorage with -schema may produce
inconsistent outputs with more fields
[ https://issues.apache.org/jira/browse/PIG-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025155#comment-16025155 ]
Daniel Dai commented on PIG-5231:
---------------------------------
Vote for 3. We pick the first schema in dirs in all LoadFunc, such as OrcStorage, AvroStorage. I don't think we shall make an exception for PigStorage. +1 for the patch.
> PigStorage with -schema may produce inconsistent outputs with more fields
> -------------------------------------------------------------------------
>
> Key: PIG-5231
> URL: https://issues.apache.org/jira/browse/PIG-5231
> Project: Pig
> Issue Type: Bug
> Reporter: Koji Noguchi
> Assignee: Koji Noguchi
> Priority: Minor
> Attachments: pig-5231-v01.patch
>
>
> When multiple directories are passed to PigStorage(',','-schema'), pig will
> {quote}
> No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used.
> {quote}
> For two directories input with schema
> file1: (f1:chararray, f2:int) and
> file2: (f1:chararray, f2:int, f3:int)
> Pig will pick the first schema from file1 and only allow f1, f2 access.
> However, output would still contain 3 fields for tuples from file2. This later leads to complete corrupt outputs due to shifted fields resulting in incorrect references.
> (This may also happen when input itself contains the delimiter.)
> If file2 schema is picked, this is already handled by filling the missing fields with null. (PIG-3100)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)