You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Niels Basjes (JIRA)" <ji...@apache.org> on 2015/10/06 11:48:27 UTC

[jira] [Updated] (PIG-4689) CSV Writes incorrect header if two CSV files are created in one script

     [ https://issues.apache.org/jira/browse/PIG-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niels Basjes updated PIG-4689:
------------------------------
    Attachment: PIG-4689-2015-10-06.patch

*Found it*
Turns out the CSVExcelStorage implemented the WRONG setter for receiving the unique UDFContextSignature. 
Hence no unique value was ever set and all instances used 'null' as their 'unique value'.

{code:title=Foo/part-m-00000}
a
1
{code}

{code:title=Bar/part-m-00000}
b	c
1	a
{code}

> CSV Writes incorrect header if two CSV files are created in one script
> ----------------------------------------------------------------------
>
>                 Key: PIG-4689
>                 URL: https://issues.apache.org/jira/browse/PIG-4689
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.14.0
>            Reporter: Niels Basjes
>         Attachments: PIG-4689-2015-10-06.patch
>
>
> From a single Pig script I write two completely different and unrelated CSV files; both with the flag 'WRITE_OUTPUT_HEADER'.
> The bug is that both files get the SAME header at the top of the output file even though the data is different.
> *Reproduction:*
> {code:title=foo.txt}
> 1
> {code}
> {code:title=bar.txt (Tab separated)}
> 1	a
> {code}
> {code:title=WriteTwoCSV.pig}
> FOO =
>     LOAD 'foo.txt'
>     USING PigStorage('\t')
>     AS (a:chararray);
> BAR =
>     LOAD 'bar.txt'
>     USING PigStorage('\t')
>     AS (b:chararray, c:chararray);
> STORE FOO into 'Foo'
> USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
> STORE BAR into 'Bar'
> USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
> {code}
> *Command:*
> {quote}pig -x local WriteTwoCSV.pig{quote}
> *Result:*
> {quote}cat Bar/part-*{quote}
> {code}
> b	c
> 1	a
> {code}
> {quote}cat Foo/part-*{quote}
> {code}
> b	c
> 1
> {code}
> *The error is that the {{Foo}} output has a the two column header from the {{Bar}} output.*
> *One of the effects is that parsing the {{Foo}} data will probably fail due to the varying number of columns*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)