You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Adam Szita (JIRA)" <ji...@apache.org> on 2016/10/20 10:23:58 UTC

[jira] [Assigned] (PIG-4572) CSVExcelStorage treats newlines within fields as record seperator when input file is split

     [ https://issues.apache.org/jira/browse/PIG-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Szita reassigned PIG-4572:
-------------------------------

    Assignee: Adam Szita

> CSVExcelStorage treats newlines within fields as record seperator when input file is split
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-4572
>                 URL: https://issues.apache.org/jira/browse/PIG-4572
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon ElasticMapReduce AMI 3.6.0
> Apache Pig version 0.14.0 and 0.12.0
> Hadoop 2.4.0
>            Reporter: Le Clue
>            Assignee: Adam Szita
>              Labels: CSVExcelStorage, pig
>             Fix For: 0.17.0
>
>         Attachments: SmallTest.txt, script.pig
>
>
> It seems that when a field enclosed by double-quotes contains a carriage return or linefeed, and the input file is bigger than the dfs blocksize, the input split does not honor CSVExcelStorage's treatment of newlines within fields.
> It seems that the input is split by the linefeed closest to the byte range defined for the split, and causes fields to become skewed.
> For example, 3190 Byte Text file containing 21 identical records such as the below:
> "John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
> This is the second line.
> Thank you for listening."~"2012-08-24 09:16:02"
> Each line termination here is specified by a CRLF
> Run through a pig script:
> SET mapred.min.split.size 1024;
> SET mapred.max.split.size 1024;
> SET pig.noSplitCombination true;
> SET mapred.max.jobs.per.node 1;
> myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
> AS(
>   name:chararray,
>   sysid:chararray,
>   message:chararray,
>   messagedate:chararray
> );
> myinput_tuples = FOREACH myinput_file GENERATE name;
> STORE myinput_tuples INTO '/output052/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
> Results in 4 output files:
> -rw-r--r--   1 hadoop supergroup          0 2015-05-26 07:19 /output052/_SUCCESS
> -rw-r--r--   1 hadoop supergroup         63 2015-05-26 07:19 /output052/part-m-00000
> -rw-r--r--   1 hadoop supergroup         54 2015-05-26 07:19 /output052/part-m-00001
> -rw-r--r--   1 hadoop supergroup        769 2015-05-26 07:19 /output052/part-m-00002
> -rw-r--r--   1 hadoop supergroup         25 2015-05-26 07:19 /output052/part-m-00003
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
> This is the second line.
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
> This is the second line.
> Skewing occurs on the third part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)