You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Shawn Weeks (JIRA)" <ji...@apache.org> on 2016/10/18 03:27:58 UTC
[jira] [Commented] (PIG-4572) CSVExcelStorage treats newlines within fields as record seperator when input file is split

    [ https://issues.apache.org/jira/browse/PIG-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584300#comment-15584300 ] 

Shawn Weeks commented on PIG-4572:
----------------------------------

I've loaded several large 10GB+ files with embedded newlines and had it work when split but I'm starting to think it was blind luck that it didn't split on one of the embedded newlines. I'm facing this issue with a file where every line has an embedded newline in the same column and as luck would have it every split is on the embedded newline instead of the row delimiter newline.

> CSVExcelStorage treats newlines within fields as record seperator when input file is split
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-4572
>                 URL: https://issues.apache.org/jira/browse/PIG-4572
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon ElasticMapReduce AMI 3.6.0
> Apache Pig version 0.14.0 and 0.12.0
> Hadoop 2.4.0
>            Reporter: Le Clue
>              Labels: CSVExcelStorage, pig
>             Fix For: 0.17.0
>
>         Attachments: SmallTest.txt, script.pig
>
>
> It seems that when a field enclosed by double-quotes contains a carriage return or linefeed, and the input file is bigger than the dfs blocksize, the input split does not honor CSVExcelStorage's treatment of newlines within fields.
> It seems that the input is split by the linefeed closest to the byte range defined for the split, and causes fields to become skewed.
> For example, 3190 Byte Text file containing 21 identical records such as the below:
> "John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
> This is the second line.
> Thank you for listening."~"2012-08-24 09:16:02"
> Each line termination here is specified by a CRLF
> Run through a pig script:
> SET mapred.min.split.size 1024;
> SET mapred.max.split.size 1024;
> SET pig.noSplitCombination true;
> SET mapred.max.jobs.per.node 1;
> myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
> AS(
>   name:chararray,
>   sysid:chararray,
>   message:chararray,
>   messagedate:chararray
> );
> myinput_tuples = FOREACH myinput_file GENERATE name;
> STORE myinput_tuples INTO '/output052/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
> Results in 4 output files:
> -rw-r--r--   1 hadoop supergroup          0 2015-05-26 07:19 /output052/_SUCCESS
> -rw-r--r--   1 hadoop supergroup         63 2015-05-26 07:19 /output052/part-m-00000
> -rw-r--r--   1 hadoop supergroup         54 2015-05-26 07:19 /output052/part-m-00001
> -rw-r--r--   1 hadoop supergroup        769 2015-05-26 07:19 /output052/part-m-00002
> -rw-r--r--   1 hadoop supergroup         25 2015-05-26 07:19 /output052/part-m-00003
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
> This is the second line.
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
> This is the second line.
> Skewing occurs on the third part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)