You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Adam Szita (JIRA)" <ji...@apache.org> on 2016/11/07 08:35:58 UTC
[jira] [Resolved] (PIG-4572) CSVExcelStorage treats newlines within
fields as record seperator when input file is split
[ https://issues.apache.org/jira/browse/PIG-4572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Szita resolved PIG-4572.
-----------------------------
Resolution: Resolved
> CSVExcelStorage treats newlines within fields as record seperator when input file is split
> ------------------------------------------------------------------------------------------
>
> Key: PIG-4572
> URL: https://issues.apache.org/jira/browse/PIG-4572
> Project: Pig
> Issue Type: Bug
> Components: piggybank
> Affects Versions: 0.12.0, 0.14.0
> Environment: Amazon ElasticMapReduce AMI 3.6.0
> Apache Pig version 0.14.0 and 0.12.0
> Hadoop 2.4.0
> Reporter: Le Clue
> Assignee: Adam Szita
> Labels: CSVExcelStorage, pig
> Fix For: 0.17.0
>
> Attachments: SmallTest.txt, script.pig
>
>
> It seems that when a field enclosed by double-quotes contains a carriage return or linefeed, and the input file is bigger than the dfs blocksize, the input split does not honor CSVExcelStorage's treatment of newlines within fields.
> It seems that the input is split by the linefeed closest to the byte range defined for the split, and causes fields to become skewed.
> For example, 3190 Byte Text file containing 21 identical records such as the below:
> "John Doe"~"025719e8244c7c400b811ea349f2c18e"~"This is a multiline message:
> This is the second line.
> Thank you for listening."~"2012-08-24 09:16:02"
> Each line termination here is specified by a CRLF
> Run through a pig script:
> SET mapred.min.split.size 1024;
> SET mapred.max.split.size 1024;
> SET pig.noSplitCombination true;
> SET mapred.max.jobs.per.node 1;
> myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
> AS(
> name:chararray,
> sysid:chararray,
> message:chararray,
> messagedate:chararray
> );
> myinput_tuples = FOREACH myinput_file GENERATE name;
> STORE myinput_tuples INTO '/output052/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');
> Results in 4 output files:
> -rw-r--r-- 1 hadoop supergroup 0 2015-05-26 07:19 /output052/_SUCCESS
> -rw-r--r-- 1 hadoop supergroup 63 2015-05-26 07:19 /output052/part-m-00000
> -rw-r--r-- 1 hadoop supergroup 54 2015-05-26 07:19 /output052/part-m-00001
> -rw-r--r-- 1 hadoop supergroup 769 2015-05-26 07:19 /output052/part-m-00002
> -rw-r--r-- 1 hadoop supergroup 25 2015-05-26 07:19 /output052/part-m-00003
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> John Doe
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
> This is the second line.
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> "Thank you for listening.~2012-08-24 09:16:02""
> John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
> [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
> This is the second line.
> Skewing occurs on the third part.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)