You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2014/03/28 01:05:15 UTC
[jira] [Commented] (PIG-3828) PigStorage should properly handle \r, \n, \r\n in record itself

    [ https://issues.apache.org/jira/browse/PIG-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13950179#comment-13950179 ] 

Daniel Dai commented on PIG-3828:
---------------------------------

This is a duplication of PIG-836. With MAPREDUCE-2254, you shall able to set textinputformat.record.delimiter if you use Hadoop 0.23+. Are you able to try it?

> PigStorage should properly handle \r, \n, \r\n in record itself
> ---------------------------------------------------------------
>
>                 Key: PIG-3828
>                 URL: https://issues.apache.org/jira/browse/PIG-3828
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0, 0.11.1
>         Environment: Linux
>            Reporter: Yubao Liu
>
> Currently PigStorage uses org.apache.hadoop.mapreduce.lib.input.LineRecordReader to read lines, LineRecordReader thinks "\r", "\n", "\r\n" all are end of line,  if unluckily some record contains these special strings,  PigStorage will read less fields,  this produces IndexOutOfBoundsException when "-schema" option is given under PIG <= 0.12.0:
> https://svn.apache.org/repos/asf/pig/tags/release-0.12.0/src/org/apache/pig/builtin/PigStorage.java
> {quote}
> private Tuple applySchema(Tuple tup) throws IOException {
>   ....
> for (int i = 0; i < Math.min(fieldSchemas.length, tup.size()); i++) {
>                 if (mRequiredColumns == null || (mRequiredColumns.length>i && mRequiredColumns[i])) {
>                     Object val = null;
>                     if(tup.get(tupleIdx) != null){     <--- !!! IndexOutOfBoundsException
> {quote}
> In PIG-trunk,  null values are silently filled:
> {quote}
> https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/builtin/PigStorage.java
> for (int i = 0; i < fieldSchemas.length; i++) {
>                 if (mRequiredColumns == null || (mRequiredColumns.length>i && mRequiredColumns[i])) {
>                     if (tupleIdx >= tup.size()) {
>                         tup.append(null);                     <--- !!! silently fill null
>                     }
>                     
>                     Object val = null;
>                     if(tup.get(tupleIdx) != null){
> {quote}
> The behaviour of PIG-trunk is still error-prone:
> * null is silently filled for current record,  the user's PIG script may not realize that field can be null thus get NullPointerException
> *  the next record is totally garbled because it starts from the middle of previous record,  the data types of each fields in this record are totally wrong, so this probably breaks user's PIG script.
> Before PigStorage supports customized record separator,  this may be a not-so-bad workaround for this nasty issue:  usually there is very small chance the first record containing record separator,  PigStorage can save maxFieldsNumber in PigStorage.getNext(),  if PigStorage.getNext() parses a record and find it has less fields, it just throws current record,  the next half of current record will be thrown too because it must also have less fields.  By this way,  PigStorage can throw bad records at best effort.



--
This message was sent by Atlassian JIRA
(v6.2#6252)