You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Bhagwan S. Soni" <bh...@gmail.com> on 2015/11/12 08:59:06 UTC

How to set newline character as \n and avoid others like \r in Pig Job?

Hi,

I have a file which is coming from any of the source system to *HDFS* with
more than one *newline character* like *\n* and *\r* which is creating
extra lines while a MapReduce/Pig job gets invoked.
I'm ok with having *\n* as newline and just want to avoid *\r*.
I'm setting newline character while running my pig job using below
property:



*-D textinputformat.record.delimiter*
I tried many of values to set newline character but it is not making any
difference and reading whole file as a single row.
Below are some values which i have already tried to set \n as newline
character -

-D textinputformat.record.delimiter=\\n
-D textinputformat.record.delimiter=\\u000a
-D textinputformat.record.delimiter=\u000a
-D textinputformat.record.delimiter=0x0a
-D textinputformat.record.delimiter=0x0A
-D textinputformat.record.delimiter=00001010
-D textinputformat.record.delimiter=\&#xa\;

Is there any possible value which I'm missing?

I was also looking into creating a custom loader for this and planning
to extend PigStorage class

but I'm not sure to do that i have to write my own RecordReader as well?


*Thanks,*

Re: How to set newline character as \n and avoid others like \r in Pig Job?

Posted by Daniel Dai <da...@hortonworks.com>.
The problem is the parameter will pass to TextInputFormat without
interpreting escape sequences, makes it hard to pass \n character.

One alternative approach is to write a simple LoadFunc and passing the
parameter using Java string, which will interpreting escape sequences, for
example:

public class PigStorageNewLine extends PigStorage {
    @Override
  public void setLocation(String location, Job job) throws IOException {
    job.getConfiguration().set("textinputformat.record.delimiter", "\n");
    super.setLocation(location, job);
  }
}


Thanks,
Daniel

On 11/11/15, 11:59 PM, "Bhagwan S. Soni" <bh...@gmail.com> wrote:

>Hi,
>
>I have a file which is coming from any of the source system to *HDFS* with
>more than one *newline character* like *\n* and *\r* which is creating
>extra lines while a MapReduce/Pig job gets invoked.
>I'm ok with having *\n* as newline and just want to avoid *\r*.
>I'm setting newline character while running my pig job using below
>property:
>
>
>
>*-D textinputformat.record.delimiter*
>I tried many of values to set newline character but it is not making any
>difference and reading whole file as a single row.
>Below are some values which i have already tried to set \n as newline
>character -
>
>-D textinputformat.record.delimiter=\\n
>-D textinputformat.record.delimiter=\\u000a
>-D textinputformat.record.delimiter=\u000a
>-D textinputformat.record.delimiter=0x0a
>-D textinputformat.record.delimiter=0x0A
>-D textinputformat.record.delimiter=00001010
>-D textinputformat.record.delimiter=\&#xa\;
>
>Is there any possible value which I'm missing?
>
>I was also looking into creating a custom loader for this and planning
>to extend PigStorage class
>
>but I'm not sure to do that i have to write my own RecordReader as well?
>
>
>*Thanks,*


Re: How to set newline character as \n and avoid others like \r in Pig Job?

Posted by Daniel Dai <da...@hortonworks.com>.
The problem is the parameter will pass to TextInputFormat without
interpreting escape sequences, makes it hard to pass \n character.

One alternative approach is to write a simple LoadFunc and passing the
parameter using Java string, which will interpreting escape sequences, for
example:

public class PigStorageNewLine extends PigStorage {
    @Override
  public void setLocation(String location, Job job) throws IOException {
    job.getConfiguration().set("textinputformat.record.delimiter", "\n");
    super.setLocation(location, job);
  }
}


Thanks,
Daniel

On 11/11/15, 11:59 PM, "Bhagwan S. Soni" <bh...@gmail.com> wrote:

>Hi,
>
>I have a file which is coming from any of the source system to *HDFS* with
>more than one *newline character* like *\n* and *\r* which is creating
>extra lines while a MapReduce/Pig job gets invoked.
>I'm ok with having *\n* as newline and just want to avoid *\r*.
>I'm setting newline character while running my pig job using below
>property:
>
>
>
>*-D textinputformat.record.delimiter*
>I tried many of values to set newline character but it is not making any
>difference and reading whole file as a single row.
>Below are some values which i have already tried to set \n as newline
>character -
>
>-D textinputformat.record.delimiter=\\n
>-D textinputformat.record.delimiter=\\u000a
>-D textinputformat.record.delimiter=\u000a
>-D textinputformat.record.delimiter=0x0a
>-D textinputformat.record.delimiter=0x0A
>-D textinputformat.record.delimiter=00001010
>-D textinputformat.record.delimiter=\&#xa\;
>
>Is there any possible value which I'm missing?
>
>I was also looking into creating a custom loader for this and planning
>to extend PigStorage class
>
>but I'm not sure to do that i have to write my own RecordReader as well?
>
>
>*Thanks,*