You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Marc Sturm <ma...@nyp.org> on 2012/04/09 21:01:10 UTC

mapreduce line separator question

Hi,
I am new to Mapreduce and I have a short question: is it possible for a MapReduce job to split the lines of a file with \n and ignore \r? Basically, in the use case I am looking into, the \r has to be included when reading a line.
I am just "playing" with mapreduce with a standalone hadoop, not using hdfs, and I am looking into writing my own LineReader but I am afraid it is much more complicated than this. I can also update each line and replace the \r with a \t, but I rather leave the file and data as is.
Any insight and/or link to the correct documentation will be appreciated.
Thanks,
Marc


________________________________
This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you.


--------------------

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged.  If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited.  If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message.  Thank you.




--------------------

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged.  If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited.  If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message.  Thank you.




RE: mapreduce line separator question

Posted by Marc Sturm <ma...@nyp.org>.
Yes, we are now trying 1.x. And having this option will be great. I have never file a JIRA in ASF, but will do it.
Thanks,
Marc

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, April 09, 2012 3:46 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: mapreduce line separator question

Marc,

The answer depends on the Hadoop version you are running. The following requires
https://issues.apache.org/jira/browse/MAPREDUCE-2254 which is present currently in 0.23 (and eventually 2.x) and also (last I checked) in
CDH3 if you use that:

Simply set "textinputformat.record.delimiter" in your Job's configuration to the exact character string you need, and that will get used as a record/line delimiter in TextInputFormat. The string can also be multi-character, and the records would be read based to that provided sequence.

Its unavailable presently in 1.x, but it appears harmless to add this in and if you can file a JIRA with a backport I can review and commit it in for a future 1.x update.

On Tue, Apr 10, 2012 at 12:31 AM, Marc Sturm <ma...@nyp.org> wrote:
> Hi,
>
> I am new to Mapreduce and I have a short question: is it possible for
> a MapReduce job to split the lines of a file with \n and ignore \r?
> Basically, in the use case I am looking into, the \r has to be
> included when reading a line.
>
> I am just "playing" with mapreduce with a standalone hadoop, not using
> hdfs, and I am looking into writing my own LineReader but I am afraid
> it is much more complicated than this. I can also update each line and
> replace the \r with a \t, but I rather leave the file and data as is.
>
> Any insight and/or link to the correct documentation will be appreciated.
>
> Thanks,
>
> Marc
>
>
>
>
> ________________________________
> This electronic message is intended to be for the use only of the
> named recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of the contents of this
> message is strictly prohibited. If you have received this message in
> error or are not the named recipient, please notify us immediately by
> contacting the sender at the electronic mail address noted above, and
> delete and destroy all copies of this message. Thank you.
>
> --------------------
>
> This electronic message is intended to be for the use only of the
> named recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of the contents of this
> message is strictly prohibited.  If you have received this message in
> error or are not the named recipient, please notify us immediately by
> contacting the sender at the electronic mail address noted above, and
> delete and destroy all copies of this message.  Thank you.
>
> --------------------
>
> This electronic message is intended to be for the use only of the
> named recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that
> any disclosure, copying, distribution or use of the contents of this
> message is strictly prohibited.  If you have received this message in
> error or are not the named recipient, please notify us immediately by
> contacting the sender at the electronic mail address noted above, and
> delete and destroy all copies of this message.  Thank you.
>
>



--
Harsh J

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you.


--------------------

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged.  If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited.  If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message.  Thank you.




--------------------

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged.  If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited.  If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message.  Thank you.




Re: mapreduce line separator question

Posted by Harsh J <ha...@cloudera.com>.
Marc,

The answer depends on the Hadoop version you are running. The
following requires
https://issues.apache.org/jira/browse/MAPREDUCE-2254 which is present
currently in 0.23 (and eventually 2.x) and also (last I checked) in
CDH3 if you use that:

Simply set "textinputformat.record.delimiter" in your Job's
configuration to the exact character string you need, and that will
get used as a record/line delimiter in TextInputFormat. The string can
also be multi-character, and the records would be read based to that
provided sequence.

Its unavailable presently in 1.x, but it appears harmless to add this
in and if you can file a JIRA with a backport I can review and commit
it in for a future 1.x update.

On Tue, Apr 10, 2012 at 12:31 AM, Marc Sturm <ma...@nyp.org> wrote:
> Hi,
>
> I am new to Mapreduce and I have a short question: is it possible for a
> MapReduce job to split the lines of a file with \n and ignore \r? Basically,
> in the use case I am looking into, the \r has to be included when reading a
> line.
>
> I am just “playing” with mapreduce with a standalone hadoop, not using hdfs,
> and I am looking into writing my own LineReader but I am afraid it is much
> more complicated than this. I can also update each line and replace the \r
> with a \t, but I rather leave the file and data as is.
>
> Any insight and/or link to the correct documentation will be appreciated.
>
> Thanks,
>
> Marc
>
>
>
>
> ________________________________
> This electronic message is intended to be for the use only of the named
> recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that any
> disclosure, copying, distribution or use of the contents of this message is
> strictly prohibited. If you have received this message in error or are not
> the named recipient, please notify us immediately by contacting the sender
> at the electronic mail address noted above, and delete and destroy all
> copies of this message. Thank you.
>
> --------------------
>
> This electronic message is intended to be for the use only of the named
> recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that any
> disclosure, copying, distribution or use of the contents of this message is
> strictly prohibited.  If you have received this message in error or are not
> the named recipient, please notify us immediately by contacting the sender
> at the electronic mail address noted above, and delete and destroy all
> copies of this message.  Thank you.
>
> --------------------
>
> This electronic message is intended to be for the use only of the named
> recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that any
> disclosure, copying, distribution or use of the contents of this message is
> strictly prohibited.  If you have received this message in error or are not
> the named recipient, please notify us immediately by contacting the sender
> at the electronic mail address noted above, and delete and destroy all
> copies of this message.  Thank you.
>
>



-- 
Harsh J