You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Rajarshi Guha <rg...@indiana.edu> on 2009/05/05 23:37:28 UTC

multi-line records and file splits

Hi, I have implemented a subclass of RecordReader to handle a plain  
text file format where a record is multi-line and of variable length.  
Schematically each record is of the form

some_title
foo
bar
$$$$
another_title
foo
foo
bar
$$$$

where $$$$ is the marker for the end of the record. My code is at http://blog.rguha.net/?p=293 
  and it seems to work fine on my input data.

However, I realized that when I run the program, Hadoop will 'chunk'  
the input file. As a result, the SDFRecordReader might get a chunk of  
input text, such that the last record is actually incomplete (a  
missing $$$$). Is this correct?

If so, how would the RecordReader implementation recover from this  
situation? Or is there a way to indicate to Hadoop that the input file  
should be chunked keeping in mind end of record delimiters?

Thanks

-------------------------------------------------------------------
Rajarshi Guha  <rg...@indiana.edu>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
-------------------------------------------------------------------
Q:  What's polite and works for the phone company?
A:  A deferential operator.

Re: multi-line records and file splits

Posted by Rajarshi Guha <rg...@indiana.edu>.

On May 6, 2009, at 8:22 AM, Tom White wrote:

> Hi Rajarshi,
>
> FileInputFormat (SDFInputFormat's superclass) will break files into
> splits, typically on HDFS block boundaries (if the defaults are left
> unchanged). This is not a problem for your code however, since it will
> read every record that starts within a split (even if it crosses a
> split boundary). This is just like how TextInputFormat works. So you
> don't need to use MultiFileInputFormat - it should work as is. You
> could demonstrate this to yourself by writing a multi-block file, and
> doing an identity MapReduce on it. You should find that no records are
> lost.

Thanks for the description - once I realized that FileSplit.getStart()  
and getLength() provide me file offsets, I was able to modify my  
RecordReader subclass to deal with chunks starting and/or ending in  
the middle of a record. (For my own understanding I wrote it up at http://blog.rguha.net/?p=310 
  - maybe it'll be useful for other newbies)

> You might be able to use
> org.apache.hadoop.streaming.StreamXmlRecordReader (and
> StreamInputFormat), which does something similar. Despite its name it
> is not only for Streaming applications, and it isn't restricted to
> XML. It can parse records that begin with a certain sequence of
> characters, and end with another sequence.

I did indeed see this, after I wrote my own record reader :)

-------------------------------------------------------------------
Rajarshi Guha  <rg...@indiana.edu>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
-------------------------------------------------------------------
Q:  What's polite and works for the phone company?
A:  A deferential operator.

Re: multi-line records and file splits

Posted by Tom White <to...@cloudera.com>.

Hi Rajarshi,

FileInputFormat (SDFInputFormat's superclass) will break files into
splits, typically on HDFS block boundaries (if the defaults are left
unchanged). This is not a problem for your code however, since it will
read every record that starts within a split (even if it crosses a
split boundary). This is just like how TextInputFormat works. So you
don't need to use MultiFileInputFormat - it should work as is. You
could demonstrate this to yourself by writing a multi-block file, and
doing an identity MapReduce on it. You should find that no records are
lost.

You might be able to use
org.apache.hadoop.streaming.StreamXmlRecordReader (and
StreamInputFormat), which does something similar. Despite its name it
is not only for Streaming applications, and it isn't restricted to
XML. It can parse records that begin with a certain sequence of
characters, and end with another sequence.

Cheers,
Tom

On Wed, May 6, 2009 at 2:06 AM, Nick Cen <ce...@gmail.com> wrote:
> I think your SDFInputFormat should implement the MultiFileInputFormat
> instead of the TextInputFormat, which will not splid the file into chunk.
>
> 2009/5/6 Rajarshi Guha <rg...@indiana.edu>
>
>> Hi, I have implemented a subclass of RecordReader to handle a plain text
>> file format where a record is multi-line and of variable length.
>> Schematically each record is of the form
>>
>> some_title
>> foo
>> bar
>> $$$$
>> another_title
>> foo
>> foo
>> bar
>> $$$$
>>
>> where $$$$ is the marker for the end of the record. My code is at
>> http://blog.rguha.net/?p=293 and it seems to work fine on my input data.
>>
>> However, I realized that when I run the program, Hadoop will 'chunk' the
>> input file. As a result, the SDFRecordReader might get a chunk of input
>> text, such that the last record is actually incomplete (a missing $$$$). Is
>> this correct?
>>
>> If so, how would the RecordReader implementation recover from this
>> situation? Or is there a way to indicate to Hadoop that the input file
>> should be chunked keeping in mind end of record delimiters?
>>
>> Thanks
>>
>> -------------------------------------------------------------------
>> Rajarshi Guha  <rg...@indiana.edu>
>> GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
>> -------------------------------------------------------------------
>> Q:  What's polite and works for the phone company?
>> A:  A deferential operator.
>>
>>
>>
>
>
> --
> http://daily.appspot.com/food/
>

Re: multi-line records and file splits

Posted by Nick Cen <ce...@gmail.com>.

I think your SDFInputFormat should implement the MultiFileInputFormat
instead of the TextInputFormat, which will not splid the file into chunk.

2009/5/6 Rajarshi Guha <rg...@indiana.edu>

> Hi, I have implemented a subclass of RecordReader to handle a plain text
> file format where a record is multi-line and of variable length.
> Schematically each record is of the form
>
> some_title
> foo
> bar
> $$$$
> another_title
> foo
> foo
> bar
> $$$$
>
> where $$$$ is the marker for the end of the record. My code is at
> http://blog.rguha.net/?p=293 and it seems to work fine on my input data.
>
> However, I realized that when I run the program, Hadoop will 'chunk' the
> input file. As a result, the SDFRecordReader might get a chunk of input
> text, such that the last record is actually incomplete (a missing $$$$). Is
> this correct?
>
> If so, how would the RecordReader implementation recover from this
> situation? Or is there a way to indicate to Hadoop that the input file
> should be chunked keeping in mind end of record delimiters?
>
> Thanks
>
> -------------------------------------------------------------------
> Rajarshi Guha  <rg...@indiana.edu>
> GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
> -------------------------------------------------------------------
> Q:  What's polite and works for the phone company?
> A:  A deferential operator.
>
>
>


-- 
http://daily.appspot.com/food/