You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kelly Burkhart <ke...@gmail.com> on 2011/03/04 20:30:42 UTC

HDFS file content restrictions

Hello, are the restrictions to the size or "width" of text files
placed in HDFS?  I have a file structure like this:

<text key><tab><text data><nl>

It would be helpful if in some circumstances I could make text data
really large (large meaning many KB to one/few MB).  I may have some
rows that have a very small payload and some with a very large
payload.  Is this OK?  When HDFS is splitting the file into chunks to
spread across the cluster will it ever spit a record?  Total file size
may be on the order of 20-30GB.

Thanks,

-K

Re: HDFS file content restrictions

Posted by Harsh J <qw...@gmail.com>.
The class responsible for reading records as lines off a file, seek in
to the next block in sequence until the newline. This behavior, and
how it affects the Map tasks, is better documented here (see the
TextInputFormat example doc):
http://wiki.apache.org/hadoop/HadoopMapReduce

On Sat, Mar 5, 2011 at 1:54 AM, Kelly Burkhart <ke...@gmail.com> wrote:
> On Fri, Mar 4, 2011 at 1:42 PM, Harsh J <qw...@gmail.com> wrote:
>> HDFS does not operate with records in mind.
>
> So does that mean that HDFS will break a file at exactly <blocksize>
> bytes?  Map/Reduce *does* operate with records in mind, so what
> happens to the split record?  Does HDFS put the fragments back
> together and deliver the reconstructed record to one map?  Or are both
> fragments and consequently the whole record discarded?
>
> Thanks,
>
> -Kelly
>



-- 
Harsh J
www.harshj.com

Re: HDFS file content restrictions

Posted by Brian Bockelman <bb...@cse.unl.edu>.
If, for example, you have a record that contains 20MB in one block and 1MB in another, Map/Reduce will feed you the entire 21MB record.  If you are lucky and the map is executing on a node with the 20MB block, MapReduce will transfer 1MB out of HDFS for you.

This is glossing over some details, but the point is that MR will feed you whole records regardless of whether they are stored on one or two blocks.

Brian

On Mar 4, 2011, at 2:24 PM, Kelly Burkhart wrote:

> On Fri, Mar 4, 2011 at 1:42 PM, Harsh J <qw...@gmail.com> wrote:
>> HDFS does not operate with records in mind.
> 
> So does that mean that HDFS will break a file at exactly <blocksize>
> bytes?  Map/Reduce *does* operate with records in mind, so what
> happens to the split record?  Does HDFS put the fragments back
> together and deliver the reconstructed record to one map?  Or are both
> fragments and consequently the whole record discarded?
> 
> Thanks,
> 
> -Kelly


Re: HDFS file content restrictions

Posted by Kelly Burkhart <ke...@gmail.com>.
On Fri, Mar 4, 2011 at 1:42 PM, Harsh J <qw...@gmail.com> wrote:
> HDFS does not operate with records in mind.

So does that mean that HDFS will break a file at exactly <blocksize>
bytes?  Map/Reduce *does* operate with records in mind, so what
happens to the split record?  Does HDFS put the fragments back
together and deliver the reconstructed record to one map?  Or are both
fragments and consequently the whole record discarded?

Thanks,

-Kelly

Re: HDFS file content restrictions

Posted by Harsh J <qw...@gmail.com>.
HDFS does not operate with records in mind. There shouldn't be too
much of a problem with having a few MBs per record in text files
(provided, 'few MBs' means a (very) small fraction of the file's
blocksize value).

On Sat, Mar 5, 2011 at 1:00 AM, Kelly Burkhart <ke...@gmail.com> wrote:
> Hello, are the restrictions to the size or "width" of text files
> placed in HDFS?  I have a file structure like this:
>
> <text key><tab><text data><nl>
>
> It would be helpful if in some circumstances I could make text data
> really large (large meaning many KB to one/few MB).  I may have some
> rows that have a very small payload and some with a very large
> payload.  Is this OK?  When HDFS is splitting the file into chunks to
> spread across the cluster will it ever spit a record?  Total file size
> may be on the order of 20-30GB.
>
> Thanks,
>
> -K
>



-- 
Harsh J
www.harshj.com