You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark question <ma...@gmail.com> on 2011/04/26 20:49:57 UTC

Reading from File

Hi,

   My mapper opens a file and read records using next() . However, I want to
stop reading if there is no memory available. What confuses me here is that
even though I'm reading record by record with next(), hadoop actually reads
them in dfs.block.size. So, I have two questions:

1. Is it true that even if I set dfs.block.size to 512 MB, then at least one
block is loaded in memory for mapper to process (part of inputSplit)?

2. How can I read multiple records from a sequenceFile at once and will it
make a difference ?

Thanks,
Mark

Re: Reading from File

Posted by Mark question <ma...@gmail.com>.
On Tue, Apr 26, 2011 at 11:49 PM, Harsh J <ha...@cloudera.com> wrote:

> Hello Mark,
>
> On Wed, Apr 27, 2011 at 12:19 AM, Mark question <ma...@gmail.com>
> wrote:
> > Hi,
> >
> >   My mapper opens a file and read records using next() . However, I want
> to
> > stop reading if there is no memory available. What confuses me here is
> that
> > even though I'm reading record by record with next(), hadoop actually
> reads
> > them in dfs.block.size. So, I have two questions:
>
> The dfs.block.size is a HDFS property, and does not have a rigid
> relationship with InputSplits in Hadoop MapReduce. It is used as hints
> for constructing offsets and lengths of splits for the RecordReaders
> to seek-and-read from and until.
>
> > 1. Is it true that even if I set dfs.block.size to 512 MB, then at least
> one
> > block is loaded in memory for mapper to process (part of inputSplit)?
>
> Blocks are not pre-loaded into memory, they are merely read off the FS
> record by record (or buffer by buffer, if you please).
>


    I assume the record reader actually have  a couple of records read from
disk into memory/buffer to be handed record by record to the maps. It can
not be the case that each recordReader.next() is reading one record at a
time from disk . So my question is how much is read into the buffer from
disk at once by the recordReader ? Is there a parameter for the space of
memory for this buffering?




> You shouldn't really have memory issues with any of the
> Hadoop-provided RecordReaders as long as individual records fit well
> into available Task JVM memory grants.
>
> > 2. How can I read multiple records from a sequenceFile at once and will
> it
> > make a difference ?
>
> Could you clarify on what it is you seek here? Do you want to supply
> your mappers with N records every call via a sequence file or do you
> merely look to do this to avoid some memory issues as stated above?
>
> In case of the former, it would be better if your Sequence Files were
> prepared with batched records instead of writing a custom N-line
> splitting InputFormat for the SequenceFiles (which will need to
> inspect the file pre-submit).
>
> Have I understood your questions right?
>
>     My mapper have other SequenceFiles opened to be read from inside map
functions. So inside map, I used a Sequencefile.Reader and used its next()
to grab 1 record. Now I'm looking for a function that does nextNrecords() of
the opened sequence file. I'm thinking of this because in general,
disk-buffered-reading of multiple blocks is better than reading
block-by-block due to syscall overhead. Does that make sense?  Unless, you
say that next() actually buffers a multiple records even if the user
requested one record or next().


> --
> Harsh J
>

Re: Reading from File

Posted by Harsh J <ha...@cloudera.com>.
Hello Mark,

On Wed, Apr 27, 2011 at 12:19 AM, Mark question <ma...@gmail.com> wrote:
> Hi,
>
>   My mapper opens a file and read records using next() . However, I want to
> stop reading if there is no memory available. What confuses me here is that
> even though I'm reading record by record with next(), hadoop actually reads
> them in dfs.block.size. So, I have two questions:

The dfs.block.size is a HDFS property, and does not have a rigid
relationship with InputSplits in Hadoop MapReduce. It is used as hints
for constructing offsets and lengths of splits for the RecordReaders
to seek-and-read from and until.

> 1. Is it true that even if I set dfs.block.size to 512 MB, then at least one
> block is loaded in memory for mapper to process (part of inputSplit)?

Blocks are not pre-loaded into memory, they are merely read off the FS
record by record (or buffer by buffer, if you please).

You shouldn't really have memory issues with any of the
Hadoop-provided RecordReaders as long as individual records fit well
into available Task JVM memory grants.

> 2. How can I read multiple records from a sequenceFile at once and will it
> make a difference ?

Could you clarify on what it is you seek here? Do you want to supply
your mappers with N records every call via a sequence file or do you
merely look to do this to avoid some memory issues as stated above?

In case of the former, it would be better if your Sequence Files were
prepared with batched records instead of writing a custom N-line
splitting InputFormat for the SequenceFiles (which will need to
inspect the file pre-submit).

Have I understood your questions right?

-- 
Harsh J