You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Toby DiPasquale <to...@turntide.com> on 2006/03/06 18:48:13 UTC

record termination and MapReduce

Hi all,

I have a question about the MapReduce and NDFS implementations. When
writing records into an NDFS file, how does one make sure that records
terminate cleanly on block boundaries such that a Map job's input does not
span multiple physical blocks? 

It also appears as if NDFS does not have an explicit "record append"
operation. Is this the case?

-- 
Toby DiPasquale
Senior Software Engineer
Symantec Corporation

Re: record termination and MapReduce

Posted by Doug Cutting <cu...@apache.org>.
Toby DiPasquale wrote:
> I have a question about the MapReduce and NDFS implementations. When
> writing records into an NDFS file, how does one make sure that records
> terminate cleanly on block boundaries such that a Map job's input does not
> span multiple physical blocks? 

We do not currently guarantee that.  A task's input may span multiple 
blocks.  We try to split things into block-sized chunks, but the last 
few records (up to the first sync mark past the split point) may be in 
the next block.  So a bit of i/o will happen over the network, but not 
the vast majority.

> It also appears as if NDFS does not have an explicit "record append"
> operation. Is this the case?

Yes.  DFS currently is write-once.

Please note that the MapReduce and DFS code has moved from Nutch to the 
Hadoop project.  Such questions are more appropriately asked there.

Doug