You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Toby DiPasquale <to...@turntide.com> on 2006/03/06 18:48:13 UTC
record termination and MapReduce
Hi all,
I have a question about the MapReduce and NDFS implementations. When
writing records into an NDFS file, how does one make sure that records
terminate cleanly on block boundaries such that a Map job's input does not
span multiple physical blocks?
It also appears as if NDFS does not have an explicit "record append"
operation. Is this the case?
--
Toby DiPasquale
Senior Software Engineer
Symantec Corporation
Re: record termination and MapReduce
Posted by Doug Cutting <cu...@apache.org>.
Toby DiPasquale wrote:
> I have a question about the MapReduce and NDFS implementations. When
> writing records into an NDFS file, how does one make sure that records
> terminate cleanly on block boundaries such that a Map job's input does not
> span multiple physical blocks?
We do not currently guarantee that. A task's input may span multiple
blocks. We try to split things into block-sized chunks, but the last
few records (up to the first sync mark past the split point) may be in
the next block. So a bit of i/o will happen over the network, but not
the vast majority.
> It also appears as if NDFS does not have an explicit "record append"
> operation. Is this the case?
Yes. DFS currently is write-once.
Please note that the MapReduce and DFS code has moved from Nutch to the
Hadoop project. Such questions are more appropriately asked there.
Doug