You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/04/13 19:41:32 UTC

MapFile.Reader bug (Re: Optimal segment size?)

Jay Yu wrote:
> I have a similar problem when the segread tool (acutually any code that
> needs to read the seg) was just hanging there forever on a 
> truncated segment. I think there are at least 2 bugs: one in the fetcher
> which generated the truncated seg without any error message, the 2nd is the

Well, truncated segments are created only in case of a fatal bug, like 
OOM Exception or a JVM crash. So, there is really no way to produce any 
message except just the usual messages in such cases...

> MapFile/SequenceFile which generates the dead lock. But looking at the codes
> it is not easy to pinpoint the bugs. Maybe someone else (like Doug) has a
> better idea?

The "bug" (or misfeature) of MapFile.Reader is that it silently assumes 
it is ok to deal with a truncated file. In reality, the tradeoff is a 
slowdown of two- or more orders of magnitude for random seeking. If the 
intended use is to process the file sequentially (as many tools do 
this), then it's ok. In other cases, if the file is used for intensive 
random seeking, then the processing performance will drop drastically.

I believe the correct fix is to refuse opening a truncated MapFile, 
unless an "override" flag is provided. This way, it will be easy to 
detect this situation and fix corrupted segments when really needed.

If this sounds like a proper way to address this problem, I'll prepare a 
patch.

> The worst part is that there is no way to fix that truncated record because
> any tool that intends to fix it needs to read it first!

Erhm. Not true. Currently this involves a bit of manual procedure, but 
can be done. First, you need to delete the partial "index" files from 
affected directories. Then, run the segread -fix command - it will 
create new "index" files.

> As for the parallel indexing on multiple machines, I think you need to copy
> the same web db over in order to do it right and you need to merge the
> segments in the end too.

Indexing doesn't use WebDB at all. However, at this moment there is no 
straightforward way to do this in parallel (unless the new MapReduce 
code can be used for that?).

-- 
Best regards,
Andrzej Bialecki
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com