You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Enzo Michelangeli <en...@gmail.com> on 2007/08/16 03:26:39 UTC

Any Paul Volcker for score inflation?

Dear All,

Has anybody devised a fix for the "score inflation" problem mentioned at
http://wiki.apache.org/nutch/FixingOpicScoring ? After many
"generate/fetch/updatedb" iteration cycles, the max and average scores
reported by "bin/nutch readdb crawl/crawldb -stats" have grown to pretty
ridiculous values:

min score:      0.0
avg score:      1.07425485E9
max score:      9.2233725E15

...and many seed URL's are ignored as a result, because they lie too low in
the pecking order to have any chance of being selcted as "-topN" by
"bin/nutch generate" (and I refuse to inject them setting db.score.injected
to 1E39...). So my questions are:

1. Is the score inflation issue expected to be fixed soon?

2. In the meantime, is there a way to "normalize" a crawldb and/or rebuild
it from the data segments in order to get rid of these scoring aberrations?

Thanks in advance,

Enzo