You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Enzo Michelangeli <en...@gmail.com> on 2007/08/16 03:26:39 UTC
Any Paul Volcker for score inflation?
Dear All,
Has anybody devised a fix for the "score inflation" problem mentioned at
http://wiki.apache.org/nutch/FixingOpicScoring ? After many
"generate/fetch/updatedb" iteration cycles, the max and average scores
reported by "bin/nutch readdb crawl/crawldb -stats" have grown to pretty
ridiculous values:
min score: 0.0
avg score: 1.07425485E9
max score: 9.2233725E15
...and many seed URL's are ignored as a result, because they lie too low in
the pecking order to have any chance of being selcted as "-topN" by
"bin/nutch generate" (and I refuse to inject them setting db.score.injected
to 1E39...). So my questions are:
1. Is the score inflation issue expected to be fixed soon?
2. In the meantime, is there a way to "normalize" a crawldb and/or rebuild
it from the data segments in order to get rid of these scoring aberrations?
Thanks in advance,
Enzo