You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexis Hope <ba...@gmail.com> on 2016/01/08 22:22:51 UTC

Custom Generator or ScoringFilter (or Fetch)

Hi All,
(happy new year!)

I've been curious this year to delve further into Nutch. I have been using
generate/fetch/parse/update but noticed some pages get re-crawled before
fetching new segments. From what I understand this is because of the
generators internal ScoringFilter?

My question is how would I prioritise certain content? For example either a
domain or content type, or just unfetched segments.

Looking at the docs for fetching I see the segment parameter to point to
the segments dir. I'm unsure how to user this with Mongo as I dont have a
segments dir (I think).
In the docs for ScoreFilter I see its used with generate, "ScoringFilter is
used within ... which selects ..... a subset of URLs due for fetching".

Should I be looking to solve this with Fetching Segments Directory or a
custom Score Filter?

Advise on either or reference material is welcomed.

Cheers,
Lex