You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by vetus <ve...@isac.cat> on 2012/11/22 13:00:05 UTC
GeneratorMapper - how to re-generate webpage?
Hello,
I have a doubt, I have clrawled all my webside using a java that does the
five steps
inject/generate/fetch/parse/update using InjectorJob, GeneratorJob, etc...
When it has indexed all the website, then I want to re-crawl some pages
again (Because it has changed), but as I understand, the generatorMapper
don't allow it because each webpage already has a GENERATE_MARK
(GeneratorMapper.java) --> line 53
if (Mark.GENERATE_MARK.checkMark(page) != null) {
if (GeneratorJob.LOG.isDebugEnabled()) {
GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
}
return;
}
So, How can I re-crawl (re-generate / re-fetch) without using -All
parameter?
Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/GeneratorMapper-how-to-re-generate-webpage-tp4021830.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: GeneratorMapper - how to re-generate webpage?
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Vetus,
On Thu, Nov 22, 2012 at 12:00 PM, vetus <ve...@isac.cat> wrote:
> When it has indexed all the website, then I want to re-crawl some pages
> again (Because it has changed),
For starters I would advise you to use the adaptive fetch schedule
[0], this can be configured from within the db.fetch.schedule.class
property in nutch-site.xml
> but as I understand, the generatorMapper
> don't allow it because each webpage already has a GENERATE_MARK
One suggestion is the following. On the 2nd round of crawling generate
everything using the -all flag, assuming that this is not too
expensive. The fetch and scoring schedules can then kick in and you
can begin to build a representation of the web graph. If you are
developing, I would set the GeneratorJob logging to TRACE or DEBUG
until you get some decent results. It will take some tweaking.
hth
Lewis
[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java