You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by vetus <ve...@isac.cat> on 2012/11/22 13:00:05 UTC

GeneratorMapper - how to re-generate webpage?

Hello, 

I have a doubt, I have clrawled all my webside  using a java that does the
five steps
inject/generate/fetch/parse/update using InjectorJob, GeneratorJob, etc...

When it has indexed all the website, then I want to re-crawl some pages
again (Because it has changed), but as I understand, the generatorMapper
don't allow it because each webpage already has a GENERATE_MARK

(GeneratorMapper.java) --> line 53

    if (Mark.GENERATE_MARK.checkMark(page) != null) {
      if (GeneratorJob.LOG.isDebugEnabled()) {
        GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
      }
      return;
    }

So, How can I re-crawl (re-generate / re-fetch) without using -All
parameter?

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/GeneratorMapper-how-to-re-generate-webpage-tp4021830.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: GeneratorMapper - how to re-generate webpage?

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Vetus,

On Thu, Nov 22, 2012 at 12:00 PM, vetus <ve...@isac.cat> wrote:

> When it has indexed all the website, then I want to re-crawl some pages
> again (Because it has changed),

For starters I would advise you to use the adaptive fetch schedule
[0], this can be configured from within the db.fetch.schedule.class
property in nutch-site.xml

> but as I understand, the generatorMapper
> don't allow it because each webpage already has a GENERATE_MARK

One suggestion is the following. On the 2nd round of crawling generate
everything using the -all flag, assuming that this is not too
expensive. The fetch and scoring schedules can then kick in and you
can begin to build a representation of the web graph. If you are
developing, I would set the GeneratorJob logging to TRACE or DEBUG
until you get some decent results. It will take some tweaking.

hth

Lewis

[0] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java