You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gora Mohanty <go...@srijan.in> on 2009/10/26 14:36:27 UTC

Deleting stale URLs from Nutch/Solr

Hi,

  We are using Nutch to crawl an internal site, and index content
to Solr. The issue is that the site is run through a CMS, and
occasionally pages are deleted, so that the corresponding URLs
become invalid. Is there any way that Nutch can discover stale
URLs during recrawls, or is the only solution a completely fresh
crawl? Also, is it possible to have Nutch automatically remove
such stale content from Solr?

  I am stumped by this problem, and would appreciate any pointers,
or even thoughts on this.

Regards,
Gora

Re: Deleting stale URLs from Nutch/Solr

Posted by Gora Mohanty <go...@srijan.in>.

On Tue, 27 Oct 2009 07:29:10 +0100
Andrzej Bialecki <ab...@getopt.org> wrote:
[...]
> I assume you mean that the "generate" step produces no new URL-s
> to fetch? That's expected, because they become eligible for
> re-fetching only after Nutch considers them expired, i.e. after
> the fetchTime + fetchInterval, and the default fetchInterval is
> 30 days.

Yes, it was indeed stopping at the generate step, and your
explanation makes sense.

> You can pretend that the time moved on using the -adddays
> parameter.
[...]

Thanks. This worked exactly as you said. I have tested this,
and the removed page indeed shows up with status db_gone, and
I can now script a solution for my problem with stale URLs,
along the lines that you have suggested.

Thank you very much for this quick and thorough response. As
I imagine that this is a common requirement, I will write up
a brief blog entry on this by the weekend, along with a solution.

Regards,
Gora

Re: Deleting stale URLs from Nutch/Solr

Posted by Andrzej Bialecki <ab...@getopt.org>.

Gora Mohanty wrote:
> On Mon, 26 Oct 2009 17:26:23 +0100
> Andrzej Bialecki <ab...@getopt.org> wrote:
> [...]
>> Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
>> They are kept in Nutch crawldb to prevent their re-discovery
>> (through stale links pointing to these URL-s from other pages).
>> If you really want to remove them from CrawlDb you can filter
>> them out (using CrawlDbMerger with just one input db, and setting
>> your URLFilters appropriately).
> [...]
> 
> Thank you for your help. Your suggestions look promising, but I
> think that I did not make myself adequately clear. Once we have
> completed a site crawl with Nutch, ideally I would like to be
> able to find stale links without doing a complete recrawl, i.e.,
> only through restarting the crawl from where it last left off. Is
> that possible.
> 
> I tried a simple test on a local webserver with five pages in a
> three-level hierarchy. The crawl completes, and discovers all
> five URLs as expected. Now, I remove a tertiary page. Ideally,
> I would like to be able run a recrawl, and have Nutch dicover
> the now-missing URL. However, when I try that, it finds no new
> links, and exits.

I assume you mean that the "generate" step produces no new URL-s to 
fetch? That's expected, because they become eligible for re-fetching 
only after Nutch considers them expired, i.e. after the fetchTime + 
fetchInterval, and the default fetchInterval is 30 days.

You can pretend that the time moved on using the -adddays parameter. 
Then Nutch will generate a new fetchlist, and when it discovers that the 
page is missing it will mark it as gone - actually, you could then take 
that information directly from the Nutch segment and instead of 
processing the CrawlDb you could process the segment to collect a 
partial list of gone pages.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Deleting stale URLs from Nutch/Solr

Posted by Gora Mohanty <go...@srijan.in>.

On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki <ab...@getopt.org> wrote:
[...]
> Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
> They are kept in Nutch crawldb to prevent their re-discovery
> (through stale links pointing to these URL-s from other pages).
> If you really want to remove them from CrawlDb you can filter
> them out (using CrawlDbMerger with just one input db, and setting
> your URLFilters appropriately).
[...]

Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits. "./bin/nutch readdb crawl/crawldb -stats"
shows me:
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:	5
retry 0:	5
min score:	0.333
avg score:	0.4664
max score:	1.0
status 2 (db_fetched):	5
CrawlDb statistics: done

Regards,
Gora

Re: Deleting stale URLs from Nutch/Solr

Posted by Andrzej Bialecki <ab...@getopt.org>.

Gora Mohanty wrote:
> Hi,
> 
>   We are using Nutch to crawl an internal site, and index content
> to Solr. The issue is that the site is run through a CMS, and
> occasionally pages are deleted, so that the corresponding URLs
> become invalid. Is there any way that Nutch can discover stale
> URLs during recrawls, or is the only solution a completely fresh
> crawl? Also, is it possible to have Nutch automatically remove
> such stale content from Solr?
> 
>   I am stumped by this problem, and would appreciate any pointers,
> or even thoughts on this.

Hi,

Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are 
kept in Nutch crawldb to prevent their re-discovery (through stale links 
pointing to these URL-s from other pages). If you really want to remove 
them from CrawlDb you can filter them out (using CrawlDbMerger with just 
  one input db, and setting your URLFilters appropriately).

Now when it comes to removing them from Solr ... The simplest (no 
coding) way would be to dump the CrawlDb, use some scripting tools to 
collect just the URL-s with the status GONE, and send them as a <delete> 
command to Solr. A slightly more involved solution would be to implement 
a tool that reads such URLs directly from CrawlDb (using e.g. 
CrawlDbReader API) and then uses SolrJ API to send the same delete 
requests + commit.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com