You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/04/08 13:21:05 UTC

[jira] [Commented] (NUTCH-963) Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)

    [ https://issues.apache.org/jira/browse/NUTCH-963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017403#comment-13017403 ] 

Julien Nioche commented on NUTCH-963:
-------------------------------------

Shall we create a new issue to track the progress of solrclean on the trunk? I'd like to release 1.3 soon and this issue will look open until we do it on trunk, which might take some time

> Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (404 urls)
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-963
>                 URL: https://issues.apache.org/jira/browse/NUTCH-963
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 2.0
>            Reporter: Claudio Martella
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.3, 2.0
>
>         Attachments: NUTCH-963-command-and-log4j.patch, Solr404Deleter.java, SolrClean.java
>
>
> When issuing recrawls it can happen that certain urls have expired (i.e. URLs that don't exist anymore and return 404).
> This patch creates a new command in the indexer that scans the crawldb looking for these urls and issues delete commands to SOLR.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira