You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ashish Nerkar (JIRA)" <ji...@apache.org> on 2015/07/12 18:19:05 UTC

[jira] [Commented] (NUTCH-2060) dedup is removing entries with status db_gone

    [ https://issues.apache.org/jira/browse/NUTCH-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623871#comment-14623871 ] 

Ashish Nerkar commented on NUTCH-2060:
--------------------------------------

Hi, I am a new to Nutch & want to  start contributing to this project. I am interested in working on this issue. Can anyone please update any specific details (like how to reproduce etc) about the issue which will help me to start working on it.
Thanks!!!

> dedup is removing entries with status db_gone
> ---------------------------------------------
>
>                 Key: NUTCH-2060
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2060
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.9
>            Reporter: Steven Hayles
>            Priority: Minor
>
> Using the standard bin/crawl script, Solr is never informed when a previously indexed document has been deleted.
> "bin/nutch update" sets db_gone status in the crawl db for requests returning HTTP 404 status.
> "bin/nutch dedup" remove entries with status db_gone from the crawl db .
> As a result "bin/nutch clean" never sees the db_gone status, so does not inform Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)