You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "julien nioche (JIRA)" <ji...@apache.org> on 2008/10/09 12:38:44 UTC

[jira] Reopened: (NUTCH-656) DeleteDuplicates based on crawlDB only

     [ https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche reopened NUTCH-656:
---------------------------------


I suppose that the SOLR dedup mechanism is valid on a single instance. If the documents are distributed across a number of SOLR shards (by modifying NUTCH-442) there will be no way of detecting that two documents have the same signature if they are sent to different shards. Assuming that the documents are distributed across SOLR shards based on their unique ID (i.e. their URL) the deduplication of documents based on URLs is already done. What the SOLR-dedup could do would be to use the crawlDB as described earlier to find duplicates based on the signature and send deletion orders to the SOLR shards.

not an urging issue for the moment as NUTCH-442 supports only one SOLR backend though 

> DeleteDuplicates based on crawlDB only 
> ---------------------------------------
>
>                 Key: NUTCH-656
>                 URL: https://issues.apache.org/jira/browse/NUTCH-656
>             Project: Nutch
>          Issue Type: Wish
>          Components: indexer
>            Reporter: julien nioche
>
> The existing dedup functionality relies on Lucene indices and can't be used when the indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead to detect URLs to delete then do the deletions in an indexer-neutral way. As far as I understand the content of the crawlDB contains all the elements we need for dedup, namely :
> * URL 
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs : 
> * read crawlDB and compare on URLs : keep only most recent element - oldest are stored in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL + fetch time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature or score) and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the indices.
> Any thoughts on this? Am I missing something?
> Julien

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.