You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2009/01/30 17:36:59 UTC

[jira] Updated: (NUTCH-684) Dedup support for Solr

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-684:
--------------------------------

    Attachment: solrdedup.patch

First version of a solr dedup feature. I haven't yet tested this patch much yet, so if you use it it may blow your computer.

I first thought about trying to make duplicate deletion a generic class with solr and lucene backends. However, lucene and solr are so different in this regard that, it was much easier to just
write a new solr dedup class.

Since urls are assumed to be unique in solr, SolrDeleteDuplicates only deletes urls with the same digest based on score. If two urls have the same digest and the same score then the one with the later timestamp stays.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.