You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2009/01/30 17:35:02 UTC

[jira] Created: (NUTCH-684) Dedup support for Solr

Dedup support for Solr
----------------------

                 Key: NUTCH-684
                 URL: https://issues.apache.org/jira/browse/NUTCH-684
             Project: Nutch
          Issue Type: New Feature
          Components: indexer
            Reporter: Doğacan Güney
            Assignee: Doğacan Güney


After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lihachev updated NUTCH-684:
----------------------------------

    Attachment: NUTCH-684_solrdedup_v2.patch

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675316#action_12675316 ] 

Doğacan Güney commented on NUTCH-684:
-------------------------------------

Oh, about javadocs. I agree with you on class-level javadocs, but do we really need javadocs for public methods? They are rather straightforward stuff; map, reduce, etc....

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675260#action_12675260 ] 

dmitry.lihachev edited comment on NUTCH-684 at 2/19/09 10:40 PM:
-----------------------------------------------------------------

Produce a little more log output for SolrDeleteDuplicates

      was (Author: dmitry.lihachev):
    Produce a little more log output
  
> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675232#action_12675232 ] 

Dmitry Lihachev commented on NUTCH-684:
---------------------------------------

This patch works for me too.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675309#action_12675309 ] 

Andrzej Bialecki  commented on NUTCH-684:
-----------------------------------------

A few comments to this patch (and to other closely related classes in o.a.n.i.solr):

* we need javadocs in this patch - both class-level and for public methods. The class-level javadoc should contain pseudo-code to illustrate the selection process (see o.a.n.i.DeleteDuplicates for an example).

* there is a silent assumption that Solr schema uses "id" field as unique key, and that this field contains the URL of the document. First, shouldn't this be "url" field? Because as far as I can see the field name "id" is not used anywhere in SolrIndexer/SolrWriter - please correct me if I missed something. At least this assumption should be spelled out in javadocs, both on the indexing side and on the dedup side. (Actually, we should have added an example of the minimum required Solr schema when the original Nutch/Solr integration was committed)

* field names should be constants and not magic literals, they should come either from o.a.n.metadata.Nutch or be defined in SolrConstants.

* SolrServer.deleteById() creates and sends UpdateRequest containing just this single id. This is inefficient, especially in our case where the number of deletes may be significant. Perhaps this patch works sufficiently well for now, but it should be improved (either here or in a separate issue) by using a single UpdateRequest per reduce task, and calling SolrServer.request(UpdateRequest) with the accumulated id-s.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-684) Dedup support for Solr

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-684:
--------------------------------

    Attachment: solrdedup_v2.patch

New version:

* Added field names to SolrConstants. Although only SolrDeleteDuplicates uses these.

* Also, I didn't implement a conf option for unique key. It is a very good idea but it also requires changes to SolrIndexer and other classes and I didn't want to do it this late in the release cycle.

* #reduce now uses a UpdateRequest so that deletes are queued and sent to server in batches.

* Added javadoc

* Updated bin/nutch with new command solrdedup

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-684) Dedup support for Solr

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-684:
--------------------------------

    Attachment: solrdedup.patch

First version of a solr dedup feature. I haven't yet tested this patch much yet, so if you use it it may blow your computer.

I first thought about trying to make duplicate deletion a generic class with solr and lucene backends. However, lucene and solr are so different in this regard that, it was much easier to just
write a new solr dedup class.

Since urls are assumed to be unique in solr, SolrDeleteDuplicates only deletes urls with the same digest based on score. If two urls have the same digest and the same score then the one with the later timestamp stays.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lihachev updated NUTCH-684:
----------------------------------

    Attachment: NUTCH-684_bin_nutch.patch

patch for bin/nutch

so we can write
{{bin/nutch solrdedup <solrurl>}}

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680194#action_12680194 ] 

Andrzej Bialecki  commented on NUTCH-684:
-----------------------------------------

Yes, I'm aware of this functionality. At this point however I thought that it would only complicate things, because users would have to install Nutch classes on Solr in order to use Signature implementations that we use. This is of course an open issue that we should investigate after 1.0 release.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680374#action_12680374 ] 

Hudson commented on NUTCH-684:
------------------------------

Integrated in Nutch-trunk #748 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/748/])
     - Dedup support for Solr


> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675323#action_12675323 ] 

Andrzej Bialecki  commented on NUTCH-684:
-----------------------------------------

IMHO it would be good to have this functionality in 1.0, and the patch is very close.

Ok, how about the following:

* we make the name of the unique field configurable, and provide a default value in nutch-default.xml, which is consistent with the one provided in the example schema.xml (yes, we should add an example schema, and the one in NUTCH-442 looks good enough).

* the UpdateRequest improvement: it's up to you whether to do it here or separately. It would be certainly a nice to have.

* javadocs: yeah, map/reduce/configure are obvious, and good javadocs exist in superclasses. Same of bean-like getters/setters. Other public methods should be documented, so that in half a year we still know what they are for and we understand the arguments they expect.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675311#action_12675311 ] 

Dmitry Lihachev commented on NUTCH-684:
---------------------------------------

bq. there is a silent assumption that Solr schema uses "id" field as unique key, and that this field contains the URL of the document. First, shouldn't this be "url" field? Because as far as I can see the field name "id" is not used anywhere in SolrIndexer/SolrWriter - please correct me if I missed something. At least this assumption should be spelled out in javadocs, both on the indexing side and on the dedup side. (Actually, we should have added an example of the minimum required Solr schema when the original Nutch/Solr integration was committed)

"id" field defined in schema.xml (NUTCH-422)

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lihachev updated NUTCH-684:
----------------------------------

    Attachment:     (was: NUTCH-684_solrdedup_v2.patch)

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680173#action_12680173 ] 

Shalin Shekhar Mangar commented on NUTCH-684:
---------------------------------------------

Just found this issue from Sami's post on Lucid blog. Are you guys aware of the Deduplication feature in Solr trunk?

http://wiki.apache.org/solr/Deduplication and SOLR-799

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-684) Dedup support for Solr

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675315#action_12675315 ] 

Doğacan Güney commented on NUTCH-684:
-------------------------------------

I wasn't thinking of putting this in for 1.0, but if people want this feature I will ready it for 1.0

bq.    *  there is a silent assumption that Solr schema uses "id" field as unique key, and that this field contains the URL of the document. First, shouldn't this be "url" field? Because as far as I can see the field name "id" is not used anywhere in SolrIndexer/SolrWriter - please correct me if I missed something. At least this assumption should be spelled out in javadocs, both on the indexing side and on the dedup side. (Actually, we should have added an example of the minimum required Solr schema when the original Nutch/Solr integration was committed)

    * field names should be constants and not magic literals, they should come either from o.a.n.metadata.Nutch or be defined in SolrConstants.

This is something I have been thinking for a while. My assumption was that you didn't have to use "url" field in
your solr server as the unique field so I added an extra "id" field (which in NUTCH-442's schema.xml is copied from "url"). But I am no longer sure the extra cost of a field is worth the flexibility.

I agree with you that we should have an solr schema xml somewhere in our codebase that is officially blessed. I guess NUTCH-442's schema is a good starting point for that but I am open to suggestions. I will create a new issue for it.

bq.  SolrServer.deleteById() creates and sends UpdateRequest containing just this single id. This is inefficient, especially in our case where the number of deletes may be significant. Perhaps this patch works sufficiently well for now, but it should be improved (either here or in a separate issue) by using a single UpdateRequest per reduce task, and calling SolrServer.request(UpdateRequest) with the accumulated id-s.

Good point. I will send an improved patch.


> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675311#action_12675311 ] 

dmitry.lihachev edited comment on NUTCH-684 at 2/20/09 2:10 AM:
----------------------------------------------------------------

bq. there is a silent assumption that Solr schema uses "id" field as unique key, and that this field contains the URL of the document. First, shouldn't this be "url" field? Because as far as I can see the field name "id" is not used anywhere in SolrIndexer/SolrWriter - please correct me if I missed something. At least this assumption should be spelled out in javadocs, both on the indexing side and on the dedup side. (Actually, we should have added an example of the minimum required Solr schema when the original Nutch/Solr integration was committed)

"id" field defined in schema.xml (NUTCH-442)

      was (Author: dmitry.lihachev):
    bq. there is a silent assumption that Solr schema uses "id" field as unique key, and that this field contains the URL of the document. First, shouldn't this be "url" field? Because as far as I can see the field name "id" is not used anywhere in SolrIndexer/SolrWriter - please correct me if I missed something. At least this assumption should be spelled out in javadocs, both on the indexing side and on the dedup side. (Actually, we should have added an example of the minimum required Solr schema when the original Nutch/Solr integration was committed)

"id" field defined in schema.xml (NUTCH-422)
  
> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-684) Dedup support for Solr

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-684.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Fixed as of rev. 751774.

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch, solrdedup_v2.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-684) Dedup support for Solr

Posted by "Dmitry Lihachev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lihachev updated NUTCH-684:
----------------------------------

    Attachment: NUTCH-684_solrdedup_v2.patch

Produce a little more log output

> Dedup support for Solr
> ----------------------
>
>                 Key: NUTCH-684
>                 URL: https://issues.apache.org/jira/browse/NUTCH-684
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>         Attachments: NUTCH-684_bin_nutch.patch, NUTCH-684_solrdedup_v2.patch, solrdedup.patch
>
>
> After NUTCH-442, nutch now can index to both solr and lucene. However, duplicate deletion feature (based on digests) is only available in lucene. It should also be available for solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.