You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2013/10/19 15:36:42 UTC

[jira] [Updated] (NUTCH-656) DeleteDuplicates based on crawlDB only

     [ https://issues.apache.org/jira/browse/NUTCH-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-656:
--------------------------------

    Attachment: NUTCH-656.v2.patch

Attached is a new patch which creates a new db status and a deduplication job which sets the status of a crawldatum to duplicate based on the heuristics described above. The deletion of the document is done with the CleaningJob task.
This addresses the comments made by Sebastian and should be produce similar results to the SOLR indexer, except that it would be more efficient on large crawls and usable by other indexing backends.
Can you please have a look and let me know what you think? Thanks!

> DeleteDuplicates based on crawlDB only 
> ---------------------------------------
>
>                 Key: NUTCH-656
>                 URL: https://issues.apache.org/jira/browse/NUTCH-656
>             Project: Nutch
>          Issue Type: Wish
>          Components: indexer
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-656.patch, NUTCH-656.v2.patch
>
>
> The existing dedup functionality relies on Lucene indices and can't be used when the indexing is delegated to SOLR.
> I was wondering whether we could use the information from the crawlDB instead to detect URLs to delete then do the deletions in an indexer-neutral way. As far as I understand the content of the crawlDB contains all the elements we need for dedup, namely :
> * URL 
> * signature
> * fetch time
> * score
> In map-reduce terms we would have two different jobs : 
> * read crawlDB and compare on URLs : keep only most recent element - oldest are stored in a file and will be deleted later
> * read crawlDB and have a map function generating signatures as keys and URL + fetch time +score as value
> * reduce function would depend on which parameter is set (i.e. use signature or score) and would output as list of URLs to delete
> This assumes that we can then use the URLs to identify documents in the indices.
> Any thoughts on this? Am I missing something?
> Julien



--
This message was sent by Atlassian JIRA
(v6.1#6144)