You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2009/01/21 20:43:59 UTC

[jira] Closed: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

     [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-579.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Doğacan Güney

Patch committed.

Still, I would strongly suggest that people use TextProfileSignature.

To use TextProfileSignature add this to your conf/nutch-site.xml:

<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.TextProfileSignature</value>
</property>


> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-579.patch
>
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.