You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/11/21 09:50:43 UTC

[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544383 ] 

Doğacan Güney commented on NUTCH-579:
-------------------------------------

Joseph, good point. Plugin parse-feed is meant to be used with TextProfileSignature or any other signature implementation that uses parse text and ignores content (since all posts have the same content (feed) but different text, they will all have different signatures).

A possible fix may be to change MD5Signature to hash content together with parse-text. This way, posts in a feed will have different signatures but MD5Signature's behaviour will stay approximately the same.

Anyway, for now, you can just change db.signature.class option.

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.