You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Yury (JIRA)" <ji...@apache.org> on 2009/01/09 20:35:59 UTC

[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662490#action_12662490 ] 

Yury commented on NUTCH-579:
----------------------------

Hi!

I have the same problem with feed perser. I crawl livejournal feed and FeddParser pars it. ParseResult contains all items of chanel but index contains only chanel header. Joseph's solution unfortunately don't work, FeedIndexingFilter process only chanels header.

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.