You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Joseph Chen (JIRA)" <ji...@apache.org> on 2007/12/19 00:34:43 UTC

[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552935 ] 

Joseph Chen commented on NUTCH-579:
-----------------------------------

I changed the db.signature.class and this seems to solve the problem when I first do a crawl.

Now I'm seeing a similar problem when I try to merge the results of two crawls.  I performed two separate crawls using the crawl tool.  I wanted to merge the results of the two crawls.  Here are the steps I did:

1) Merged the segments from the two crawls
2) Inverted links
3) Merged the crawldb
4) Indexed the segments
5) Dedup the index
6) Merged the indexes

I noticed a problem after running the dedup.  My original index had about 8000 documents (corresponding to feed posts) and after merging I ended up with about half that number (4000 documents).

Examining the index via Luke shows that I'm back down to one post feed - each document has a unique digest value. 
When I skip the dedup step (step 5), the number of documents is around 17000, and examining this index shows multiple posts from a feed.

I searched for the db.signature.class value in the DeleteDuplicates.java class, which is the class that gets called when running bin/nutch dedup, but I didn't see any references to this value.

Any ideas about this issue?

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.