You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Joseph Chen (JIRA)" <ji...@apache.org> on 2007/11/21 08:41:43 UTC

[jira] Created: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Feed plugin only indexes one post per feed due to identical digest
------------------------------------------------------------------

                 Key: NUTCH-579
                 URL: https://issues.apache.org/jira/browse/NUTCH-579
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.0.0
            Reporter: Joseph Chen


When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.

I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.

As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:

byte[] signature = MD5Hash.digest(url.toString()).getDigest();
doc.removeField("digest");
doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));

This seems to fix the issue as the index now contains the proper number of documents.

Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-579:
--------------------------------

    Attachment: NUTCH-579.patch

This simple patch should fix it.

Instead of just looking at content, MD5Signature calculates hash of content + parse text.

I think this keeps intended behavior of MD5Signature as well as a tiny change in page
will create a different hash as before.

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>         Attachments: NUTCH-579.patch
>
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666045#action_12666045 ] 

Hudson commented on NUTCH-579:
------------------------------

Integrated in Nutch-trunk #701 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/701/])
     - Feed plugin only indexes one post per feed due to identical digest


> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-579.patch
>
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Posted by "Yury (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662490#action_12662490 ] 

Yury commented on NUTCH-579:
----------------------------

Hi!

I have the same problem with feed perser. I crawl livejournal feed and FeddParser pars it. ParseResult contains all items of chanel but index contains only chanel header. Joseph's solution unfortunately don't work, FeedIndexingFilter process only chanels header.

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544383 ] 

Doğacan Güney commented on NUTCH-579:
-------------------------------------

Joseph, good point. Plugin parse-feed is meant to be used with TextProfileSignature or any other signature implementation that uses parse text and ignores content (since all posts have the same content (feed) but different text, they will all have different signatures).

A possible fix may be to change MD5Signature to hash content together with parse-text. This way, posts in a feed will have different signatures but MD5Signature's behaviour will stay approximately the same.

Anyway, for now, you can just change db.signature.class option.

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-579.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Doğacan Güney

Patch committed.

Still, I would strongly suggest that people use TextProfileSignature.

To use TextProfileSignature add this to your conf/nutch-site.xml:

<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.TextProfileSignature</value>
</property>


> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-579.patch
>
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Posted by "Joseph Chen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552935 ] 

Joseph Chen commented on NUTCH-579:
-----------------------------------

I changed the db.signature.class and this seems to solve the problem when I first do a crawl.

Now I'm seeing a similar problem when I try to merge the results of two crawls.  I performed two separate crawls using the crawl tool.  I wanted to merge the results of the two crawls.  Here are the steps I did:

1) Merged the segments from the two crawls
2) Inverted links
3) Merged the crawldb
4) Indexed the segments
5) Dedup the index
6) Merged the indexes

I noticed a problem after running the dedup.  My original index had about 8000 documents (corresponding to feed posts) and after merging I ended up with about half that number (4000 documents).

Examining the index via Luke shows that I'm back down to one post feed - each document has a unique digest value. 
When I skip the dedup step (step 5), the number of documents is around 17000, and examining this index shows multiple posts from a feed.

I searched for the db.signature.class value in the DeleteDuplicates.java class, which is the class that gets called when running bin/nutch dedup, but I didn't see any references to this value.

Any ideas about this issue?

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Posted by "Yury (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662490#action_12662490 ] 

yury edited comment on NUTCH-579 at 1/9/09 11:36 AM:
-----------------------------------------------------

Hi!

I have the same problem with feed perser. I crawl livejournal feed and FeedParser pars it. ParseResult contains all items of chanel but index contains only chanel header. Joseph's solution unfortunately don't work, FeedIndexingFilter process only chanels header.

      was (Author: yury):
    Hi!

I have the same problem with feed perser. I crawl livejournal feed and FeddParser pars it. ParseResult contains all items of chanel but index contains only chanel header. Joseph's solution unfortunately don't work, FeedIndexingFilter process only chanels header.
  
> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason for this is that the digest, which is calculated for based on the content (or the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the individual feed entries were being indexed properly but then when the dedup step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of documents.
> Anyone have any comments on whether this is a good solution or if there is a better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.