You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Andre Pautz <a-...@gmx.de> on 2010/08/23 18:11:45 UTC

obvious duplicates with different hash-values

Dear list,

i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't.

The strange thing is, that if i check the index with luke, the pages in doubt do have in fact different hash sums and different URLs. This of course explains why the dedup option "fails". But if i take two of these URLs, which lead obviously to the same content, store the pages with all their content locally and calculate the hash with md5sum, the result is that they have the same hash value and are binary identical.

Do you have any hints why these pages are indexed with different hash values? What point am i missing here?

Example URLs:
1) http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
2) http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
3) http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true

What i've read so far about the TextProfileSignature class is that it would not help me that much, since many of the pages i am trying to index are not that text heavy. Since the indexing took quite some time and the amount of duplicates is large i would be thankful for any idea on how to remove these duplicates.

Thanks for any suggestions,
André
--
GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!*
http://portal.gmx.net/de/go/dsl

AW: obvious duplicates with different hash-values

Posted by Andre Pautz <a-...@gmx.de>.

Hello,

i was running the indexing with nutch 1.0 - so most probably the mentioned bug  NUTCH-835 is the problem. I will give version 1.2 a try.
If that doesn't help I will try the TextProfilSignature as Reinhard Schwab suggested.

Again, thanks everyone for their help.

Regards, André




-----Ursprüngliche Nachricht-----
Von: Andrzej Bialecki [mailto:ab@getopt.org] 
Gesendet: Montag, 23. August 2010 22:38
An: user@nutch.apache.org
Betreff: Re: obvious duplicates with different hash-values

On 2010-08-23 18:11, Andre Pautz wrote:
> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't.
>
> The strange thing is, that if i check the index with luke, the pages in doubt do have in fact different hash sums and different URLs. This of course explains why the dedup option "fails". But if i take two of these URLs, which lead obviously to the same content, store the pages with all their content locally and calculate the hash with md5sum, the result is that they have the same hash value and are binary identical.
>
> Do you have any hints why these pages are indexed with different hash values? What point am i missing here?
>
> Example URLs:
> 1) 
> http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufa
> chlicherService__node.html?__nnn=true
> 2) 
> http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufac
> hlicherService__node.html?__nnn=true
> 3) 
> http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26
> BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=t
> rue
>
> What i've read so far about the TextProfileSignature class is that it would not help me that much, since many of the pages i am trying to index are not that text heavy. Since the indexing took quite some time and the amount of duplicates is large i would be thankful for any idea on how to remove these duplicates.

You didn't say what version of Nutch you are using, but take a look at this issue:

https://issues.apache.org/jira/browse/NUTCH-835

This has been fixed in 1.2 and 2.0.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||  \|  ||  |  Embedded Unix, System Integration http://www.sigram.com  Contact: info at sigram dot com

Re: obvious duplicates with different hash-values

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-23 18:11, Andre Pautz wrote:
> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i understood it right, then the dedup option should do the work for me, i.e. remove entries with the same URL or same content (MD5 hash). But unfortunately it doesn't.
>
> The strange thing is, that if i check the index with luke, the pages in doubt do have in fact different hash sums and different URLs. This of course explains why the dedup option "fails". But if i take two of these URLs, which lead obviously to the same content, store the pages with all their content locally and calculate the hash with md5sum, the result is that they have the same hash value and are binary identical.
>
> Do you have any hints why these pages are indexed with different hash values? What point am i missing here?
>
> Example URLs:
> 1) http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
> 2) http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
> 3) http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>
> What i've read so far about the TextProfileSignature class is that it would not help me that much, since many of the pages i am trying to index are not that text heavy. Since the indexing took quite some time and the amount of duplicates is large i would be thankful for any idea on how to remove these duplicates.

You didn't say what version of Nutch you are using, but take a look at 
this issue:

https://issues.apache.org/jira/browse/NUTCH-835

This has been fixed in 1.2 and 2.0.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: obvious duplicates with different hash-values

Posted by reinhard schwab <re...@aon.at>.

use another signature.
it is tolerant against small changes.

<property>
  <name>db.signature.class</name>
  <value>org.apache.nutch.crawl.TextProfileSignature</value>
  <description>The default implementation of a page signature. Signatures
  created with this implementation will be used for duplicate detection
  and removal.</description>
</property>

Scott Gonyea schrieb:
> Were I to guess, the md5 hash isn't a hash of the content but, rather, of
> the CrawlDatum object that Nutch stores.
>
> Scott
>
> On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz <a-...@gmx.de> wrote:
>
>   
>> Dear list,
>>
>> i have a problem with removing duplicates from my nutch index. If i
>> understood it right, then the dedup option should do the work for me, i.e.
>> remove entries with the same URL or same content (MD5 hash). But
>> unfortunately it doesn't.
>>
>> The strange thing is, that if i check the index with luke, the pages in
>> doubt do have in fact different hash sums and different URLs. This of course
>> explains why the dedup option "fails". But if i take two of these URLs,
>> which lead obviously to the same content, store the pages with all their
>> content locally and calculate the hash with md5sum, the result is that they
>> have the same hash value and are binary identical.
>>
>> Do you have any hints why these pages are indexed with different hash
>> values? What point am i missing here?
>>
>> Example URLs:
>> 1)
>> http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>> 2)
>> http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>> 3)
>> http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>>
>> What i've read so far about the TextProfileSignature class is that it would
>> not help me that much, since many of the pages i am trying to index are not
>> that text heavy. Since the indexing took quite some time and the amount of
>> duplicates is large i would be thankful for any idea on how to remove these
>> duplicates.
>>
>> Thanks for any suggestions,
>> André
>> --
>> GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!*
>> http://portal.gmx.net/de/go/dsl
>>
>>     
>
>

Re: obvious duplicates with different hash-values

Posted by Scott Gonyea <me...@sgonyea.com>.

Were I to guess, the md5 hash isn't a hash of the content but, rather, of
the CrawlDatum object that Nutch stores.

Scott

On Mon, Aug 23, 2010 at 9:11 AM, Andre Pautz <a-...@gmx.de> wrote:

> Dear list,
>
> i have a problem with removing duplicates from my nutch index. If i
> understood it right, then the dedup option should do the work for me, i.e.
> remove entries with the same URL or same content (MD5 hash). But
> unfortunately it doesn't.
>
> The strange thing is, that if i check the index with luke, the pages in
> doubt do have in fact different hash sums and different URLs. This of course
> explains why the dedup option "fails". But if i take two of these URLs,
> which lead obviously to the same content, store the pages with all their
> content locally and calculate the hash with md5sum, the result is that they
> have the same hash value and are binary identical.
>
> Do you have any hints why these pages are indexed with different hash
> values? What point am i missing here?
>
> Example URLs:
> 1)
> http://www.bbr.bund.de/cln_015/nn_343756/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
> 2)
> http://www.bbr.bund.de/cln_015/nn_21196/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
> 3)
> http://www.bbr.bund.de/cln_015/nn_21210/sid_A75D796CCCFFEBE7CDDD46DC26BEC98E/DE/BaufachlicherService/baufachlicherService__node.html?__nnn=true
>
> What i've read so far about the TextProfileSignature class is that it would
> not help me that much, since many of the pages i am trying to index are not
> that text heavy. Since the indexing took quite some time and the amount of
> duplicates is large i would be thankful for any idea on how to remove these
> duplicates.
>
> Thanks for any suggestions,
> André
> --
> GMX DSL SOMMER-SPECIAL: Surf & Phone Flat 16.000 für nur 19,99 ¿/mtl.!*
> http://portal.gmx.net/de/go/dsl
>