You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by J S <ve...@hotmail.com> on 2005/06/12 10:31:44 UTC

unable to remove duplicates

Hi,

Some of my searches return duplicate pages, so I wanted to remove these. I'm 
not exactly sure how to do this but tried the command below, and restarted 
Tomcat, but still got the same results.

I'm using the Nutch-Nightly from about 2 weeks ago. Just wondered if I'm 
doing something wrong here?

Thanks.

$ nutch dedup -local -workingdir /www/nutch/planetbp 
/www/nutch/planetbp/segments

run java in /usr/j2sdk1.4.2_03
050612 092326 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611171518/index(/www/nutch/planetbp/segments/20050611171518/index)
050612 092326 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611171700/index(/www/nutch/planetbp/segments/20050611171700/index)
050612 092326 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611173224/index(/www/nutch/planetbp/segments/20050611173224/index)
050612 092326 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611181430/index(/www/nutch/planetbp/segments/20050611181430/index)
050612 092326 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611184455/index(/www/nutch/planetbp/segments/20050611184455/index)
050612 092327 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611185714/index(/www/nutch/planetbp/segments/20050611185714/index)
050612 092327 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611190051/index(/www/nutch/planetbp/segments/20050611190051/index)
050612 092327 Clearing old deletions in 
/www/nutch/planetbp/segments/20050611190155/index(/www/nutch/planetbp/segments/20050611190155/index)
050612 092327 Reading url hashes...
050612 092328 parsing file:/www/nutch-nightly/conf/nutch-default.xml
050612 092329 parsing file:/www/nutch-nightly/conf/nutch-site.xml
050612 092330 Sorting url hashes...
050612 092331 Deleting url duplicates...
050612 092331 Deleted 0 url duplicates.
050612 092331 Reading content hashes...
050612 092331 Sorting content hashes...
050612 092331 Deleting content duplicates...
050612 092331 Deleted 309 content duplicates.
050612 092332 Duplicate deletion complete locally.  Now returning to NFS...
050612 092332 DeleteDuplicates complete
$



RE: unable to remove duplicates

Posted by Chirag Chaman <de...@filangy.com>.
They may in fact be two different URLs -- unix/Linux would treat them are
separate paths. Example, we use the capitalization as a hashing mechanism.
Thus, 90% of the time I don't think it will be a problem.

That being said, the content check should have flagged it for removal during
the merge, so you wont end up with dups even if the URLs are not the same.

CC-

--------------------------------------------
Filangy, Inc.
We're Improving Search!
http://filangy.com/
 


-----Original Message-----
From: J S [mailto:vervoom@hotmail.com] 
Sent: Monday, June 13, 2005 11:22 AM
To: nutch-user@incubator.apache.org
Subject: Re: unable to remove duplicates

Thanks Piotr,

I finally found the problem. The 2 urls were the same except for one letter
in the path which was capitalised!

I didn't think URLs were case-sensitive, although maybe the paths are? Can
nutch be configured to ignore the case in a URL?

JS.

>Hello,
>I have no idea why it should not work properly.
>You can look at the contents of merged Lucene index with luke
>(http://www.getopt.org/luke/) - to verify its content.(I assume you are 
>using Intranet crawling method from tutorial based on logs).
>Yo can also print all urls in fetchlist from segments using "nutch 
>fetchlist" command to see if they are present there.
>
>It might be good to look at the segment creation process as some 
>numbers in your log look strange:
>050613 123307 DONE indexing segment 20050613123024: total 3 records
>050613 123307 DONE indexing segment 20050613123053: total 0 records
>
>So it looks like your segments are really small - I do not think it is 
>a problem for deduplication but looks suspicious.
>How do you create your segments? What parameteres do you use?
>
>Regards
>Piotr
>
>J S wrote:
>>Hi Piotr,
>>
>>Thanks for replying. I understand the terminology better now! I was 
>>referring to url duplicates. I've rerun the crawl and I'm still 
>>getting
>>them:
>>
>>050613 123307 * Optimizing index...
>>050613 123307 * Moving index to NFS if needed...
>>050613 123307 DONE indexing segment 20050613123024: total 3 records in
>>0.242 s (Infinity rec/s).
>>050613 123307 done indexing
>>050613 123307 indexing segment: 
>>/www/nutch-nightly/planetbp.tmp/segments/20050613123053
>>050613 123307 * Opening segment 20050613123053
>>050613 123307 * Indexing segment 20050613123053
>>050613 123307 * Optimizing index...
>>050613 123307 * Moving index to NFS if needed...
>>050613 123307 DONE indexing segment 20050613123053: total 0 records in
>>0.021 s (NaN rec/s).
>>050613 123307 done indexing
>>050613 123307 Reading url hashes...
>>050613 123308 Sorting url hashes...
>>050613 123308 Deleting url duplicates...
>>050613 123308 Deleted 0 url duplicates.
>>050613 123308 Reading content hashes...
>>050613 123309 Sorting content hashes...
>>050613 123309 Deleting content duplicates...
>>050613 123309 Deleted 267 content duplicates.
>>050613 123310 Duplicate deletion complete locally.  Now returning to 
>>NFS...
>>050613 123310 DeleteDuplicates complete
>>050613 123310 Merging segment indexes...
>>
>>
>>>
>>>Hello,
>>>It looks like deduplication process removed some (309) duplicates. 
>>>They were content duplicates - so different url but identical page
content.
>>>There were no url duplicates (every url was different). So what you 
>>>realy mean by "duplicate pages" taht are returned by your seach?
>>>Do they have identical urls or identical content?
>>>One more thing to remember is that nutch deduplication currently 
>>>removes pages that have identical content - even smallest difference 
>>>in page source (including url,comments etc) will be treated as
difference.
>>>So please verify if pages you see as duplicates are really identical.
>>>Regards
>>>Piotr
>>>
>>>
>>>
>>>J S wrote:
>>>
>>>>Hi,
>>>>
>>>>Some of my searches return duplicate pages, so I wanted to remove these.

>>>>I'm not exactly sure how to do this but tried the command below, and 
>>>>restarted Tomcat, but still got the same results.
>>>>
>>>>I'm using the Nutch-Nightly from about 2 weeks ago. Just wondered if 
>>>>I'm doing something wrong here?
>>>>
>>>>Thanks.
>>>>
>>>>$ nutch dedup -local -workingdir /www/nutch/planetbp 
>>>>/www/nutch/planetbp/segments
>>>>
>>>>run java in /usr/j2sdk1.4.2_03
>>>>050612 092326 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611171518/index(/www/nutch/planetb
>>>>p/segments/20050611171518/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611171700/index(/www/nutch/planetb
>>>>p/segments/20050611171700/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611173224/index(/www/nutch/planetb
>>>>p/segments/20050611173224/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611181430/index(/www/nutch/planetb
>>>>p/segments/20050611181430/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611184455/index(/www/nutch/planetb
>>>>p/segments/20050611184455/index)
>>>>
>>>>
>>>>050612 092327 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611185714/index(/www/nutch/planetb
>>>>p/segments/20050611185714/index)
>>>>
>>>>
>>>>050612 092327 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611190051/index(/www/nutch/planetb
>>>>p/segments/20050611190051/index)
>>>>
>>>>
>>>>050612 092327 Clearing old deletions in
>>>>/www/nutch/planetbp/segments/20050611190155/index(/www/nutch/planetb
>>>>p/segments/20050611190155/index)
>>>>
>>>>
>>>>050612 092327 Reading url hashes...
>>>>050612 092328 parsing file:/www/nutch-nightly/conf/nutch-default.xml
>>>>050612 092329 parsing file:/www/nutch-nightly/conf/nutch-site.xml
>>>>050612 092330 Sorting url hashes...
>>>>050612 092331 Deleting url duplicates...
>>>>050612 092331 Deleted 0 url duplicates.
>>>>050612 092331 Reading content hashes...
>>>>050612 092331 Sorting content hashes...
>>>>050612 092331 Deleting content duplicates...
>>>>050612 092331 Deleted 309 content duplicates.
>>>>050612 092332 Duplicate deletion complete locally.  Now returning to 
>>>>NFS...
>>>>050612 092332 DeleteDuplicates complete $
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>





Re: unable to remove duplicates

Posted by J S <ve...@hotmail.com>.
Thanks Piotr,

I finally found the problem. The 2 urls were the same except for one letter 
in the path which was capitalised!

I didn't think URLs were case-sensitive, although maybe the paths are? Can 
nutch be configured to ignore the case in a URL?

JS.

>Hello,
>I have no idea why it should not work properly.
>You can look at the contents of merged Lucene index with luke 
>(http://www.getopt.org/luke/) - to verify its content.(I assume you are 
>using Intranet crawling method from tutorial based on logs).
>Yo can also print all urls in fetchlist from segments using
>"nutch fetchlist" command to see if they are present there.
>
>It might be good to look at the segment creation process as
>some numbers in your log look strange:
>050613 123307 DONE indexing segment 20050613123024: total 3 records
>050613 123307 DONE indexing segment 20050613123053: total 0 records
>
>So it looks like your segments are really small - I do not think it is a 
>problem for deduplication but looks suspicious.
>How do you create your segments? What parameteres do you use?
>
>Regards
>Piotr
>
>J S wrote:
>>Hi Piotr,
>>
>>Thanks for replying. I understand the terminology better now! I was 
>>referring to url duplicates. I've rerun the crawl and I'm still getting 
>>them:
>>
>>050613 123307 * Optimizing index...
>>050613 123307 * Moving index to NFS if needed...
>>050613 123307 DONE indexing segment 20050613123024: total 3 records in 
>>0.242 s (Infinity rec/s).
>>050613 123307 done indexing
>>050613 123307 indexing segment: 
>>/www/nutch-nightly/planetbp.tmp/segments/20050613123053
>>050613 123307 * Opening segment 20050613123053
>>050613 123307 * Indexing segment 20050613123053
>>050613 123307 * Optimizing index...
>>050613 123307 * Moving index to NFS if needed...
>>050613 123307 DONE indexing segment 20050613123053: total 0 records in 
>>0.021 s (NaN rec/s).
>>050613 123307 done indexing
>>050613 123307 Reading url hashes...
>>050613 123308 Sorting url hashes...
>>050613 123308 Deleting url duplicates...
>>050613 123308 Deleted 0 url duplicates.
>>050613 123308 Reading content hashes...
>>050613 123309 Sorting content hashes...
>>050613 123309 Deleting content duplicates...
>>050613 123309 Deleted 267 content duplicates.
>>050613 123310 Duplicate deletion complete locally.  Now returning to 
>>NFS...
>>050613 123310 DeleteDuplicates complete
>>050613 123310 Merging segment indexes...
>>
>>
>>>
>>>Hello,
>>>It looks like deduplication process removed some (309) duplicates. They 
>>>were content duplicates - so different url but identical page content. 
>>>There were no url duplicates (every url was different). So what you realy 
>>>mean by "duplicate pages" taht are returned by your seach?
>>>Do they have identical urls or identical content?
>>>One more thing to remember is that nutch deduplication currently removes 
>>>pages that have identical content - even smallest difference in page 
>>>source (including url,comments etc) will be treated as difference.
>>>So please verify if pages you see as duplicates are really identical.
>>>Regards
>>>Piotr
>>>
>>>
>>>
>>>J S wrote:
>>>
>>>>Hi,
>>>>
>>>>Some of my searches return duplicate pages, so I wanted to remove these. 
>>>>I'm not exactly sure how to do this but tried the command below, and 
>>>>restarted Tomcat, but still got the same results.
>>>>
>>>>I'm using the Nutch-Nightly from about 2 weeks ago. Just wondered if I'm 
>>>>doing something wrong here?
>>>>
>>>>Thanks.
>>>>
>>>>$ nutch dedup -local -workingdir /www/nutch/planetbp 
>>>>/www/nutch/planetbp/segments
>>>>
>>>>run java in /usr/j2sdk1.4.2_03
>>>>050612 092326 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611171518/index(/www/nutch/planetbp/segments/20050611171518/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611171700/index(/www/nutch/planetbp/segments/20050611171700/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611173224/index(/www/nutch/planetbp/segments/20050611173224/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611181430/index(/www/nutch/planetbp/segments/20050611181430/index)
>>>>
>>>>
>>>>050612 092326 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611184455/index(/www/nutch/planetbp/segments/20050611184455/index)
>>>>
>>>>
>>>>050612 092327 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611185714/index(/www/nutch/planetbp/segments/20050611185714/index)
>>>>
>>>>
>>>>050612 092327 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611190051/index(/www/nutch/planetbp/segments/20050611190051/index)
>>>>
>>>>
>>>>050612 092327 Clearing old deletions in 
>>>>/www/nutch/planetbp/segments/20050611190155/index(/www/nutch/planetbp/segments/20050611190155/index)
>>>>
>>>>
>>>>050612 092327 Reading url hashes...
>>>>050612 092328 parsing file:/www/nutch-nightly/conf/nutch-default.xml
>>>>050612 092329 parsing file:/www/nutch-nightly/conf/nutch-site.xml
>>>>050612 092330 Sorting url hashes...
>>>>050612 092331 Deleting url duplicates...
>>>>050612 092331 Deleted 0 url duplicates.
>>>>050612 092331 Reading content hashes...
>>>>050612 092331 Sorting content hashes...
>>>>050612 092331 Deleting content duplicates...
>>>>050612 092331 Deleted 309 content duplicates.
>>>>050612 092332 Duplicate deletion complete locally.  Now returning to 
>>>>NFS...
>>>>050612 092332 DeleteDuplicates complete
>>>>$
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>



Re: unable to remove duplicates

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,
I have no idea why it should not work properly.
You can look at the contents of merged Lucene index with luke 
(http://www.getopt.org/luke/) - to verify its content.(I assume you are 
using Intranet crawling method from tutorial based on logs).
Yo can also print all urls in fetchlist from segments using
"nutch fetchlist" command to see if they are present there.

It might be good to look at the segment creation process as
some numbers in your log look strange:
050613 123307 DONE indexing segment 20050613123024: total 3 records
050613 123307 DONE indexing segment 20050613123053: total 0 records

So it looks like your segments are really small - I do not think it is a 
problem for deduplication but looks suspicious.
How do you create your segments? What parameteres do you use?

Regards
Piotr

J S wrote:
> Hi Piotr,
> 
> Thanks for replying. I understand the terminology better now! I was 
> referring to url duplicates. I've rerun the crawl and I'm still getting 
> them:
> 
> 050613 123307 * Optimizing index...
> 050613 123307 * Moving index to NFS if needed...
> 050613 123307 DONE indexing segment 20050613123024: total 3 records in 
> 0.242 s (Infinity rec/s).
> 050613 123307 done indexing
> 050613 123307 indexing segment: 
> /www/nutch-nightly/planetbp.tmp/segments/20050613123053
> 050613 123307 * Opening segment 20050613123053
> 050613 123307 * Indexing segment 20050613123053
> 050613 123307 * Optimizing index...
> 050613 123307 * Moving index to NFS if needed...
> 050613 123307 DONE indexing segment 20050613123053: total 0 records in 
> 0.021 s (NaN rec/s).
> 050613 123307 done indexing
> 050613 123307 Reading url hashes...
> 050613 123308 Sorting url hashes...
> 050613 123308 Deleting url duplicates...
> 050613 123308 Deleted 0 url duplicates.
> 050613 123308 Reading content hashes...
> 050613 123309 Sorting content hashes...
> 050613 123309 Deleting content duplicates...
> 050613 123309 Deleted 267 content duplicates.
> 050613 123310 Duplicate deletion complete locally.  Now returning to NFS...
> 050613 123310 DeleteDuplicates complete
> 050613 123310 Merging segment indexes...
> 
> 
>>
>> Hello,
>> It looks like deduplication process removed some (309) duplicates. 
>> They were content duplicates - so different url but identical page 
>> content. There were no url duplicates (every url was different). So 
>> what you realy mean by "duplicate pages" taht are returned by your seach?
>> Do they have identical urls or identical content?
>> One more thing to remember is that nutch deduplication currently 
>> removes pages that have identical content - even smallest difference 
>> in page source (including url,comments etc) will be treated as 
>> difference.
>> So please verify if pages you see as duplicates are really identical.
>> Regards
>> Piotr
>>
>>
>>
>> J S wrote:
>>
>>> Hi,
>>>
>>> Some of my searches return duplicate pages, so I wanted to remove 
>>> these. I'm not exactly sure how to do this but tried the command 
>>> below, and restarted Tomcat, but still got the same results.
>>>
>>> I'm using the Nutch-Nightly from about 2 weeks ago. Just wondered if 
>>> I'm doing something wrong here?
>>>
>>> Thanks.
>>>
>>> $ nutch dedup -local -workingdir /www/nutch/planetbp 
>>> /www/nutch/planetbp/segments
>>>
>>> run java in /usr/j2sdk1.4.2_03
>>> 050612 092326 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611171518/index(/www/nutch/planetbp/segments/20050611171518/index) 
>>>
>>>
>>> 050612 092326 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611171700/index(/www/nutch/planetbp/segments/20050611171700/index) 
>>>
>>>
>>> 050612 092326 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611173224/index(/www/nutch/planetbp/segments/20050611173224/index) 
>>>
>>>
>>> 050612 092326 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611181430/index(/www/nutch/planetbp/segments/20050611181430/index) 
>>>
>>>
>>> 050612 092326 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611184455/index(/www/nutch/planetbp/segments/20050611184455/index) 
>>>
>>>
>>> 050612 092327 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611185714/index(/www/nutch/planetbp/segments/20050611185714/index) 
>>>
>>>
>>> 050612 092327 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611190051/index(/www/nutch/planetbp/segments/20050611190051/index) 
>>>
>>>
>>> 050612 092327 Clearing old deletions in 
>>> /www/nutch/planetbp/segments/20050611190155/index(/www/nutch/planetbp/segments/20050611190155/index) 
>>>
>>>
>>> 050612 092327 Reading url hashes...
>>> 050612 092328 parsing file:/www/nutch-nightly/conf/nutch-default.xml
>>> 050612 092329 parsing file:/www/nutch-nightly/conf/nutch-site.xml
>>> 050612 092330 Sorting url hashes...
>>> 050612 092331 Deleting url duplicates...
>>> 050612 092331 Deleted 0 url duplicates.
>>> 050612 092331 Reading content hashes...
>>> 050612 092331 Sorting content hashes...
>>> 050612 092331 Deleting content duplicates...
>>> 050612 092331 Deleted 309 content duplicates.
>>> 050612 092332 Duplicate deletion complete locally.  Now returning to 
>>> NFS...
>>> 050612 092332 DeleteDuplicates complete
>>> $
>>>
>>>
>>>
>>
> 
> 
> 


Re: unable to remove duplicates

Posted by J S <ve...@hotmail.com>.
Hi Piotr,

Thanks for replying. I understand the terminology better now! I was 
referring to url duplicates. I've rerun the crawl and I'm still getting 
them:

050613 123307 * Optimizing index...
050613 123307 * Moving index to NFS if needed...
050613 123307 DONE indexing segment 20050613123024: total 3 records in 0.242 
s (Infinity rec/s).
050613 123307 done indexing
050613 123307 indexing segment: 
/www/nutch-nightly/planetbp.tmp/segments/20050613123053
050613 123307 * Opening segment 20050613123053
050613 123307 * Indexing segment 20050613123053
050613 123307 * Optimizing index...
050613 123307 * Moving index to NFS if needed...
050613 123307 DONE indexing segment 20050613123053: total 0 records in 0.021 
s (NaN rec/s).
050613 123307 done indexing
050613 123307 Reading url hashes...
050613 123308 Sorting url hashes...
050613 123308 Deleting url duplicates...
050613 123308 Deleted 0 url duplicates.
050613 123308 Reading content hashes...
050613 123309 Sorting content hashes...
050613 123309 Deleting content duplicates...
050613 123309 Deleted 267 content duplicates.
050613 123310 Duplicate deletion complete locally.  Now returning to NFS...
050613 123310 DeleteDuplicates complete
050613 123310 Merging segment indexes...


>
>Hello,
>It looks like deduplication process removed some (309) duplicates. They 
>were content duplicates - so different url but identical page content. 
>There were no url duplicates (every url was different). So what you realy 
>mean by "duplicate pages" taht are returned by your seach?
>Do they have identical urls or identical content?
>One more thing to remember is that nutch deduplication currently removes 
>pages that have identical content - even smallest difference in page source 
>(including url,comments etc) will be treated as difference.
>So please verify if pages you see as duplicates are really identical.
>Regards
>Piotr
>
>
>
>J S wrote:
>>Hi,
>>
>>Some of my searches return duplicate pages, so I wanted to remove these. 
>>I'm not exactly sure how to do this but tried the command below, and 
>>restarted Tomcat, but still got the same results.
>>
>>I'm using the Nutch-Nightly from about 2 weeks ago. Just wondered if I'm 
>>doing something wrong here?
>>
>>Thanks.
>>
>>$ nutch dedup -local -workingdir /www/nutch/planetbp 
>>/www/nutch/planetbp/segments
>>
>>run java in /usr/j2sdk1.4.2_03
>>050612 092326 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611171518/index(/www/nutch/planetbp/segments/20050611171518/index)
>>
>>050612 092326 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611171700/index(/www/nutch/planetbp/segments/20050611171700/index)
>>
>>050612 092326 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611173224/index(/www/nutch/planetbp/segments/20050611173224/index)
>>
>>050612 092326 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611181430/index(/www/nutch/planetbp/segments/20050611181430/index)
>>
>>050612 092326 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611184455/index(/www/nutch/planetbp/segments/20050611184455/index)
>>
>>050612 092327 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611185714/index(/www/nutch/planetbp/segments/20050611185714/index)
>>
>>050612 092327 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611190051/index(/www/nutch/planetbp/segments/20050611190051/index)
>>
>>050612 092327 Clearing old deletions in 
>>/www/nutch/planetbp/segments/20050611190155/index(/www/nutch/planetbp/segments/20050611190155/index)
>>
>>050612 092327 Reading url hashes...
>>050612 092328 parsing file:/www/nutch-nightly/conf/nutch-default.xml
>>050612 092329 parsing file:/www/nutch-nightly/conf/nutch-site.xml
>>050612 092330 Sorting url hashes...
>>050612 092331 Deleting url duplicates...
>>050612 092331 Deleted 0 url duplicates.
>>050612 092331 Reading content hashes...
>>050612 092331 Sorting content hashes...
>>050612 092331 Deleting content duplicates...
>>050612 092331 Deleted 309 content duplicates.
>>050612 092332 Duplicate deletion complete locally.  Now returning to 
>>NFS...
>>050612 092332 DeleteDuplicates complete
>>$
>>
>>
>>
>



Re: unable to remove duplicates

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,
It looks like deduplication process removed some (309) duplicates. They 
were content duplicates - so different url but identical page content. 
There were no url duplicates (every url was different). So what you 
realy mean by "duplicate pages" taht are returned by your seach?
Do they have identical urls or identical content?
One more thing to remember is that nutch deduplication currently removes 
pages that have identical content - even smallest difference in page 
source (including url,comments etc) will be treated as difference.
So please verify if pages you see as duplicates are really identical.
Regards
Piotr



J S wrote:
> Hi,
> 
> Some of my searches return duplicate pages, so I wanted to remove these. 
> I'm not exactly sure how to do this but tried the command below, and 
> restarted Tomcat, but still got the same results.
> 
> I'm using the Nutch-Nightly from about 2 weeks ago. Just wondered if I'm 
> doing something wrong here?
> 
> Thanks.
> 
> $ nutch dedup -local -workingdir /www/nutch/planetbp 
> /www/nutch/planetbp/segments
> 
> run java in /usr/j2sdk1.4.2_03
> 050612 092326 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611171518/index(/www/nutch/planetbp/segments/20050611171518/index) 
> 
> 050612 092326 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611171700/index(/www/nutch/planetbp/segments/20050611171700/index) 
> 
> 050612 092326 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611173224/index(/www/nutch/planetbp/segments/20050611173224/index) 
> 
> 050612 092326 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611181430/index(/www/nutch/planetbp/segments/20050611181430/index) 
> 
> 050612 092326 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611184455/index(/www/nutch/planetbp/segments/20050611184455/index) 
> 
> 050612 092327 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611185714/index(/www/nutch/planetbp/segments/20050611185714/index) 
> 
> 050612 092327 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611190051/index(/www/nutch/planetbp/segments/20050611190051/index) 
> 
> 050612 092327 Clearing old deletions in 
> /www/nutch/planetbp/segments/20050611190155/index(/www/nutch/planetbp/segments/20050611190155/index) 
> 
> 050612 092327 Reading url hashes...
> 050612 092328 parsing file:/www/nutch-nightly/conf/nutch-default.xml
> 050612 092329 parsing file:/www/nutch-nightly/conf/nutch-site.xml
> 050612 092330 Sorting url hashes...
> 050612 092331 Deleting url duplicates...
> 050612 092331 Deleted 0 url duplicates.
> 050612 092331 Reading content hashes...
> 050612 092331 Sorting content hashes...
> 050612 092331 Deleting content duplicates...
> 050612 092331 Deleted 309 content duplicates.
> 050612 092332 Duplicate deletion complete locally.  Now returning to NFS...
> 050612 092332 DeleteDuplicates complete
> $
> 
> 
>