You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jon Shoberg <jo...@shoberg.net> on 2005/10/08 16:45:56 UTC

Dedup won't actually dedup

Any idea why dedup won't actually remove the items?

Thoughts?

*** First Pass ************************************

051008 144843 Clearing old deletions in
051008 144843 Reading url hashes...
051008 144902 Sorting url hashes...
051008 144905 Deleting url duplicates...
051008 144907 Deleted 4082 url duplicates.
051008 144907 Reading content hashes...
051008 144918 Sorting content hashes...
051008 144923 Deleting content duplicates...
051008 144925 Deleted 228430 content duplicates.
051008 144925 Duplicate deletion complete locally.  Now returning to NFS...
051008 144925 DeleteDuplicates complete

*** Second Pass ***********************************

051008 144932 Reading url hashes...
051008 144949 Sorting url hashes...
051008 144953 Deleting url duplicates...
051008 144955 Deleted 4082 url duplicates.
051008 144955 Reading content hashes...
051008 145005 Sorting content hashes...
051008 145011 Deleting content duplicates...
051008 145012 Deleted 228430 content duplicates.
051008 145012 Duplicate deletion complete locally.  Now returning to NFS...
051008 145012 DeleteDuplicates complete



Re: Dedup won't actually dedup

Posted by quo vadis <qu...@webmail.co.za>.
Is it possible to get the marked records deleted?

On Sun, 09 Oct 2005 11:05:31 +0200
 Piotr Kosiorowski <pk...@gmail.com> wrote:
>Hello Jon,
>As far as I remember dedup marks the records as deleted
>only without physically removing them.
>And first action of dedup is to clear old deletions (as it
>is written in log). So if you repeat it you will get the
>same number of deleted records each time.
>Regards
>Piotr
>
>Jon Shoberg wrote:
>> Any idea why dedup won't actually remove the items?
>> 
>> Thoughts?
>> 
>> *** First Pass ************************************
>> 
>> 051008 144843 Clearing old deletions in
>> 051008 144843 Reading url hashes...
>> 051008 144902 Sorting url hashes...
>> 051008 144905 Deleting url duplicates...
>> 051008 144907 Deleted 4082 url duplicates.
>> 051008 144907 Reading content hashes...
>> 051008 144918 Sorting content hashes...
>> 051008 144923 Deleting content duplicates...
>> 051008 144925 Deleted 228430 content duplicates.
>> 051008 144925 Duplicate deletion complete locally.  Now
>returning to NFS...
>> 051008 144925 DeleteDuplicates complete
>> 
>> *** Second Pass ***********************************
>> 
>> 051008 144932 Reading url hashes...
>> 051008 144949 Sorting url hashes...
>> 051008 144953 Deleting url duplicates...
>> 051008 144955 Deleted 4082 url duplicates.
>> 051008 144955 Reading content hashes...
>> 051008 145005 Sorting content hashes...
>> 051008 145011 Deleting content duplicates...
>> 051008 145012 Deleted 228430 content duplicates.
>> 051008 145012 Duplicate deletion complete locally.  Now
>returning to NFS...
>> 051008 145012 DeleteDuplicates complete
>> 
>> 
>> 
>

____________________________________________________________
Specials on Demo Appliances http://www.discountdirect.co.za

http://www.webmail.co.za the South African FREE email service

Re: Dedup won't actually dedup

Posted by Michael Ji <fj...@yahoo.com>.
As I checked code of DeleteDuplicates.java, there is
function called "deleteDuplicates ( )" and has line of


"readers[indexedDoc.index].delete(indexedDoc.doc);" 
// delete it

I believe when readers is closed, it does physical
deletion, is it right?

Michael Ji,

--- Piotr Kosiorowski <pk...@gmail.com> wrote:

> Hello Jon,
> As far as I remember dedup marks the records as
> deleted only without 
> physically removing them.
> And first action of dedup is to clear old deletions
> (as it is written in 
> log). So if you repeat it you will get the same
> number of deleted 
> records each time.
> Regards
> Piotr
> 
> Jon Shoberg wrote:
> > Any idea why dedup won't actually remove the
> items?
> > 
> > Thoughts?
> > 
> > *** First Pass
> ************************************
> > 
> > 051008 144843 Clearing old deletions in
> > 051008 144843 Reading url hashes...
> > 051008 144902 Sorting url hashes...
> > 051008 144905 Deleting url duplicates...
> > 051008 144907 Deleted 4082 url duplicates.
> > 051008 144907 Reading content hashes...
> > 051008 144918 Sorting content hashes...
> > 051008 144923 Deleting content duplicates...
> > 051008 144925 Deleted 228430 content duplicates.
> > 051008 144925 Duplicate deletion complete locally.
>  Now returning to NFS...
> > 051008 144925 DeleteDuplicates complete
> > 
> > *** Second Pass
> ***********************************
> > 
> > 051008 144932 Reading url hashes...
> > 051008 144949 Sorting url hashes...
> > 051008 144953 Deleting url duplicates...
> > 051008 144955 Deleted 4082 url duplicates.
> > 051008 144955 Reading content hashes...
> > 051008 145005 Sorting content hashes...
> > 051008 145011 Deleting content duplicates...
> > 051008 145012 Deleted 228430 content duplicates.
> > 051008 145012 Duplicate deletion complete locally.
>  Now returning to NFS...
> > 051008 145012 DeleteDuplicates complete
> > 
> > 
> > 
> 
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: Dedup won't actually dedup

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello Jon,
As far as I remember dedup marks the records as deleted only without 
physically removing them.
And first action of dedup is to clear old deletions (as it is written in 
log). So if you repeat it you will get the same number of deleted 
records each time.
Regards
Piotr

Jon Shoberg wrote:
> Any idea why dedup won't actually remove the items?
> 
> Thoughts?
> 
> *** First Pass ************************************
> 
> 051008 144843 Clearing old deletions in
> 051008 144843 Reading url hashes...
> 051008 144902 Sorting url hashes...
> 051008 144905 Deleting url duplicates...
> 051008 144907 Deleted 4082 url duplicates.
> 051008 144907 Reading content hashes...
> 051008 144918 Sorting content hashes...
> 051008 144923 Deleting content duplicates...
> 051008 144925 Deleted 228430 content duplicates.
> 051008 144925 Duplicate deletion complete locally.  Now returning to NFS...
> 051008 144925 DeleteDuplicates complete
> 
> *** Second Pass ***********************************
> 
> 051008 144932 Reading url hashes...
> 051008 144949 Sorting url hashes...
> 051008 144953 Deleting url duplicates...
> 051008 144955 Deleted 4082 url duplicates.
> 051008 144955 Reading content hashes...
> 051008 145005 Sorting content hashes...
> 051008 145011 Deleting content duplicates...
> 051008 145012 Deleted 228430 content duplicates.
> 051008 145012 Duplicate deletion complete locally.  Now returning to NFS...
> 051008 145012 DeleteDuplicates complete
> 
> 
>