You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sdeck <sc...@gmail.com> on 2006/12/20 06:44:19 UTC

Need help with deleteduplicates

Hello,
  I am running nutch .8 against hadoop .4, just for reference
I want to add a delete duplicate based on a similarity algorithm, as opposed
to the hash method that is currently in there.
I would have to say I am pretty lost as to how the delete duplicates class
is working.
I would guess that I need to implement a compareTo method, but I am not
really sure what to return. Also, when I do return something, where do I
implement the functionality to say "yes, these are dupes, so remove the
first one)

Can anyone help out?
Thanks,
S
-- 
View this message in context: http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a7985094
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Need help with deleteduplicates

Posted by Dennis Kubes <nu...@dragonflymc.com>.
Some of it is happening behind the scenes.  A hash of the text of the 
index document is created when the documents are read.  The MapReduce 
process uses the InputFormat static inner class to read documents into a 
Map class.  Here the Map class is not specified so it is an 
IdentityMapper which basically just passed content through.  In this 
case it is a url->document mapping becase that was the input from the 
RecordReader in the InputFormat.

The values passed to the UrlReducer would have all documents with the 
exact same url.  This class then gets only the most recent url and 
discards the others storing the output as a hash->document structure.  
The hash is the MD5 hash of the contents of the web page or document.  
This is then used as input for a second MapReduce job.  Again no Mapper 
class so Identity which just passes through.  The values passed to the 
HashReducer would be all documents with the exact same content (i.e. 
Hash).  It keeps the one with the highest score (created at indexing 
time) and discards the others.  The output from the HashReducer is then 
used as input for a a third MapReduce job that actually deletes from the 
index files.

Forget about the compare to because it wouldn't be a straight comparison 
light object compareTo object.  What you would want to do is to either 
change the Record reader to analyze the document in your specific way 
and have another field in the IndexDoc which is your numeric 
representation of the similarity comparison.  Then in the UrlReducer you 
would want to collect your numeric as the key.  Then in the HashReducer 
make your comparison of which document to keep and not to keep based on 
the similarity numeric.  Remember that similar urls would need to return 
the same numeric so that they get passed to the HashReducer class as a 
single set of values.

Dennis

sdeck wrote:
> That sort of gets me there in understanding what is going on.
> Still not all the way though.
> So, let's look at the trunk of deleteduplicates:
> http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
>
> No where in there do I see where url == url, and if so, delete that doc from
> the index.
> So, I am not sure where I would put my code.
>   
> I could possibly modify the hash content reducer.  Basically, here is the
> algorithm approach
>
> start at 1
> loop through 2-N and take the text of 1 and compare to the text of 2, 3,
> 4,...N
> If the similarity score is > ## then delete that document.
>
> The way I understand the hash reducer, that is what it is doing, but I don't
> really understand where the score is coming from and where the comparison is
> really taking place.
>   
The score is the calculated score of the indexed document.  This score 
is partially created at the time the page was indexed.
> I see this:
> public int compareTo(Object o) {
>       IndexDoc that = (IndexDoc)o;
>       if (this.keep != that.keep) {
>         return this.keep ? 1 : -1; 
>       } else if (!this.hash.equals(that.hash)) {       // order first by
> hash
>         return this.hash.compareTo(that.hash);
> ...
>
>
> So, is that where I would place my similary score and return that value
> there?
>
>
>
>
> Dennis Kubes wrote:
>   
>> If I am understanding what you are asking, in the getRecordReader method 
>> of the InputFormat innner class in DeleteDuplicates it gets the hash 
>> score from the document.  You could put your algorithm there and return 
>> some type of numeric value based on analysis of the document fields.  
>> You would need to write a different class for HashScore and return it 
>> from the record reader.  You would probably want to keep the IndexDoc 
>> being written out as the value in dedup phase 1 (in the job config) but 
>> change the key to your HashScore replacement class.  You would need to 
>> change HashPartitioner to partition according to your new key numeric.  
>> The HashReducer would also need to be changed to collect only the ones 
>> you want based on your new key numeric. 
>>
>> The dedup phase 2 deletes by url so if you want to remove exact urls 
>> then you would leave it in otherwise you might want to take the job 
>> config section for phase 2 out.
>>
>> Hope this helps.
>>
>> Dennis
>>
>> sdeck wrote:
>>     
>>> Hello,
>>>   I am running nutch .8 against hadoop .4, just for reference
>>> I want to add a delete duplicate based on a similarity algorithm, as
>>> opposed
>>> to the hash method that is currently in there.
>>> I would have to say I am pretty lost as to how the delete duplicates
>>> class
>>> is working.
>>> I would guess that I need to implement a compareTo method, but I am not
>>> really sure what to return. Also, when I do return something, where do I
>>> implement the functionality to say "yes, these are dupes, so remove the
>>> first one)
>>>
>>> Can anyone help out?
>>> Thanks,
>>> S
>>>   
>>>       
>>     
>
>   

Re: Need help with deleteduplicates

Posted by Doğacan Güney <do...@agmlab.com>.
sdeck wrote:
> That sort of gets me there in understanding what is going on.
> Still not all the way though.
> So, let's look at the trunk of deleteduplicates:
> http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
>
> No where in there do I see where url == url, and if so, delete that doc from
> the index.
> So, I am not sure where I would put my code.
>
> I could possibly modify the hash content reducer.  Basically, here is the
> algorithm approach
>
> start at 1
> loop through 2-N and take the text of 1 and compare to the text of 2, 3,
> 4,...N
> If the similarity score is > ## then delete that document.
>
> The way I understand the hash reducer, that is what it is doing, but I don't
> really understand where the score is coming from and where the comparison is
> really taking place.
> I see this:
> public int compareTo(Object o) {
>       IndexDoc that = (IndexDoc)o;
>       if (this.keep != that.keep) {
>         return this.keep ? 1 : -1; 
>       } else if (!this.hash.equals(that.hash)) {       // order first by
> hash
>         return this.hash.compareTo(that.hash);
> ...
>
>
> So, is that where I would place my similary score and return that value
> there?
>
>
>   

AFAIK DeleteDuplicates works like this:
IndexDoc is a presentation of the actual document in your index(IndexDoc 
keeps among other things, document's url, boost and digest). It is also 
Writable and Comparable which means that it can be used both as a key 
and a value in MapReduce.

In the first phase of dedup, job reads the indexes and outputs 
<IndexDoc.url, IndexDoc> pairs. Job's map is identity, so in reduce, 
IndexDocs with same url are grouped under same reduce. Reduce outputs 
these marking older versions of same urls to be deleted. (So if you 
fetched the same url more than  once only the newest is kept)
 
In phase 2, job reads this output then outputs <IndexDoc.hash, IndexDoc> 
pairs. Again map is identity and reduce marks relevant ones to be 
deleted. (So if you fetched same documents under different urls, only 
the the one with the highest boost or the shortest url is kept).

Phase 3 reads this output then deletes all marked documents.

I think that your version will be somewhat difficult to implement. 
Because, MapReduce works best on input that can be processed 
independently from each other.

Hope that clears things a bit.

--
Dogacan Guney

Re: Need help with deleteduplicates

Posted by sdeck <sc...@gmail.com>.
That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java

No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.

I could possibly modify the hash content reducer.  Basically, here is the
algorithm approach

start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.

The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.
I see this:
public int compareTo(Object o) {
      IndexDoc that = (IndexDoc)o;
      if (this.keep != that.keep) {
        return this.keep ? 1 : -1; 
      } else if (!this.hash.equals(that.hash)) {       // order first by
hash
        return this.hash.compareTo(that.hash);
...


So, is that where I would place my similary score and return that value
there?




Dennis Kubes wrote:
> 
> If I am understanding what you are asking, in the getRecordReader method 
> of the InputFormat innner class in DeleteDuplicates it gets the hash 
> score from the document.  You could put your algorithm there and return 
> some type of numeric value based on analysis of the document fields.  
> You would need to write a different class for HashScore and return it 
> from the record reader.  You would probably want to keep the IndexDoc 
> being written out as the value in dedup phase 1 (in the job config) but 
> change the key to your HashScore replacement class.  You would need to 
> change HashPartitioner to partition according to your new key numeric.  
> The HashReducer would also need to be changed to collect only the ones 
> you want based on your new key numeric. 
> 
> The dedup phase 2 deletes by url so if you want to remove exact urls 
> then you would leave it in otherwise you might want to take the job 
> config section for phase 2 out.
> 
> Hope this helps.
> 
> Dennis
> 
> sdeck wrote:
>> Hello,
>>   I am running nutch .8 against hadoop .4, just for reference
>> I want to add a delete duplicate based on a similarity algorithm, as
>> opposed
>> to the hash method that is currently in there.
>> I would have to say I am pretty lost as to how the delete duplicates
>> class
>> is working.
>> I would guess that I need to implement a compareTo method, but I am not
>> really sure what to return. Also, when I do return something, where do I
>> implement the functionality to say "yes, these are dupes, so remove the
>> first one)
>>
>> Can anyone help out?
>> Thanks,
>> S
>>   
> 
> 

-- 
View this message in context: http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a8058635
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Need help with deleteduplicates

Posted by Dennis Kubes <nu...@dragonflymc.com>.
If I am understanding what you are asking, in the getRecordReader method 
of the InputFormat innner class in DeleteDuplicates it gets the hash 
score from the document.  You could put your algorithm there and return 
some type of numeric value based on analysis of the document fields.  
You would need to write a different class for HashScore and return it 
from the record reader.  You would probably want to keep the IndexDoc 
being written out as the value in dedup phase 1 (in the job config) but 
change the key to your HashScore replacement class.  You would need to 
change HashPartitioner to partition according to your new key numeric.  
The HashReducer would also need to be changed to collect only the ones 
you want based on your new key numeric. 

The dedup phase 2 deletes by url so if you want to remove exact urls 
then you would leave it in otherwise you might want to take the job 
config section for phase 2 out.

Hope this helps.

Dennis

sdeck wrote:
> Hello,
>   I am running nutch .8 against hadoop .4, just for reference
> I want to add a delete duplicate based on a similarity algorithm, as opposed
> to the hash method that is currently in there.
> I would have to say I am pretty lost as to how the delete duplicates class
> is working.
> I would guess that I need to implement a compareTo method, but I am not
> really sure what to return. Also, when I do return something, where do I
> implement the functionality to say "yes, these are dupes, so remove the
> first one)
>
> Can anyone help out?
> Thanks,
> S
>