You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by sdeck <sc...@gmail.com> on 2006/12/20 06:44:19 UTC
Need help with deleteduplicates
Hello,
I am running nutch .8 against hadoop .4, just for reference
I want to add a delete duplicate based on a similarity algorithm, as opposed
to the hash method that is currently in there.
I would have to say I am pretty lost as to how the delete duplicates class
is working.
I would guess that I need to implement a compareTo method, but I am not
really sure what to return. Also, when I do return something, where do I
implement the functionality to say "yes, these are dupes, so remove the
first one)
Can anyone help out?
Thanks,
S
--
View this message in context: http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a7985094
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Need help with deleteduplicates
Posted by Dennis Kubes <nu...@dragonflymc.com>.
Some of it is happening behind the scenes. A hash of the text of the
index document is created when the documents are read. The MapReduce
process uses the InputFormat static inner class to read documents into a
Map class. Here the Map class is not specified so it is an
IdentityMapper which basically just passed content through. In this
case it is a url->document mapping becase that was the input from the
RecordReader in the InputFormat.
The values passed to the UrlReducer would have all documents with the
exact same url. This class then gets only the most recent url and
discards the others storing the output as a hash->document structure.
The hash is the MD5 hash of the contents of the web page or document.
This is then used as input for a second MapReduce job. Again no Mapper
class so Identity which just passes through. The values passed to the
HashReducer would be all documents with the exact same content (i.e.
Hash). It keeps the one with the highest score (created at indexing
time) and discards the others. The output from the HashReducer is then
used as input for a a third MapReduce job that actually deletes from the
index files.
Forget about the compare to because it wouldn't be a straight comparison
light object compareTo object. What you would want to do is to either
change the Record reader to analyze the document in your specific way
and have another field in the IndexDoc which is your numeric
representation of the similarity comparison. Then in the UrlReducer you
would want to collect your numeric as the key. Then in the HashReducer
make your comparison of which document to keep and not to keep based on
the similarity numeric. Remember that similar urls would need to return
the same numeric so that they get passed to the HashReducer class as a
single set of values.
Dennis
sdeck wrote:
> That sort of gets me there in understanding what is going on.
> Still not all the way though.
> So, let's look at the trunk of deleteduplicates:
> http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
>
> No where in there do I see where url == url, and if so, delete that doc from
> the index.
> So, I am not sure where I would put my code.
>
> I could possibly modify the hash content reducer. Basically, here is the
> algorithm approach
>
> start at 1
> loop through 2-N and take the text of 1 and compare to the text of 2, 3,
> 4,...N
> If the similarity score is > ## then delete that document.
>
> The way I understand the hash reducer, that is what it is doing, but I don't
> really understand where the score is coming from and where the comparison is
> really taking place.
>
The score is the calculated score of the indexed document. This score
is partially created at the time the page was indexed.
> I see this:
> public int compareTo(Object o) {
> IndexDoc that = (IndexDoc)o;
> if (this.keep != that.keep) {
> return this.keep ? 1 : -1;
> } else if (!this.hash.equals(that.hash)) { // order first by
> hash
> return this.hash.compareTo(that.hash);
> ...
>
>
> So, is that where I would place my similary score and return that value
> there?
>
>
>
>
> Dennis Kubes wrote:
>
>> If I am understanding what you are asking, in the getRecordReader method
>> of the InputFormat innner class in DeleteDuplicates it gets the hash
>> score from the document. You could put your algorithm there and return
>> some type of numeric value based on analysis of the document fields.
>> You would need to write a different class for HashScore and return it
>> from the record reader. You would probably want to keep the IndexDoc
>> being written out as the value in dedup phase 1 (in the job config) but
>> change the key to your HashScore replacement class. You would need to
>> change HashPartitioner to partition according to your new key numeric.
>> The HashReducer would also need to be changed to collect only the ones
>> you want based on your new key numeric.
>>
>> The dedup phase 2 deletes by url so if you want to remove exact urls
>> then you would leave it in otherwise you might want to take the job
>> config section for phase 2 out.
>>
>> Hope this helps.
>>
>> Dennis
>>
>> sdeck wrote:
>>
>>> Hello,
>>> I am running nutch .8 against hadoop .4, just for reference
>>> I want to add a delete duplicate based on a similarity algorithm, as
>>> opposed
>>> to the hash method that is currently in there.
>>> I would have to say I am pretty lost as to how the delete duplicates
>>> class
>>> is working.
>>> I would guess that I need to implement a compareTo method, but I am not
>>> really sure what to return. Also, when I do return something, where do I
>>> implement the functionality to say "yes, these are dupes, so remove the
>>> first one)
>>>
>>> Can anyone help out?
>>> Thanks,
>>> S
>>>
>>>
>>
>
>
Re: Need help with deleteduplicates
Posted by Doğacan Güney <do...@agmlab.com>.
sdeck wrote:
> That sort of gets me there in understanding what is going on.
> Still not all the way though.
> So, let's look at the trunk of deleteduplicates:
> http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
>
> No where in there do I see where url == url, and if so, delete that doc from
> the index.
> So, I am not sure where I would put my code.
>
> I could possibly modify the hash content reducer. Basically, here is the
> algorithm approach
>
> start at 1
> loop through 2-N and take the text of 1 and compare to the text of 2, 3,
> 4,...N
> If the similarity score is > ## then delete that document.
>
> The way I understand the hash reducer, that is what it is doing, but I don't
> really understand where the score is coming from and where the comparison is
> really taking place.
> I see this:
> public int compareTo(Object o) {
> IndexDoc that = (IndexDoc)o;
> if (this.keep != that.keep) {
> return this.keep ? 1 : -1;
> } else if (!this.hash.equals(that.hash)) { // order first by
> hash
> return this.hash.compareTo(that.hash);
> ...
>
>
> So, is that where I would place my similary score and return that value
> there?
>
>
>
AFAIK DeleteDuplicates works like this:
IndexDoc is a presentation of the actual document in your index(IndexDoc
keeps among other things, document's url, boost and digest). It is also
Writable and Comparable which means that it can be used both as a key
and a value in MapReduce.
In the first phase of dedup, job reads the indexes and outputs
<IndexDoc.url, IndexDoc> pairs. Job's map is identity, so in reduce,
IndexDocs with same url are grouped under same reduce. Reduce outputs
these marking older versions of same urls to be deleted. (So if you
fetched the same url more than once only the newest is kept)
In phase 2, job reads this output then outputs <IndexDoc.hash, IndexDoc>
pairs. Again map is identity and reduce marks relevant ones to be
deleted. (So if you fetched same documents under different urls, only
the the one with the highest boost or the shortest url is kept).
Phase 3 reads this output then deletes all marked documents.
I think that your version will be somewhat difficult to implement.
Because, MapReduce works best on input that can be processed
independently from each other.
Hope that clears things a bit.
--
Dogacan Guney
Re: Need help with deleteduplicates
Posted by sdeck <sc...@gmail.com>.
That sort of gets me there in understanding what is going on.
Still not all the way though.
So, let's look at the trunk of deleteduplicates:
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
No where in there do I see where url == url, and if so, delete that doc from
the index.
So, I am not sure where I would put my code.
I could possibly modify the hash content reducer. Basically, here is the
algorithm approach
start at 1
loop through 2-N and take the text of 1 and compare to the text of 2, 3,
4,...N
If the similarity score is > ## then delete that document.
The way I understand the hash reducer, that is what it is doing, but I don't
really understand where the score is coming from and where the comparison is
really taking place.
I see this:
public int compareTo(Object o) {
IndexDoc that = (IndexDoc)o;
if (this.keep != that.keep) {
return this.keep ? 1 : -1;
} else if (!this.hash.equals(that.hash)) { // order first by
hash
return this.hash.compareTo(that.hash);
...
So, is that where I would place my similary score and return that value
there?
Dennis Kubes wrote:
>
> If I am understanding what you are asking, in the getRecordReader method
> of the InputFormat innner class in DeleteDuplicates it gets the hash
> score from the document. You could put your algorithm there and return
> some type of numeric value based on analysis of the document fields.
> You would need to write a different class for HashScore and return it
> from the record reader. You would probably want to keep the IndexDoc
> being written out as the value in dedup phase 1 (in the job config) but
> change the key to your HashScore replacement class. You would need to
> change HashPartitioner to partition according to your new key numeric.
> The HashReducer would also need to be changed to collect only the ones
> you want based on your new key numeric.
>
> The dedup phase 2 deletes by url so if you want to remove exact urls
> then you would leave it in otherwise you might want to take the job
> config section for phase 2 out.
>
> Hope this helps.
>
> Dennis
>
> sdeck wrote:
>> Hello,
>> I am running nutch .8 against hadoop .4, just for reference
>> I want to add a delete duplicate based on a similarity algorithm, as
>> opposed
>> to the hash method that is currently in there.
>> I would have to say I am pretty lost as to how the delete duplicates
>> class
>> is working.
>> I would guess that I need to implement a compareTo method, but I am not
>> really sure what to return. Also, when I do return something, where do I
>> implement the functionality to say "yes, these are dupes, so remove the
>> first one)
>>
>> Can anyone help out?
>> Thanks,
>> S
>>
>
>
--
View this message in context: http://www.nabble.com/Need-help-with-deleteduplicates-tf2858127.html#a8058635
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Need help with deleteduplicates
Posted by Dennis Kubes <nu...@dragonflymc.com>.
If I am understanding what you are asking, in the getRecordReader method
of the InputFormat innner class in DeleteDuplicates it gets the hash
score from the document. You could put your algorithm there and return
some type of numeric value based on analysis of the document fields.
You would need to write a different class for HashScore and return it
from the record reader. You would probably want to keep the IndexDoc
being written out as the value in dedup phase 1 (in the job config) but
change the key to your HashScore replacement class. You would need to
change HashPartitioner to partition according to your new key numeric.
The HashReducer would also need to be changed to collect only the ones
you want based on your new key numeric.
The dedup phase 2 deletes by url so if you want to remove exact urls
then you would leave it in otherwise you might want to take the job
config section for phase 2 out.
Hope this helps.
Dennis
sdeck wrote:
> Hello,
> I am running nutch .8 against hadoop .4, just for reference
> I want to add a delete duplicate based on a similarity algorithm, as opposed
> to the hash method that is currently in there.
> I would have to say I am pretty lost as to how the delete duplicates class
> is working.
> I would guess that I need to implement a compareTo method, but I am not
> really sure what to return. Also, when I do return something, where do I
> implement the functionality to say "yes, these are dupes, so remove the
> first one)
>
> Can anyone help out?
> Thanks,
> S
>