You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/10/24 20:48:12 UTC

status dedub

Hi,
what is the status of the dedub tool in the mapreduce branche.
The javadoc mentioned that the second part isn't implemented but the  
indexer will take about this issue anyway.
However I tried this tool and it looks like that it does not work  
correctly.

Thanks for a comment.

Stefan

Re: status dedub

Posted by Doug Cutting <cu...@nutch.org>.

Stefan Groschupf wrote:
> I copy a working index and merge the original and the old together.  
> Than I run the dedub over these index. Shouldn't the dedub tool  remove 
> the duplicates in the merged index?

I usually dedup before index merge, so that the merged index contains no 
duplicates.  The mapred dedup tool should work after merging too, 
though, although it expects a directory of indexes, not a single index. 
  Note again that it does not yet dedup by url, only but md5 of content.

Doug

Re: status dedub

Posted by Marko Bauhardt <mb...@media-style.com>.

Hi,
here is a shell script that reproduce the problem.
We notice that after dedub in the merged index we have less documents  
than in the orginal index.

Number of Documents in
Original Index: 42
Dedup Index: 17

Do we may have a mistake somehow in the script or in the process itself?

Regards,
Marko.

Here is the script. When you try the script please delete the folders  
indexes, segments, tmp_index (if exists) and urls.


#!/bin/sh


NUTCH=$HOME/nutch
DB=$NUTCH/db


SEGMENT=$NUTCH/segments
INDEX=$NUTCH/indexes

TMP_INDEX=$NUTCH/tmp_index


mkdir urls
echo 'http://www.apache.org' > urls/links.txt

$NUTCH/bin/nutch inject $DB urls
REPEATS=2
for ((a=1; a <= REPEATS ; a++))
do
$NUTCH/bin/nutch generate $DB $SEGMENT
s1=`ls -d $SEGMENT/2* | tail -1`
$NUTCH/bin/nutch fetch $s1
$NUTCH/bin/nutch updatedb $DB $s1
done
s1=`ls -d $SEGMENT/2*`
$NUTCH/bin/nutch index $TMP_INDEX $DB $s1
s1=`ls -d $TMP_INDEX/part-* | tail -1`
cp -r $s1 $TMP_INDEX/copyOfIndex
$NUTCH/bin/nutch merge $INDEX/index $TMP_INDEX
$NUTCH/bin/nutch dedup $INDEX













Am 25.10.2005 um 16:28 schrieb Stefan Groschupf:

> Hi Doug,
>
> I copy a working index and merge the original and the old together.  
> Than I run the dedub over these index. Shouldn't the dedub tool  
> remove the duplicates in the merged index?
> Thanks,
> Stefan
>
>
> Am 24.10.2005 um 21:25 schrieb Doug Cutting:
>
>
>> It works for me.  It currently only deletes md5 duplicates, but  
>> url duplicates are currently handled elsewhere in the mapred  
>> branch.  What problems did you see?
>>
>> Doug
>>
>> Stefan Groschupf wrote:
>>
>>
>>> Hi,
>>> what is the status of the dedub tool in the mapreduce branche.
>>> The javadoc mentioned that the second part isn't implemented but  
>>> the  indexer will take about this issue anyway.
>>> However I tried this tool and it looks like that it does not  
>>> work  correctly.
>>> Thanks for a comment.
>>> Stefan
>>>
>>>
>>
>>
>>
>
>
>

Re: status dedub

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Doug,

I copy a working index and merge the original and the old together.  
Than I run the dedub over these index. Shouldn't the dedub tool  
remove the duplicates in the merged index?
Thanks,
Stefan

Am 24.10.2005 um 21:25 schrieb Doug Cutting:

> It works for me.  It currently only deletes md5 duplicates, but url  
> duplicates are currently handled elsewhere in the mapred branch.   
> What problems did you see?
>
> Doug
>
> Stefan Groschupf wrote:
>
>> Hi,
>> what is the status of the dedub tool in the mapreduce branche.
>> The javadoc mentioned that the second part isn't implemented but  
>> the  indexer will take about this issue anyway.
>> However I tried this tool and it looks like that it does not work   
>> correctly.
>> Thanks for a comment.
>> Stefan
>>
>
>

Re: status dedub

Posted by Doug Cutting <cu...@nutch.org>.

It works for me.  It currently only deletes md5 duplicates, but url 
duplicates are currently handled elsewhere in the mapred branch.  What 
problems did you see?

Doug

Stefan Groschupf wrote:
> Hi,
> what is the status of the dedub tool in the mapreduce branche.
> The javadoc mentioned that the second part isn't implemented but the  
> indexer will take about this issue anyway.
> However I tried this tool and it looks like that it does not work  
> correctly.
> 
> Thanks for a comment.
> 
> Stefan
>