You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/10/24 20:48:12 UTC
status dedub
Hi,
what is the status of the dedub tool in the mapreduce branche.
The javadoc mentioned that the second part isn't implemented but the
indexer will take about this issue anyway.
However I tried this tool and it looks like that it does not work
correctly.
Thanks for a comment.
Stefan
Re: status dedub
Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> I copy a working index and merge the original and the old together.
> Than I run the dedub over these index. Shouldn't the dedub tool remove
> the duplicates in the merged index?
I usually dedup before index merge, so that the merged index contains no
duplicates. The mapred dedup tool should work after merging too,
though, although it expects a directory of indexes, not a single index.
Note again that it does not yet dedup by url, only but md5 of content.
Doug
Re: status dedub
Posted by Marko Bauhardt <mb...@media-style.com>.
Hi,
here is a shell script that reproduce the problem.
We notice that after dedub in the merged index we have less documents
than in the orginal index.
Number of Documents in
Original Index: 42
Dedup Index: 17
Do we may have a mistake somehow in the script or in the process itself?
Regards,
Marko.
Here is the script. When you try the script please delete the folders
indexes, segments, tmp_index (if exists) and urls.
#!/bin/sh
NUTCH=$HOME/nutch
DB=$NUTCH/db
SEGMENT=$NUTCH/segments
INDEX=$NUTCH/indexes
TMP_INDEX=$NUTCH/tmp_index
mkdir urls
echo 'http://www.apache.org' > urls/links.txt
$NUTCH/bin/nutch inject $DB urls
REPEATS=2
for ((a=1; a <= REPEATS ; a++))
do
$NUTCH/bin/nutch generate $DB $SEGMENT
s1=`ls -d $SEGMENT/2* | tail -1`
$NUTCH/bin/nutch fetch $s1
$NUTCH/bin/nutch updatedb $DB $s1
done
s1=`ls -d $SEGMENT/2*`
$NUTCH/bin/nutch index $TMP_INDEX $DB $s1
s1=`ls -d $TMP_INDEX/part-* | tail -1`
cp -r $s1 $TMP_INDEX/copyOfIndex
$NUTCH/bin/nutch merge $INDEX/index $TMP_INDEX
$NUTCH/bin/nutch dedup $INDEX
Am 25.10.2005 um 16:28 schrieb Stefan Groschupf:
> Hi Doug,
>
> I copy a working index and merge the original and the old together.
> Than I run the dedub over these index. Shouldn't the dedub tool
> remove the duplicates in the merged index?
> Thanks,
> Stefan
>
>
> Am 24.10.2005 um 21:25 schrieb Doug Cutting:
>
>
>> It works for me. It currently only deletes md5 duplicates, but
>> url duplicates are currently handled elsewhere in the mapred
>> branch. What problems did you see?
>>
>> Doug
>>
>> Stefan Groschupf wrote:
>>
>>
>>> Hi,
>>> what is the status of the dedub tool in the mapreduce branche.
>>> The javadoc mentioned that the second part isn't implemented but
>>> the indexer will take about this issue anyway.
>>> However I tried this tool and it looks like that it does not
>>> work correctly.
>>> Thanks for a comment.
>>> Stefan
>>>
>>>
>>
>>
>>
>
>
>
Re: status dedub
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Doug,
I copy a working index and merge the original and the old together.
Than I run the dedub over these index. Shouldn't the dedub tool
remove the duplicates in the merged index?
Thanks,
Stefan
Am 24.10.2005 um 21:25 schrieb Doug Cutting:
> It works for me. It currently only deletes md5 duplicates, but url
> duplicates are currently handled elsewhere in the mapred branch.
> What problems did you see?
>
> Doug
>
> Stefan Groschupf wrote:
>
>> Hi,
>> what is the status of the dedub tool in the mapreduce branche.
>> The javadoc mentioned that the second part isn't implemented but
>> the indexer will take about this issue anyway.
>> However I tried this tool and it looks like that it does not work
>> correctly.
>> Thanks for a comment.
>> Stefan
>>
>
>
Re: status dedub
Posted by Doug Cutting <cu...@nutch.org>.
It works for me. It currently only deletes md5 duplicates, but url
duplicates are currently handled elsewhere in the mapred branch. What
problems did you see?
Doug
Stefan Groschupf wrote:
> Hi,
> what is the status of the dedub tool in the mapreduce branche.
> The javadoc mentioned that the second part isn't implemented but the
> indexer will take about this issue anyway.
> However I tried this tool and it looks like that it does not work
> correctly.
>
> Thanks for a comment.
>
> Stefan
>