You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dima Mazmanov <nu...@proservice.ge> on 2006/04/18 11:04:18 UTC

Nutch shows same results multiple times.

Hi all!! 
I'm running on nutch-0.7.1.

Here is result of my search.

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.org/RootPages/Default.aspx (Cached) 

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.com/rootpages/Default.aspx (Cached) 

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.com/RootPages/Default.aspx (Cached) 

ArGo Software Design Homepage 
[html] - 30.2 k - 
... Look of our Web Site Our web site has new look and ... link on the ... 
http://www.argosoft.org/rootpages/Default.aspx (Cached) 

As you can see one result is shown multiple times.
Why so? 
What is the difference between these links? I don't see any..
So, how can I avoid this problem?
Thanks, 
Regards, Dima


Re: Nutch shows same results multiple times.

Posted by Dima Mazmanov <nu...@proservice.ge>.
Well my script already contains this command....



> Run bin/nutch dedup segments dedup.tmp
> 
> 
> Dima Mazmanov wrote:
>> Hi all!! I'm running on nutch-0.7.1.
>>
>> Here is result of my search.
>>
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.org/RootPages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.com/rootpages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.com/RootPages/Default.aspx (Cached)
>> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
>> Site Our web site has new look and ... link on the ... 
>> http://www.argosoft.org/rootpages/Default.aspx (Cached)
>> As you can see one result is shown multiple times.
>> Why so? What is the difference between these links? I don't see any..
>> So, how can I avoid this problem?
>> Thanks, Regards, Dima
>>
>>


Re: Nutch shows same results multiple times.

Posted by "Håvard W. Kongsgård" <h....@niap.no>.
Run bin/nutch dedup segments dedup.tmp


Dima Mazmanov wrote:
> Hi all!! I'm running on nutch-0.7.1.
>
> Here is result of my search.
>
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.org/RootPages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.com/rootpages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.com/RootPages/Default.aspx (Cached)
> ArGo Software Design Homepage [html] - 30.2 k - ... Look of our Web 
> Site Our web site has new look and ... link on the ... 
> http://www.argosoft.org/rootpages/Default.aspx (Cached)
> As you can see one result is shown multiple times.
> Why so? What is the difference between these links? I don't see any..
> So, how can I avoid this problem?
> Thanks, Regards, Dima
>
>