You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dima Mazmanov <nu...@proservice.ge> on 2006/04/20 14:46:43 UTC

Re[2]: Nutch shows same results multiple times.

Hi,Håvard.

Ok, thanks a lot! I'll apply this filter now.
On more thing..
If I disallowed 'com' zone and my url file contains some com domains
would they bee indexed or NOT?



> Like this

> +http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
> -.*

> see:
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00479.html

> Dima Mazmanov wrote:
>> I'm not adding urls into urlfilter files.
>> Besides, I still don't understand how to allow only one zone in 
>> urlfilter.
>> Let's say I want to index only ".ge" zone.
>> Which one of the following filters is correct?
>>
>> +^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/
>> +^http://([a-z0-9\-\.]*\.)*.ge/
>> +^http://([a-z0-9\-\.])*.ge/
>> +^http://www\..*\.ge/
>> +^http://www\..*\.*\.ge/
>>
>> By the way if the site you are indexing is dynamic you may just 
>> disallow to index
>> www.bbc.co.uk and index only second one.
>>
>>
>>> So what filter settings do you use?
>>> Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
>>> Then you will get bbc.co.uk and www.bbc.co.uk <http://www.bbc.co.uk/>
>>> and
>>> since this site is dynamic, content might bee different.
>>> Have the same problem myself :-(
>>>
>>>
>>>
>>>
>>> -----------------------------------
>>> Well my script already contains this command....
>>>
>>>
>>>
>>>
>>>    Run bin/nutch dedup segments dedup.tmp
>>>
>>>
>>>    Dima Mazmanov wrote:
>>>
>>>        Hi all!! I'm running on nutch-0.7.1.
>>>
>>>        Here is result of my search.
>>>
>>>
>>>        ArGo Software Design Homepage [html] - 30.2 k - ... Look of our
>>>        Web Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.org/rootpages/Default.aspx (Cached)
>>>
>>>        As you can see one result is shown multiple times.
>>>        Why so? What is the difference between these links? I don't
>>> see any..
>>>        So, how can I avoid this problem?
>>>        Thanks, Regards, Dima
>>>
>>>
>>>
>>>
>>>
>>>
>>> __________ NOD32 1.1497 (20060419) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>
>>



> __________ NOD32 1.1497 (20060419) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge

Re: Re[2]: Nutch shows same results multiple times.

Posted by Dima Mazmanov <nu...@proservice.ge>.

Thank you very much.
It really worked!!!!

> Like this

> +http://[^/]*\.(com|org|net|biz|mil|us|info|cc)/
> -.*

> see:
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00479.html

> Dima Mazmanov wrote:
>> I'm not adding urls into urlfilter files.
>> Besides, I still don't understand how to allow only one zone in 
>> urlfilter.
>> Let's say I want to index only ".ge" zone.
>> Which one of the following filters is correct?
>>
>> +^http://([a-z0-9]*\.)*([a-z0-9]*\.).ge/
>> +^http://([a-z0-9\-\.]*\.)*.ge/
>> +^http://([a-z0-9\-\.])*.ge/
>> +^http://www\..*\.ge/
>> +^http://www\..*\.*\.ge/
>>
>> By the way if the site you are indexing is dynamic you may just 
>> disallow to index
>> www.bbc.co.uk and index only second one.
>>
>>
>>> So what filter settings do you use?
>>> Like this +^http://([a-z0-9]*\.)*bbc.co.uk/
>>> Then you will get bbc.co.uk and www.bbc.co.uk <http://www.bbc.co.uk/>
>>> and
>>> since this site is dynamic, content might bee different.
>>> Have the same problem myself :-(
>>>
>>>
>>>
>>>
>>> -----------------------------------
>>> Well my script already contains this command....
>>>
>>>
>>>
>>>
>>>    Run bin/nutch dedup segments dedup.tmp
>>>
>>>
>>>    Dima Mazmanov wrote:
>>>
>>>        Hi all!! I'm running on nutch-0.7.1.
>>>
>>>        Here is result of my search.
>>>
>>>
>>>        ArGo Software Design Homepage [html] - 30.2 k - ... Look of our
>>>        Web Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.org/RootPages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.com/rootpages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.com/RootPages/Default.aspx (Cached) ArGo
>>>        Software Design Homepage [html] - 30.2 k - ... Look of our Web
>>>        Site Our web site has new look and ... link on the ...
>>>        http://www.argosoft.org/rootpages/Default.aspx (Cached)
>>>
>>>        As you can see one result is shown multiple times.
>>>        Why so? What is the difference between these links? I don't
>>> see any..
>>>        So, how can I avoid this problem?
>>>        Thanks, Regards, Dima
>>>
>>>
>>>
>>>
>>>
>>>
>>> __________ NOD32 1.1497 (20060419) Information __________
>>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>>
>>>
>>
>>



> __________ NOD32 1.1497 (20060419) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge




__________ NOD32 1.1497 (20060419) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com

Re: Nutch shows same results multiple times.

Posted by "Håvard W. Kongsgård" <h....@niap.no>.

Don't know but you can try to upgrading to 0.7.2


See Nutch Change Log:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158

Dima Mazmanov wrote:
> Hi,Håvard.
> Thank you again for your help.
> ..mmm. there is else once thing  I'm cuerious about...
> The search result of several sites displays content like following :
>
> Cool-Warez
> [html] - 19.1 k - 11/3/2006
> ... Avatars   გართობა   კონტაქტი                                     
> როგორ მოვხსნათ www.sendspace.com Многие из Вас ... вопрос: "Как качать 
> с    http://www
> http://www.cool.caucasus.net/index_moxsna_2.htm (Cached) (More from 
> www.cool.caucasus.net)
>
> as you can see there is a lot of spaces between words.. is this bug or 
> what?...
> maybe it's because of different borders in web page and nutch places 
> spaces by his own ???
> Is there any way to avoid this problem?
>

Re: Re[2]: Nutch shows same results multiple times.

Posted by Dima Mazmanov <nu...@proservice.ge>.

Hi,Håvard.
Thank you again for your help.
..mmm. there is else once thing  I'm cuerious about...
The search result of several sites displays content like following :

Cool-Warez
[html] - 19.1 k - 11/3/2006
... Avatars   გართობა   კონტაქტი                                     როგორ 
მოვხსნათ 
www.sendspace.com 
Многие из Вас ... вопрос: "Как качать с    http://www
http://www.cool.caucasus.net/index_moxsna_2.htm (Cached) (More from 
www.cool.caucasus.net)

as you can see there is a lot of spaces between words.. is this bug or 
what?...
maybe it's because of different borders in web page and nutch places spaces 
by his own ???
Is there any way to avoid this problem?