You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Aled Jones <Al...@comtec-europe.co.uk> on 2005/12/07 11:32:28 UTC

Nutch returns irrelevant site

Hi

I'm currently setting up a nutch search engine that searches travel
websites.  It works quite well but sometimes returns odd results.
One good example:
One of the 100 or so sites I've asked it to crawl is
http://www.hfholidays.co.uk/ .  This site is mainly about walking
holidays and has many pages with the word "walking" in it, so when I
type in "walking" into nutch then I'd expect it to turn up, however the
first result I get back from using the keyword "walking" is
http://www.hfholidays.co.uk/email.asp .  This page doesn't have the word
walking in it anywhere.
Could someone please explain if this is a bug or the way nutch works.
I've got an idea how google works, if nutch works in a similar fashion
does this page appear because it is linked from many pages with the word
walking in them?

Thanks
Aled




************************************************************************
This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. 

Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
 


Re: Nutch returns irrelevant site

Posted by Piotr Kosiorowski <pk...@gmail.com>.
You can use explain page to find out why this page is scored the way it 
is. I would expect anchor text would be th emain component of it.
Regards
Piotr

Aled Jones wrote:
> Hi
> 
> I'm currently setting up a nutch search engine that searches travel
> websites.  It works quite well but sometimes returns odd results.
> One good example:
> One of the 100 or so sites I've asked it to crawl is
> http://www.hfholidays.co.uk/ .  This site is mainly about walking
> holidays and has many pages with the word "walking" in it, so when I
> type in "walking" into nutch then I'd expect it to turn up, however the
> first result I get back from using the keyword "walking" is
> http://www.hfholidays.co.uk/email.asp .  This page doesn't have the word
> walking in it anywhere.
> Could someone please explain if this is a bug or the way nutch works.
> I've got an idea how google works, if nutch works in a similar fashion
> does this page appear because it is linked from many pages with the word
> walking in them?
> 
> Thanks
> Aled
> 
> 
> 
> 
> ************************************************************************
> This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored. 
> 
> Although we have scanned this e-mail and any attachments for viruses, it is your responsibility to ensure that they are actually virus free.
>  
> 
>