You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gora Mohanty <go...@mimirtech.com> on 2011/01/04 12:27:46 UTC

Re: unnecessary results in search

On Tue, Jan 4, 2011 at 5:40 AM,  <al...@aim.com> wrote:
> Hello,
>
> I used nutch-1.2 to index a few domains. I noticed that nutch correctly crawled all sub-pages of domains. By sub-pages I mean the followings, for example for a domain mydomain.com all links inside it like
> mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that google-bot also crawled all sub-pages.
> However, in search for mydomain.com google gives mydomain.com in the first page and almost no subpages, but nutch gives all subpages. If a domain has, let say 200 sub-pages and we display 10 results in a page then it would take us 10 pages to go forward to see results from other domains. In contrary google displays results form ohter domains in the second place.
[...]

It is not entirely clear what you want:
* If your goal is to only crawl to a certain depth on a domain, you can
  use the -depth argument for the Nutch crawl, or use the -topN option
  to specify the max. number of pages to retrieve.
* Can you give an actual example of what you are searching for.
  It is difficult to understand your description above. E.g., searching
  Google for "yahoo.com" returns many, many links from yahoo.com.
* If you mean that a search with any query string returns different
  results between Google, and Nutch, that could be due to many
  reasons. In both cases, the returned pages are ranked by relevancy,
  but the algorithm is different. Also, Google has probably indexed many
  more sites than your Nutch crawl.

Regards,
Gora

Re: unnecessary results in search

Posted by al...@aim.com.
Hello,

Just noticed that google actually has results from all subpages of mydomain.com for keyword mydomain.com but they are hidden in a link "show more results from mydomain.com". Is there a way of putting more results from the same domain in such a link in Nutch rss feed, since I use opensearch to display results from nutch.

Thanks.
Alex.


 

 


 

 

-----Original Message-----
From: Gora Mohanty <go...@mimirtech.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Jan 5, 2011 10:20 am
Subject: Re: unnecessary results in search


On Wed, Jan 5, 2011 at 11:25 PM,  <al...@aim.com> wrote:

> I do search directly in Nutch version 1-2.

> I think google gives very low scores to subpages of a domain and higher scores 

to other domains for a given keyword.



That is possible, though I am not sure why the situation is different with

non-popular domains.



> This must be so because if  mydomain.com has let say 2000 subpages then in the 

search result for keyword mydomain.com  the next 200 pages all will be subpages 

of mydomain.com.

>

> If someone could direct me to the part of the source code where Nutch gives 

scores to pages I can take a look to it.



If you are using Nutch for search also, I am afraid that someone else

will have to help you. I have no experience there.



Regards,

Gora




 

Re: unnecessary results in search

Posted by al...@aim.com.
One more thing I just noticed is that Nutch search results do not display information from meta tag. 
Google and yahoo does. 
In more details, Nutch search results for keyword mydomain.com displays  some short text from page mydomain.com. In contrary, google and yahoo search results for the same keyword display words from meta tag.

How this can be fixed in Nutch?

Thanks.
Alex.

 

 


 

 

-----Original Message-----
From: Gora Mohanty <go...@mimirtech.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Jan 5, 2011 10:20 am
Subject: Re: unnecessary results in search


On Wed, Jan 5, 2011 at 11:25 PM,  <al...@aim.com> wrote:

> I do search directly in Nutch version 1-2.

> I think google gives very low scores to subpages of a domain and higher scores 

to other domains for a given keyword.



That is possible, though I am not sure why the situation is different with

non-popular domains.



> This must be so because if  mydomain.com has let say 2000 subpages then in the 

search result for keyword mydomain.com  the next 200 pages all will be subpages 

of mydomain.com.

>

> If someone could direct me to the part of the source code where Nutch gives 

scores to pages I can take a look to it.



If you are using Nutch for search also, I am afraid that someone else

will have to help you. I have no experience there.



Regards,

Gora




 

Re: unnecessary results in search

Posted by Gora Mohanty <go...@mimirtech.com>.
On Wed, Jan 5, 2011 at 11:25 PM,  <al...@aim.com> wrote:
> I do search directly in Nutch version 1-2.
> I think google gives very low scores to subpages of a domain and higher scores to other domains for a given keyword.

That is possible, though I am not sure why the situation is different with
non-popular domains.

> This must be so because if  mydomain.com has let say 2000 subpages then in the search result for keyword mydomain.com  the next 200 pages all will be subpages of mydomain.com.
>
> If someone could direct me to the part of the source code where Nutch gives scores to pages I can take a look to it.

If you are using Nutch for search also, I am afraid that someone else
will have to help you. I have no experience there.

Regards,
Gora

Re: unnecessary results in search

Posted by al...@aim.com.
I do search directly in Nutch version 1-2. 
I think google gives very low scores to subpages of a domain and higher scores to other domains for a given keyword.
This must be so because if  mydomain.com has let say 2000 subpages then in the search result for keyword mydomain.com  the next 200 pages all will be subpages of mydomain.com.

If someone could direct me to the part of the source code where Nutch gives scores to pages I can take a look to it.

For testing this issue you can index a domain with a few subpages and compare search results with the one in google.

Thanks.
Alex.


 

 


 

 

-----Original Message-----
From: Gora Mohanty <go...@mimirtech.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Jan 5, 2011 4:10 am
Subject: Re: unnecessary results in search


On Tue, Jan 4, 2011 at 11:36 PM,  <al...@aim.com> wrote:

> Hello,

>

> Thanks you for your response.

>

> Let me give you more detail of the issue that I have.

> First definitions. Let say I have my own domain that I host on a dedicated 

server and call it mydomain.com

> Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, 

maps.mydomain.com and etc.

> Call subpages the followings mydomain.com/show/photos/1, mydomain.com/forum/id/5 

and etc.

>

> Having these definitions, I have observed by examinig apache log files that 

Google and Nutch crawlers crawled all subpages of mydomain.com

> However, if we search in google for keyword mydomain.com it gives in results 

all subdomains of mydomain.com not all subpages, maybe some of them. If we 

search in Nutch for the keyword mydomain.com it gives all subdomains and 

subpages. My concern was not to include all subpages in a search for keyword 

mydomain.com. Of course, we must see subpages  for keywords that is in that 

subpage. This means we must not remove subpages from index.

[...]



OK, the above description makes more sense, after looking

through Google results for "yahoo.com". I do not have the

results of an equivalent Nutch crawl to compare, but I do

imagine that the result would be what you describe above.



What Google seems to be doing here is some special-case

processing for when it recognises that the search is a primary

domain. Interestingly, while it does this for a popular domain

name, searching for more obscure domain names does not

seem to work in the same manner.



You could probably implement a similar special-case handling

of domain names. How are you searching with Nutch? Directly,

or via indexing through Solr?



Regards,

Gora




 

Re: unnecessary results in search

Posted by Gora Mohanty <go...@mimirtech.com>.
On Tue, Jan 4, 2011 at 11:36 PM,  <al...@aim.com> wrote:
> Hello,
>
> Thanks you for your response.
>
> Let me give you more detail of the issue that I have.
> First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com
> Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, maps.mydomain.com and etc.
> Call subpages the followings mydomain.com/show/photos/1, mydomain.com/forum/id/5 and etc.
>
> Having these definitions, I have observed by examinig apache log files that Google and Nutch crawlers crawled all subpages of mydomain.com
> However, if we search in google for keyword mydomain.com it gives in results all subdomains of mydomain.com not all subpages, maybe some of them. If we search in Nutch for the keyword mydomain.com it gives all subdomains and subpages. My concern was not to include all subpages in a search for keyword mydomain.com. Of course, we must see subpages  for keywords that is in that subpage. This means we must not remove subpages from index.
[...]

OK, the above description makes more sense, after looking
through Google results for "yahoo.com". I do not have the
results of an equivalent Nutch crawl to compare, but I do
imagine that the result would be what you describe above.

What Google seems to be doing here is some special-case
processing for when it recognises that the search is a primary
domain. Interestingly, while it does this for a popular domain
name, searching for more obscure domain names does not
seem to work in the same manner.

You could probably implement a similar special-case handling
of domain names. How are you searching with Nutch? Directly,
or via indexing through Solr?

Regards,
Gora

Re: unnecessary results in search

Posted by al...@aim.com.
Hello,

Thanks you for your response. 

Let me give you more detail of the issue that I have.
First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com
Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, maps.mydomain.com and etc.
Call subpages the followings mydomain.com/show/photos/1, mydomain.com/forum/id/5 and etc.

Having these definitions, I have observed by examinig apache log files that Google and Nutch crawlers crawled all subpages of mydomain.com
However, if we search in google for keyword mydomain.com it gives in results all subdomains of mydomain.com not all subpages, maybe some of them. If we search in Nutch for the keyword mydomain.com it gives all subdomains and subpages. My concern was not to include all subpages in a search for keyword mydomain.com. Of course, we must see subpages  for keywords that is in that subpage. This means we must not remove subpages from index.

I hope this gives you more detail of the issue that I have.

Thanks.
Alex.



 

 


 

 

-----Original Message-----
From: Gora Mohanty <go...@mimirtech.com>
To: user <us...@nutch.apache.org>
Sent: Tue, Jan 4, 2011 3:28 am
Subject: Re: unnecessary results in search


On Tue, Jan 4, 2011 at 5:40 AM,  <al...@aim.com> wrote:

> Hello,

>

> I used nutch-1.2 to index a few domains. I noticed that nutch correctly 

crawled all sub-pages of domains. By sub-pages I mean the followings, for 

example for a domain mydomain.com all links inside it like

> mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that 

google-bot also crawled all sub-pages.

> However, in search for mydomain.com google gives mydomain.com in the first 

page and almost no subpages, but nutch gives all subpages. If a domain has, let 

say 200 sub-pages and we display 10 results in a page then it would take us 10 

pages to go forward to see results from other domains. In contrary google 

displays results form ohter domains in the second place.

[...]



It is not entirely clear what you want:

* If your goal is to only crawl to a certain depth on a domain, you can

  use the -depth argument for the Nutch crawl, or use the -topN option

  to specify the max. number of pages to retrieve.

* Can you give an actual example of what you are searching for.

  It is difficult to understand your description above. E.g., searching

  Google for "yahoo.com" returns many, many links from yahoo.com.

* If you mean that a search with any query string returns different

  results between Google, and Nutch, that could be due to many

  reasons. In both cases, the returned pages are ranked by relevancy,

  but the algorithm is different. Also, Google has probably indexed many

  more sites than your Nutch crawl.



Regards,

Gora