You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Massimo Miccoli <mm...@iltrovatore.it> on 2005/06/28 17:20:51 UTC

Re: [Nutch-dev] Copy DB by the piece

Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at 
indexing and fetching time is used on the search.
Using DistributedSearch seams that Pgae Boost is not used to calculate 
the ranks for pages. What I see in my result pages
is most pages with low page Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits 
results? For an hits I think that ranks must be boost*score for query or 
I'm wrong?

Thanks,

Massimo

RE: [Nutch-dev] Copy DB by the piece

Posted by Chirag Chaman <de...@filangy.com>.
Boost are multiplied into the "match score" (aka. The Idf-tf)

Thus, pages are not soted by boosts, but by the final score.

Here's a example:

You have 3 pages:

- news.google.com
- www.blogspot.com/googguy (blog talking about google)
- www.yahoo.com/google-launches-ship-into-space.html

Let's say the boosts factors are 1,2 and 3 respectively.

Now, you do a search for "google".
Let's take the raw scores to be 50,20,15 for the 3 url.

After boosts are applied:

News.google.com - 50 * 1 = 50
www.blogspot.com - 20 * 2 = 40
www.yahoo.com - 15 * 3 = 45

Thus, you'll get ranking as

News.google.com
www.yahoo.com...
www.blogspot.com...


HTH!





 

-----Original Message-----
From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it] 
Sent: Tuesday, June 28, 2005 12:51 PM
To: dev@nutch.org
Subject: Re: [Nutch-dev] Copy DB by the piece

Chirag,

So the boost on top of explain.jsp is for sorting results, the final value
for rank? If so  the Hits on results pages is not ordered by boost.
Because I have in firts positions Hits with low boost.

Thanks

Chirag Chaman ha scritto:

>Massimo,
>
>The boost gets multiplied at search time.
>
>This boost has already been applied to the "field norms" -- a good way 
>to confirm is see a field norm that was originally one (URL or anchor 
>is a good
>one) and that should now be higher. A lot of the other fields like
"content"
>is way too small be being with to show any difference.
>
>In shot, if you see the boost on the top of the explain page, it's 
>definitely there in the field norms -- and thus being applied.
>
>CC-
>
>--------------------------------------------
>Filangy, Inc.
>We're Improving Search!
>www.filangy.com
>
>
>-----Original Message-----
>From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it]
>Sent: Tuesday, June 28, 2005 11:21 AM
>To: dev@nutch.org
>Subject: Re: [Nutch-dev] Copy DB by the piece
>
>Dear Nutch dev,
>
>I want to know if  the Boost calulated for Pages from inlink count at 
>indexing and fetching time is used on the search.
>Using DistributedSearch seams that Pgae Boost is not used to calculate 
>the ranks for pages. What I see in my result pages is most pages with 
>low page Boost is on top and some with high Boost below.
>For example by explain.jsp:
>
>
>1)  boost = 5.3968873 score for query= 50.692223
>2 ) boost = 5.586193   score for query= 46.90389
>3)  boost = 6.0371985 score for query= 43.306103
>4) boost = 7.388178    score for query= 37.984783
>....
>
>So only the score for query is considered for sort (rank) the hits results?
>For an hits I think that ranks must be boost*score for query or I'm wrong?
>
>Thanks,
>
>Massimo
>
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by: Discover Easy Linux Migration Strategies 
>from IBM. Find simple to follow Roadmaps, straightforward articles, 
>informative Webcasts and more! Get everything you need to get up to 
>speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>_______________________________________________
>Nutch-developers mailing list
>Nutch-developers@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>



Re: [Nutch-dev] Copy DB by the piece

Posted by Massimo Miccoli <mm...@iltrovatore.it>.
Chirag,

So the boost on top of explain.jsp is for sorting results, the final 
value for rank? If so  the Hits on results pages is not ordered by boost.
Because I have in firts positions Hits with low boost.

Thanks

Chirag Chaman ha scritto:

>Massimo,
>
>The boost gets multiplied at search time.
>
>This boost has already been applied to the "field norms" -- a good way to
>confirm is see a field norm that was originally one (URL or anchor is a good
>one) and that should now be higher. A lot of the other fields like "content"
>is way too small be being with to show any difference.
>
>In shot, if you see the boost on the top of the explain page, it's
>definitely there in the field norms -- and thus being applied.
>
>CC-
>
>--------------------------------------------
>Filangy, Inc.
>We're Improving Search!
>www.filangy.com
>
>
>-----Original Message-----
>From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it] 
>Sent: Tuesday, June 28, 2005 11:21 AM
>To: dev@nutch.org
>Subject: Re: [Nutch-dev] Copy DB by the piece
>
>Dear Nutch dev,
>
>I want to know if  the Boost calulated for Pages from inlink count at
>indexing and fetching time is used on the search.
>Using DistributedSearch seams that Pgae Boost is not used to calculate the
>ranks for pages. What I see in my result pages is most pages with low page
>Boost is on top and some with high Boost below.
>For example by explain.jsp:
>
>
>1)  boost = 5.3968873 score for query= 50.692223
>2 ) boost = 5.586193   score for query= 46.90389
>3)  boost = 6.0371985 score for query= 43.306103
>4) boost = 7.388178    score for query= 37.984783
>....
>
>So only the score for query is considered for sort (rank) the hits results?
>For an hits I think that ranks must be boost*score for query or I'm wrong?
>
>Thanks,
>
>Massimo
>
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
>from IBM. Find simple to follow Roadmaps, straightforward articles,
>informative Webcasts and more! Get everything you need to get up to
>speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>_______________________________________________
>Nutch-developers mailing list
>Nutch-developers@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>

RE: [Nutch-dev] Copy DB by the piece

Posted by Chirag Chaman <de...@filangy.com>.
Massimo,

The boost gets multiplied at search time.

This boost has already been applied to the "field norms" -- a good way to
confirm is see a field norm that was originally one (URL or anchor is a good
one) and that should now be higher. A lot of the other fields like "content"
is way too small be being with to show any difference.

In shot, if you see the boost on the top of the explain page, it's
definitely there in the field norms -- and thus being applied.

CC-

--------------------------------------------
Filangy, Inc.
We're Improving Search!
www.filangy.com


-----Original Message-----
From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it] 
Sent: Tuesday, June 28, 2005 11:21 AM
To: dev@nutch.org
Subject: Re: [Nutch-dev] Copy DB by the piece

Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at
indexing and fetching time is used on the search.
Using DistributedSearch seams that Pgae Boost is not used to calculate the
ranks for pages. What I see in my result pages is most pages with low page
Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits results?
For an hits I think that ranks must be boost*score for query or I'm wrong?

Thanks,

Massimo