You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jakob Heidebrecht <Ja...@gmx.de> on 2005/06/28 11:05:50 UTC

Copy DB by the piece

Hi,

I'm trying to copy the nutch database.

It seems to be enough to list all pages by MD5 
and get all Links of those pages.

I open up a reader of the db directory, make an new db directory and open a
writer for it.

When i copy all the database the hd space isnt't enough to merge the
tempfile for big databases, but it works for small db's.

I tried to do it by the piece, to close the writer after a number of pages
and reopen it agailn. 
It works for pages but now there aren't enough links in the new db.
The more pages and links I do in one round the more links I get in the new
db.

Can somebody help me with this.
Is there a posibility to avoid this?

Regards,
Jakob

-- 
Geschenkt: 3 Monate GMX ProMail gratis + 3 Ausgaben stern gratis
++ Jetzt anmelden & testen ++ http://www.gmx.net/de/go/promail ++

Hits Rank and Page Boost problem

Posted by Massimo Miccoli <mm...@iltrovatore.it>.
Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at
indexing and fetching time is used on the search.
Using DistributedSearch seams that Page Boost is not used to calculate
the ranks for pages. What I see in my result pages
is most pages with low page Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits
results? For an hits I think that ranks must be boost*score for query or
I'm wrong?

Thanks,

Massimo


RE: [Nutch-dev] Copy DB by the piece

Posted by Chirag Chaman <de...@filangy.com>.
Boost are multiplied into the "match score" (aka. The Idf-tf)

Thus, pages are not soted by boosts, but by the final score.

Here's a example:

You have 3 pages:

- news.google.com
- www.blogspot.com/googguy (blog talking about google)
- www.yahoo.com/google-launches-ship-into-space.html

Let's say the boosts factors are 1,2 and 3 respectively.

Now, you do a search for "google".
Let's take the raw scores to be 50,20,15 for the 3 url.

After boosts are applied:

News.google.com - 50 * 1 = 50
www.blogspot.com - 20 * 2 = 40
www.yahoo.com - 15 * 3 = 45

Thus, you'll get ranking as

News.google.com
www.yahoo.com...
www.blogspot.com...


HTH!





 

-----Original Message-----
From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it] 
Sent: Tuesday, June 28, 2005 12:51 PM
To: dev@nutch.org
Subject: Re: [Nutch-dev] Copy DB by the piece

Chirag,

So the boost on top of explain.jsp is for sorting results, the final value
for rank? If so  the Hits on results pages is not ordered by boost.
Because I have in firts positions Hits with low boost.

Thanks

Chirag Chaman ha scritto:

>Massimo,
>
>The boost gets multiplied at search time.
>
>This boost has already been applied to the "field norms" -- a good way 
>to confirm is see a field norm that was originally one (URL or anchor 
>is a good
>one) and that should now be higher. A lot of the other fields like
"content"
>is way too small be being with to show any difference.
>
>In shot, if you see the boost on the top of the explain page, it's 
>definitely there in the field norms -- and thus being applied.
>
>CC-
>
>--------------------------------------------
>Filangy, Inc.
>We're Improving Search!
>www.filangy.com
>
>
>-----Original Message-----
>From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it]
>Sent: Tuesday, June 28, 2005 11:21 AM
>To: dev@nutch.org
>Subject: Re: [Nutch-dev] Copy DB by the piece
>
>Dear Nutch dev,
>
>I want to know if  the Boost calulated for Pages from inlink count at 
>indexing and fetching time is used on the search.
>Using DistributedSearch seams that Pgae Boost is not used to calculate 
>the ranks for pages. What I see in my result pages is most pages with 
>low page Boost is on top and some with high Boost below.
>For example by explain.jsp:
>
>
>1)  boost = 5.3968873 score for query= 50.692223
>2 ) boost = 5.586193   score for query= 46.90389
>3)  boost = 6.0371985 score for query= 43.306103
>4) boost = 7.388178    score for query= 37.984783
>....
>
>So only the score for query is considered for sort (rank) the hits results?
>For an hits I think that ranks must be boost*score for query or I'm wrong?
>
>Thanks,
>
>Massimo
>
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by: Discover Easy Linux Migration Strategies 
>from IBM. Find simple to follow Roadmaps, straightforward articles, 
>informative Webcasts and more! Get everything you need to get up to 
>speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>_______________________________________________
>Nutch-developers mailing list
>Nutch-developers@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>



Re: [Nutch-dev] Copy DB by the piece

Posted by Massimo Miccoli <mm...@iltrovatore.it>.
Chirag,

So the boost on top of explain.jsp is for sorting results, the final 
value for rank? If so  the Hits on results pages is not ordered by boost.
Because I have in firts positions Hits with low boost.

Thanks

Chirag Chaman ha scritto:

>Massimo,
>
>The boost gets multiplied at search time.
>
>This boost has already been applied to the "field norms" -- a good way to
>confirm is see a field norm that was originally one (URL or anchor is a good
>one) and that should now be higher. A lot of the other fields like "content"
>is way too small be being with to show any difference.
>
>In shot, if you see the boost on the top of the explain page, it's
>definitely there in the field norms -- and thus being applied.
>
>CC-
>
>--------------------------------------------
>Filangy, Inc.
>We're Improving Search!
>www.filangy.com
>
>
>-----Original Message-----
>From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it] 
>Sent: Tuesday, June 28, 2005 11:21 AM
>To: dev@nutch.org
>Subject: Re: [Nutch-dev] Copy DB by the piece
>
>Dear Nutch dev,
>
>I want to know if  the Boost calulated for Pages from inlink count at
>indexing and fetching time is used on the search.
>Using DistributedSearch seams that Pgae Boost is not used to calculate the
>ranks for pages. What I see in my result pages is most pages with low page
>Boost is on top and some with high Boost below.
>For example by explain.jsp:
>
>
>1)  boost = 5.3968873 score for query= 50.692223
>2 ) boost = 5.586193   score for query= 46.90389
>3)  boost = 6.0371985 score for query= 43.306103
>4) boost = 7.388178    score for query= 37.984783
>....
>
>So only the score for query is considered for sort (rank) the hits results?
>For an hits I think that ranks must be boost*score for query or I'm wrong?
>
>Thanks,
>
>Massimo
>
>
>
>
>-------------------------------------------------------
>SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
>from IBM. Find simple to follow Roadmaps, straightforward articles,
>informative Webcasts and more! Get everything you need to get up to
>speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>_______________________________________________
>Nutch-developers mailing list
>Nutch-developers@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>  
>

RE: [Nutch-dev] Copy DB by the piece

Posted by Chirag Chaman <de...@filangy.com>.
Massimo,

The boost gets multiplied at search time.

This boost has already been applied to the "field norms" -- a good way to
confirm is see a field norm that was originally one (URL or anchor is a good
one) and that should now be higher. A lot of the other fields like "content"
is way too small be being with to show any difference.

In shot, if you see the boost on the top of the explain page, it's
definitely there in the field norms -- and thus being applied.

CC-

--------------------------------------------
Filangy, Inc.
We're Improving Search!
www.filangy.com


-----Original Message-----
From: Massimo Miccoli [mailto:mmiccoli@iltrovatore.it] 
Sent: Tuesday, June 28, 2005 11:21 AM
To: dev@nutch.org
Subject: Re: [Nutch-dev] Copy DB by the piece

Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at
indexing and fetching time is used on the search.
Using DistributedSearch seams that Pgae Boost is not used to calculate the
ranks for pages. What I see in my result pages is most pages with low page
Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits results?
For an hits I think that ranks must be boost*score for query or I'm wrong?

Thanks,

Massimo



Re: [Nutch-dev] Copy DB by the piece

Posted by Massimo Miccoli <mm...@iltrovatore.it>.
Dear Nutch dev,

I want to know if  the Boost calulated for Pages from inlink count at 
indexing and fetching time is used on the search.
Using DistributedSearch seams that Pgae Boost is not used to calculate 
the ranks for pages. What I see in my result pages
is most pages with low page Boost is on top and some with high Boost below.
For example by explain.jsp:


1)  boost = 5.3968873 score for query= 50.692223
2 ) boost = 5.586193   score for query= 46.90389
3)  boost = 6.0371985 score for query= 43.306103
4) boost = 7.388178    score for query= 37.984783
....

So only the score for query is considered for sort (rank) the hits 
results? For an hits I think that ranks must be boost*score for query or 
I'm wrong?

Thanks,

Massimo