You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Xiaocheng Luan <je...@yahoo.com> on 2006/03/08 19:58:56 UTC

Does Lucene support on-disk search?

Hi,
   
  I heard that Lucene loads the index into memory to do a search, which does not sound quite right to me. I will not be surprised if Lucene is smart enough to load
  the index into memory when it is feasible, but I'd be surprised if it ALWAYS loads index memory to do the search, which I think would have scalability problem.
   
  Could someone clarify on this, thanks!
   
  By the way, could someone please share some experience on the performance of Lucene, say, on a data set of a few gigabytes and a "reasonable" query, what would
  be the average search time?
   
  Xiaocheng

		
---------------------------------
 Yahoo! Mail
 Use Photomail to share photos without annoying attachments.

Re: Does Lucene support on-disk search?

Posted by Grant Ingersoll <gs...@syr.edu>.

Lucene _can_ load the index into memory, but it doesn't have to, if you 
want further details see the Javadocs on RAMDirectory versus 
FSDirectory.  I think you will find it has good performance on a few 
gigs of data.  Results, of course, vary based on what you are asking it 
to do and what kind of hardware you have.

-Grant

Xiaocheng Luan wrote:
> Hi,
>    
>   I heard that Lucene loads the index into memory to do a search, which does not sound quite right to me. I will not be surprised if Lucene is smart enough to load
>   the index into memory when it is feasible, but I'd be surprised if it ALWAYS loads index memory to do the search, which I think would have scalability problem.
>    
>   Could someone clarify on this, thanks!
>    
>   By the way, could someone please share some experience on the performance of Lucene, say, on a data set of a few gigabytes and a "reasonable" query, what would
>   be the average search time?
>    
>   Xiaocheng
>
> 		
> ---------------------------------
>  Yahoo! Mail
>  Use Photomail to share photos without annoying attachments.
>   

-- 
------------------------------------------------------------------- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Ranking/scoring

Posted by Chris Hostetter <ho...@fucit.org>.

: Do I have this right? I got bit confused at first because I assumed that the
: actual field values were being used in the computation, but you really need
: to know the unique term count in order to get the score 'right'.

you can use the actual values in FunctionQueries, except that:
  1) dates aren't numeric values that lend themselves well to functions
  2) the ReverseOrdinalValueSource comes in handy when you want the docs
with the highest value (ie: most recent date) to be "special" (ie: to plug
into your reciprical function and get the max value.

i suppose you could write a ValueSource that finds the max value of a
field and then a ValueSource that normalizes all the values of one
valuesource against the value(s) of another value source ... but no one
has done that yet (and it still wouldn't have a lot of meaning for dates)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Ranking/scoring

Posted by Peter Keegan <pe...@gmail.com>.

I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be
used to rank documents by score and date (solr.search.function contains
great stuff!). The values in the date field that are used for the
ValueSource are not actually used as 'floats', but rather their ordinal term
values from the FieldCache string index. This means that if the 'date' field
has 3000 unique string 'values' in the index, the values for 'x' in
ReciprocalFloatFuncion could be 0-2999. So if I want the most recent 'date'
to return a score of 1.0, one could set 'a' and 'b' in the function to
2999.

Do I have this right? I got bit confused at first because I assumed that the
actual field values were being used in the computation, but you really need
to know the unique term count in order to get the score 'right'.

By the way, as I try to get my head around the Score, Weight, and Boolean*
classes (and next(), skipTo()), I nominate these for discussion in Lucene In
Action II.

Peter

On 3/9/06, Yonik Seeley <ys...@gmail.com> wrote:
>
> On 3/9/06, Yang Sun <ys...@ist.psu.edu> wrote:
> > Hi Yonik,
> > Thanks very much for your suggestion. The query boost works great for
> > keyword matching. But in my case, I need to rank the results by date and
> > title. For example, title:foo^2 abstract:foo^1.5 date:2004^3 will only
> boost
> > the document with date=2004. What I need is boosting the "distance" from
> the
> > specified date
>
> If all you need to do is boost more recent documents (and a single
> fixed boost will always work), then you can do that boosting at index
> time.
>
> > which means 2003 will have a better ranking than 2002,
> > 2002>2001, etc.
> > I implemented a customized ScoreDocComparator class which works fine for
> one
> > field. But I met some trouble when trying to combine other fields
> together.
> > I'm still looking at FunctionQuery. Don't know if I can figure out
> > something.
>
> FunctionQuery support is integrated into Solr (or currently hacked-in,
> as the case may be),  and can be useful for debugging and trying out
> query types even if you don't use it for your runtime.
>
> ReciprocalFloatFunction might meet your needs for increasing the score
> of more recent documents:
>
> http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/ReciprocalFloatFunction.html
>
> The SolrQueryParser can make
> ReciprocalFloatFunction(new ReverseOrdFieldSource("my_date"),1,1000,1000)
> out of _val_:"recip(rord(my_date),1,1000,1000)"
>
> -Yonik
> http://incubator.apache.org/solr Solr, The Open Source Lucene Search
> Server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Ranking/scoring

Posted by Yonik Seeley <ys...@gmail.com>.

On 3/9/06, Yang Sun <ys...@ist.psu.edu> wrote:
> Hi Yonik,
> Thanks very much for your suggestion. The query boost works great for
> keyword matching. But in my case, I need to rank the results by date and
> title. For example, title:foo^2 abstract:foo^1.5 date:2004^3 will only boost
> the document with date=2004. What I need is boosting the "distance" from the
> specified date

If all you need to do is boost more recent documents (and a single
fixed boost will always work), then you can do that boosting at index
time.

> which means 2003 will have a better ranking than 2002,
> 2002>2001, etc.
> I implemented a customized ScoreDocComparator class which works fine for one
> field. But I met some trouble when trying to combine other fields together.
> I'm still looking at FunctionQuery. Don't know if I can figure out
> something.

FunctionQuery support is integrated into Solr (or currently hacked-in,
as the case may be),  and can be useful for debugging and trying out
query types even if you don't use it for your runtime.

ReciprocalFloatFunction might meet your needs for increasing the score
of more recent documents:
http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/ReciprocalFloatFunction.html

The SolrQueryParser can make
ReciprocalFloatFunction(new ReverseOrdFieldSource("my_date"),1,1000,1000)
out of _val_:"recip(rord(my_date),1,1000,1000)"

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Lucene Ranking/scoring

Posted by Yang Sun <ys...@ist.psu.edu>.

Hi Yonik,
Thanks very much for your suggestion. The query boost works great for
keyword matching. But in my case, I need to rank the results by date and
title. For example, title:foo^2 abstract:foo^1.5 date:2004^3 will only boost
the document with date=2004. What I need is boosting the "distance" from the
specified date which means 2003 will have a better ranking than 2002,
2002>2001, etc. 
I implemented a customized ScoreDocComparator class which works fine for one
field. But I met some trouble when trying to combine other fields together.
I'm still looking at FunctionQuery. Don't know if I can figure out
something. 
Any suggestions? Thanks.

Yang

-----Original Message-----
From: Yonik Seeley [mailto:yseeley@gmail.com] 
Sent: 2006年3月8日 21:35
To: java-user@lucene.apache.org
Subject: Re: Lucene Ranking/scoring

Hi Yang,

Boosting works at query time as well as index time.
If you are using the QueryParser, specify boosts like so:
title:foo^2 abstract:foo^1.5 date:mydate^3

If you are building queries pragmatically, then use the Query.setBoost()
method.

That will boost relative to how a non-boosted query would score, but
keep in mind that you still have tf/idf factors in the score.  If you
need to get rid of the tf/idf factors, either write your own
ScoreDocComparator, or use a FunctionQuery.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

On 3/8/06, Yang Sun <ys...@ist.psu.edu> wrote:
> Hi,
> Just wondering how I can rank search result by a combination of fields. I
> know there is a multi-field sort, but it is just a sorting method. It is
> sorted by the first field and then the second field ...
> What I need is a weighted combination. For example, I want to assign a
> weight of 2 to title match, 1.5 to abstract match, and 3 to date match (i.
e.
> How close the last modified date). The final score will be
> 2*inTitle+1.5*inAbstract+3*date instead of sorting by date and then
sorting
> by title within the same date.
> I checked lucene Score, Similarity, and SortDocComparator and can't find
an
> answer. Implements the SortDocComparator seems the closest, but it can
only
> sort the result by one field. The Field boost does not work because the
> boosting factor has to be set during index time. What I need is setting
the
> weight at query time.
> Please help. Thanks.
>
> Yang

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Ranking/scoring

Posted by Yonik Seeley <ys...@gmail.com>.

Hi Yang,

Boosting works at query time as well as index time.
If you are using the QueryParser, specify boosts like so:
title:foo^2 abstract:foo^1.5 date:mydate^3

If you are building queries pragmatically, then use the Query.setBoost() method.

That will boost relative to how a non-boosted query would score, but
keep in mind that you still have tf/idf factors in the score.  If you
need to get rid of the tf/idf factors, either write your own
ScoreDocComparator, or use a FunctionQuery.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server


On 3/8/06, Yang Sun <ys...@ist.psu.edu> wrote:
> Hi,
> Just wondering how I can rank search result by a combination of fields. I
> know there is a multi-field sort, but it is just a sorting method. It is
> sorted by the first field and then the second field ...
> What I need is a weighted combination. For example, I want to assign a
> weight of 2 to title match, 1.5 to abstract match, and 3 to date match (i.e.
> How close the last modified date). The final score will be
> 2*inTitle+1.5*inAbstract+3*date instead of sorting by date and then sorting
> by title within the same date.
> I checked lucene Score, Similarity, and SortDocComparator and can't find an
> answer. Implements the SortDocComparator seems the closest, but it can only
> sort the result by one field. The Field boost does not work because the
> boosting factor has to be set during index time. What I need is setting the
> weight at query time.
> Please help. Thanks.
>
> Yang

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Lucene Ranking/scoring

Posted by Yang Sun <ys...@ist.psu.edu>.

Hi,
Just wondering how I can rank search result by a combination of fields. I
know there is a multi-field sort, but it is just a sorting method. It is
sorted by the first field and then the second field ... 
What I need is a weighted combination. For example, I want to assign a
weight of 2 to title match, 1.5 to abstract match, and 3 to date match (i.e.
How close the last modified date). The final score will be
2*inTitle+1.5*inAbstract+3*date instead of sorting by date and then sorting
by title within the same date. 
I checked lucene Score, Similarity, and SortDocComparator and can't find an
answer. Implements the SortDocComparator seems the closest, but it can only
sort the result by one field. The Field boost does not work because the
boosting factor has to be set during index time. What I need is setting the
weight at query time.
Please help. Thanks.

Yang


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org