You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Arjun Dhar <dh...@yahoo.com> on 2012/06/29 15:02:17 UTC

Query, Searcher, Weight, Similarity = ?

Hi,
I'm new and that is my disclaimer to the stupid question I am about to ask.

Am trying to form a conceptual picture of the relation between Query <-->
Weight <--> IndexReader, Scorer, Searcher <--> Similarity

*From what I gather : (and someone please validate or correct me) *
1. We want *Queries* to be RE-USABLE instances hence *Weight* is a specific
Queries state !?
2. *Searcher* is STATEFUL, and though it processes a *Query*, the state for
that *Searcher* is delegated to the WEIGHT !?
3. *IndexReader* Reads an Index, and the *Searcher* uses the Reader to
SEARCH, using a QUERY
4. From the JavaDocs of Weight class ----> "IndexReader dependent state
should reside in the Scorer. " -- Means, when *weights* are calculated, the
final result of the Calculation goes into a STATEFUL object represented by
the *Scorer* which is also Iterable !?
5. *Searcher* can be assigned a *Similarity* algorithm. ... hence using that
algorithm, it calculates *Weight*, which eventually leads to the
construction of an Iterable *Scorer* !?

6. While Indexing, its simple there is a direct relation between
IndexWriterConfig <--> Similarity

+Q) Apart from the validation of my understanding, is there a Sequence
Diagram explaining the process of calculation, during a Query?

+Q) There are different implementations of Queries. Do they differ in how
they mash up all the other stuff?
Looks like if i mess each of the other entities, I can pretty much produce
whatever Query?!

thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/Query-Searcher-Weight-Similarity-tp3992080.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Query, Searcher, Weight, Similarity = ?

Posted by Mike Sokolov <so...@ifactory.com>.


On 06/29/2012 02:55 PM, Mike Sokolov wrote:
>
> On 06/29/2012 02:17 PM, Robert Muir wrote:
>> On Fri, Jun 29, 2012 at 2:12 PM, Mike Sokolov<so...@ifactory.com>  
>> wrote:
>>> This has been elucidating, thanks!
>>>
>>> On a related topic:
>>>
>>> I need the ability to pull results lazily, so that I can decide 
>>> whether to
>>> terminate the search iteration early, and ultimately I need to 
>>> delegate that
>>> decision to callers of *my* API.
>>>
>> The typical solution to this is to just throw an exception from your
>> collector when you are satisfied: see TimeLimitingCollector.
>>
> I can't do that because am writing a middle layer that is called by a 
> consumer that I can't change and that doesn't know anything about 
> Lucene.  I need to implement an Iterator-style interface.  My consumer 
> will simply call next() repeatedly, and then stop at some point, and 
> close() me so I can clean up.
>
> So to satisfy that I think I would have to collect all results.
Or the other solution I considered was a multi-threaded collector, but 
all-in-all it seemed simplest to write a pull-style iterator :)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Query, Searcher, Weight, Similarity = ?

Posted by Mike Sokolov <so...@ifactory.com>.

On 06/29/2012 02:17 PM, Robert Muir wrote:
> On Fri, Jun 29, 2012 at 2:12 PM, Mike Sokolov<so...@ifactory.com>  wrote:
>    
>> This has been elucidating, thanks!
>>
>> On a related topic:
>>
>> I need the ability to pull results lazily, so that I can decide whether to
>> terminate the search iteration early, and ultimately I need to delegate that
>> decision to callers of *my* API.
>>
>>      
> The typical solution to this is to just throw an exception from your
> collector when you are satisfied: see TimeLimitingCollector.
>
>    
I can't do that because am writing a middle layer that is called by a 
consumer that I can't change and that doesn't know anything about 
Lucene.  I need to implement an Iterator-style interface.  My consumer 
will simply call next() repeatedly, and then stop at some point, and 
close() me so I can clean up.

So to satisfy that I think I would have to collect all results.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Query, Searcher, Weight, Similarity = ?

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Jun 29, 2012 at 2:12 PM, Mike Sokolov <so...@ifactory.com> wrote:
> This has been elucidating, thanks!
>
> On a related topic:
>
> I need the ability to pull results lazily, so that I can decide whether to
> terminate the search iteration early, and ultimately I need to delegate that
> decision to callers of *my* API.
>

The typical solution to this is to just throw an exception from your
collector when you are satisfied: see TimeLimitingCollector.

>
> Basically what I wanted was a method nextDoc() that I could call repeatedly
> to retrieve all of the docIDs returned by the search, or at any rate, as
> many as needed.

the push and pull are no different here I think, though I dont know
your use case. When your collector is satisfied, just throw an
exception :)

>
> Last question: what order are documents returned in if you create Scorers
> with ordered=true - is that always ascending docID order?
>

Yes, if you allow for out-of-order scoring, currently only
BooleanScorer will do anything with that (i think only when its a
top-level scorer), returning out-of-order hits within each window of
2k docs


-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Query, Searcher, Weight, Similarity = ?

Posted by Mike Sokolov <so...@ifactory.com>.

This has been elucidating, thanks!

On a related topic:

I need the ability to pull results lazily, so that I can decide whether 
to terminate the search iteration early, and ultimately I need to 
delegate that decision to callers of *my* API.

My first question is: did I overlook support for this in Lucene's 
user-facing API?  What I see in Lucene (Collector-based) seems to be all 
push-style; the Searcher retrieves all, or the top N results, stores 
them and returns them all at once.

Basically what I wanted was a method nextDoc() that I could call 
repeatedly to retrieve all of the docIDs returned by the search, or at 
any rate, as many as needed.

Not finding this, I wrote a class that would do.  What I ended up doing 
was subclassing DocIdSetIterator and copying some of the logic I saw in 
IndexSearcher (IIRC) (create a Weight, iterate over subReaders creating 
Scorers, retrieve docID from scorers, correcting for docBase offset).

So my second question is, assuming this isn't already provided 
somewhere, does it belong in IndexSearcher?  Is it worth posting a 
patch?  I'm a little concerned that because I ended up having to access 
some internal members marked as experimental (subReaders, I think?), 
this might end up not supported or having to track changes in Lucene's 
internal API.

Last question: what order are documents returned in if you create 
Scorers with ordered=true - is that always ascending docID order?

-Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Query, Searcher, Weight, Similarity = ?

Posted by Robert Muir <rc...@gmail.com>.

On Fri, Jun 29, 2012 at 9:02 AM, Arjun Dhar <dh...@yahoo.com> wrote:
> Hi,
> I'm new and that is my disclaimer to the stupid question I am about to ask.
>
> Am trying to form a conceptual picture of the relation between Query <-->
> Weight <--> IndexReader, Scorer, Searcher <--> Similarity
>
> *From what I gather : (and someone please validate or correct me) *
> 1. We want *Queries* to be RE-USABLE instances hence *Weight* is a specific
> Queries state !?

Queries are independent of a Searcher. When executing a Query, it
creates a Weight specifically for that searcher. This contains things
things like IDF computations: collection-wide state.

> 2. *Searcher* is STATEFUL, and though it processes a *Query*, the state for
> that *Searcher* is delegated to the WEIGHT !?

Searcher wraps an indexreader (usually a composite indexreader
containing multiple segments like a DirectoryReader) to provide search
capabilities. It also has extension points that are search specific:
one of these is Similarity, but there are others. For example, in 4.0
you can override methods to provide collection-wide stats where the
collection is distributed: consisting of indexes across multiple
machines

> 3. *IndexReader* Reads an Index, and the *Searcher* uses the Reader to
> SEARCH, using a QUERY

yes.

> 4.  From the JavaDocs of Weight class ----> "IndexReader dependent state
> should reside in the Scorer. " -- Means, when *weights* are calculated, the
> final result of the Calculation goes into a STATEFUL object represented by
> the *Scorer* which is also Iterable !?

This could maybe be clarified to say per-segment state. So if you have
an IndexSearcher wrapping a DirectoryReader with 4 index segments, in
the typical case the Weight holds the state of the entire collection:
e.g. IDF across all 4 segments. The Weight creates 4 Scorers: a Scorer
for each segment in that DirectoryReader. Any per-segment information
such as the document length normalization ("norms") array resides in
each of those Scorers.

> 5. *Searcher* can be assigned a *Similarity* algorithm. ... hence using that
> algorithm, it calculates *Weight*, which eventually leads to the
> construction of an Iterable *Scorer* !?

A Similarity is a hook for term weighting. But term weighting is not
the entire scoring algorithm in many cases: Scorers don't have to use
Similarity to compute things: they can use whatever logic they want.

>
> 6. While Indexing, its simple there is a direct relation between
> IndexWriterConfig <--> Similarity

this is for computing document length normalization information
("norms") at indexing time. Currently thats the only way that
IndexWriter interacts with Similarity.

>
> +Q) Apart from the validation of my understanding, is there a Sequence
> Diagram explaining the process of calculation, during a Query?

have a look at https://builds.apache.org/job/Lucene-trunk/javadoc/ ,
click "Searching and Scoring in Lucene". I don't think there are any
diagrams there, but there is more information available.

>
> +Q) There are different implementations of Queries. Do they differ in how
> they mash up all the other stuff?
> Looks like if i mess each of the other entities, I can pretty much produce
> whatever Query?!

See the link above for more information, especially the section on
writing custom queries.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org