You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Yang <te...@gmail.com> on 2011/04/08 06:17:31 UTC

some basic questions on how Lucene/search engines work

I'm new to lucene/search engine , and have been struggling with these
questions recently.
I'd appreciate a lot of you could shed some light on this.


let's say I do a query on

dog   greyhound

note that I did not quote them, i.e. this is not a phrase search.

what happens under the hood ?

which term does Lucene use to look up the inverted Index ?
I read somewhere that Lucene uses the term with the higher IDF (i.e.
the more distinguishing term), i.e. in this case
"greyhound", but what about dog? does Lucene traverse down the doclist
of  "dog" at all? if I provide multiple terms in my query,
generally how does Lucene decide how many doclists to travel down?


I read that Lucene uses a combination of "binary model" and  VSM, then
it seems that in the above case, it finds
the full doclist of dog , and that of "greyhound", (the binary model
part), then find the common docs from the two doclists,
then order them by scores ( the VSM part).  is it true that the FULL
doclists are fetched first? or is some pruning done on the individual
doclists? I see the
talk in http://www.slideshare.net/abial/eurocon2010  that talks about
pruning and tiered search, but is this the default behavior of Lucene?
how are the doclists sorted? (by  idf ?? --- sorry I'm just beginning
to sift through a lot of docs online, somehow got this impression but
can't form a precise conclusion)



also generally, could you please provide some good articles on how
lucene/search engines work? I've read the "anatomy of a search engine"
(google Sergey Brin & Larry Page paper),
"introduction to information retrieval (Manning et al ) "  , "Lucene
in action" ....


Thanks
Yang

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: some basic questions on how Lucene/search engines work

Posted by Yang <te...@gmail.com>.

thanks a lot for the detailed info!

On Wed, Apr 13, 2011 at 4:43 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On Apr 7, 2011, at 9:17 PM, Yang wrote:
>
>> I'm new to lucene/search engine , and have been struggling with these
>> questions recently.
>> I'd appreciate a lot of you could shed some light on this.
>>
>>
>> let's say I do a query on
>>
>> dog   greyhound
>>
>> note that I did not quote them, i.e. this is not a phrase search.
>>
>> what happens under the hood ?
>
> A lot and it is really beyond the scope of an email, but...
>
> Assuming dog and greyhound are not removed by analysis, it will go to the inverted index and look up each of the terms and the documents for each of the terms and score them (see BooleanQuery for how it combines them).  I would suggest setting up a really simple program that does this query and then just step through the code.  You may even want to start a little simpler with a single term.
>
>>
>> which term does Lucene use to look up the inverted Index ?
>
> Both
>
>
>> I read somewhere that Lucene uses the term with the higher IDF (i.e.
>> the more distinguishing term), i.e. in this case
>> "greyhound", but what about dog? does Lucene traverse down the doclist
>> of  "dog" at all?
>
> Yes, it does.
>
>> if I provide multiple terms in my query,
>> generally how does Lucene decide how many doclists to travel down?
>>
>
> I don't believe we currently do any pruning, so all will be visited.
>
>>
>> I read that Lucene uses a combination of "binary model"
>
> I would say "Boolean Model"
>
>> and  VSM, then
>> it seems that in the above case, it finds
>> the full doclist of dog , and that of "greyhound", (the binary model
>> part), then find the common docs from the two doclists,
>> then order them by scores ( the VSM part).
>
> Yes, assuming an AND combines them.  If it's an OR, then it will union the lists.
>
>>  is it true that the FULL
>> doclists are fetched first? or is some pruning done on the individual
>> doclists?
>
> All docs that contain the word are scored and can potentially be returned if you have enough time/memory (not recommended)
>
>> I see the
>> talk in http://www.slideshare.net/abial/eurocon2010  that talks about
>> pruning and tiered search, but is this the default behavior of Lucene?
>
> That is more advanced and is something you can do w/ Lucene, but is not out of the box.
>
>> how are the doclists sorted? (by  idf ?? --- sorry I'm just beginning
>> to sift through a lot of docs online, somehow got this impression but
>> can't form a precise conclusion)
>>
>
> It depends on the query, etc.  Have a look at the Scoring page on the website for more info.
>
>>
>>
>> also generally, could you please provide some good articles on how
>> lucene/search engines work? I've read the "anatomy of a search engine"
>> (google Sergey Brin & Larry Page paper),
>> "introduction to information retrieval (Manning et al ) "  , "Lucene
>> in action" ....
>
> I'd start w/ Lucene in Action 2nd ed.  Brin and Page paper is good.  As is the Manning book, Baeza Yates, Grossman, etc.  I believe we have a resources page on our Wiki that lists out a lot of books and talks.
>
> I would recommend, however, just trying out small examples of different queries and stepping through the code if you really want to know the guts of Lucene.
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: some basic questions on how Lucene/search engines work

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 7, 2011, at 9:17 PM, Yang wrote:

> I'm new to lucene/search engine , and have been struggling with these
> questions recently.
> I'd appreciate a lot of you could shed some light on this.
> 
> 
> let's say I do a query on
> 
> dog   greyhound
> 
> note that I did not quote them, i.e. this is not a phrase search.
> 
> what happens under the hood ?

A lot and it is really beyond the scope of an email, but...

Assuming dog and greyhound are not removed by analysis, it will go to the inverted index and look up each of the terms and the documents for each of the terms and score them (see BooleanQuery for how it combines them).  I would suggest setting up a really simple program that does this query and then just step through the code.  You may even want to start a little simpler with a single term.

> 
> which term does Lucene use to look up the inverted Index ?

Both

> I read somewhere that Lucene uses the term with the higher IDF (i.e.
> the more distinguishing term), i.e. in this case
> "greyhound", but what about dog? does Lucene traverse down the doclist
> of  "dog" at all?

Yes, it does.

> if I provide multiple terms in my query,
> generally how does Lucene decide how many doclists to travel down?
> 

I don't believe we currently do any pruning, so all will be visited.

> 
> I read that Lucene uses a combination of "binary model"

I would say "Boolean Model"

> and  VSM, then
> it seems that in the above case, it finds
> the full doclist of dog , and that of "greyhound", (the binary model
> part), then find the common docs from the two doclists,
> then order them by scores ( the VSM part).

Yes, assuming an AND combines them.  If it's an OR, then it will union the lists.  

>  is it true that the FULL
> doclists are fetched first? or is some pruning done on the individual
> doclists?

All docs that contain the word are scored and can potentially be returned if you have enough time/memory (not recommended)

> I see the
> talk in http://www.slideshare.net/abial/eurocon2010  that talks about
> pruning and tiered search, but is this the default behavior of Lucene?

That is more advanced and is something you can do w/ Lucene, but is not out of the box.

> how are the doclists sorted? (by  idf ?? --- sorry I'm just beginning
> to sift through a lot of docs online, somehow got this impression but
> can't form a precise conclusion)
> 

It depends on the query, etc.  Have a look at the Scoring page on the website for more info.

> 
> 
> also generally, could you please provide some good articles on how
> lucene/search engines work? I've read the "anatomy of a search engine"
> (google Sergey Brin & Larry Page paper),
> "introduction to information retrieval (Manning et al ) "  , "Lucene
> in action" ....

I'd start w/ Lucene in Action 2nd ed.  Brin and Page paper is good.  As is the Manning book, Baeza Yates, Grossman, etc.  I believe we have a resources page on our Wiki that lists out a lot of books and talks.

I would recommend, however, just trying out small examples of different queries and stepping through the code if you really want to know the guts of Lucene.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org