You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Yang <te...@gmail.com> on 2012/04/25 23:13:06 UTC

lucene algorithm ?

I read the paper by Doug "Space optimizations for total ranking",

since it was written a long time ago, I wonder what algorithms lucene uses
(regarding postings list traversal and score calculation, ranking)


particularly the total ranking algorithm described there needs to traverse
down the entire postings list for all the query terms,
so in case of very common query terms like "yellow dog", either of the 2
terms may have a very very long postings list in case of web search,
are they all really traversed in current lucene/Solr ? or  any heuristics
to truncate the list are actually employed?

in the case of returning top-k results, I can understand that partitioning
the postings list into multiple machines, and then combining the  top-k
from each would work,
but if we are required to return "the 100th result page", i.e. results
ranked from 990--1000th, then each partition would still have to find out
the top 1000, so
partitioning would not help much.


overall, is there any up-to-date detailed docs on the internal algorithms
of lucene?

Thanks a lot
Yang

Re: lucene algorithm ?

Posted by Yang <te...@gmail.com>.
Thanks Ralf.

basically you are talking about selectivity of  columns in a  JOIN, right?


but in my above example, "yellow dog", both terms are very common, and both
have long postings lists.

Yang
On Thu, Apr 26, 2012 at 12:17 AM, Ralf Heyde <ra...@gmx.de> wrote:

> Hi,
>
> i dont know the correct implementation. All that I can say is, that you
> should take a look at query optimization in state-of-the-art database
> systems. The generell solution is to select this part of a query first,
> which reduces the resultset most.
>
> For example you try to search for a term like "I'm looking for a term" -
> each token has a selectivity (TF/IDF, histograms (ie. Zipf-Distribution)).
>
> Term / Frequency
> I'm - often
> looking - normal
> for - stopword (too often and maybe ignored)
> a - stopword (too often and maybe ignored)
> term - rare
>
> So - Lucene might to select all documents, which contain "term", then on
> the "small" resultset looking for documents matching "looking" ... and so
> on.
> For that reason Lucene stores information like Histograms and other
> "things" ...
>
> I hope I gave you a simple example, which Lucene might use.
>
> Greets from Berlin,
>
> Ralf
>
>
> -------- Original-Nachricht --------
> > Datum: Wed, 25 Apr 2012 14:20:10 -0700
> > Von: Yang <te...@gmail.com>
> > An: java-user@lucene.apache.org
> > Betreff: Re: lucene algorithm ?
>
> > additionally,  anybody knows roughly (of course the details are a secret,
> > but I guess the main ideas should be
> > common enough these days) how google does fast ranking in cases of
> > multi-term queries with AND ?
> > (if their postings are sorted by PageRank order, then it's understandable
> > that a single term query would quickly return the top-k, but if it's
> > multi-term, they would have to traverse the entire lists to find the
> > insersection set, because the lists are not sorted by docId, as in the
> > Lucene paper case)
> >
> >
> >
> > On Wed, Apr 25, 2012 at 2:13 PM, Yang <te...@gmail.com> wrote:
> >
> > > I read the paper by Doug "Space optimizations for total ranking",
> > >
> > > since it was written a long time ago, I wonder what algorithms lucene
> > uses
> > > (regarding postings list traversal and score calculation, ranking)
> > >
> > >
> > > particularly the total ranking algorithm described there needs to
> > traverse
> > > down the entire postings list for all the query terms,
> > > so in case of very common query terms like "yellow dog", either of the
> 2
> > > terms may have a very very long postings list in case of web search,
> > > are they all really traversed in current lucene/Solr ? or  any
> > heuristics
> > > to truncate the list are actually employed?
> > >
> > > in the case of returning top-k results, I can understand that
> > partitioning
> > > the postings list into multiple machines, and then combining the  top-k
> > > from each would work,
> > > but if we are required to return "the 100th result page", i.e. results
> > > ranked from 990--1000th, then each partition would still have to find
> > out
> > > the top 1000, so
> > > partitioning would not help much.
> > >
> > >
> > > overall, is there any up-to-date detailed docs on the internal
> > algorithms
> > > of lucene?
> > >
> > > Thanks a lot
> > > Yang
> > >
>
> --
> Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
> belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: lucene algorithm ?

Posted by Ralf Heyde <ra...@gmx.de>.
Hi,

i dont know the correct implementation. All that I can say is, that you should take a look at query optimization in state-of-the-art database systems. The generell solution is to select this part of a query first, which reduces the resultset most.

For example you try to search for a term like "I'm looking for a term" - each token has a selectivity (TF/IDF, histograms (ie. Zipf-Distribution)).

Term / Frequency
I'm - often
looking - normal
for - stopword (too often and maybe ignored)
a - stopword (too often and maybe ignored)
term - rare

So - Lucene might to select all documents, which contain "term", then on the "small" resultset looking for documents matching "looking" ... and so on.
For that reason Lucene stores information like Histograms and other "things" ... 

I hope I gave you a simple example, which Lucene might use.

Greets from Berlin,

Ralf


-------- Original-Nachricht --------
> Datum: Wed, 25 Apr 2012 14:20:10 -0700
> Von: Yang <te...@gmail.com>
> An: java-user@lucene.apache.org
> Betreff: Re: lucene algorithm ?

> additionally,  anybody knows roughly (of course the details are a secret,
> but I guess the main ideas should be
> common enough these days) how google does fast ranking in cases of
> multi-term queries with AND ?
> (if their postings are sorted by PageRank order, then it's understandable
> that a single term query would quickly return the top-k, but if it's
> multi-term, they would have to traverse the entire lists to find the
> insersection set, because the lists are not sorted by docId, as in the
> Lucene paper case)
> 
> 
> 
> On Wed, Apr 25, 2012 at 2:13 PM, Yang <te...@gmail.com> wrote:
> 
> > I read the paper by Doug "Space optimizations for total ranking",
> >
> > since it was written a long time ago, I wonder what algorithms lucene
> uses
> > (regarding postings list traversal and score calculation, ranking)
> >
> >
> > particularly the total ranking algorithm described there needs to
> traverse
> > down the entire postings list for all the query terms,
> > so in case of very common query terms like "yellow dog", either of the 2
> > terms may have a very very long postings list in case of web search,
> > are they all really traversed in current lucene/Solr ? or  any
> heuristics
> > to truncate the list are actually employed?
> >
> > in the case of returning top-k results, I can understand that
> partitioning
> > the postings list into multiple machines, and then combining the  top-k
> > from each would work,
> > but if we are required to return "the 100th result page", i.e. results
> > ranked from 990--1000th, then each partition would still have to find
> out
> > the top 1000, so
> > partitioning would not help much.
> >
> >
> > overall, is there any up-to-date detailed docs on the internal
> algorithms
> > of lucene?
> >
> > Thanks a lot
> > Yang
> >

-- 
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: lucene algorithm ?

Posted by Yang <te...@gmail.com>.
additionally,  anybody knows roughly (of course the details are a secret,
but I guess the main ideas should be
common enough these days) how google does fast ranking in cases of
multi-term queries with AND ?
(if their postings are sorted by PageRank order, then it's understandable
that a single term query would quickly return the top-k, but if it's
multi-term, they would have to traverse the entire lists to find the
insersection set, because the lists are not sorted by docId, as in the
Lucene paper case)



On Wed, Apr 25, 2012 at 2:13 PM, Yang <te...@gmail.com> wrote:

> I read the paper by Doug "Space optimizations for total ranking",
>
> since it was written a long time ago, I wonder what algorithms lucene uses
> (regarding postings list traversal and score calculation, ranking)
>
>
> particularly the total ranking algorithm described there needs to traverse
> down the entire postings list for all the query terms,
> so in case of very common query terms like "yellow dog", either of the 2
> terms may have a very very long postings list in case of web search,
> are they all really traversed in current lucene/Solr ? or  any heuristics
> to truncate the list are actually employed?
>
> in the case of returning top-k results, I can understand that partitioning
> the postings list into multiple machines, and then combining the  top-k
> from each would work,
> but if we are required to return "the 100th result page", i.e. results
> ranked from 990--1000th, then each partition would still have to find out
> the top 1000, so
> partitioning would not help much.
>
>
> overall, is there any up-to-date detailed docs on the internal algorithms
> of lucene?
>
> Thanks a lot
> Yang
>

Re: lucene algorithm ?

Posted by Yang <te...@gmail.com>.
yes, that's why many search engines will not allow user visit page
> number greater than a threshold. for most application, users usually
> only visit top results. That's why ranking algorithm is important. if
> you found your users always turn to next page, I think you should
> consider your application. you should provide more filter condition or
> improving ranking algorithm.
>
> >
>


interesting, I just tried google "yellow dog"  and ask it to retrieve from
the 900's result,  the access time is 0.6 seconds, while the first page
comes out in 0.33 seconds (well of course the first page result may well be
cached,
but as I go from the first page to the 8'th page, the time does
consistently increase)


https://www.google.com/#q=yellow+dog&hl=en&prmd=imvns&ei=Q_maT433OombiQeB3dm1Dg&start=900&fp=a2cbb8d5ae567362

Re: lucene algorithm ?

Posted by Li Li <fa...@gmail.com>.
On Thu, Apr 26, 2012 at 5:13 AM, Yang <te...@gmail.com> wrote:
>
> I read the paper by Doug "Space optimizations for total ranking",
>
> since it was written a long time ago, I wonder what algorithms lucene uses
> (regarding postings list traversal and score calculation, ranking)
>
>
> particularly the total ranking algorithm described there needs to traverse
> down the entire postings list for all the query terms,
> so in case of very common query terms like "yellow dog", either of the 2
> terms may have a very very long postings list in case of web search,
> are they all really traversed in current lucene/Solr ? or  any heuristics
> to truncate the list are actually employed?
you can read related papers about early termination, they are closely
related to ranking algorithm. Now lucene did little thing of this
area. Also is it's ranking algorithm.
>
> in the case of returning top-k results, I can understand that partitioning
> the postings list into multiple machines, and then combining the  top-k
That's distributed searching, solr has this ability. Even for a single
node, for conjunction query(and query), lucene will use skip list in
posting to speed up. for disjunction query(or query), lucene will use
BooleanScorer rather than BooleanScorer2. BooleanScorer is TAAT(Term
at a Time) algorithm while BooleanScorer2 is DAAT(Document at a Time).
> from each would work,
> but if we are required to return "the 100th result page", i.e. results
> ranked from 990--1000th, then each partition would still have to find out
> the top 1000, so
> partitioning would not help much.
>
yes, that's why many search engines will not allow user visit page
number greater than a threshold. for most application, users usually
only visit top results. That's why ranking algorithm is important. if
you found your users always turn to next page, I think you should
consider your application. you should provide more filter condition or
improving ranking algorithm.

>
>
> overall, is there any up-to-date detailed docs on the internal algorithms
> of lucene?
if you can read Chinese, I recommend
http://www.cnblogs.com/forfuture1978/category/300665.htm. you may also
find some of my blogs about lucene/solr in
blog.csdn.net/fancyerII(I am not a persistent person, and plan of
writing blogs of lucene/solr is not continued)
anyhow, the source code is the best resource.
>
> Thanks a lot
> Yang

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: lucene algorithm ?

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 26/04/2012 09:49, Uwe Schindler wrote:

> There are possibilities to truncate those lists, but this is not implemented
> in Lucene core. The main problem with Lucene's segmented index structure is,
> that you cannot early exit, because the very last document in the posting
> list could be one with a very high score. Techniques like index pruning or
> index sorting can optimize all this, but need index preprocessing and loose
> updatability of the index while in use.
>
> The Top-N result collection is optimized in Lucene (item #3), to throw away
> all collected documents, where the score is never be able to get into the
> top-n priority queue (too small score, once the queue is full). But the
> score has to be calculated.

This paper presents yet another option: keep a value that represents 
max. impact of postings in a block and then skip whole blocks if the max 
impact is too low to get docs within the block in the queue:

http://cis.poly.edu/suel/papers/bmw.pdf

Implementing this in Lucene would require changing the iterators so that 
they can take a threshold value, i.e. the current lowest score in the 
queue, so that nextDoc() and advance() could skip whole blocks when 
their max impact is lower than the current lowest score.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: lucene algorithm ?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,
 
> I read the paper by Doug "Space optimizations for total ranking",
> 
> since it was written a long time ago, I wonder what algorithms lucene uses
> (regarding postings list traversal and score calculation, ranking)

The algorithms described in that paper are still in use to merge posting
lists of several terms. The "total ranking" as described here is implemented
in several key parts of Lucene:

#1 Section 4.1 (Parallel Merge) is done in BooleanScorer2.java
#2 Section 4.2 (Block Processing) is done in BooleanScorer.java
#3 Maintaining the queue of Top-N-Results is implemented in
TopScoreDocCollector.java, the above scorer implementations call for each
hit found the "collect(docid, score) method of the collector.

> particularly the total ranking algorithm described there needs to traverse
down
> the entire postings list for all the query terms, so in case of very
common query
> terms like "yellow dog", either of the 2 terms may have a very very long
> postings list in case of web search, are they all really traversed in
current
> lucene/Solr ? or  any heuristics to truncate the list are actually
employed?

There are possibilities to truncate those lists, but this is not implemented
in Lucene core. The main problem with Lucene's segmented index structure is,
that you cannot early exit, because the very last document in the posting
list could be one with a very high score. Techniques like index pruning or
index sorting can optimize all this, but need index preprocessing and loose
updatability of the index while in use.

The Top-N result collection is optimized in Lucene (item #3), to throw away
all collected documents, where the score is never be able to get into the
top-n priority queue (too small score, once the queue is full). But the
score has to be calculated.

> in the case of returning top-k results, I can understand that partitioning
the
> postings list into multiple machines, and then combining the  top-k from
each
> would work, but if we are required to return "the 100th result page", i.e.
results
> ranked from 990--1000th, then each partition would still have to find out
the
> top 1000, so partitioning would not help much.

Yes, and because of that almost no search engine support deep paging. Lucene
uses lots of memory when trying to do this, although there are new collector
implementations in Lucene that allow "optimized" deep paging (included into
item #3), if the top-n of the 99th result page were already found. In that
case, you can throw away all documents with a higher score in the collection
phase than the last one of page 99 when collection page 100.

> overall, is there any up-to-date detailed docs on the internal algorithms
of
> lucene?

Reading code is best documentation. But Javadocs also help, we recently
updated the documentation of Lucene's internals for the coming version 4:
https://builds.apache.org/job/Lucene-trunk/javadoc/

> Thanks a lot
> Yang

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org