You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sriram Sankar <sa...@gmail.com> on 2013/06/13 04:56:37 UTC

posting list traversal code

Can someone point me to the code that traverses the posting lists?  I
trying to understand how it works.

Thanks,

Sriram

Re: posting list traversal code

Posted by Adrien Grand <jp...@gmail.com>.

On Thu, Jun 13, 2013 at 7:56 PM, Sriram Sankar <sa...@gmail.com> wrote:
> Thank you very much.  I think I need to play a bit with the code before
> asking more questions.  Here is the context for my questions:
>
> I was at Facebook until recently and worked extensively on the Unicorn
> search backend.  Unicorn allows documents to be ordered by a static rank in
> the posting lists and retrieval (based on a query) will therefore return
> documents in decreasing order of importance (by static rank).  We only
> retrieved a limited number of these documents - so static rank is the first
> filter.  These documents are scored which becomes the second filter.  This
> two level filtering ended up working very well.
>
> I am trying to see how we can do the same with Lucene.  It looks like we
> can build an index offline (say in Hadoop) and use SortingAtomicReader,
> etc. to get the posting lists in static rank order.  Then everything works
> great. However, as we update the index with new data, we end up with
> additional segments.

The use-case you are describing is why we built SortingMergePolicy:
this merge policy sorts segments at merging time. The consequence is
that segments resulting from a flush are unsorted while segments
resulting from a merge are sorted. This is an interesting property
because one can then early-terminate collection on the large segments
(which necessarily result from a merge) which are the most costly to
collect.

> To get the same functionality as Unicorn, it looks
> like we will have to sort these additional segments by the same static
> rank, and then traverse the posting lists in these multiple segments
> simultaneously (rather than segment by segment).  I am trying to see how
> easy this will be to achieve.

Traversing the postings lists simultaneously to collect documents by
strict increasing rank is an option, but segment-by-segment collection
might be faster (depending on the number of segments to collect, the
likeliness of new documents to have better ranks, etc.): although you
will over-collect some segments, it is no more needed to make a
decision after each collected document to know which segment to
collect. I think both options are worth trying.

> To get more specific, the all documents in Unicorn have two attributes - an
> id (called fbid) which is unique and never changes - so for example the
> fbid for the Lucene page is 104032849634513 (so you can always get it as
> www.facebook.com/104032849634513), and a static rank.  The static rank may
> not be unique (multiple docs may share the same static rank, in which case
> we could order them arbitrarily so long as they are ordered the same way in
> all posting lists).

SortingAtomicReader and SortingMergePolicy ensure this as well by
using a stable sorting algorithm to sort documents. Documents which
have the same rank will remain in index order through the sorting
process.

> It looks like we will need an fbid to docid mapping in Lucene for each
> segment so that the simultaneous calls to advance() in each segment can be
> in sync.

Although I understand why you would need a docid -> fbid mapping
(likely through a stored field or a doc values field), I'm not sure to
understand why you need a fbid -> docid mapping.

> This is how far I've thought things out - let me play with the code a bit
> and I may have more questions in a couple of days.

You are welcome. Online sorting is rather new in Lucene so there are
likely many things that could be improved!

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: posting list traversal code

Posted by Sriram Sankar <sa...@gmail.com>.

Thank you very much.  I think I need to play a bit with the code before
asking more questions.  Here is the context for my questions:

I was at Facebook until recently and worked extensively on the Unicorn
search backend.  Unicorn allows documents to be ordered by a static rank in
the posting lists and retrieval (based on a query) will therefore return
documents in decreasing order of importance (by static rank).  We only
retrieved a limited number of these documents - so static rank is the first
filter.  These documents are scored which becomes the second filter.  This
two level filtering ended up working very well.

I am trying to see how we can do the same with Lucene.  It looks like we
can build an index offline (say in Hadoop) and use SortingAtomicReader,
etc. to get the posting lists in static rank order.  Then everything works
great.  However, as we update the index with new data, we end up with
additional segments.  To get the same functionality as Unicorn, it looks
like we will have to sort these additional segments by the same static
rank, and then traverse the posting lists in these multiple segments
simultaneously (rather than segment by segment).  I am trying to see how
easy this will be to achieve.

To get more specific, the all documents in Unicorn have two attributes - an
id (called fbid) which is unique and never changes - so for example the
fbid for the Lucene page is 104032849634513 (so you can always get it as
www.facebook.com/104032849634513), and a static rank.  The static rank may
not be unique (multiple docs may share the same static rank, in which case
we could order them arbitrarily so long as they are ordered the same way in
all posting lists).

It looks like we will need an fbid to docid mapping in Lucene for each
segment so that the simultaneous calls to advance() in each segment can be
in sync.

This is how far I've thought things out - let me play with the code a bit
and I may have more questions in a couple of days.

Thanks again for all your help!

Srirarm.

On Thu, Jun 13, 2013 at 12:20 AM, Adrien Grand <jp...@gmail.com> wrote:

> Hi,
>
> On Thu, Jun 13, 2013 at 8:24 AM, Denis Bazhenov <do...@gmail.com> wrote:
> > Document id on the index level is offset of the document in the index.
> It can change over time for the same document, for example when merging
> several segments. They are also stored in order in posting lists. This
> allows fast posting list intersection. Some Lucene API's explicitly state
> that they operate on the document ids in order (like TermDocs), some allows
> out of order processing (like Collector). So it really depends.
> >
> > In case of SortingAtomicReader, as far as I know, it calculate document
> permutation, which allows to have sorted docIDs on the output. So, it
> basically relabel documents.
>
> This is correct. The org.apache.lucene.index.sorter.Sorter.sort method
> computes a permutation of the doc IDs which makes doc IDs sorted
> according to the sort order. SortingAtomicReader is just a view over
> an AtomicReader which uses this permutation to relabel doc IDs and
> give the impression that the index is sorted. But this class is not
> very interesting by itself can can be very slow to decode postings:
> for each term it needs to load all postings into memory and sort them
> before returning an enumeration of the doc IDs (see the
> SortingDocsEnum class), it is only useful to sort indices offline with
> IndexWriter.addIndexes or online with SortingMergePolicy.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: posting list traversal code

Posted by Adrien Grand <jp...@gmail.com>.

Hi,

On Thu, Jun 13, 2013 at 8:24 AM, Denis Bazhenov <do...@gmail.com> wrote:
> Document id on the index level is offset of the document in the index. It can change over time for the same document, for example when merging several segments. They are also stored in order in posting lists. This allows fast posting list intersection. Some Lucene API's explicitly state that they operate on the document ids in order (like TermDocs), some allows out of order processing (like Collector). So it really depends.
>
> In case of SortingAtomicReader, as far as I know, it calculate document permutation, which allows to have sorted docIDs on the output. So, it basically relabel documents.

This is correct. The org.apache.lucene.index.sorter.Sorter.sort method
computes a permutation of the doc IDs which makes doc IDs sorted
according to the sort order. SortingAtomicReader is just a view over
an AtomicReader which uses this permutation to relabel doc IDs and
give the impression that the index is sorted. But this class is not
very interesting by itself can can be very slow to decode postings:
for each term it needs to load all postings into memory and sort them
before returning an enumeration of the doc IDs (see the
SortingDocsEnum class), it is only useful to sort indices offline with
IndexWriter.addIndexes or online with SortingMergePolicy.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: posting list traversal code

Posted by Denis Bazhenov <do...@gmail.com>.

Document id on the index level is offset of the document in the index. It can change over time for the same document, for example when merging several segments. They are also stored in order in posting lists. This allows fast posting list intersection. Some Lucene API's explicitly state that they operate on the document ids in order (like TermDocs), some allows out of order processing (like Collector). So it really depends.

In case of SortingAtomicReader, as far as I know, it calculate document permutation, which allows to have sorted docIDs on the output. So, it basically relabel documents.

On Jun 13, 2013, at 4:38 PM, Sriram Sankar <sa...@gmail.com> wrote:

> Thanks Denis.  I've been looking at the code in more detail now.  I'm
> interested in how the new SortingAtomicReader works.  Suppose I build an
> index and sort the documents using my own sorting function - as shown in
> the docs:
> 
> AtomicReader sortingReader = new SortingAtomicReader(reader, sorter);
> 
> writer.addIndexes(sortingReader);
> 
> When the docs are sorted using my function, I assume the docids are not
> going to be in order any more?  Unless the docids change to maintain the
> sorted order.
> 
> If you look at the code in (for example) ConjunctionScorer.doNext(doc),
> what is the "doc" that gets used here?  If it is the docid (and they are
> out of order), this method will not work.  So either the docids have to be
> in order, or the "doc" here is some other number that defines the position
> of the document in the posting list.
> 
> I'm trying to read the code to understand this - I'd really appreciate
> someone with more indepth knowledge of this explaining this and also
> pointing me to somewhere in the code where the magic happens.
> 
> Thanks,
> 
> Sriram.
> 
> 
> 
> 
> On Wed, Jun 12, 2013 at 9:33 PM, Denis Bazhenov <do...@gmail.com> wrote:
> 
>> I'm not quite sure, what you really need. But as far as I understand, you
>> want to get all document id's for a given term. If so, the following code
>> will work for you:
>> 
>> Term term = new Term("fieldName", "fieldValue");
>> TermDocs termDocs = indexReader.termDocs(term);
>> while (termDocs.next()) {
>>        int docId = termDocs.doc();
>>        // work with the document...
>> }
>> On Jun 13, 2013, at 1:56 PM, Sriram Sankar <sa...@gmail.com> wrote:
>> 
>>> Can someone point me to the code that traverses the posting lists?  I
>>> trying to understand how it works.
>>> 
>>> Thanks,
>>> 
>>> Sriram
>> 
>> ---
>> Denis Bazhenov <do...@gmail.com>
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 

---
Denis Bazhenov <do...@gmail.com>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: posting list traversal code

Posted by Sriram Sankar <sa...@gmail.com>.

Thanks Denis.  I've been looking at the code in more detail now.  I'm
interested in how the new SortingAtomicReader works.  Suppose I build an
index and sort the documents using my own sorting function - as shown in
the docs:

AtomicReader sortingReader = new SortingAtomicReader(reader, sorter);

writer.addIndexes(sortingReader);

When the docs are sorted using my function, I assume the docids are not
going to be in order any more?  Unless the docids change to maintain the
sorted order.

If you look at the code in (for example) ConjunctionScorer.doNext(doc),
what is the "doc" that gets used here?  If it is the docid (and they are
out of order), this method will not work.  So either the docids have to be
in order, or the "doc" here is some other number that defines the position
of the document in the posting list.

I'm trying to read the code to understand this - I'd really appreciate
someone with more indepth knowledge of this explaining this and also
pointing me to somewhere in the code where the magic happens.

Thanks,

Sriram.

On Wed, Jun 12, 2013 at 9:33 PM, Denis Bazhenov <do...@gmail.com> wrote:

> I'm not quite sure, what you really need. But as far as I understand, you
> want to get all document id's for a given term. If so, the following code
> will work for you:
>
> Term term = new Term("fieldName", "fieldValue");
> TermDocs termDocs = indexReader.termDocs(term);
> while (termDocs.next()) {
>         int docId = termDocs.doc();
>         // work with the document...
> }
> On Jun 13, 2013, at 1:56 PM, Sriram Sankar <sa...@gmail.com> wrote:
>
> > Can someone point me to the code that traverses the posting lists?  I
> > trying to understand how it works.
> >
> > Thanks,
> >
> > Sriram
>
> ---
> Denis Bazhenov <do...@gmail.com>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: posting list traversal code

Posted by Denis Bazhenov <do...@gmail.com>.

I'm not quite sure, what you really need. But as far as I understand, you want to get all document id's for a given term. If so, the following code will work for you:

Term term = new Term("fieldName", "fieldValue");
TermDocs termDocs = indexReader.termDocs(term);
while (termDocs.next()) {
	int docId = termDocs.doc();
	// work with the document...
}
On Jun 13, 2013, at 1:56 PM, Sriram Sankar <sa...@gmail.com> wrote:

> Can someone point me to the code that traverses the posting lists?  I
> trying to understand how it works.
> 
> Thanks,
> 
> Sriram

---
Denis Bazhenov <do...@gmail.com>






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org