You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2015/08/07 08:30:19 UTC

Mapping doc values back to doc ID (in decent time)

Hi all.

It's that time again.

I'm trying to kill off our long-standing reliance on stable doc IDs.
To that end, I am adding an additional field which contains the ID.
But we use these IDs a lot and for all kinds of purposes, and in some
of these purposes, many lookups are done at once, so performance
starts to matter.

A good example of this is the use case of doing a query in some
external source to get matching IDs, which then have to be converted
to DocIdSet to use them as a Filter.

For doc ID -> our ID, it's simple.

    NumericDocValues values = MultiDocValues.getNumericValues(reader, "doc-id");
    for (int docId = 0; docId < count; docId++)
    {
       int ourId = values.get(docId); // only ever one value
    }

This is pretty quick. 106.7 ms for 10 million lookups.

For our ID -> doc ID, I can't figure out how to get decent speed.
IndexSearcher is phenomenally slow (as expected), so I tried working
with Terms and PostingsEnum directly:

    Terms terms = MultiFields.getTerms(reader, "doc-id");
    assertNotNull(terms);
    BytesRefBuilder builder = new BytesRefBuilder();
    TermsEnum termsEnum = terms.iterator();
    PostingsEnum postingsEnum = null;

    for (int ourId = 0; ourId < count; ourId++)
    {
        builder.clear();
        NumericUtils.longToPrefixCoded(ourId, 0, builder);
        termsEnum.seekExact(builder.get());
        postingsEnum = termsEnum.postings(null, postingsEnum);
        int docId = postingsEnum.nextDoc(); // only ever one value
    }

This is slow - 38 seconds for the same items. Of course, I tried using
MemoryPostingsFormat to speed this up. It did help a little, bringing
the time down to around 10 seconds.

But that is still slow. The SQL query returning the same items,
iterating the result set to pull the results out and stuffing them
into the bit set takes 260ms. I'd rather not map them to actual doc
IDs if it's going to make queries an order of magnitude slower.

If I look at what Lucene is doing, most of the time seems to go here:

    at org.apache.lucene.util.fst.FST.readNextRealArc(FST.java:1097)
    at org.apache.lucene.util.fst.FST.findTargetArc(FST.java:1271)
    at org.apache.lucene.util.fst.FST.findTargetArc(FST.java:1195)
    at org.apache.lucene.util.fst.FSTEnum.doSeekExact(FSTEnum.java:441)
    at org.apache.lucene.util.fst.BytesRefFSTEnum.seekExact(BytesRefFSTEnum.java:84)
    at org.apache.lucene.codecs.memory.MemoryPostingsFormat$FSTTermsEnum.seekExact(MemoryPostingsFormat.java:800)
    at org.apache.lucene.index.MultiTermsEnum.seekExact(MultiTermsEnum.java:159)

The next approach on my list is the sledgehammer - walk the entire
DocValues, build the reverse mapping myself and stuff a cache of that
somewhere tied to the LeafReader. Maybe try to implement PostingsEnum
and Terms for that and then put it on our reader which wraps the real
one.

Before I resort to that, though, I want to check whether Lucene has a
better way to do this than what I can find by reading the docs.
I had a look at SortedNumericDocValues but the sorting it has doesn't
appear to be the sort which I would be able to use for this.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Mapping doc values back to doc ID (in decent time)

Posted by András Péteri <ap...@b2international.com>.

If I understand it correctly, the Zoie library [1][2] implements the
"sledgehammer" approach by collecting docValues for all documents when a
segment reader is opened. If you have some RAM to throw at the problem,
this could indeed bring you an acceptable level of performance.

[1] http://senseidb.github.io/zoie/
[2]
https://github.com/senseidb/zoie/blob/master/zoie-core/src/main/java/proj/zoie/api/impl/DocIDMapperImpl.java

On Sun, Aug 9, 2015 at 9:41 AM, Trejkaz <tr...@trypticon.org> wrote:

> On Fri, Aug 7, 2015 at 5:34 PM, Adrien Grand <jp...@gmail.com> wrote:
> > Does your application actually iterate in order over dense ids, or is
> > it just for benchmarking purposes? Because if it does, you probably
> > don't actually need seeking, you could just see what the current ID in
> > the terms enum is.
>
> Both dense ID fetches and individual ID fetches exist in the
> application. I put them in a benchmark deliberately doing it as
> individual fetches to get an idea of average timing for a single
> operation.
>
> There are so many use cases of doing the individual fetches that it's
> tough to enumerate. The first one I found was "fetch the term vector
> for ID + field" but I'm sure there will be tons of them.
>
> For mapping a dense set of IDs to doc IDs (e.g. for filtering), I
> would probably use something like DocValuesTermsQuery for that to get
> them all in one shot. I also wondered whether writing our filters as
> queries would help, but I think it would turn out to be about as fast
> as DocValuesTermsQuery even if I did that.
>
> I'm sure the only way to really improve the speed of these filters is
> to start storing these things in the text index and use query-time
> joins, but I can't do that until I solve the issue of relying on
> stable doc IDs and it seems like trying to solve two large problems in
> a single commit would be biting off more than I can chew.
>
> > If you actually need seeking, then you should try
> > to avoid MultiFields, it will call seedExact on each segment, while
> > given what I see you could just stop after you found one segment with
> > the value.
>
> Ah, I did wonder whether MultiFields had any behaviour like that, so
> that definitely means that I will avoid using it. Then I can try other
> tricks, like trying the seeks in order of segment size (the largest
> segment is most likely to contain the hit.)
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
-- 
András

Re: Mapping doc values back to doc ID (in decent time)

Posted by Trejkaz <tr...@trypticon.org>.

On Fri, Aug 7, 2015 at 5:34 PM, Adrien Grand <jp...@gmail.com> wrote:
> Does your application actually iterate in order over dense ids, or is
> it just for benchmarking purposes? Because if it does, you probably
> don't actually need seeking, you could just see what the current ID in
> the terms enum is.

Both dense ID fetches and individual ID fetches exist in the
application. I put them in a benchmark deliberately doing it as
individual fetches to get an idea of average timing for a single
operation.

There are so many use cases of doing the individual fetches that it's
tough to enumerate. The first one I found was "fetch the term vector
for ID + field" but I'm sure there will be tons of them.

For mapping a dense set of IDs to doc IDs (e.g. for filtering), I
would probably use something like DocValuesTermsQuery for that to get
them all in one shot. I also wondered whether writing our filters as
queries would help, but I think it would turn out to be about as fast
as DocValuesTermsQuery even if I did that.

I'm sure the only way to really improve the speed of these filters is
to start storing these things in the text index and use query-time
joins, but I can't do that until I solve the issue of relying on
stable doc IDs and it seems like trying to solve two large problems in
a single commit would be biting off more than I can chew.

> If you actually need seeking, then you should try
> to avoid MultiFields, it will call seedExact on each segment, while
> given what I see you could just stop after you found one segment with
> the value.

Ah, I did wonder whether MultiFields had any behaviour like that, so
that definitely means that I will avoid using it. Then I can try other
tricks, like trying the seeks in order of segment size (the largest
segment is most likely to contain the hit.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Mapping doc values back to doc ID (in decent time)

Posted by Adrien Grand <jp...@gmail.com>.

On Fri, Aug 7, 2015 at 8:30 AM, Trejkaz <tr...@trypticon.org> wrote:
>     for (int ourId = 0; ourId < count; ourId++)
>     {
>         builder.clear();
>         NumericUtils.longToPrefixCoded(ourId, 0, builder);
>         termsEnum.seekExact(builder.get());
>         postingsEnum = termsEnum.postings(null, postingsEnum);
>         int docId = postingsEnum.nextDoc(); // only ever one value
>     }

Does your application actually iterate in order over dense ids, or is
it just for benchmarking purposes? Because if it does, you probably
don't actually need seeking, you could just see what the current ID in
the terms enum is. If you actually need seeking, then you should try
to avoid MultiFields, it will call seedExact on each segment, while
given what I see you could just stop after you found one segment with
the value.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org