You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2015/08/24 03:27:10 UTC

Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)

There is a MultiPhraseQuery we use which looks a bit like:

    MultiPhraseQuery query = new MultiPhraseQuery();
    query.add(new Term[] { "first" });
    query.add(new Term[] { "second1", "second2", ... });

The actual number of terms in this particular case is 207087. The size
of the index itself is 21GB or so, with around 1,300,000 docs. Large
but not gigantic. I ran the test with 2GB of RAM which was certainly
enough for Lucene 3.

Although I do think that this is abusing MultiPhraseQuery and that
SpanQuery is probably a better fit, I think that back in Lucene 3,
there were problems with SpanQuery performance which resulted in
switching to this as a performance hack.

Anyway, we now get an OOME when running this query and the heap
histogram comes out sort of like this:
  int[]  995,093 (5.2%) 617,539,592 (31.6%)
  byte[] 1,065,597 (5.6%) 434,990,616 (22.3%)
  DocIdSet[]  777,620 (4.1%) 149,303,040 (7.6%)
  Lucene50PostingsReader$BlockPostingsEnum  326,022 (1.7%) 67,486,554 (3.5%)
  Lucene50PostingsFormat$IntBlockTermState  621,265 (3.2%) 57,777,645 (3%)

I went looking for the owner of these int arrays and it turns out to
be a postings reader which is ultimately (unsurprisingly) being held
by the MultiPhraseQuery.

What I'm wondering is:
- Why the increase in memory cost?
- Is our performance hack of using MultiPhraseQuery over SpanQuery
really warranted anymore?
- Is there a better way to do this particular query?

Also, just in case this is an X-Y problem, what we're actually
implementing here is simulating a large number of integer fields
without using a large number of fields. We index the name of the
sub-field followed by the value and then use this as a proximity query
to say "find values in range X to Y with the sub-field immediately in
front". This was done because there was some conventional wisdom
saying that having a large number of fields in Lucene is problematic,
although whether this still applies is unknown.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)

Posted by Trejkaz <tr...@trypticon.org>.

Thought I would try some thread necromancy here, because nobody
replied about this a year ago.

Now we're on 5.4.1 and the numbers changed a bit again. Recording best
times for each operation.

    Indexing: 5.723 s
    SpanQuery: 25.13 s
    MultiPhraseQuery: (waited 10 minutes and it hasn't completed yet)
    TermAutomatonQuery: 19.72 s

So it seems like span query performance is slightly better than it was
in 5.2, but MultiPhraseQuery is still no good, and TermAutomatonQuery
might be better than both.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)

Posted by Trejkaz <tr...@trypticon.org>.

I spent some time carving out a quick test of the bits that matter and
put them up here:
https://gist.github.com/trejkaz/a72b87277b1aec800c2e

The tests index 1,000,000 docs with just one instance of the
field/sub-field trick we're using, plus one unique value. So it's a
bit of an artificial test, but benchmarks tend to be like that.

Times for Lucene 3.6:
    Indexing: 3.365 s
    SpanQuery: 20.48 s
    MultiPhraseQuery: 9.641 s

Times for Lucene 5.2:
    Indexing: 4.423 s
    SpanQuery: 31.94 s
    MultiPhraseQuery: (never completes due to OOME)

An aside which is totally a red herring: it seems there is quite a bit
of slowdown on indexing and SpanQuery as well, which makes me wonder
whether I have incorrectly configured the FieldType when compared with
how the same field was indexed for 3.6.

You can also see from these numbers how MultiPhraseQuery used to be
much faster than SpanQuery, which was why we stopped using SpanQuery
for this particular query in the first place.

Timings aside, MultiPhraseQuery used to complete but now gets an OOME
when provided 2GB of RAM for this particular case.

I also tried hacking together a TermAutomatonQuery to see what
happened with that, and it gets an OOME as well.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org