You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Stamatis Zampetakis <za...@gmail.com> on 2019/04/01 14:04:56 UTC

Clarification regarding BlockTree implementation of IntersectTermsEnum

Hi all,

I am currently working on improving the performance of range queries on
strings. I've noticed that using TermRangeQuery with low-selective queries
is a very bad idea in terms of performance but I cannot clearly explain why
since it seems related with how the IntersectTermsEnum#next method is
implemented.

The Javadoc of the class says that the terms index (the burst-trie
datastructure) is not used by this implementation of TermsEnum. However,
when I see the implementation of the next method I get the impression that
this is not accurate. Aren't we using the trie structure to skip parts of
the data when  the automaton states do not match?

Can somebody provide a high-level intutition of what
IntersectTermsEnum#next does? Initially, I thought that it is traversing
the whole trie structure (skipping some branches when necessary) but I may
be wrong.

Thanks in advance,
Stamatis

RE: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

in fact I was also wondering why the TermRangeQuery is now a subclass of AutomatonQuery, but this was changed for the reasons that Robert mentioned in his e-mail (https://issues.apache.org/jira/browse/LUCENE-5879). For sure, the easiest is to just start and seek to the first term in the enum, then iterate over all terms and stop when you reach the last term.

The problem with TermRangeQueries is actually not the iteration over the term index. The slowness comes from the fact that all terms between start and end have to be iterated and their postings be fetched and those postings be merged together. If the "source of terms" for doing this is just a simple linear iteration of all terms from/to or the automaton intersection does not really matter for the query execution. The change to prefer the automaton instead of a simple term iteration is just to allow further optimizations, for more info see https://issues.apache.org/jira/browse/LUCENE-5879

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Robert Muir <rc...@gmail.com>
> Sent: Monday, April 1, 2019 6:30 PM
> To: java-user <ja...@lucene.apache.org>
> Subject: Re: Clarification regarding BlockTree implementation of
> IntersectTermsEnum
> 
> The regular TermsEnum is really designed for walking terms in linear order.
> it does have some ability to seek/leapfrog. But this means paths in a query
> automaton that match no terms result in a wasted seek and cpu, because the
> api is designed to return the next term after regardless.
> 
> On the other hand the intersect() is for intersecting two automata: query
> and index. Presumably it can also remove more inefficiencies than just the
> wasted seeks for complex wildcards and fuzzies and stuff, since it can
> "see" the whole input as an automaton. so for example it might be able to
> work on blocks of terms at a time and so on.
> 
> On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis <za...@gmail.com>
> wrote:
> 
> > Yes it is used.
> >
> > I think there are simpler and possibly more efficient ways to implement a
> > TermRangeQuery and that is why I am looking into this.
> > But I am also curious to understand what IntersectTermsEnum is supposed
> to
> > do.
> >
> > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <rc...@gmail.com>
> > έγραψε:
> >
> > > Is this IntersectTermsEnum really being used for term range query?
> Seems
> > > like using a standard TermsEnum, seeking to the start of the range, then
> > > calling next until the end would be easier.
> > >
> > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> <za...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am currently working on improving the performance of range queries
> on
> > > > strings. I've noticed that using TermRangeQuery with low-selective
> > > queries
> > > > is a very bad idea in terms of performance but I cannot clearly explain
> > > why
> > > > since it seems related with how the IntersectTermsEnum#next method
> is
> > > > implemented.
> > > >
> > > > The Javadoc of the class says that the terms index (the burst-trie
> > > > datastructure) is not used by this implementation of TermsEnum.
> > However,
> > > > when I see the implementation of the next method I get the impression
> > > that
> > > > this is not accurate. Aren't we using the trie structure to skip parts
> > of
> > > > the data when  the automaton states do not match?
> > > >
> > > > Can somebody provide a high-level intutition of what
> > > > IntersectTermsEnum#next does? Initially, I thought that it is
> > traversing
> > > > the whole trie structure (skipping some branches when necessary) but I
> > > may
> > > > be wrong.
> > > >
> > > > Thanks in advance,
> > > > Stamatis
> > > >
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Stamatis Zampetakis <za...@gmail.com>.

I am mostly referring to keyset pagination so sorting the field is always
necessary.
For the same reason range queries are inevitable.

Στις Τρί, 2 Απρ 2019 στις 5:08 μ.μ., ο/η Robert Muir <rc...@gmail.com>
έγραψε:

> I think instead of using range queries for custom pagination, it would
> be best to simply sort by the field normally (with or without
> searchAfter)
>
> On Tue, Apr 2, 2019 at 7:34 AM Stamatis Zampetakis <za...@gmail.com>
> wrote:
> >
> > Thanks for the explanation regarding IntersectTermsEnum.next(). I
> > understand better now.
> >
> > How about indexing a string field containing the firstname and lastname
> of
> > a person or possibly its address.
> > Values for these fields have variable length and usually they have more
> > than 16 bytes. If my understanding is correct then they should (can) not
> be
> > indexed as points.
> > Range queries on this fields appear mostly when users implement custom
> > pagination (without using searchAfter) on the result set.
> >
> > The alternative that I consider and seems to work well for the use-cases
> I
> > mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery.
> >
> >
> >
> >
> >
> > Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir <rc...@gmail.com>
> > έγραψε:
> >
> > > Can you explain a little more about your use-case? I think that's the
> > > biggest problem here for term range query. Pretty much all range
> > > use-cases are converted over to space partitioned data structures
> > > (Points) so its unclear why this query would even be used for anything
> > > serious.
> > >
> > > To answer your question, IntersectTermsEnum.next() doesn't iterate
> > > over the whole index, just the term dictionary. But the lines are
> > > blurred since if there's only one posting for a term, that single
> > > posting is inlined directly into the term dictionary.... and often a
> > > ton of terms fit into this category, so seeing 80% time in the term
> > > dictionary wouldn't surprise me.
> > >
> > > On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis <za...@gmail.com>
> > > wrote:
> > > >
> > > > Thanks Robert and Uwe for your feedback!
> > > >
> > > > I am not sure what you mean that the slowness does not come from the
> > > iteration over the term index.
> > > > I did a small profiling (screenshot attached) of sending repeatedly a
> > > TermRangeQuery that matches almost the whole index and I observe that
> 80%
> > > of the time is spend on IntersectTermsEnum.next() method.
> > > > Isn't this the method which iterates over the index?
> > > >
> > > >
> > > >
> > > > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <
> uwe@thetaphi.de>
> > > έγραψε:
> > > >>
> > > >> Hi again,
> > > >>
> > > >> > The problem with TermRangeQueries is actually not the iteration
> over
> > > the
> > > >> > term index. The slowness comes from the fact that all terms
> between
> > > start
> > > >> > and end have to be iterated and their postings be fetched and
> those
> > > postings
> > > >> > be merged together. If the "source of terms" for doing this is
> just a
> > > simple
> > > >> > linear iteration of all terms from/to or the automaton
> intersection
> > > does not
> > > >> > really matter for the query execution. The change to prefer the
> > > automaton
> > > >> > instead of a simple term iteration is just to allow further
> > > optimizations, for
> > > >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
> > > >>
> > > >> So the above issue brings no further improvements (currently). So as
> > > said before, a simple enumeration of all all terms and fetching their
> > > postings will be as fast. By using an automaton, the idea here is that
> the
> > > term dictionary may have some "percalculated" postings list for prefix
> > > terms. So instead of iterating over all terms and merging their
> postings,
> > > the terms dictionary could return a "virtual term" that contains all
> > > documents for a whole prefix and store the merged postings list in the
> > > index file. This was not yet implemented but is the place to hook
> into. You
> > > could create an improved BlockTermsDict implementation, that allows to
> get
> > > a PostingsEnum for a whole prefix of terms.
> > > >>
> > > >> Not sure how much of that was already implemented by LUCENE-5879,
> but
> > > it allows to do this. So here is where you could step in and improve
> the
> > > terms dictionary!
> > > >>
> > > >> Uwe
> > > >>
> > > >> > Uwe
> > > >> >
> > > >> > -----
> > > >> > Uwe Schindler
> > > >> > Achterdiek 19, D-28357 Bremen
> > > >> > http://www.thetaphi.de
> > > >> > eMail: uwe@thetaphi.de
> > > >> >
> > > >> > > -----Original Message-----
> > > >> > > From: Robert Muir <rc...@gmail.com>
> > > >> > > Sent: Monday, April 1, 2019 6:30 PM
> > > >> > > To: java-user <ja...@lucene.apache.org>
> > > >> > > Subject: Re: Clarification regarding BlockTree implementation of
> > > >> > > IntersectTermsEnum
> > > >> > >
> > > >> > > The regular TermsEnum is really designed for walking terms in
> > > linear order.
> > > >> > > it does have some ability to seek/leapfrog. But this means
> paths in
> > > a query
> > > >> > > automaton that match no terms result in a wasted seek and cpu,
> > > because
> > > >> > the
> > > >> > > api is designed to return the next term after regardless.
> > > >> > >
> > > >> > > On the other hand the intersect() is for intersecting two
> automata:
> > > query
> > > >> > > and index. Presumably it can also remove more inefficiencies
> than
> > > just the
> > > >> > > wasted seeks for complex wildcards and fuzzies and stuff, since
> it
> > > can
> > > >> > > "see" the whole input as an automaton. so for example it might
> be
> > > able to
> > > >> > > work on blocks of terms at a time and so on.
> > > >> > >
> > > >> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> > > >> > <za...@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Yes it is used.
> > > >> > > >
> > > >> > > > I think there are simpler and possibly more efficient ways to
> > > implement a
> > > >> > > > TermRangeQuery and that is why I am looking into this.
> > > >> > > > But I am also curious to understand what IntersectTermsEnum is
> > > >> > supposed
> > > >> > > to
> > > >> > > > do.
> > > >> > > >
> > > >> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <
> > > rcmuir@gmail.com>
> > > >> > > > έγραψε:
> > > >> > > >
> > > >> > > > > Is this IntersectTermsEnum really being used for term range
> > > query?
> > > >> > > Seems
> > > >> > > > > like using a standard TermsEnum, seeking to the start of the
> > > range, then
> > > >> > > > > calling next until the end would be easier.
> > > >> > > > >
> > > >> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> > > >> > > <za...@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi all,
> > > >> > > > > >
> > > >> > > > > > I am currently working on improving the performance of
> range
> > > >> > queries
> > > >> > > on
> > > >> > > > > > strings. I've noticed that using TermRangeQuery with
> > > low-selective
> > > >> > > > > queries
> > > >> > > > > > is a very bad idea in terms of performance but I cannot
> > > clearly explain
> > > >> > > > > why
> > > >> > > > > > since it seems related with how the
> IntersectTermsEnum#next
> > > >> > method
> > > >> > > is
> > > >> > > > > > implemented.
> > > >> > > > > >
> > > >> > > > > > The Javadoc of the class says that the terms index (the
> > > burst-trie
> > > >> > > > > > datastructure) is not used by this implementation of
> > > TermsEnum.
> > > >> > > > However,
> > > >> > > > > > when I see the implementation of the next method I get the
> > > >> > impression
> > > >> > > > > that
> > > >> > > > > > this is not accurate. Aren't we using the trie structure
> to
> > > skip parts
> > > >> > > > of
> > > >> > > > > > the data when  the automaton states do not match?
> > > >> > > > > >
> > > >> > > > > > Can somebody provide a high-level intutition of what
> > > >> > > > > > IntersectTermsEnum#next does? Initially, I thought that
> it is
> > > >> > > > traversing
> > > >> > > > > > the whole trie structure (skipping some branches when
> > > necessary) but
> > > >> > I
> > > >> > > > > may
> > > >> > > > > > be wrong.
> > > >> > > > > >
> > > >> > > > > > Thanks in advance,
> > > >> > > > > > Stamatis
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Robert Muir <rc...@gmail.com>.

I think instead of using range queries for custom pagination, it would
be best to simply sort by the field normally (with or without
searchAfter)

On Tue, Apr 2, 2019 at 7:34 AM Stamatis Zampetakis <za...@gmail.com> wrote:
>
> Thanks for the explanation regarding IntersectTermsEnum.next(). I
> understand better now.
>
> How about indexing a string field containing the firstname and lastname of
> a person or possibly its address.
> Values for these fields have variable length and usually they have more
> than 16 bytes. If my understanding is correct then they should (can) not be
> indexed as points.
> Range queries on this fields appear mostly when users implement custom
> pagination (without using searchAfter) on the result set.
>
> The alternative that I consider and seems to work well for the use-cases I
> mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery.
>
>
>
>
>
> Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir <rc...@gmail.com>
> έγραψε:
>
> > Can you explain a little more about your use-case? I think that's the
> > biggest problem here for term range query. Pretty much all range
> > use-cases are converted over to space partitioned data structures
> > (Points) so its unclear why this query would even be used for anything
> > serious.
> >
> > To answer your question, IntersectTermsEnum.next() doesn't iterate
> > over the whole index, just the term dictionary. But the lines are
> > blurred since if there's only one posting for a term, that single
> > posting is inlined directly into the term dictionary.... and often a
> > ton of terms fit into this category, so seeing 80% time in the term
> > dictionary wouldn't surprise me.
> >
> > On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis <za...@gmail.com>
> > wrote:
> > >
> > > Thanks Robert and Uwe for your feedback!
> > >
> > > I am not sure what you mean that the slowness does not come from the
> > iteration over the term index.
> > > I did a small profiling (screenshot attached) of sending repeatedly a
> > TermRangeQuery that matches almost the whole index and I observe that 80%
> > of the time is spend on IntersectTermsEnum.next() method.
> > > Isn't this the method which iterates over the index?
> > >
> > >
> > >
> > > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <uw...@thetaphi.de>
> > έγραψε:
> > >>
> > >> Hi again,
> > >>
> > >> > The problem with TermRangeQueries is actually not the iteration over
> > the
> > >> > term index. The slowness comes from the fact that all terms between
> > start
> > >> > and end have to be iterated and their postings be fetched and those
> > postings
> > >> > be merged together. If the "source of terms" for doing this is just a
> > simple
> > >> > linear iteration of all terms from/to or the automaton intersection
> > does not
> > >> > really matter for the query execution. The change to prefer the
> > automaton
> > >> > instead of a simple term iteration is just to allow further
> > optimizations, for
> > >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
> > >>
> > >> So the above issue brings no further improvements (currently). So as
> > said before, a simple enumeration of all all terms and fetching their
> > postings will be as fast. By using an automaton, the idea here is that the
> > term dictionary may have some "percalculated" postings list for prefix
> > terms. So instead of iterating over all terms and merging their postings,
> > the terms dictionary could return a "virtual term" that contains all
> > documents for a whole prefix and store the merged postings list in the
> > index file. This was not yet implemented but is the place to hook into. You
> > could create an improved BlockTermsDict implementation, that allows to get
> > a PostingsEnum for a whole prefix of terms.
> > >>
> > >> Not sure how much of that was already implemented by LUCENE-5879, but
> > it allows to do this. So here is where you could step in and improve the
> > terms dictionary!
> > >>
> > >> Uwe
> > >>
> > >> > Uwe
> > >> >
> > >> > -----
> > >> > Uwe Schindler
> > >> > Achterdiek 19, D-28357 Bremen
> > >> > http://www.thetaphi.de
> > >> > eMail: uwe@thetaphi.de
> > >> >
> > >> > > -----Original Message-----
> > >> > > From: Robert Muir <rc...@gmail.com>
> > >> > > Sent: Monday, April 1, 2019 6:30 PM
> > >> > > To: java-user <ja...@lucene.apache.org>
> > >> > > Subject: Re: Clarification regarding BlockTree implementation of
> > >> > > IntersectTermsEnum
> > >> > >
> > >> > > The regular TermsEnum is really designed for walking terms in
> > linear order.
> > >> > > it does have some ability to seek/leapfrog. But this means paths in
> > a query
> > >> > > automaton that match no terms result in a wasted seek and cpu,
> > because
> > >> > the
> > >> > > api is designed to return the next term after regardless.
> > >> > >
> > >> > > On the other hand the intersect() is for intersecting two automata:
> > query
> > >> > > and index. Presumably it can also remove more inefficiencies than
> > just the
> > >> > > wasted seeks for complex wildcards and fuzzies and stuff, since it
> > can
> > >> > > "see" the whole input as an automaton. so for example it might be
> > able to
> > >> > > work on blocks of terms at a time and so on.
> > >> > >
> > >> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> > >> > <za...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Yes it is used.
> > >> > > >
> > >> > > > I think there are simpler and possibly more efficient ways to
> > implement a
> > >> > > > TermRangeQuery and that is why I am looking into this.
> > >> > > > But I am also curious to understand what IntersectTermsEnum is
> > >> > supposed
> > >> > > to
> > >> > > > do.
> > >> > > >
> > >> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <
> > rcmuir@gmail.com>
> > >> > > > έγραψε:
> > >> > > >
> > >> > > > > Is this IntersectTermsEnum really being used for term range
> > query?
> > >> > > Seems
> > >> > > > > like using a standard TermsEnum, seeking to the start of the
> > range, then
> > >> > > > > calling next until the end would be easier.
> > >> > > > >
> > >> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> > >> > > <za...@gmail.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi all,
> > >> > > > > >
> > >> > > > > > I am currently working on improving the performance of range
> > >> > queries
> > >> > > on
> > >> > > > > > strings. I've noticed that using TermRangeQuery with
> > low-selective
> > >> > > > > queries
> > >> > > > > > is a very bad idea in terms of performance but I cannot
> > clearly explain
> > >> > > > > why
> > >> > > > > > since it seems related with how the IntersectTermsEnum#next
> > >> > method
> > >> > > is
> > >> > > > > > implemented.
> > >> > > > > >
> > >> > > > > > The Javadoc of the class says that the terms index (the
> > burst-trie
> > >> > > > > > datastructure) is not used by this implementation of
> > TermsEnum.
> > >> > > > However,
> > >> > > > > > when I see the implementation of the next method I get the
> > >> > impression
> > >> > > > > that
> > >> > > > > > this is not accurate. Aren't we using the trie structure to
> > skip parts
> > >> > > > of
> > >> > > > > > the data when  the automaton states do not match?
> > >> > > > > >
> > >> > > > > > Can somebody provide a high-level intutition of what
> > >> > > > > > IntersectTermsEnum#next does? Initially, I thought that it is
> > >> > > > traversing
> > >> > > > > > the whole trie structure (skipping some branches when
> > necessary) but
> > >> > I
> > >> > > > > may
> > >> > > > > > be wrong.
> > >> > > > > >
> > >> > > > > > Thanks in advance,
> > >> > > > > > Stamatis
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Stamatis Zampetakis <za...@gmail.com>.

Thanks for the explanation regarding IntersectTermsEnum.next(). I
understand better now.

How about indexing a string field containing the firstname and lastname of
a person or possibly its address.
Values for these fields have variable length and usually they have more
than 16 bytes. If my understanding is correct then they should (can) not be
indexed as points.
Range queries on this fields appear mostly when users implement custom
pagination (without using searchAfter) on the result set.

The alternative that I consider and seems to work well for the use-cases I
mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery.





Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir <rc...@gmail.com>
έγραψε:

> Can you explain a little more about your use-case? I think that's the
> biggest problem here for term range query. Pretty much all range
> use-cases are converted over to space partitioned data structures
> (Points) so its unclear why this query would even be used for anything
> serious.
>
> To answer your question, IntersectTermsEnum.next() doesn't iterate
> over the whole index, just the term dictionary. But the lines are
> blurred since if there's only one posting for a term, that single
> posting is inlined directly into the term dictionary.... and often a
> ton of terms fit into this category, so seeing 80% time in the term
> dictionary wouldn't surprise me.
>
> On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis <za...@gmail.com>
> wrote:
> >
> > Thanks Robert and Uwe for your feedback!
> >
> > I am not sure what you mean that the slowness does not come from the
> iteration over the term index.
> > I did a small profiling (screenshot attached) of sending repeatedly a
> TermRangeQuery that matches almost the whole index and I observe that 80%
> of the time is spend on IntersectTermsEnum.next() method.
> > Isn't this the method which iterates over the index?
> >
> >
> >
> > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <uw...@thetaphi.de>
> έγραψε:
> >>
> >> Hi again,
> >>
> >> > The problem with TermRangeQueries is actually not the iteration over
> the
> >> > term index. The slowness comes from the fact that all terms between
> start
> >> > and end have to be iterated and their postings be fetched and those
> postings
> >> > be merged together. If the "source of terms" for doing this is just a
> simple
> >> > linear iteration of all terms from/to or the automaton intersection
> does not
> >> > really matter for the query execution. The change to prefer the
> automaton
> >> > instead of a simple term iteration is just to allow further
> optimizations, for
> >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
> >>
> >> So the above issue brings no further improvements (currently). So as
> said before, a simple enumeration of all all terms and fetching their
> postings will be as fast. By using an automaton, the idea here is that the
> term dictionary may have some "percalculated" postings list for prefix
> terms. So instead of iterating over all terms and merging their postings,
> the terms dictionary could return a "virtual term" that contains all
> documents for a whole prefix and store the merged postings list in the
> index file. This was not yet implemented but is the place to hook into. You
> could create an improved BlockTermsDict implementation, that allows to get
> a PostingsEnum for a whole prefix of terms.
> >>
> >> Not sure how much of that was already implemented by LUCENE-5879, but
> it allows to do this. So here is where you could step in and improve the
> terms dictionary!
> >>
> >> Uwe
> >>
> >> > Uwe
> >> >
> >> > -----
> >> > Uwe Schindler
> >> > Achterdiek 19, D-28357 Bremen
> >> > http://www.thetaphi.de
> >> > eMail: uwe@thetaphi.de
> >> >
> >> > > -----Original Message-----
> >> > > From: Robert Muir <rc...@gmail.com>
> >> > > Sent: Monday, April 1, 2019 6:30 PM
> >> > > To: java-user <ja...@lucene.apache.org>
> >> > > Subject: Re: Clarification regarding BlockTree implementation of
> >> > > IntersectTermsEnum
> >> > >
> >> > > The regular TermsEnum is really designed for walking terms in
> linear order.
> >> > > it does have some ability to seek/leapfrog. But this means paths in
> a query
> >> > > automaton that match no terms result in a wasted seek and cpu,
> because
> >> > the
> >> > > api is designed to return the next term after regardless.
> >> > >
> >> > > On the other hand the intersect() is for intersecting two automata:
> query
> >> > > and index. Presumably it can also remove more inefficiencies than
> just the
> >> > > wasted seeks for complex wildcards and fuzzies and stuff, since it
> can
> >> > > "see" the whole input as an automaton. so for example it might be
> able to
> >> > > work on blocks of terms at a time and so on.
> >> > >
> >> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> >> > <za...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Yes it is used.
> >> > > >
> >> > > > I think there are simpler and possibly more efficient ways to
> implement a
> >> > > > TermRangeQuery and that is why I am looking into this.
> >> > > > But I am also curious to understand what IntersectTermsEnum is
> >> > supposed
> >> > > to
> >> > > > do.
> >> > > >
> >> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <
> rcmuir@gmail.com>
> >> > > > έγραψε:
> >> > > >
> >> > > > > Is this IntersectTermsEnum really being used for term range
> query?
> >> > > Seems
> >> > > > > like using a standard TermsEnum, seeking to the start of the
> range, then
> >> > > > > calling next until the end would be easier.
> >> > > > >
> >> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> >> > > <za...@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi all,
> >> > > > > >
> >> > > > > > I am currently working on improving the performance of range
> >> > queries
> >> > > on
> >> > > > > > strings. I've noticed that using TermRangeQuery with
> low-selective
> >> > > > > queries
> >> > > > > > is a very bad idea in terms of performance but I cannot
> clearly explain
> >> > > > > why
> >> > > > > > since it seems related with how the IntersectTermsEnum#next
> >> > method
> >> > > is
> >> > > > > > implemented.
> >> > > > > >
> >> > > > > > The Javadoc of the class says that the terms index (the
> burst-trie
> >> > > > > > datastructure) is not used by this implementation of
> TermsEnum.
> >> > > > However,
> >> > > > > > when I see the implementation of the next method I get the
> >> > impression
> >> > > > > that
> >> > > > > > this is not accurate. Aren't we using the trie structure to
> skip parts
> >> > > > of
> >> > > > > > the data when  the automaton states do not match?
> >> > > > > >
> >> > > > > > Can somebody provide a high-level intutition of what
> >> > > > > > IntersectTermsEnum#next does? Initially, I thought that it is
> >> > > > traversing
> >> > > > > > the whole trie structure (skipping some branches when
> necessary) but
> >> > I
> >> > > > > may
> >> > > > > > be wrong.
> >> > > > > >
> >> > > > > > Thanks in advance,
> >> > > > > > Stamatis
> >> > > > > >
> >> > > > >
> >> > > >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Robert Muir <rc...@gmail.com>.

Can you explain a little more about your use-case? I think that's the
biggest problem here for term range query. Pretty much all range
use-cases are converted over to space partitioned data structures
(Points) so its unclear why this query would even be used for anything
serious.

To answer your question, IntersectTermsEnum.next() doesn't iterate
over the whole index, just the term dictionary. But the lines are
blurred since if there's only one posting for a term, that single
posting is inlined directly into the term dictionary.... and often a
ton of terms fit into this category, so seeing 80% time in the term
dictionary wouldn't surprise me.

On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis <za...@gmail.com> wrote:
>
> Thanks Robert and Uwe for your feedback!
>
> I am not sure what you mean that the slowness does not come from the iteration over the term index.
> I did a small profiling (screenshot attached) of sending repeatedly a TermRangeQuery that matches almost the whole index and I observe that 80% of the time is spend on IntersectTermsEnum.next() method.
> Isn't this the method which iterates over the index?
>
>
>
> Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <uw...@thetaphi.de> έγραψε:
>>
>> Hi again,
>>
>> > The problem with TermRangeQueries is actually not the iteration over the
>> > term index. The slowness comes from the fact that all terms between start
>> > and end have to be iterated and their postings be fetched and those postings
>> > be merged together. If the "source of terms" for doing this is just a simple
>> > linear iteration of all terms from/to or the automaton intersection does not
>> > really matter for the query execution. The change to prefer the automaton
>> > instead of a simple term iteration is just to allow further optimizations, for
>> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
>>
>> So the above issue brings no further improvements (currently). So as said before, a simple enumeration of all all terms and fetching their postings will be as fast. By using an automaton, the idea here is that the term dictionary may have some "percalculated" postings list for prefix terms. So instead of iterating over all terms and merging their postings, the terms dictionary could return a "virtual term" that contains all documents for a whole prefix and store the merged postings list in the index file. This was not yet implemented but is the place to hook into. You could create an improved BlockTermsDict implementation, that allows to get a PostingsEnum for a whole prefix of terms.
>>
>> Not sure how much of that was already implemented by LUCENE-5879, but it allows to do this. So here is where you could step in and improve the terms dictionary!
>>
>> Uwe
>>
>> > Uwe
>> >
>> > -----
>> > Uwe Schindler
>> > Achterdiek 19, D-28357 Bremen
>> > http://www.thetaphi.de
>> > eMail: uwe@thetaphi.de
>> >
>> > > -----Original Message-----
>> > > From: Robert Muir <rc...@gmail.com>
>> > > Sent: Monday, April 1, 2019 6:30 PM
>> > > To: java-user <ja...@lucene.apache.org>
>> > > Subject: Re: Clarification regarding BlockTree implementation of
>> > > IntersectTermsEnum
>> > >
>> > > The regular TermsEnum is really designed for walking terms in linear order.
>> > > it does have some ability to seek/leapfrog. But this means paths in a query
>> > > automaton that match no terms result in a wasted seek and cpu, because
>> > the
>> > > api is designed to return the next term after regardless.
>> > >
>> > > On the other hand the intersect() is for intersecting two automata: query
>> > > and index. Presumably it can also remove more inefficiencies than just the
>> > > wasted seeks for complex wildcards and fuzzies and stuff, since it can
>> > > "see" the whole input as an automaton. so for example it might be able to
>> > > work on blocks of terms at a time and so on.
>> > >
>> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
>> > <za...@gmail.com>
>> > > wrote:
>> > >
>> > > > Yes it is used.
>> > > >
>> > > > I think there are simpler and possibly more efficient ways to implement a
>> > > > TermRangeQuery and that is why I am looking into this.
>> > > > But I am also curious to understand what IntersectTermsEnum is
>> > supposed
>> > > to
>> > > > do.
>> > > >
>> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <rc...@gmail.com>
>> > > > έγραψε:
>> > > >
>> > > > > Is this IntersectTermsEnum really being used for term range query?
>> > > Seems
>> > > > > like using a standard TermsEnum, seeking to the start of the range, then
>> > > > > calling next until the end would be easier.
>> > > > >
>> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
>> > > <za...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi all,
>> > > > > >
>> > > > > > I am currently working on improving the performance of range
>> > queries
>> > > on
>> > > > > > strings. I've noticed that using TermRangeQuery with low-selective
>> > > > > queries
>> > > > > > is a very bad idea in terms of performance but I cannot clearly explain
>> > > > > why
>> > > > > > since it seems related with how the IntersectTermsEnum#next
>> > method
>> > > is
>> > > > > > implemented.
>> > > > > >
>> > > > > > The Javadoc of the class says that the terms index (the burst-trie
>> > > > > > datastructure) is not used by this implementation of TermsEnum.
>> > > > However,
>> > > > > > when I see the implementation of the next method I get the
>> > impression
>> > > > > that
>> > > > > > this is not accurate. Aren't we using the trie structure to skip parts
>> > > > of
>> > > > > > the data when  the automaton states do not match?
>> > > > > >
>> > > > > > Can somebody provide a high-level intutition of what
>> > > > > > IntersectTermsEnum#next does? Initially, I thought that it is
>> > > > traversing
>> > > > > > the whole trie structure (skipping some branches when necessary) but
>> > I
>> > > > > may
>> > > > > > be wrong.
>> > > > > >
>> > > > > > Thanks in advance,
>> > > > > > Stamatis
>> > > > > >
>> > > > >
>> > > >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Stamatis Zampetakis <za...@gmail.com>.

Thanks Robert and Uwe for your feedback!

I am not sure what you mean that the slowness does not come from the
iteration over the term index.
I did a small profiling (screenshot attached) of sending repeatedly a
TermRangeQuery that matches almost the whole index and I observe that 80%
of the time is spend on IntersectTermsEnum.next() method.
Isn't this the method which iterates over the index?



Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler <uw...@thetaphi.de>
έγραψε:

> Hi again,
>
> > The problem with TermRangeQueries is actually not the iteration over the
> > term index. The slowness comes from the fact that all terms between start
> > and end have to be iterated and their postings be fetched and those
> postings
> > be merged together. If the "source of terms" for doing this is just a
> simple
> > linear iteration of all terms from/to or the automaton intersection does
> not
> > really matter for the query execution. The change to prefer the automaton
> > instead of a simple term iteration is just to allow further
> optimizations, for
> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
>
> So the above issue brings no further improvements (currently). So as said
> before, a simple enumeration of all all terms and fetching their postings
> will be as fast. By using an automaton, the idea here is that the term
> dictionary may have some "percalculated" postings list for prefix terms. So
> instead of iterating over all terms and merging their postings, the terms
> dictionary could return a "virtual term" that contains all documents for a
> whole prefix and store the merged postings list in the index file. This was
> not yet implemented but is the place to hook into. You could create an
> improved BlockTermsDict implementation, that allows to get a PostingsEnum
> for a whole prefix of terms.
>
> Not sure how much of that was already implemented by LUCENE-5879, but it
> allows to do this. So here is where you could step in and improve the terms
> dictionary!
>
> Uwe
>
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Robert Muir <rc...@gmail.com>
> > > Sent: Monday, April 1, 2019 6:30 PM
> > > To: java-user <ja...@lucene.apache.org>
> > > Subject: Re: Clarification regarding BlockTree implementation of
> > > IntersectTermsEnum
> > >
> > > The regular TermsEnum is really designed for walking terms in linear
> order.
> > > it does have some ability to seek/leapfrog. But this means paths in a
> query
> > > automaton that match no terms result in a wasted seek and cpu, because
> > the
> > > api is designed to return the next term after regardless.
> > >
> > > On the other hand the intersect() is for intersecting two automata:
> query
> > > and index. Presumably it can also remove more inefficiencies than just
> the
> > > wasted seeks for complex wildcards and fuzzies and stuff, since it can
> > > "see" the whole input as an automaton. so for example it might be able
> to
> > > work on blocks of terms at a time and so on.
> > >
> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> > <za...@gmail.com>
> > > wrote:
> > >
> > > > Yes it is used.
> > > >
> > > > I think there are simpler and possibly more efficient ways to
> implement a
> > > > TermRangeQuery and that is why I am looking into this.
> > > > But I am also curious to understand what IntersectTermsEnum is
> > supposed
> > > to
> > > > do.
> > > >
> > > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <
> rcmuir@gmail.com>
> > > > έγραψε:
> > > >
> > > > > Is this IntersectTermsEnum really being used for term range query?
> > > Seems
> > > > > like using a standard TermsEnum, seeking to the start of the
> range, then
> > > > > calling next until the end would be easier.
> > > > >
> > > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> > > <za...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I am currently working on improving the performance of range
> > queries
> > > on
> > > > > > strings. I've noticed that using TermRangeQuery with
> low-selective
> > > > > queries
> > > > > > is a very bad idea in terms of performance but I cannot clearly
> explain
> > > > > why
> > > > > > since it seems related with how the IntersectTermsEnum#next
> > method
> > > is
> > > > > > implemented.
> > > > > >
> > > > > > The Javadoc of the class says that the terms index (the
> burst-trie
> > > > > > datastructure) is not used by this implementation of TermsEnum.
> > > > However,
> > > > > > when I see the implementation of the next method I get the
> > impression
> > > > > that
> > > > > > this is not accurate. Aren't we using the trie structure to skip
> parts
> > > > of
> > > > > > the data when  the automaton states do not match?
> > > > > >
> > > > > > Can somebody provide a high-level intutition of what
> > > > > > IntersectTermsEnum#next does? Initially, I thought that it is
> > > > traversing
> > > > > > the whole trie structure (skipping some branches when necessary)
> but
> > I
> > > > > may
> > > > > > be wrong.
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Stamatis
> > > > > >
> > > > >
> > > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi again,

> The problem with TermRangeQueries is actually not the iteration over the
> term index. The slowness comes from the fact that all terms between start
> and end have to be iterated and their postings be fetched and those postings
> be merged together. If the "source of terms" for doing this is just a simple
> linear iteration of all terms from/to or the automaton intersection does not
> really matter for the query execution. The change to prefer the automaton
> instead of a simple term iteration is just to allow further optimizations, for
> more info see https://issues.apache.org/jira/browse/LUCENE-5879

So the above issue brings no further improvements (currently). So as said before, a simple enumeration of all all terms and fetching their postings will be as fast. By using an automaton, the idea here is that the term dictionary may have some "percalculated" postings list for prefix terms. So instead of iterating over all terms and merging their postings, the terms dictionary could return a "virtual term" that contains all documents for a whole prefix and store the merged postings list in the index file. This was not yet implemented but is the place to hook into. You could create an improved BlockTermsDict implementation, that allows to get a PostingsEnum for a whole prefix of terms.

Not sure how much of that was already implemented by LUCENE-5879, but it allows to do this. So here is where you could step in and improve the terms dictionary!

Uwe

> Uwe
> 
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> 
> > -----Original Message-----
> > From: Robert Muir <rc...@gmail.com>
> > Sent: Monday, April 1, 2019 6:30 PM
> > To: java-user <ja...@lucene.apache.org>
> > Subject: Re: Clarification regarding BlockTree implementation of
> > IntersectTermsEnum
> >
> > The regular TermsEnum is really designed for walking terms in linear order.
> > it does have some ability to seek/leapfrog. But this means paths in a query
> > automaton that match no terms result in a wasted seek and cpu, because
> the
> > api is designed to return the next term after regardless.
> >
> > On the other hand the intersect() is for intersecting two automata: query
> > and index. Presumably it can also remove more inefficiencies than just the
> > wasted seeks for complex wildcards and fuzzies and stuff, since it can
> > "see" the whole input as an automaton. so for example it might be able to
> > work on blocks of terms at a time and so on.
> >
> > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
> <za...@gmail.com>
> > wrote:
> >
> > > Yes it is used.
> > >
> > > I think there are simpler and possibly more efficient ways to implement a
> > > TermRangeQuery and that is why I am looking into this.
> > > But I am also curious to understand what IntersectTermsEnum is
> supposed
> > to
> > > do.
> > >
> > > Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <rc...@gmail.com>
> > > έγραψε:
> > >
> > > > Is this IntersectTermsEnum really being used for term range query?
> > Seems
> > > > like using a standard TermsEnum, seeking to the start of the range, then
> > > > calling next until the end would be easier.
> > > >
> > > > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis
> > <za...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am currently working on improving the performance of range
> queries
> > on
> > > > > strings. I've noticed that using TermRangeQuery with low-selective
> > > > queries
> > > > > is a very bad idea in terms of performance but I cannot clearly explain
> > > > why
> > > > > since it seems related with how the IntersectTermsEnum#next
> method
> > is
> > > > > implemented.
> > > > >
> > > > > The Javadoc of the class says that the terms index (the burst-trie
> > > > > datastructure) is not used by this implementation of TermsEnum.
> > > However,
> > > > > when I see the implementation of the next method I get the
> impression
> > > > that
> > > > > this is not accurate. Aren't we using the trie structure to skip parts
> > > of
> > > > > the data when  the automaton states do not match?
> > > > >
> > > > > Can somebody provide a high-level intutition of what
> > > > > IntersectTermsEnum#next does? Initially, I thought that it is
> > > traversing
> > > > > the whole trie structure (skipping some branches when necessary) but
> I
> > > > may
> > > > > be wrong.
> > > > >
> > > > > Thanks in advance,
> > > > > Stamatis
> > > > >
> > > >
> > >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Robert Muir <rc...@gmail.com>.

The regular TermsEnum is really designed for walking terms in linear order.
it does have some ability to seek/leapfrog. But this means paths in a query
automaton that match no terms result in a wasted seek and cpu, because the
api is designed to return the next term after regardless.

On the other hand the intersect() is for intersecting two automata: query
and index. Presumably it can also remove more inefficiencies than just the
wasted seeks for complex wildcards and fuzzies and stuff, since it can
"see" the whole input as an automaton. so for example it might be able to
work on blocks of terms at a time and so on.

On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis <za...@gmail.com> wrote:

> Yes it is used.
>
> I think there are simpler and possibly more efficient ways to implement a
> TermRangeQuery and that is why I am looking into this.
> But I am also curious to understand what IntersectTermsEnum is supposed to
> do.
>
> Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <rc...@gmail.com>
> έγραψε:
>
> > Is this IntersectTermsEnum really being used for term range query? Seems
> > like using a standard TermsEnum, seeking to the start of the range, then
> > calling next until the end would be easier.
> >
> > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis <za...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I am currently working on improving the performance of range queries on
> > > strings. I've noticed that using TermRangeQuery with low-selective
> > queries
> > > is a very bad idea in terms of performance but I cannot clearly explain
> > why
> > > since it seems related with how the IntersectTermsEnum#next method is
> > > implemented.
> > >
> > > The Javadoc of the class says that the terms index (the burst-trie
> > > datastructure) is not used by this implementation of TermsEnum.
> However,
> > > when I see the implementation of the next method I get the impression
> > that
> > > this is not accurate. Aren't we using the trie structure to skip parts
> of
> > > the data when  the automaton states do not match?
> > >
> > > Can somebody provide a high-level intutition of what
> > > IntersectTermsEnum#next does? Initially, I thought that it is
> traversing
> > > the whole trie structure (skipping some branches when necessary) but I
> > may
> > > be wrong.
> > >
> > > Thanks in advance,
> > > Stamatis
> > >
> >
>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Stamatis Zampetakis <za...@gmail.com>.

Yes it is used.

I think there are simpler and possibly more efficient ways to implement a
TermRangeQuery and that is why I am looking into this.
But I am also curious to understand what IntersectTermsEnum is supposed to
do.

Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir <rc...@gmail.com>
έγραψε:

> Is this IntersectTermsEnum really being used for term range query? Seems
> like using a standard TermsEnum, seeking to the start of the range, then
> calling next until the end would be easier.
>
> On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis <za...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I am currently working on improving the performance of range queries on
> > strings. I've noticed that using TermRangeQuery with low-selective
> queries
> > is a very bad idea in terms of performance but I cannot clearly explain
> why
> > since it seems related with how the IntersectTermsEnum#next method is
> > implemented.
> >
> > The Javadoc of the class says that the terms index (the burst-trie
> > datastructure) is not used by this implementation of TermsEnum. However,
> > when I see the implementation of the next method I get the impression
> that
> > this is not accurate. Aren't we using the trie structure to skip parts of
> > the data when  the automaton states do not match?
> >
> > Can somebody provide a high-level intutition of what
> > IntersectTermsEnum#next does? Initially, I thought that it is traversing
> > the whole trie structure (skipping some branches when necessary) but I
> may
> > be wrong.
> >
> > Thanks in advance,
> > Stamatis
> >
>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

Posted by Robert Muir <rc...@gmail.com>.

Is this IntersectTermsEnum really being used for term range query? Seems
like using a standard TermsEnum, seeking to the start of the range, then
calling next until the end would be easier.

On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis <za...@gmail.com> wrote:

> Hi all,
>
> I am currently working on improving the performance of range queries on
> strings. I've noticed that using TermRangeQuery with low-selective queries
> is a very bad idea in terms of performance but I cannot clearly explain why
> since it seems related with how the IntersectTermsEnum#next method is
> implemented.
>
> The Javadoc of the class says that the terms index (the burst-trie
> datastructure) is not used by this implementation of TermsEnum. However,
> when I see the implementation of the next method I get the impression that
> this is not accurate. Aren't we using the trie structure to skip parts of
> the data when  the automaton states do not match?
>
> Can somebody provide a high-level intutition of what
> IntersectTermsEnum#next does? Initially, I thought that it is traversing
> the whole trie structure (skipping some branches when necessary) but I may
> be wrong.
>
> Thanks in advance,
> Stamatis
>