You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Arun Kumar K <ar...@gmail.com> on 2013/03/29 10:37:58 UTC

Wild Card Query Performance

Hi Guys,

I have been testing the search time improvement in Lucene 4.0 from Lucene
3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).

For a 2GB size index with 4000000 docs, the following observations were
made:

Around 3X improvement with and without STRING sort on a sortable field.

I guess this improvement is because of the Automation Query by Robert which
is used in WildCard Queries.

As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but these
wildcard queries are not that faster comparatively.

I have used default codecs and postings format.

Did i miss something or is it the max improvement that we can expect
currently for WildCard Queries?


Arun

Re: Wild Card Query Performance

Posted by Arun Kumar K <ar...@gmail.com>.

Thanks Uwe for the detailed & clear response.
I will definitely go further into details of 4.0 improvements.

Thanks,
Arun


On Fri, Mar 29, 2013 at 4:21 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi Arun,
>
> this is hard to explain with one single reason - To conclude:
> - A wildcard query that consists only of a prefix ("ab*") is internally
> implemented by iterating all terms starting with "ab" and ending with the
> one before "ac"). This is not different to Lucene 3.x and there is no room
> for optimization. If you look at
> o.a.l.util.automaton.CompiledAutomaton#getTermsEnum(), you see that prefix
> terms are implemented by PrefixTermsEnum, that uses the above algorithm and
> is mostly a copy of the Lucene 3.x PrefixTermEnum.
> - A more complex wildcard or regex query that does not solely consist of a
> prefix but more restrictions after the prefix is faster in 4.x than in
> Lucene 3.x, because Lucene 3.x was only able to scan all terms starting
> with the common prefix ("ab*xy" would have approx. same speed as a prefix
> "ab*", it just removes all terms not matching *xy in a second step). This
> is also the improvement in FuzzyQuery: In Lucene 3.x a FuzzyQuery has no
> prefix at all (unless you specify one, but that limits the fuzzy use), so
> it has to scan all terms in the term dictionary and filter them against the
> Levenshtein distance. In Lucene 4.x, it can use an automaton, to
> effectively seek inside the term enum to find all terms that match the
> (non-prefix) regex, wildcard or Levensthein distance. In short: Lucene 4 is
> able to step over terms that cannot match the automaton. A cool example is
> the following regex: "(ht|f)tp://.*". In Lucene 3.x this would scan all
> terms because there is no common prefix, in 4.x this expands to something
> similar to the intersection of 2 prefix queries, so the automaton
> intersection between the regex and the term dictionary will seek to term
> "ftp://" first, iterate until the last term of this prefix and then seeks
> to "http://", doing the same.
>
> If you only have a prefix query (wildcard "ab*" or regex "ab.*"), then the
> query is executed in the same way like in Lucene 3.x, no automatons are
> involved at all! The reason why you see a speed improvement is not caused
> by the automaton implementation (it's not used at all), but it is caused by
> a complete rewrite of Lucene's algorithms regarding the term dictionary and
> the posting lists (Lucene Codecs). E.g. Lucene 4.0 uses a new term
> dictionary based on FST, Lucene 4.0 no longer creates Strings all the time
> (all terms are just slices inside a large byte[]),... and many more
> improvements. Also when scanning the posting lists to find all documents
> related to the terms found, Lucene uses different data structures (e.g.,
> block postings in 4.1). So it’s a number of improvements that speed up the
> index (also while indexing!). Just review the numerous presentations on
> conferences or the blog posts by Mike McCandless (
> http://blog.mikemccandless.com/).
>
> > - Can we do something to improve speed for such queries ?
>
> You cannot improve speed of pure prefix queries. Maybe another data
> structure is helpful? E.g. if you scan for dates as string (queries on
> dates like "201301*"), it might be better to use a numeric field and use
> NumericRangeQuery(20130101, 20130131)? Another approach for improving
> prefix queries is indexing additional terms: If you are always searching
> for a 2-char prefix for "ab*", then simply index an additional term in a
> separate field with 2 chars (e.g., "ab") in your documents while indexing
> and just search for "ab"? This is very fast! This also applies to Lucene
> 3.x!
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Arun Kumar K [mailto:arunk786@gmail.com]
> > Sent: Friday, March 29, 2013 11:13 AM
> > To: java-user
> > Subject: Re: Wild Card Query Performance
> >
> > Hi Uwe,
> >
> > Thanks for the info.
> > You were mentioning about term dictionary and and other index
> > components. I didn't get this.
> > - What could be the other factors that improve the speed of such query ?
> > Can u explain or give some pointers to this ?
> > - Can we do something to improve speed for such queries ?
> >
> > Also other observation is indexing time has increased by around 6% in
> 4.0.
> >
> > Arun
> >
> > On Fri, Mar 29, 2013 at 3:25 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> >
> > > Hi,
> > >
> > > It depends on the type of wildcard query. If you only have a prefix
> > > (ab*), they rewrite to a simple PrefixQuery and this one is
> > > implemented exactly like in 3.x, so you only see the speed
> > > improvements of Lucene 4.0 in the term dictionary and and other index
> > > components, not related to the query itsself.
> > >
> > > If you have wildcards like ab?xy, then this query will be multiple
> > > times faster than in 3.x, because the "?" wildcard can only expand to
> > > a limited set of terms, while in Lucene 3.x, it still scans all terms
> > > with prefix "ab". The same applies to other wildcard constructs, if
> > > they limit more than just prefix.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > >
> > > > -----Original Message-----
> > > > From: Arun Kumar K [mailto:arunk786@gmail.com]
> > > > Sent: Friday, March 29, 2013 10:38 AM
> > > > To: java-user
> > > > Subject: Wild Card Query Performance
> > > >
> > > > Hi Guys,
> > > >
> > > > I have been testing the search time improvement in Lucene 4.0 from
> > > > Lucene
> > > > 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).
> > > >
> > > > For a 2GB size index with 4000000 docs, the following observations
> > > > were
> > > > made:
> > > >
> > > > Around 3X improvement with and without STRING sort on a sortable
> > field.
> > > >
> > > > I guess this improvement is because of the Automation Query by
> > > > Robert which is used in WildCard Queries.
> > > >
> > > > As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but
> > > > these wildcard queries are not that faster comparatively.
> > > >
> > > > I have used default codecs and postings format.
> > > >
> > > > Did i miss something or is it the max improvement that we can expect
> > > > currently for WildCard Queries?
> > > >
> > > >
> > > > Arun
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Wild Card Query Performance

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi Arun,

this is hard to explain with one single reason - To conclude:
- A wildcard query that consists only of a prefix ("ab*") is internally implemented by iterating all terms starting with "ab" and ending with the one before "ac"). This is not different to Lucene 3.x and there is no room for optimization. If you look at o.a.l.util.automaton.CompiledAutomaton#getTermsEnum(), you see that prefix terms are implemented by PrefixTermsEnum, that uses the above algorithm and is mostly a copy of the Lucene 3.x PrefixTermEnum.
- A more complex wildcard or regex query that does not solely consist of a prefix but more restrictions after the prefix is faster in 4.x than in Lucene 3.x, because Lucene 3.x was only able to scan all terms starting with the common prefix ("ab*xy" would have approx. same speed as a prefix "ab*", it just removes all terms not matching *xy in a second step). This is also the improvement in FuzzyQuery: In Lucene 3.x a FuzzyQuery has no prefix at all (unless you specify one, but that limits the fuzzy use), so it has to scan all terms in the term dictionary and filter them against the Levenshtein distance. In Lucene 4.x, it can use an automaton, to effectively seek inside the term enum to find all terms that match the (non-prefix) regex, wildcard or Levensthein distance. In short: Lucene 4 is able to step over terms that cannot match the automaton. A cool example is the following regex: "(ht|f)tp://.*". In Lucene 3.x this would scan all terms because there is no common prefix, in 4.x this expands to something similar to the intersection of 2 prefix queries, so the automaton intersection between the regex and the term dictionary will seek to term "ftp://" first, iterate until the last term of this prefix and then seeks to "http://", doing the same.

If you only have a prefix query (wildcard "ab*" or regex "ab.*"), then the query is executed in the same way like in Lucene 3.x, no automatons are involved at all! The reason why you see a speed improvement is not caused by the automaton implementation (it's not used at all), but it is caused by a complete rewrite of Lucene's algorithms regarding the term dictionary and the posting lists (Lucene Codecs). E.g. Lucene 4.0 uses a new term dictionary based on FST, Lucene 4.0 no longer creates Strings all the time (all terms are just slices inside a large byte[]),... and many more improvements. Also when scanning the posting lists to find all documents related to the terms found, Lucene uses different data structures (e.g., block postings in 4.1). So it’s a number of improvements that speed up the index (also while indexing!). Just review the numerous presentations on conferences or the blog posts by Mike McCandless (http://blog.mikemccandless.com/).

> - Can we do something to improve speed for such queries ?

You cannot improve speed of pure prefix queries. Maybe another data structure is helpful? E.g. if you scan for dates as string (queries on dates like "201301*"), it might be better to use a numeric field and use NumericRangeQuery(20130101, 20130131)? Another approach for improving prefix queries is indexing additional terms: If you are always searching for a 2-char prefix for "ab*", then simply index an additional term in a separate field with 2 chars (e.g., "ab") in your documents while indexing and just search for "ab"? This is very fast! This also applies to Lucene 3.x!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Arun Kumar K [mailto:arunk786@gmail.com]
> Sent: Friday, March 29, 2013 11:13 AM
> To: java-user
> Subject: Re: Wild Card Query Performance
> 
> Hi Uwe,
> 
> Thanks for the info.
> You were mentioning about term dictionary and and other index
> components. I didn't get this.
> - What could be the other factors that improve the speed of such query ?
> Can u explain or give some pointers to this ?
> - Can we do something to improve speed for such queries ?
> 
> Also other observation is indexing time has increased by around 6% in 4.0.
> 
> Arun
> 
> On Fri, Mar 29, 2013 at 3:25 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Hi,
> >
> > It depends on the type of wildcard query. If you only have a prefix
> > (ab*), they rewrite to a simple PrefixQuery and this one is
> > implemented exactly like in 3.x, so you only see the speed
> > improvements of Lucene 4.0 in the term dictionary and and other index
> > components, not related to the query itsself.
> >
> > If you have wildcards like ab?xy, then this query will be multiple
> > times faster than in 3.x, because the "?" wildcard can only expand to
> > a limited set of terms, while in Lucene 3.x, it still scans all terms
> > with prefix "ab". The same applies to other wildcard constructs, if
> > they limit more than just prefix.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Arun Kumar K [mailto:arunk786@gmail.com]
> > > Sent: Friday, March 29, 2013 10:38 AM
> > > To: java-user
> > > Subject: Wild Card Query Performance
> > >
> > > Hi Guys,
> > >
> > > I have been testing the search time improvement in Lucene 4.0 from
> > > Lucene
> > > 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).
> > >
> > > For a 2GB size index with 4000000 docs, the following observations
> > > were
> > > made:
> > >
> > > Around 3X improvement with and without STRING sort on a sortable
> field.
> > >
> > > I guess this improvement is because of the Automation Query by
> > > Robert which is used in WildCard Queries.
> > >
> > > As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but
> > > these wildcard queries are not that faster comparatively.
> > >
> > > I have used default codecs and postings format.
> > >
> > > Did i miss something or is it the max improvement that we can expect
> > > currently for WildCard Queries?
> > >
> > >
> > > Arun
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Wild Card Query Performance

Posted by Arun Kumar K <ar...@gmail.com>.

Hi Uwe,

Thanks for the info.
You were mentioning about term dictionary and and other index components. I
didn't get this.
- What could be the other factors that improve the speed of such query ?
Can u explain or give some pointers to this ?
- Can we do something to improve speed for such queries ?

Also other observation is indexing time has increased by around 6% in 4.0.

Arun

On Fri, Mar 29, 2013 at 3:25 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> It depends on the type of wildcard query. If you only have a prefix (ab*),
> they rewrite to a simple PrefixQuery and this one is implemented exactly
> like in 3.x, so you only see the speed improvements of Lucene 4.0 in the
> term dictionary and and other index components, not related to the query
> itsself.
>
> If you have wildcards like ab?xy, then this query will be multiple times
> faster than in 3.x, because the "?" wildcard can only expand to a limited
> set of terms, while in Lucene 3.x, it still scans all terms with prefix
> "ab". The same applies to other wildcard constructs, if they limit more
> than just prefix.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Arun Kumar K [mailto:arunk786@gmail.com]
> > Sent: Friday, March 29, 2013 10:38 AM
> > To: java-user
> > Subject: Wild Card Query Performance
> >
> > Hi Guys,
> >
> > I have been testing the search time improvement in Lucene 4.0 from Lucene
> > 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).
> >
> > For a 2GB size index with 4000000 docs, the following observations were
> > made:
> >
> > Around 3X improvement with and without STRING sort on a sortable field.
> >
> > I guess this improvement is because of the Automation Query by Robert
> > which is used in WildCard Queries.
> >
> > As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but these
> > wildcard queries are not that faster comparatively.
> >
> > I have used default codecs and postings format.
> >
> > Did i miss something or is it the max improvement that we can expect
> > currently for WildCard Queries?
> >
> >
> > Arun
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Wild Card Query Performance

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

It depends on the type of wildcard query. If you only have a prefix (ab*), they rewrite to a simple PrefixQuery and this one is implemented exactly like in 3.x, so you only see the speed improvements of Lucene 4.0 in the term dictionary and and other index components, not related to the query itsself.

If you have wildcards like ab?xy, then this query will be multiple times faster than in 3.x, because the "?" wildcard can only expand to a limited set of terms, while in Lucene 3.x, it still scans all terms with prefix "ab". The same applies to other wildcard constructs, if they limit more than just prefix.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Arun Kumar K [mailto:arunk786@gmail.com]
> Sent: Friday, March 29, 2013 10:38 AM
> To: java-user
> Subject: Wild Card Query Performance
> 
> Hi Guys,
> 
> I have been testing the search time improvement in Lucene 4.0 from Lucene
> 3.0.2 version for Wildcard Queries (with atleast say 2 chars Eg.ar*).
> 
> For a 2GB size index with 4000000 docs, the following observations were
> made:
> 
> Around 3X improvement with and without STRING sort on a sortable field.
> 
> I guess this improvement is because of the Automation Query by Robert
> which is used in WildCard Queries.
> 
> As per mike's blog, FuzzyQueries are 100X times faster in 4.0 but these
> wildcard queries are not that faster comparatively.
> 
> I have used default codecs and postings format.
> 
> Did i miss something or is it the max improvement that we can expect
> currently for WildCard Queries?
> 
> 
> Arun


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org