You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Roman Chyla <ro...@gmail.com> on 2012/11/30 18:13:05 UTC

Regexp and speed

Hi,

Some time ago we have done some measurement of the performance fo the
regexp queries and found that they are VERY FAST! We can't be grateful
enough, it saves many days/lives ;)

This was an old lenovo x61 laptop, core2 due, 1.7GHz,no special memory
allocation, SSD disk:


51459ms.  Buiding index of 100000 docs
181175ms.  Verifying data integrity with 100 docs
315ms.  Preparing 1000 random queries

61167ms.  Regex queries - Stopping execution, # queries finished: 150
2795ms.  Regexp queries (new style)
3936ms.  Wildcard queries
777ms.  Boolean queries
893ms.  Boolean queries (truncated)
3596ms.  Span queries
91751ms.  Span queries (truncated)Stopping execution, # queries finished: 100
3937ms.  Payload queries
93726ms.  Payload queries (truncated)Stopping execution, # queries finished: 100
Totals: [4865, 18284, 18286, 18284, 18405, 287934, 44375, 18284, 2489]

Examples of queries:
--------------------
regex:bgiyodjrr, k\w* michael\w* jay\w* .*
regexp:/bgiyodjrr, k\w* michael\w* jay\w* .*/
wildcard:bgiyodjrr, k*1 michael*2 jay*3 *
+n0:bgiyodjrr +n1:k +n2:michael +n3:jay
+n0:bgiyodjrr +n1:k* +n2:m* +n3:j*
spanNear([vectrfield:bgiyodjrr, vectrfield:k, vectrfield:michael,
vectrfield:jay], 0, true)
spanNear([vectrfield:bgiyodjrr,
SpanMultiTermQueryWrapper(vectrfield:k*),
SpanMultiTermQueryWrapper(vectrfield:m*),
SpanMultiTermQueryWrapper(vectrfield:j*)], 0, true)
spanPayCheck(spanNear([vectrfield:bgiyodjrr, vectrfield:k,
vectrfield:michael, vectrfield:jay], 1, true), payloadRef:
b[0]=48;b[0]=49;b[0]=50;b[0]=51;)
spanPayCheck(spanNear([vectrfield:bgiyodjrr,
SpanMultiTermQueryWrapper(vectrfield:k*),
SpanMultiTermQueryWrapper(vectrfield:m*),
SpanMultiTermQueryWrapper(vectrfield:j*)], 1, true), payloadRef:
b[0]=48;b[0]=49;b[0]=50;b[0]=51;)


The code here:
https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java

The benchmark should probably not be called 'benchmark', do you think it
may be too simplistic? Can we expect some bad surprises somewhere?

Thanks,

  roman

Re: Regexp and speed

Posted by Roman Chyla <ro...@gmail.com>.
found also some 1M test


258033ms.  Buiding index of 1000000 docs
29703ms.  Verifying data integrity with 100 docs
1821ms.  Preparing 10000 random queries
2867284ms.  Regex queries
18772ms.  Regexp queries (new style)
29257ms.  Wildcard queries
4920ms.  Boolean queries
Totals: [1749708, 1744494, 1749708, 1744494]


On Fri, Nov 30, 2012 at 12:13 PM, Roman Chyla <ro...@gmail.com> wrote:

> Hi,
>
> Some time ago we have done some measurement of the performance fo the
> regexp queries and found that they are VERY FAST! We can't be grateful
> enough, it saves many days/lives ;)
>
> This was an old lenovo x61 laptop, core2 due, 1.7GHz,no special memory
> allocation, SSD disk:
>
>
> 51459ms.  Buiding index of 100000 docs
> 181175ms.  Verifying data integrity with 100 docs
> 315ms.  Preparing 1000 random queries
>
> 61167ms.  Regex queries - Stopping execution, # queries finished: 150
> 2795ms.  Regexp queries (new style)
> 3936ms.  Wildcard queries
> 777ms.  Boolean queries
> 893ms.  Boolean queries (truncated)
> 3596ms.  Span queries
> 91751ms.  Span queries (truncated)Stopping execution, # queries finished: 100
> 3937ms.  Payload queries
> 93726ms.  Payload queries (truncated)Stopping execution, # queries finished: 100
> Totals: [4865, 18284, 18286, 18284, 18405, 287934, 44375, 18284, 2489]
>
> Examples of queries:
> --------------------
> regex:bgiyodjrr, k\w* michael\w* jay\w* .*
> regexp:/bgiyodjrr, k\w* michael\w* jay\w* .*/
> wildcard:bgiyodjrr, k*1 michael*2 jay*3 *
> +n0:bgiyodjrr +n1:k +n2:michael +n3:jay
> +n0:bgiyodjrr +n1:k* +n2:m* +n3:j*
> spanNear([vectrfield:bgiyodjrr, vectrfield:k, vectrfield:michael, vectrfield:jay], 0, true)
> spanNear([vectrfield:bgiyodjrr, SpanMultiTermQueryWrapper(vectrfield:k*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:j*)], 0, true)
> spanPayCheck(spanNear([vectrfield:bgiyodjrr, vectrfield:k, vectrfield:michael, vectrfield:jay], 1, true), payloadRef: b[0]=48;b[0]=49;b[0]=50;b[0]=51;)
> spanPayCheck(spanNear([vectrfield:bgiyodjrr, SpanMultiTermQueryWrapper(vectrfield:k*), SpanMultiTermQueryWrapper(vectrfield:m*), SpanMultiTermQueryWrapper(vectrfield:j*)], 1, true), payloadRef: b[0]=48;b[0]=49;b[0]=50;b[0]=51;)
>
>
> The code here:
>
> https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java
>
> The benchmark should probably not be called 'benchmark', do you think it
> may be too simplistic? Can we expect some bad surprises somewhere?
>
> Thanks,
>
>   roman
>

Re: Regexp and speed

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Nov 30, 2012 at 12:13 PM, Roman Chyla <ro...@gmail.com> wrote:

>
> The code here:
>
> https://github.com/romanchyla/montysolr/blob/solr-trunk/contrib/adsabs/src/test/org/adsabs/lucene/BenchmarkAuthorSearch.java
>
> The benchmark should probably not be called 'benchmark', do you think it
> may be too simplistic? Can we expect some bad surprises somewhere?
>
>
I think maybe a few surprises, since it extends LuceneTestCase and uses
RandomIndexWriter, newSearcher and so on, the benchmark results can be
confusing.

This stuff is fantastic to use for tests but for benchmarks may cause
confusion.

For example you might run it and it gets SimpleText codec, maybe wraps the
indexsearcher with slow things like ParallelReader, and maybe you get
horrific merge parameters and so on.