You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Sanjoy Das <sa...@playingwithpointers.com> on 2015/11/23 20:42:59 UTC

Benchmarking Lucene

Hi all,

I work for a JVM vendor, and we're interested in obtaining / creating
a set of Lucene benchmarks for internal use.  We plan to use these for
performance regression testing and general performance analysis
(i.e. to make sure Lucene performs well on our JVM).  I'm especially
interested in benchmarks that demonstrate opportunities for
improvements in our JIT compiler.

While I imagine that the lucene/benchmark/ directory is probably the
right place to start, I have a few high-level questions that are best
answered by people on this mailing list:

- Are there realistic Lucene workloads that are bottle-necked on the
   JVM's performance (JIT, GC etc.) and *not* e.g. disk / network IO?
   If so, what are some examples?

- How relevant are the Dacapo "luindex" and "lusearch" benchmarks
   today?  Will porting them to the latest version of Lucene give me a
   benchmark representative of modern Lucene usage, or has Lucene's
   performance characteristics evolved in fundamental ways since Dacapo
   was published?

- What is the distribution of Lucene versions in production
   deployments?  Do users tend to aggressively upgrade to the "latest
   and greatest" Lucene version, or is there usually a non-trivial lag?

Any other information that you think is useful or relevant is
welcome.

Thanks!
-- Sanjoy

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Benchmarking Lucene

Posted by Dawid Weiss <da...@gmail.com>.

> I work for Azul Systems (https://www.azul.com).

Ahem. A bit off topic.

Lucene tests are known to quite frequently crash bleeding edge hotspot
releases. Since Zing is not available to us what would be great is to
have Azul run the Lucene test suite on its own JVM so that we can make
sure everything works (for all parties involved).

Dawid

Re: Benchmarking Lucene

Posted by Dawid Weiss <da...@gmail.com>.

> I work for Azul Systems (https://www.azul.com).

Ahem. A bit off topic.

Lucene tests are known to quite frequently crash bleeding edge hotspot
releases. Since Zing is not available to us what would be great is to
have Azul run the Lucene test suite on its own JVM so that we can make
sure everything works (for all parties involved).

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Benchmarking Lucene

Posted by Sanjoy Das <sa...@playingwithpointers.com>.


Michael McCandless wrote:
 > Which JVM vendor :)  There are not so many, unfortunately...

I work for Azul Systems (https://www.azul.com).

 > I run nightly benchmarks for Lucene, which are visible at
 > https://people.apache.org/~mikemccand/lucenebench/
 >
 > We use this to catch accidental performance regressions... the sources
 > for all of this are at https://github.com/mikemccand/luceneutil but
 > running them yourself can be tricky.  They index and search
 > Wikipedia's English export.

I was hoping to get hold of benchmarks that are a little more
"lightweight" -- something that I can run from beginning to end in <
30 minutes.  Is there an interesting subset of the nightly tests that
I can run within that sort of timeframe?

 > Lucene is definitely JVM/GC bound in many cases, e.g. when the index
 > is "hot" (fully cached by the OS in free RAM).
 >
 > I'm not familiar with Dacapo...
 >
 > I'm not sure how aggressively users upgrade ... but I believe most
 > users use Lucene via Elasticsearch or Solr.
 >
 > Mike McCandless
 >
 > http://blog.mikemccandless.com
 >
 >
 > On Mon, Nov 23, 2015 at 2:42 PM, Sanjoy Das
 > <sa...@playingwithpointers.com>  wrote:
 >> Hi all,
 >>
 >> I work for a JVM vendor, and we're interested in obtaining / creating
 >> a set of Lucene benchmarks for internal use.  We plan to use these for
 >> performance regression testing and general performance analysis
 >> (i.e. to make sure Lucene performs well on our JVM).  I'm especially
 >> interested in benchmarks that demonstrate opportunities for
 >> improvements in our JIT compiler.
 >>
 >> While I imagine that the lucene/benchmark/ directory is probably the
 >> right place to start, I have a few high-level questions that are best
 >> answered by people on this mailing list:
 >>
 >> - Are there realistic Lucene workloads that are bottle-necked on the
 >>    JVM's performance (JIT, GC etc.) and *not* e.g. disk / network IO?
 >>    If so, what are some examples?
 >>
 >> - How relevant are the Dacapo "luindex" and "lusearch" benchmarks
 >>    today?  Will porting them to the latest version of Lucene give me a
 >>    benchmark representative of modern Lucene usage, or has Lucene's
 >>    performance characteristics evolved in fundamental ways since Dacapo
 >>    was published?
 >>
 >> - What is the distribution of Lucene versions in production
 >>    deployments?  Do users tend to aggressively upgrade to the "latest
 >>    and greatest" Lucene version, or is there usually a non-trivial lag?
 >>
 >> Any other information that you think is useful or relevant is
 >> welcome.
 >>
 >> Thanks!
 >> -- Sanjoy
 >>
 >> ---------------------------------------------------------------------
 >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
 >> For additional commands, e-mail: dev-help@lucene.apache.org
 >>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Benchmarking Lucene

Posted by Sanjoy Das <sa...@playingwithpointers.com>.


Michael McCandless wrote:
 > Which JVM vendor :)  There are not so many, unfortunately...

I work for Azul Systems (https://www.azul.com).

 > I run nightly benchmarks for Lucene, which are visible at
 > https://people.apache.org/~mikemccand/lucenebench/
 >
 > We use this to catch accidental performance regressions... the sources
 > for all of this are at https://github.com/mikemccand/luceneutil but
 > running them yourself can be tricky.  They index and search
 > Wikipedia's English export.

I was hoping to get hold of benchmarks that are a little more
"lightweight" -- something that I can run from beginning to end in <
30 minutes.  Is there an interesting subset of the nightly tests that
I can run within that sort of timeframe?

 > Lucene is definitely JVM/GC bound in many cases, e.g. when the index
 > is "hot" (fully cached by the OS in free RAM).
 >
 > I'm not familiar with Dacapo...
 >
 > I'm not sure how aggressively users upgrade ... but I believe most
 > users use Lucene via Elasticsearch or Solr.
 >
 > Mike McCandless
 >
 > http://blog.mikemccandless.com
 >
 >
 > On Mon, Nov 23, 2015 at 2:42 PM, Sanjoy Das
 > <sa...@playingwithpointers.com>  wrote:
 >> Hi all,
 >>
 >> I work for a JVM vendor, and we're interested in obtaining / creating
 >> a set of Lucene benchmarks for internal use.  We plan to use these for
 >> performance regression testing and general performance analysis
 >> (i.e. to make sure Lucene performs well on our JVM).  I'm especially
 >> interested in benchmarks that demonstrate opportunities for
 >> improvements in our JIT compiler.
 >>
 >> While I imagine that the lucene/benchmark/ directory is probably the
 >> right place to start, I have a few high-level questions that are best
 >> answered by people on this mailing list:
 >>
 >> - Are there realistic Lucene workloads that are bottle-necked on the
 >>    JVM's performance (JIT, GC etc.) and *not* e.g. disk / network IO?
 >>    If so, what are some examples?
 >>
 >> - How relevant are the Dacapo "luindex" and "lusearch" benchmarks
 >>    today?  Will porting them to the latest version of Lucene give me a
 >>    benchmark representative of modern Lucene usage, or has Lucene's
 >>    performance characteristics evolved in fundamental ways since Dacapo
 >>    was published?
 >>
 >> - What is the distribution of Lucene versions in production
 >>    deployments?  Do users tend to aggressively upgrade to the "latest
 >>    and greatest" Lucene version, or is there usually a non-trivial lag?
 >>
 >> Any other information that you think is useful or relevant is
 >> welcome.
 >>
 >> Thanks!
 >> -- Sanjoy
 >>
 >> ---------------------------------------------------------------------
 >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
 >> For additional commands, e-mail: dev-help@lucene.apache.org
 >>

Re: Benchmarking Lucene

Posted by Michael McCandless <lu...@mikemccandless.com>.

Which JVM vendor :)  There are not so many, unfortunately...

I run nightly benchmarks for Lucene, which are visible at
https://people.apache.org/~mikemccand/lucenebench/

We use this to catch accidental performance regressions... the sources
for all of this are at https://github.com/mikemccand/luceneutil but
running them yourself can be tricky.  They index and search
Wikipedia's English export.

Lucene is definitely JVM/GC bound in many cases, e.g. when the index
is "hot" (fully cached by the OS in free RAM).

I'm not familiar with Dacapo...

I'm not sure how aggressively users upgrade ... but I believe most
users use Lucene via Elasticsearch or Solr.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 23, 2015 at 2:42 PM, Sanjoy Das
<sa...@playingwithpointers.com> wrote:
> Hi all,
>
> I work for a JVM vendor, and we're interested in obtaining / creating
> a set of Lucene benchmarks for internal use.  We plan to use these for
> performance regression testing and general performance analysis
> (i.e. to make sure Lucene performs well on our JVM).  I'm especially
> interested in benchmarks that demonstrate opportunities for
> improvements in our JIT compiler.
>
> While I imagine that the lucene/benchmark/ directory is probably the
> right place to start, I have a few high-level questions that are best
> answered by people on this mailing list:
>
> - Are there realistic Lucene workloads that are bottle-necked on the
>   JVM's performance (JIT, GC etc.) and *not* e.g. disk / network IO?
>   If so, what are some examples?
>
> - How relevant are the Dacapo "luindex" and "lusearch" benchmarks
>   today?  Will porting them to the latest version of Lucene give me a
>   benchmark representative of modern Lucene usage, or has Lucene's
>   performance characteristics evolved in fundamental ways since Dacapo
>   was published?
>
> - What is the distribution of Lucene versions in production
>   deployments?  Do users tend to aggressively upgrade to the "latest
>   and greatest" Lucene version, or is there usually a non-trivial lag?
>
> Any other information that you think is useful or relevant is
> welcome.
>
> Thanks!
> -- Sanjoy
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

Re: Benchmarking Lucene

Posted by Michael McCandless <lu...@mikemccandless.com>.

Which JVM vendor :)  There are not so many, unfortunately...

I run nightly benchmarks for Lucene, which are visible at
https://people.apache.org/~mikemccand/lucenebench/

We use this to catch accidental performance regressions... the sources
for all of this are at https://github.com/mikemccand/luceneutil but
running them yourself can be tricky.  They index and search
Wikipedia's English export.

Lucene is definitely JVM/GC bound in many cases, e.g. when the index
is "hot" (fully cached by the OS in free RAM).

I'm not familiar with Dacapo...

I'm not sure how aggressively users upgrade ... but I believe most
users use Lucene via Elasticsearch or Solr.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 23, 2015 at 2:42 PM, Sanjoy Das
<sa...@playingwithpointers.com> wrote:
> Hi all,
>
> I work for a JVM vendor, and we're interested in obtaining / creating
> a set of Lucene benchmarks for internal use.  We plan to use these for
> performance regression testing and general performance analysis
> (i.e. to make sure Lucene performs well on our JVM).  I'm especially
> interested in benchmarks that demonstrate opportunities for
> improvements in our JIT compiler.
>
> While I imagine that the lucene/benchmark/ directory is probably the
> right place to start, I have a few high-level questions that are best
> answered by people on this mailing list:
>
> - Are there realistic Lucene workloads that are bottle-necked on the
>   JVM's performance (JIT, GC etc.) and *not* e.g. disk / network IO?
>   If so, what are some examples?
>
> - How relevant are the Dacapo "luindex" and "lusearch" benchmarks
>   today?  Will porting them to the latest version of Lucene give me a
>   benchmark representative of modern Lucene usage, or has Lucene's
>   performance characteristics evolved in fundamental ways since Dacapo
>   was published?
>
> - What is the distribution of Lucene versions in production
>   deployments?  Do users tend to aggressively upgrade to the "latest
>   and greatest" Lucene version, or is there usually a non-trivial lag?
>
> Any other information that you think is useful or relevant is
> welcome.
>
> Thanks!
> -- Sanjoy
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Benchmarking Lucene

Posted by Sanjoy Das <sa...@playingwithpointers.com>.

Robert Muir wrote:
 > On Mon, Nov 23, 2015 at 2:42 PM, Sanjoy Das
 > <sa...@playingwithpointers.com>  wrote:
 >> Hi all,
 >>
 >> I work for a JVM vendor, and we're interested in obtaining / creating
 >> a set of Lucene benchmarks for internal use.  We plan to use these for
 >> performance regression testing and general performance analysis
 >> (i.e. to make sure Lucene performs well on our JVM).  I'm especially
 >> interested in benchmarks that demonstrate opportunities for
 >> improvements in our JIT compiler.
 >>
 >> While I imagine that the lucene/benchmark/ directory is probably the
 >> right place to start, I have a few high-level questions that are best
 >> answered by people on this mailing list:
 >
 > Actually I think http://people.apache.org/~mikemccand/lucenebench/
 > might be better for your purposes. Code is currently located here:
 > https://github.com/mikemccand/luceneutil

I just replied to Mike about this -- ideally the benchmarks I'm
looking for should run relatively quickly (i.e. < 30 min).

However, if the lucenebench is the right thing to run, I'd rather have
a good benchmark that takes a while to finish over a misleading
benchmark that runs quickly. :)

 >> - Are there realistic Lucene workloads that are bottle-necked on the
 >>    JVM's performance (JIT, GC etc.) and *not* e.g. disk / network IO?
 >>    If so, what are some examples?
 >
 > You can see some changes in query graphs when the JVM was upgraded at
 > the above link. In some cases they are not positive. For example, why
 > did indexing throughput drop significantly when upgrading from
 > 1.8.0_25 to 1.8.0_40? (annotation BD in
 > http://people.apache.org/~mikemccand/lucenebench/indexing.html)

I don't work on OpenJDK, so I cannot comment on OpenJDK's performance;
but that is an interesting data point nevertheless.  It certainly
shows that improving the JVM can help, and vice versa.

 >> - How relevant are the Dacapo "luindex" and "lusearch" benchmarks
 >>    today?  Will porting them to the latest version of Lucene give me a
 >>    benchmark representative of modern Lucene usage, or has Lucene's
 >>    performance characteristics evolved in fundamental ways since Dacapo
 >>    was published?
 >
 > Some things have changed since lucene 2.4 such as much better
 > concurrency when indexing with multiple threads, the use of bulk
 > integer decompression methods vs vByte compression, and so on. Also
 > support for new data structures like column-stride fields were added,
 > and the use cases around those (e.g. faceted search) are probably not
 > represented.

Thanks, that is very useful to know.

-- Sanjoy

 >
 > ---------------------------------------------------------------------
 > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
 > For additional commands, e-mail: dev-help@lucene.apache.org
 >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Benchmarking Lucene

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 23, 2015 at 2:42 PM, Sanjoy Das
<sa...@playingwithpointers.com> wrote:
> Hi all,
>
> I work for a JVM vendor, and we're interested in obtaining / creating
> a set of Lucene benchmarks for internal use.  We plan to use these for
> performance regression testing and general performance analysis
> (i.e. to make sure Lucene performs well on our JVM).  I'm especially
> interested in benchmarks that demonstrate opportunities for
> improvements in our JIT compiler.
>
> While I imagine that the lucene/benchmark/ directory is probably the
> right place to start, I have a few high-level questions that are best
> answered by people on this mailing list:

Actually I think http://people.apache.org/~mikemccand/lucenebench/
might be better for your purposes. Code is currently located here:
https://github.com/mikemccand/luceneutil

>
> - Are there realistic Lucene workloads that are bottle-necked on the
>   JVM's performance (JIT, GC etc.) and *not* e.g. disk / network IO?
>   If so, what are some examples?

You can see some changes in query graphs when the JVM was upgraded at
the above link. In some cases they are not positive. For example, why
did indexing throughput drop significantly when upgrading from
1.8.0_25 to 1.8.0_40? (annotation BD in
http://people.apache.org/~mikemccand/lucenebench/indexing.html)

>
> - How relevant are the Dacapo "luindex" and "lusearch" benchmarks
>   today?  Will porting them to the latest version of Lucene give me a
>   benchmark representative of modern Lucene usage, or has Lucene's
>   performance characteristics evolved in fundamental ways since Dacapo
>   was published?

Some things have changed since lucene 2.4 such as much better
concurrency when indexing with multiple threads, the use of bulk
integer decompression methods vs vByte compression, and so on. Also
support for new data structures like column-stride fields were added,
and the use cases around those (e.g. faceted search) are probably not
represented.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org