You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2012/03/08 15:17:33 UTC

Oak benchmarks (Was: [jr3] Index on randomly distributed data)

Hi,

On Tue, Mar 6, 2012 at 5:01 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Rather than discuss this issue in the abstract, I suggest that we
> define a set of relevant performance benchmarks, and  use them for
> evaluating potential alternatives.

In addition to this specific case, I think it's important that we
define and implement a good set of performance and scalability
benchmarks as early as possible. That allows us to get a good picture
of where we are and what areas and potential bottlenecks need more
focus. Such a set of benchmarks should also make it easy to evaluate
alternative designs and produce hard evidence to help resolve
potential disagreements.

So what should we benchmark then? Here's one idea to get us started:

* Large, flat hierarchy (selected pages-articles dump from Wikipedia)
  * Time it takes to load all articles (ideally a single transaction)
  * Amount of disk space used
  * Time it takes to iterate over all articles
  * Number of reads by X clients in Y seconds (power-law distribution)
  * Number of writes by X clients in Y seconds (power-law distribution)

Ideally we'd design the benchmarks so that they can be run against not
just different configurations of Oak, but also Jackrabbit 2.x and
other databases (SQL and NoSQL) like Oracle, PostgreSQL, CouchDB and
MongoDB.

To start with, I'd target the following basic deployment configurations:

* 1 node, MB-range test sets (small embedded or development/testing deployment)
* 4 nodes, GB-range test sets (mid-size non-cloud deployment)
* 16 nodes, TB-range test sets (low-end cloud deployment)

WDYT?

BR,

Jukka Zitting

Re: Oak benchmarks (Was: [jr3] Index on randomly distributed data)

Posted by Michael Dürig <md...@apache.org>.


On 8.3.12 14:17, Jukka Zitting wrote:
> So what should we benchmark then? Here's one idea to get us started:
>
> * Large, flat hierarchy (selected pages-articles dump from Wikipedia)
>    * Time it takes to load all articles (ideally a single transaction)
>    * Amount of disk space used
>    * Time it takes to iterate over all articles
>    * Number of reads by X clients in Y seconds (power-law distribution)
>    * Number of writes by X clients in Y seconds (power-law distribution)

Ack. In addition we should add tests which check that large numbers of 
direct child nodes (Millions) work. That is, adding a child node takes 
constant time irrespective of how many child nodes there are already. 
These use case seems to be quite important to us.

Michael

> Ideally we'd design the benchmarks so that they can be run against not
> just different configurations of Oak, but also Jackrabbit 2.x and
> other databases (SQL and NoSQL) like Oracle, PostgreSQL, CouchDB and
> MongoDB.
>
> To start with, I'd target the following basic deployment configurations:
>
> * 1 node, MB-range test sets (small embedded or development/testing deployment)
> * 4 nodes, GB-range test sets (mid-size non-cloud deployment)
> * 16 nodes, TB-range test sets (low-end cloud deployment)

Sounds like a good idea to me. Having such deployment configuration and 
testing infrastructure ready from the beginning should help a lot during 
further development.

Michael

>
> WDYT?
>
> BR,
>
> Jukka Zitting

Re: Oak benchmarks (Was: [jr3] Index on randomly distributed data)

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>The goals as currently defined are too vague
>(what kind of read access patterns, how much data per node, how big a
>cluster, etc.)

I propose the following use cases:

* initial loading (all users need to do at some point; either at once or
in steps)

* iterating over all nodes (indexing, search if there is no index, data
store garbage collection, export, consistency check)

* reading and writing in chunks of 100 nodes (not sure if that's a
realistic pattern)

As for 'real world data' we could use an Adobe CQ installation, or
simulate a similar structure.

>(creating 10 trillion nodes at a rate of one node per millisecond
>takes 31 years).

But only 1.2 days with 10'000 nodes/s and 10'000 cluster instances :-)

Regards,
Thomas

Re: Oak benchmarks (Was: [jr3] Index on randomly distributed data)

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Mar 8, 2012 at 5:14 PM, Thomas Mueller <mu...@adobe.com> wrote:
>>To start with, I'd target the following basic deployment configurations:
>>
>>* 1 node, MB-range test sets (small embedded or development/testing
>>deployment)
>>* 4 nodes, GB-range test sets (mid-size non-cloud deployment)
>>* 16 nodes, TB-range test sets (low-end cloud deployment)
>
> I interpret the goals we defined at [1] as:
>
> * read throughput: no degradation from current Jackrabbit 2
> * single repository (without clustering): 100 million nodes
> * cluster: 10 trillion nodes

Yep, a big part in defining actual benchmarks is that such goals can
be made more concrete. The goals as currently defined are too vague
(what kind of read access patterns, how much data per node, how big a
cluster, etc.) and perhaps not too well grounded in actual use cases
(creating 10 trillion nodes at a rate of one node per millisecond
takes 31 years).

Thus what I'd like to see here are ideas for more specific benchmarks
that model some real world use cases and deployments that we expect
the repository to be able to support.

BR,

Jukka Zitting

Re: Oak benchmarks (Was: [jr3] Index on randomly distributed data)

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

>To start with, I'd target the following basic deployment configurations:
>
>* 1 node, MB-range test sets (small embedded or development/testing
>deployment)
>* 4 nodes, GB-range test sets (mid-size non-cloud deployment)
>* 16 nodes, TB-range test sets (low-end cloud deployment)

I interpret the goals we defined at [1] as:

* read throughput: no degradation from current Jackrabbit 2
* single repository (without clustering): 100 million nodes
* cluster: 10 trillion nodes

[1]: 
http://wiki.apache.org/jackrabbit/Goals%20and%20non%20goals%20for%20Jackrab
bit%203


Regards,

Thomas