You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Sriram Narayanan <sr...@gmail.com> on 2007/03/01 10:30:15 UTC

Some questions on JackRAbbit performance with large data sets

Hi list:

We are using JackRabbit 1.2.1 at present.

Out Nodes look like this:
/Product/Customer1/Settings
/Product/Customer1/Configuration
/Product/Customer1/DataA
/Product/Customer1/DataB
/Product/Customer2/Settings
/Product/Customer2/Configuration
/Product/Customer2/DataA
/Product/Customer2/DataB

This way, we go all the way upto 250 customers.

When we load data for all these customers, we see that the derby
database size is 2.5 GB, and the Lucene Index is 470 MB.

We have to provide for the following:
a. Access data for around 20 customer simulatneously.
b. The queries are of the type "All attributes of a given node for a
given customer".
c. Data about one customer should not be accessed by another customer.

At present, we're access JackRabbit using 20 threads and 20 different
sessions. This is to achieve separation of data etc.

We're seeing performance figures such as the following:

Network Derby: 80 seconds for all the threads to receive results
Oracle: 35 seconds for all the threads to receive results

Some questions:
1. What are the lessons learned by various community members on using Derby ?
2. Would you recommend using Oracle to using Derby for such large
amounts of data ?
3. Are there ways to speed up lucene searches ?
4. Are lucene searches affected by such large indexes ?
5. Would it be better for us to split the repository into smaller ones
and to then have smaller lucene indexes ?
6. For such large data, would Embedded Derby or Network derby be
suitable to the task ?

-- Sriram

Re: Some questions on JackRAbbit performance with large data sets

Posted by Marcel Reutegger <ma...@gmx.net>.

Sriram Narayanan wrote:
> 1. What are the lessons learned by various community members on using 
> Derby ?

what I heard from others playing with different setups is that derby over a 
network is quite slow. I didn't do any tests myself, but it seems that derby is 
the best choice if you use it in embedded more, but you should consider another 
db if you use a standalone db server.

> 2. Would you recommend using Oracle to using Derby for such large
> amounts of data ?

from what I've seen so far, both scale well with large amounts of data.

> 3. Are there ways to speed up lucene searches ?

1) there are configuration parameters that affect the query performance:
	a) respectDocumentOrder
	b) resultFetchSize
    see [1] for some details on those parameters.

2) some query feature are more expensive that others, which means you may be 
able to speed up searches by rephrasing your query statements.

> 4. Are lucene searches affected by such large indexes ?

access rights are checked at the very end of the query and will probably affect 
your queries negatively. because you have access rights that are limited to a 
certain customer most query results are rejected by access control in the last 
stage of the query execution. if we assume 250 customers and each has only 
access to its own tree an average of 99.6% of the query result nodes are 
rejected by access control.

> 5. Would it be better for us to split the repository into smaller ones
> and to then have smaller lucene indexes ?

if each customer has only access to its own tree I would definitively create one 
workspace per customer. this will result in:

- smaller indexes
- faster queries, because only a small amount of intermediate result nodes are 
rejected by access control
- you can configure an idle time which will shutdown workspaces that are not in 
use (-> saves resources)
- allows better concurrency because an update in one workspace does not affect 
other workspaces
- allows you to create db backups per customer

> 6. For such large data, would Embedded Derby or Network derby be
> suitable to the task ?

as mentioned before, I think derby does its job best if it runs embedded.

regards
  marcel

[1] 
http://svn.apache.org/repos/asf/jackrabbit/tags/1.2.2/jackrabbit-core/src/main/config/repository.xml