You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Daniel Hagen <dh...@h1-software.de> on 2005/09/25 15:39:08 UTC

Jackrabbit Performance

Hi,

I apologize if this is the wrong place to ask my questions but I do not know
where else I should ask.

I am currently considering the use of Jackrabbit in a future project.
The (very) rough layout I am thinking about is Jboss as Application Server
and Jackrabbit for content storage (equipped with a custom access manager
and login module for authentication & authorization).

But I am not sure whether Jackrabbit will be able to handle the amount of
data we will have to deal with.
The application might have to handle ~ 2000 - 5000 new documents/day (size
ranging from 2kb to 1 mb, I assume an average of ~50 KB). 
Each document will have about 5 - 10 simple text properties and the "binary"
content of the documents (plain text/HTML/MS Word/PDF) will have to be
indexed for a fulltext search.
Read access to the contents will not be very frequent, I am assuming 5
requests for the mentionened simple properties of a node per minute, 5
concurrent users, access to binary contents will propably appear once every
minute.

In short: The application will have to be able to do a fulltext search on
(worst case) more than 10,000,000 contents and will have to handle creation
of new contents without stalling the server.

What is your opinion, is Jackrabbit the right tool for the task?
Which Persistence Manager would be the best choice?
Are there any special hardware considerations I should think about (e.g.
separating index and storage on separate discs using separate controllers
...)?
Should we have OS preferences for the server (current options are Windows
2003 Server vs. Linux with a strong preference towards Windows 2003 Server)?

I know that not all of my questions are directly related to Jackrabbit
Development and some will propably not be answered due to a lack of existing
data, but any clues/hints will be greatly appreciated.

Thank you for your help!

Daniel


Re: Jackrabbit Performance

Posted by Stefan Guggisberg <st...@gmail.com>.
hi daniel,

some remarks/answers follow inline:

On 9/25/05, Daniel Hagen <dh...@h1-software.de> wrote:
> Hi,
>
> I apologize if this is the wrong place to ask my questions but I do not know
> where else I should ask.
>
> I am currently considering the use of Jackrabbit in a future project.
> The (very) rough layout I am thinking about is Jboss as Application Server
> and Jackrabbit for content storage (equipped with a custom access manager
> and login module for authentication & authorization).
>
> But I am not sure whether Jackrabbit will be able to handle the amount of
> data we will have to deal with.
> The application might have to handle ~ 2000 - 5000 new documents/day (size
> ranging from 2kb to 1 mb, I assume an average of ~50 KB).
> Each document will have about 5 - 10 simple text properties and the "binary"
> content of the documents (plain text/HTML/MS Word/PDF) will have to be
> indexed for a fulltext search.
> Read access to the contents will not be very frequent, I am assuming 5
> requests for the mentionened simple properties of a node per minute, 5
> concurrent users, access to binary contents will propably appear once every
> minute.
>
> In short: The application will have to be able to do a fulltext search on
> (worst case) more than 10,000,000 contents and will have to handle creation
> of new contents without stalling the server.
>
> What is your opinion, is Jackrabbit the right tool for the task?
> Which Persistence Manager would be the best choice?
> Are there any special hardware considerations I should think about (e.g.
> separating index and storage on separate discs using separate controllers
> ...)?
> Should we have OS preferences for the server (current options are Windows
> 2003 Server vs. Linux with a strong preference towards Windows 2003 Server)?

if you're using a filesystem-based pm (e.g. ObjectPersistenceManager on
LocalFileSystem) i'd definitely go for linux. the windows filesystem really
sucks whith a large number of small files. with the CQFileSystem
(custom filesystem in-a-file) you can improve the performance on a windows
box considerably but it's not opensource and it's only free for non-commercial
use.

ObjectPersistenceManager w/LocalFileSystem on a linux box provides imo
a decent performance, it's major flaw is that it is non-transactional.

there's also a jdbc-based pm in the contrib directory (contrib/db-persistence).
it is transactional and, depending on the type of database, provides a very
decent performance (e.g. mysql).

i suggest you setup your own performance/scalability tests.

cheers
stefan

>
> I know that not all of my questions are directly related to Jackrabbit
> Development and some will propably not be answered due to a lack of existing
> data, but any clues/hints will be greatly appreciated.
>
> Thank you for your help!
>
> Daniel
>
>

Re: Jackrabbit Performance

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Daniel,

I'll try to answer some of your questions.

Daniel Hagen wrote:
> The application might have to handle ~ 2000 - 5000 new documents/day (size
> ranging from 2kb to 1 mb, I assume an average of ~50 KB). 
> Each document will have about 5 - 10 simple text properties and the "binary"
> content of the documents (plain text/HTML/MS Word/PDF) will have to be
> indexed for a fulltext search.
> Read access to the contents will not be very frequent, I am assuming 5
> requests for the mentionened simple properties of a node per minute, 5
> concurrent users, access to binary contents will propably appear once every
> minute.
> 
> In short: The application will have to be able to do a fulltext search on
> (worst case) more than 10,000,000 contents and will have to handle creation
> of new contents without stalling the server.

regarding concurrency, this is not a problem. Jackrabbit is able to 
handle queries and workspace modifications concurrently.

regarding the volume of content: this more or less depends on how well 
lucene scales. and it seems that it does quite well. The lucene website 
probably has some information on this topic.

> Are there any special hardware considerations I should think about (e.g.
> separating index and storage on separate discs using separate controllers
> ...)?

this will definitively help increase performance.


regards
  marcel