You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Viraf Bankwalla <vi...@yahoo.com> on 2007/04/27 03:54:32 UTC

Jackrabbit Scalability / Performance

Hi,

I am working on an application in which documents arriving at a mail-room are scanned and placed in a content repository. I need basic functionality to add, locate and retrieve artifacts to the repository. JSR-170 provides these basic services, however the interface looks chatty. To address chattyness and performance issues associated with large documents, I am planning on exposing coarse grained business services (which use the local JCR interface). Given that these are scanned images, I do not need the document to be indexed. I do however need to be able to search on the metadata associated with a document. I was wondering if:

Has anyone built an application similar to that described above? What version of Jackrabbit was used, and what were the issues that you ran into. How much meta-data did a node carry, what was the average depth of a leaf node, and how many nodes did you have in the implementation before performance became an issue.
I am considering on building a cluster of servers providing repository services. Can the repository be clustered ? (a load balancer in front of the repository will distribute requests to a pool of repository servers.).
How does the repository scale? can it handle > 50Million artifacts (if the artifacts are placed on the file system does Alfresco manage the directory structure or are all files placed in a single directory)
Is there support for auditing access to documents ?
Is there support for defining archival / retention policies?

Is there support for backups ?

Thanks.

- viraf

---------------------------------
Ahhh...imagining that irresistible "new car" smell?
Check outnew cars at Yahoo! Autos.

Re: Jackrabbit Scalability / Performance

Posted by Marcel Reutegger <ma...@gmx.net>.

Christoph Kiehl wrote:
> AFAIK you should as well backup your index files. We've got a fairly 
> large workspace with about 3,5GB of data. If we just backup the rdbms 
> data and rebuilt the index based on that data it takes some hours. This 
> is unacceptable if you need to recover a production system. Our current 
> solution is to shutdown the repository for a short time start the rdbms 
> backup and copy the index files. When index file copying is finished we 
> startup the repository again, while the rdbms backup is still running 
> (we use oracle which allows you write operations which don't affect the 
> backup data).
> If you know about a better solution without shutting down the repository 
> in between I would like to hear about it.

If you extend the jackrabbit query handler (o.a.j.core.query.lucene.SearchIndex) 
you get access to the IndexReader of the index. The returned index reader gives 
you a snapshot view of the index and will never change even when other sessions 
continue to write changes to the index. Using the index reader you can then 
create a new index that is written to the backup location. something along the 
lines:

IndexReader reader = getIndexReader();
IndexWriter writer = new IndexWriter(
      new File("backup-location"), getTextAnalyzer(), true);
writer.addIndexes(new IndexReader[]{reader});
writer.close();
reader.close();

regards
  marcel

Re: Jackrabbit Scalability / Performance

Posted by Oliver Zeigermann <ol...@zeigermann.de>.

2007/4/28, Christoph Kiehl <ch...@sulu3000.de>:
> If you know about a better solution without shutting down the repository in
> between I would like to hear about it.

I do not think this is supported by Jackrabbit, but generally
switchting to some sort of "read-only" run-level might be an idea. If
there are no write processes, you could savely do your backup as well.

Cheers

Oliver

Re: Jackrabbit Scalability / Performance

Posted by Ian Boston <ie...@tfd.co.uk>.

Bertrand Delacretaz wrote:
> On 4/28/07, Christoph Kiehl <ch...@sulu3000.de> wrote:
> 
>> ...Our current solution is to shutdown the
>> repository for a short time start the rdbms backup and copy the index 
>> files.
>> When index file copying is finished we startup the repository again...
> 
> Note that the Lucene-based Solr indexer
> (http://lucene.apache.org/solr/) has a clever way of allowing online
> backups of Lucene indexes, without having to stop anything (or for a
> very short time only).
> 
> In short, it works like this:
> 
> -Solr can be configured to launch a "snapshotter" script at a point in
> time when it's not writing anything to the index.
> 
> -The script takes a snapshot of the index files using hard links
> (won't work on Windows AFAIK), which is very quick on Unixish
> platforms.
> 
> -Solr waits until the script is done (a few milliseconds I guess) and
> resumes indexing.
> 
> -Another asynchronous backup script can then copy the snapshot
> anywhere, from the hard linked files, without disturbing Solr.
> 
> This won't help for the RDBMS part, but implementing something similar
> might help for online backups of index files.
> 
> See http://wiki.apache.org/solr/CollectionDistribution for more
> details - the main goal described there is index replication, but it
> obviously works for backups as well.
> 
> -Bertrand

Slightly off thread, but relevant to index backup

-----
Sakai has been using Lucene to provide search indexes in a cluster, we 
have been using a realtime index distribution mechanism where all nodes 
can take part in the indexing an all nodes can take part in the search 
delivery. With minor modifications it can work as an indexing farm and 
searching farm.

It uses a shim just below the index open/close that manages updates to 
the clusters local disks just below the IndexReaders and IndexWriters.

I looked at Nutch and the nutch file system at the time and 
unfortunately we had to reject it because, like Solr it required Unix 
setup and system commands and we needed a 100% java solution that worked 
  out of the box.

It doesn't do Map Reduce, but it does put the indexes locally and all 
the nodes are up and running all the time.

The relevant parts of the code tree can be found at

The index factory

https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java

And the distribution management

which puts segments on zipped form on a shared location, either DB or 
Filesystem

https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java


Is definitely not a perfect solution, and I can see it need lots of 
improvement, but it works in production.

If Jackrabbit looks really good in a cluster (which I am expecting), we 
may start putting the indexes directly in jackrabbit and let it manage 
the distribution, they are not that big in most cases, generally < 10G. 
(The total data set being indexed go will up to 1TB at some Universities)


The main point being, the central location provides a  convenient place 
for consistent backups of the index (perhaps it overkill)

Ian

Re: Jackrabbit Scalability / Performance

Posted by Bertrand Delacretaz <bd...@apache.org>.

On 4/28/07, Christoph Kiehl <ch...@sulu3000.de> wrote:

> ...Our current solution is to shutdown the
> repository for a short time start the rdbms backup and copy the index files.
> When index file copying is finished we startup the repository again...

Note that the Lucene-based Solr indexer
(http://lucene.apache.org/solr/) has a clever way of allowing online
backups of Lucene indexes, without having to stop anything (or for a
very short time only).

In short, it works like this:

-Solr can be configured to launch a "snapshotter" script at a point in
time when it's not writing anything to the index.

-The script takes a snapshot of the index files using hard links
(won't work on Windows AFAIK), which is very quick on Unixish
platforms.

-Solr waits until the script is done (a few milliseconds I guess) and
resumes indexing.

-Another asynchronous backup script can then copy the snapshot
anywhere, from the hard linked files, without disturbing Solr.

This won't help for the RDBMS part, but implementing something similar
might help for online backups of index files.

See http://wiki.apache.org/solr/CollectionDistribution for more
details - the main goal described there is index replication, but it
obviously works for backups as well.

-Bertrand

Re: Jackrabbit Scalability / Performance

Posted by Christoph Kiehl <ch...@sulu3000.de>.

David Nuescheler wrote:

>>    Is there support for defining archival / retention policies?
> jackrabbit certainly offers the hooks to build recordsmanagment
> but does not come with ootb archival or retention facilties.
> 
>>    Is there support for backups ?
> for the most convenient backup i would recommend to persist the entire
> content repository in an rdbms and use the rdbms features for backup.

AFAIK you should as well backup your index files. We've got a fairly large 
workspace with about 3,5GB of data. If we just backup the rdbms data and rebuilt 
the index based on that data it takes some hours. This is unacceptable if you 
need to recover a production system. Our current solution is to shutdown the 
repository for a short time start the rdbms backup and copy the index files. 
When index file copying is finished we startup the repository again, while the 
rdbms backup is still running (we use oracle which allows you write operations 
which don't affect the backup data).
If you know about a better solution without shutting down the repository in 
between I would like to hear about it.

Cheers,
Christoph

Re: Jackrabbit Scalability / Performance

Posted by Viraf Bankwalla <vi...@yahoo.com>.

Thanks, this is great news.  Is there any additional information that you could share about your implementation.  What was the deployment environment, what model did you use for persistence, how did you handle backups, etc.

Did you consider Alfresco or other JCR solutions?  What did you see as the pro's and cons.

Thanks.

- viraf



David Nuescheler <da...@gmail.com> wrote: hi viraf,

thanks for your mail.

>    Has anyone built an application similar to that described above?
> What version of Jackrabbit was used, and what were the issues that you ran into.
> How much meta-data did a node carry, what was the average depth of a leaf
> node, and how many nodes did you have in the implementation before
> performance became an issue.
we built a digital asset management application that sounds very
similar to what you are describing. the meta information varies from
filetype to filetype but ranges on average between 10 and 50 properties
per nt:resource instance. in addition to typical meta information
we also store a number of thumbnail images in the content repository
for every asset.

>    I am considering on building a cluster of servers providing repository
> services. Can the repository be clustered ? (a load balancer in front of the
> repository will distribute requests to a pool of repository servers.).
yes, jackrabbit can be clustered. i would recommend though to run the
repository with model 1 or model 2 [1] and just use the load balancer
on top of your application. this avoids the overhead of remoting all
together and still provides you with clustering.

[1] http://jackrabbit.apache.org/doc/deploy.html

>    How does the repository scale? can it handle > 50Million artifacts
> (if the artifacts are placed on the file system does Alfresco manage
> the directory structure or are all files placed in a single directory)
assuming that you mean "jackrabbit"... ;)
we ran tests beyond 50m files and yes jackrabbit manages the filesystem
if the filesystem is chosen as the persistence layer for blobs.

>    Is there support for auditing access to documents ?
this could easily be achieved with a decoration layer.

>    Is there support for defining archival / retention policies?
jackrabbit certainly offers the hooks to build recordsmanagment
but does not come with ootb archival or retention facilties.

>    Is there support for backups ?
for the most convenient backup i would recommend to persist the entire
content repository in an rdbms and use the rdbms features for backup.

regards,
david


       
---------------------------------
Ahhh...imagining that irresistible "new car" smell?
 Check outnew cars at Yahoo! Autos.

Re: Jackrabbit Scalability / Performance

Posted by David Nuescheler <da...@gmail.com>.

hi viraf,

thanks for your mail.

>    Has anyone built an application similar to that described above?
> What version of Jackrabbit was used, and what were the issues that you ran into.
> How much meta-data did a node carry, what was the average depth of a leaf
> node, and how many nodes did you have in the implementation before
> performance became an issue.
we built a digital asset management application that sounds very
similar to what you are describing. the meta information varies from
filetype to filetype but ranges on average between 10 and 50 properties
per nt:resource instance. in addition to typical meta information
we also store a number of thumbnail images in the content repository
for every asset.

>    I am considering on building a cluster of servers providing repository
> services. Can the repository be clustered ? (a load balancer in front of the
> repository will distribute requests to a pool of repository servers.).
yes, jackrabbit can be clustered. i would recommend though to run the
repository with model 1 or model 2 [1] and just use the load balancer
on top of your application. this avoids the overhead of remoting all
together and still provides you with clustering.

[1] http://jackrabbit.apache.org/doc/deploy.html

>    How does the repository scale? can it handle > 50Million artifacts
> (if the artifacts are placed on the file system does Alfresco manage
> the directory structure or are all files placed in a single directory)
assuming that you mean "jackrabbit"... ;)
we ran tests beyond 50m files and yes jackrabbit manages the filesystem
if the filesystem is chosen as the persistence layer for blobs.

>    Is there support for auditing access to documents ?
this could easily be achieved with a decoration layer.

>    Is there support for defining archival / retention policies?
jackrabbit certainly offers the hooks to build recordsmanagment
but does not come with ootb archival or retention facilties.

>    Is there support for backups ?
for the most convenient backup i would recommend to persist the entire
content repository in an rdbms and use the rdbms features for backup.

regards,
david