You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Samuel Cox <cr...@gmail.com> on 2013/12/04 00:34:07 UTC

Performance: Search + Lucene + indexes + Amazon cloud

Hi,

My company has been successfully (albeit naively) using Jackrabbit for
several years in an on-prem product.  About the only things we've
customized are some node types, the use of MySQL over Derby, and some
trivial search configuration.

Now, we're trying to leverage this product in Amazon's cloud; however,
we're running into problems with the Lucene index being stored on EC2
instances that go away.  Our current strategy is to compress the index
and ship it off to S3.  However, uploading/downloading it from S3
takes too long.

We are currently using Jackrabbit 1.6.  I will also admit that we have
a very inefficient algorithm for storing 'documents'.  It creates many
more nodes than we actually need.  I'm ASSuming that the Lucene index
grows roughly linearly with the nodes.

Currently, we're investigating storing/accessing the index in MySQL,
which would mean we don't have to copy it back and forth as we spin up
machines.

Some questions I have:

Assuming upgrading Jackrabbit would upgrade Lucene, do you anticipate
this significantly impacting performance related to indexing?
Supposedly Lucene 4 greatly reduces the index size; however, I see
that you guys are suggesting Oak when people ask about Lucene 4 and
Jackrabbit.

Do you have other suggestions/reading material about how to
effectively use Jackrabbit in a cloud environment?

Any tips or pointers to information is appreciated.