You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by js...@neasys.com on 2005/10/06 17:45:17 UTC

large repository

Hello,

I am evaluating jackrabbit for use in my new project.
Initially there are around 10 million items.
I would like to know
(1) Has anyone used jackrabbit for 10 million items and beyond?
(2) How does it scale?

Thanks in advance.

John

Re: large repository

Posted by David Nuescheler <da...@gmail.com>.

hi john,

> On day.com, crx is listed as being able to handle only 1M items.
> It looks like this info is out of date?
> Because, by Marcel's test, jackrabbit can do at least 5M items.
> What is current number for crx and with versioning?

thanks for the email. first of all this is probably not the right
place to have such a discussion, but of course this is a licensing
limitation. not a technical limitation.
technically, crx has been tested far beyond all those licensing
limitations.

regards,
david

Re: large repository

Posted by js...@neasys.com.

Hi, Marcel, David and all,

One quick question:

On day.com, crx is listed as being able to handle only 1M items.
It looks like this info is out of date?
Because, by Marcel's test, jackrabbit can do at least 5M items.
What is current number for crx and with versioning?

Thanks a lot.

John

On Mon, Oct 10, 2005 at 09:30:58AM +0200, Marcel Reutegger wrote:
> Hi John,
> 
> I personally did tests with about 5 million items using various property
> types (string, double, long, date) and a DB based persistence manager.
> 
> Both importing and then querying the data scaled very well. I assume
> that is also the case for 10 million items.
> 
> There was once the idea to run wikipedia on a JCR repository, but I
> assume there is not much progress yet, unfortunately :-/
> 
> http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/2867
> 
> regards
>   marcel
> 
> js@neasys.com wrote:
> >Hello,
> >
> >I am evaluating jackrabbit for use in my new project.
> >Initially there are around 10 million items.
> >I would like to know
> >(1) Has anyone used jackrabbit for 10 million items and beyond?
> >(2) How does it scale?
> >
> >Thanks in advance.
> >
> >John
> >
> >
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!

Re: large repository

Posted by js...@neasys.com.

Hi, Marcel,

Thanks for your reply.

On Mon, Oct 10, 2005 at 09:30:58AM +0200, Marcel Reutegger wrote:
> Hi John,
> 
> I personally did tests with about 5 million items using various property
> types (string, double, long, date) and a DB based persistence manager.

Does your test come with versioning?

> 
> Both importing and then querying the data scaled very well. I assume
> that is also the case for 10 million items.

How long does it take to import the whole thing?
And on what hardware and os?

Best,

John

> 
> There was once the idea to run wikipedia on a JCR repository, but I
> assume there is not much progress yet, unfortunately :-/
> 
> http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/2867
> 
> regards
>   marcel
> 
> js@neasys.com wrote:
> >Hello,
> >
> >I am evaluating jackrabbit for use in my new project.
> >Initially there are around 10 million items.
> >I would like to know
> >(1) Has anyone used jackrabbit for 10 million items and beyond?
> >(2) How does it scale?
> >
> >Thanks in advance.
> >
> >John
> >
> >
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!

Re: large repository

Posted by Marcel Reutegger <ma...@gmail.com>.

Hi John,

I personally did tests with about 5 million items using various property
types (string, double, long, date) and a DB based persistence manager.

Both importing and then querying the data scaled very well. I assume
that is also the case for 10 million items.

There was once the idea to run wikipedia on a JCR repository, but I
assume there is not much progress yet, unfortunately :-/

http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/2867

regards
   marcel

js@neasys.com wrote:
> Hello,
> 
> I am evaluating jackrabbit for use in my new project.
> Initially there are around 10 million items.
> I would like to know
> (1) Has anyone used jackrabbit for 10 million items and beyond?
> (2) How does it scale?
> 
> Thanks in advance.
> 
> John
> 
>

Re: large repository

Posted by Alexandru Popescu <th...@gmail.com>.

#: Marcel Reutegger changed the world a bit at a time by saying on  10/26/2005 9:14 AM :#
> Hi John,
> 
> I haven't tried the bdb persistence manager yet.
> 
> but it seems that brian is working with it, maybe he can share his 
> experience?
> 
> regards
>   marcel
> 

How is db-persistence (so Derby) storing binary content? (I mean f.e. the uploaded files are stored 
in the DB as blobs? or as BerkleyDB is doing on FS?)

thanks,

./alex
--
.w( the_mindstorm )p.

> js@neasys.com wrote:
>> Hi, Marcel,
>> 
>> Thanks a lot for your reply. One more question:
>> how does bdb persistent compare with db persistent?
>> Which one will be able to hold more items?
>> 
>> John
>> 
>> On Tue, Oct 25, 2005 at 09:08:00AM +0200, Marcel Reutegger wrote:
>> 
>>>Hi John,
>>>
>>>js@neasys.com wrote:
>>>
>>>>I have tried jcr/jackrabbit and like it.
>>>>Next I would like to push jackrabbit to its limit:
>>>>load in as many items as possible. I would appreciate help on
>>>>a few configuration/tuning issues:
>>>>(1) which persistent manager to use?
>>>
>>>in a recent test I imported over a million wikipedia articles which 
>>>resulted in about 6 million items. no versioning, btw.
>>>
>>>my configuration is:
>>>dell latitude d505
>>>db-persitence using derby
>>>256m heap
>>>
>>>at the beginning the time to add an article was about 5ms.
>>>towards the end of the load the time to add an article was stable at 
>>>about 50ms.
>>>
>>>some other figures:
>>>db size: 2 GB
>>>index size: 300 MB
>>>
>>>
>>>>(2) what parameters to tune?
>>>
>>>I can give you some advice on configuring the index: the default config 
>>>will cause lucene to create segments of 100 nodes, which will be merged 
>>>when as soon as 10 segments exist. when doing a bulk load you should set 
>>>the paramter minMergeDocs to a higher value. e.g. 1000. this will create 
>>>segments of 1000 nodes, and will be more efficient.
>>>
>>>
>>>>(3) will multiple wordspaces help?
>>>
>>>IMO this might help, if you run into scalability issues with the 
>>>persistence manager you are using.
>>>
>>>
>>>>(4) any other things to watch for?
>>>
>>>use separate disks for the index and workspace data.
>>>
>>>
>>>>My host has 4GB ram and a few TB diskspace.
>>>>
>>>>Also, any doc describing all possbile elements in repository.xml?
>>>
>>>the sample repository.xml file in src/conf contains an inline dtd that 
>>>contains some documentation.
>>>
>>>
>>>>And if SearchIndex can be turned off?
>>>
>>>yes, this is possible. you simply omit the SearchIndex element in the 
>>>configuration. though, I would be very interested to see how well the 
>>>index works with your data.
>>>
>>>regards
>>> marcel
>>>
>>>
>> 
>> __________________________________________
>> http://www.neasys.com - A Good Place to Be
>> Come to visit us today!
>> 
>> 
>

Re: large repository

Posted by Brian Moseley <bc...@osafoundation.org>.

Marcel Reutegger wrote:

> did you consider using the object pm instead of xml? it doesn't come 
> with the overhead of parsing xml and should be faster.

yep, i switched to that one yesterday. i don't have concrete 
proof, but it sure does *feel* zippier.

Re: large repository

Posted by Marcel Reutegger <ma...@gmx.net>.

Brian Moseley wrote:
> i'd love to hear if anybody's done current benchmarks with the various 
> pms. i'm currently using the xml one and would be happy to switch to 
> something else for better performance and manageability.

did you consider using the object pm instead of xml? it doesn't come 
with the overhead of parsing xml and should be faster.

regards
  marcel

Re: large repository

Posted by Brian Moseley <bc...@osafoundation.org>.

Marcel Reutegger wrote:
> Hi John,
> 
> I haven't tried the bdb persistence manager yet.
> 
> but it seems that brian is working with it, maybe he can share his 
> experience?

haven't gotten it working yet, cos of the os x issue i described in 
another thread :/

i'd love to hear if anybody's done current benchmarks with the various 
pms. i'm currently using the xml one and would be happy to switch to 
something else for better performance and manageability.

Re: large repository

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi John,

I haven't tried the bdb persistence manager yet.

but it seems that brian is working with it, maybe he can share his 
experience?

regards
  marcel

js@neasys.com wrote:
> Hi, Marcel,
> 
> Thanks a lot for your reply. One more question:
> how does bdb persistent compare with db persistent?
> Which one will be able to hold more items?
> 
> John
> 
> On Tue, Oct 25, 2005 at 09:08:00AM +0200, Marcel Reutegger wrote:
> 
>>Hi John,
>>
>>js@neasys.com wrote:
>>
>>>I have tried jcr/jackrabbit and like it.
>>>Next I would like to push jackrabbit to its limit:
>>>load in as many items as possible. I would appreciate help on
>>>a few configuration/tuning issues:
>>>(1) which persistent manager to use?
>>
>>in a recent test I imported over a million wikipedia articles which 
>>resulted in about 6 million items. no versioning, btw.
>>
>>my configuration is:
>>dell latitude d505
>>db-persitence using derby
>>256m heap
>>
>>at the beginning the time to add an article was about 5ms.
>>towards the end of the load the time to add an article was stable at 
>>about 50ms.
>>
>>some other figures:
>>db size: 2 GB
>>index size: 300 MB
>>
>>
>>>(2) what parameters to tune?
>>
>>I can give you some advice on configuring the index: the default config 
>>will cause lucene to create segments of 100 nodes, which will be merged 
>>when as soon as 10 segments exist. when doing a bulk load you should set 
>>the paramter minMergeDocs to a higher value. e.g. 1000. this will create 
>>segments of 1000 nodes, and will be more efficient.
>>
>>
>>>(3) will multiple wordspaces help?
>>
>>IMO this might help, if you run into scalability issues with the 
>>persistence manager you are using.
>>
>>
>>>(4) any other things to watch for?
>>
>>use separate disks for the index and workspace data.
>>
>>
>>>My host has 4GB ram and a few TB diskspace.
>>>
>>>Also, any doc describing all possbile elements in repository.xml?
>>
>>the sample repository.xml file in src/conf contains an inline dtd that 
>>contains some documentation.
>>
>>
>>>And if SearchIndex can be turned off?
>>
>>yes, this is possible. you simply omit the SearchIndex element in the 
>>configuration. though, I would be very interested to see how well the 
>>index works with your data.
>>
>>regards
>> marcel
>>
>>
> 
> __________________________________________
> http://www.neasys.com - A Good Place to Be
> Come to visit us today!
> 
>

Re: large repository

Posted by js...@neasys.com.

Hi, Marcel,

Thanks a lot for your reply. One more question:
how does bdb persistent compare with db persistent?
Which one will be able to hold more items?

John

On Tue, Oct 25, 2005 at 09:08:00AM +0200, Marcel Reutegger wrote:
> Hi John,
> 
> js@neasys.com wrote:
> >I have tried jcr/jackrabbit and like it.
> >Next I would like to push jackrabbit to its limit:
> >load in as many items as possible. I would appreciate help on
> >a few configuration/tuning issues:
> >(1) which persistent manager to use?
> 
> in a recent test I imported over a million wikipedia articles which 
> resulted in about 6 million items. no versioning, btw.
> 
> my configuration is:
> dell latitude d505
> db-persitence using derby
> 256m heap
> 
> at the beginning the time to add an article was about 5ms.
> towards the end of the load the time to add an article was stable at 
> about 50ms.
> 
> some other figures:
> db size: 2 GB
> index size: 300 MB
> 
> >(2) what parameters to tune?
> 
> I can give you some advice on configuring the index: the default config 
> will cause lucene to create segments of 100 nodes, which will be merged 
> when as soon as 10 segments exist. when doing a bulk load you should set 
> the paramter minMergeDocs to a higher value. e.g. 1000. this will create 
> segments of 1000 nodes, and will be more efficient.
> 
> >(3) will multiple wordspaces help?
> 
> IMO this might help, if you run into scalability issues with the 
> persistence manager you are using.
> 
> >(4) any other things to watch for?
> 
> use separate disks for the index and workspace data.
> 
> >My host has 4GB ram and a few TB diskspace.
> >
> >Also, any doc describing all possbile elements in repository.xml?
> 
> the sample repository.xml file in src/conf contains an inline dtd that 
> contains some documentation.
> 
> >And if SearchIndex can be turned off?
> 
> yes, this is possible. you simply omit the SearchIndex element in the 
> configuration. though, I would be very interested to see how well the 
> index works with your data.
> 
> regards
>  marcel
> 
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!

Re: large repository

Posted by Marcel Reutegger <ma...@gmx.net>.

Hi John,

js@neasys.com wrote:
> I have tried jcr/jackrabbit and like it.
> Next I would like to push jackrabbit to its limit:
> load in as many items as possible. I would appreciate help on
> a few configuration/tuning issues:
> (1) which persistent manager to use?

in a recent test I imported over a million wikipedia articles which 
resulted in about 6 million items. no versioning, btw.

my configuration is:
dell latitude d505
db-persitence using derby
256m heap

at the beginning the time to add an article was about 5ms.
towards the end of the load the time to add an article was stable at 
about 50ms.

some other figures:
db size: 2 GB
index size: 300 MB

> (2) what parameters to tune?

I can give you some advice on configuring the index: the default config 
will cause lucene to create segments of 100 nodes, which will be merged 
when as soon as 10 segments exist. when doing a bulk load you should set 
the paramter minMergeDocs to a higher value. e.g. 1000. this will create 
segments of 1000 nodes, and will be more efficient.

> (3) will multiple wordspaces help?

IMO this might help, if you run into scalability issues with the 
persistence manager you are using.

> (4) any other things to watch for?

use separate disks for the index and workspace data.

> My host has 4GB ram and a few TB diskspace.
> 
> Also, any doc describing all possbile elements in repository.xml?

the sample repository.xml file in src/conf contains an inline dtd that 
contains some documentation.

> And if SearchIndex can be turned off?

yes, this is possible. you simply omit the SearchIndex element in the 
configuration. though, I would be very interested to see how well the 
index works with your data.

regards
  marcel

Re: large repository

Posted by js...@neasys.com.

Hi,

I have tried jcr/jackrabbit and like it.
Next I would like to push jackrabbit to its limit:
load in as many items as possible. I would appreciate help on
a few configuration/tuning issues:
(1) which persistent manager to use?
(2) what parameters to tune?
(3) will multiple wordspaces help?
(4) any other things to watch for?
My host has 4GB ram and a few TB diskspace.

Also, any doc describing all possbile elements in repository.xml?
And if SearchIndex can be turned off?

Thanks,

John

On Thu, Oct 06, 2005 at 08:45:17AM -0700, js@neasys.com wrote:
> Hello,
> 
> I am evaluating jackrabbit for use in my new project.
> Initially there are around 10 million items.
> I would like to know
> (1) Has anyone used jackrabbit for 10 million items and beyond?
> (2) How does it scale?
> 
> Thanks in advance.
> 
> John
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!