You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Ajai <aj...@gmail.com> on 2009/07/23 08:10:45 UTC

Performance of Jackrabbit

Hi,

I am in the process of Evaluation of Jackrabbit. We are running few
performance tests.
Here we are adding 25,000 Folder nodes with each consisting of 15 documents.

It is taking around 37 hours to complete this process, we also tried using
thread to achieve this.
But still the time hasn't come down.

It also seems that, when adding 500 Folders with 15 docs each, takes  ~ 20
mins for a empty repository, 

After uploading 25000 folders, when trying to add same 500 Folders with 15
docs each, it takes ~ 5 hrs. 

So is there a way to improve the performance of above mentioned functions ?. 

Also kindly suggest an alternate solution to perform bulk upload?

Thanks
Ajai G





-- 
View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24619853.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Performance of Jackrabbit

Posted by Stefan Guggisberg <st...@gmail.com>.
On Thu, Jul 23, 2009 at 2:05 PM, Bart van der
Schans<b....@onehippo.com> wrote:
> On Thu, Jul 23, 2009 at 1:50 PM, Guo Du<mr...@gmail.com> wrote:
>> On Thu, Jul 23, 2009 at 12:37 PM, Bart van der
>> Schans<b....@onehippo.com> wrote:
>>> Iirc there's a similar problem with multi value properties when you
>>> add a lot of values.
>>>
>>> Is there any room left in the current implementation to improve the
>>> performance of those two use cases? Or did somebody already look at it
>>> thoroughly and squeezed out the last bit of performance gain? If not,
>>> I would be happy to run some tests in a profiler and see what comes
>>> up..
>>>
>>
>> By Alex:  bundle pm stores nodes + its properties + its list of child
>> nodes in a compact binary blob.
>>
>> This means read a node will load all child reference/properties to
>> mem, I cannot see it could be improved significantly by bundle pm.
>> It's possible with other pm store if the child/properties are stored
>> separately.
> It could be interesting to just add one table with (id,parent_id) and
> remove the child node list from the bundle and separate the
> hierarchical information from the content.

the current implementation is optimized for medium sized list of
child node entries (~10-20k) and same-name sibling support.

separating parent-child relations from node state would allow to
support 'very' flat hierarchies, OTOH it would probably hurt
performance when traversing hierarchies and when SNS are
involved (apart from making the implementation considerably more
complex).

however, it might well be worth trying. go ahead, if you are up to it.

cheers
stefan

> But I changing the storage
> format is a huge change and I don't think it would justify the
> performance gain.
>
> Bart
>
>
> --
> Hippo B.V.  -  Amsterdam
> Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466
>
> Hippo USA Inc.  -  San Francisco
> 101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
> -----------------------------------------------------------------
> http://www.onehippo.com   -  info@onehippo.com
> -----------------------------------------------------------------
>

Re: Performance of Jackrabbit

Posted by Bart van der Schans <b....@onehippo.com>.
On Thu, Jul 23, 2009 at 1:50 PM, Guo Du<mr...@gmail.com> wrote:
> On Thu, Jul 23, 2009 at 12:37 PM, Bart van der
> Schans<b....@onehippo.com> wrote:
>> Iirc there's a similar problem with multi value properties when you
>> add a lot of values.
>>
>> Is there any room left in the current implementation to improve the
>> performance of those two use cases? Or did somebody already look at it
>> thoroughly and squeezed out the last bit of performance gain? If not,
>> I would be happy to run some tests in a profiler and see what comes
>> up..
>>
>
> By Alex:  bundle pm stores nodes + its properties + its list of child
> nodes in a compact binary blob.
>
> This means read a node will load all child reference/properties to
> mem, I cannot see it could be improved significantly by bundle pm.
> It's possible with other pm store if the child/properties are stored
> separately.
It could be interesting to just add one table with (id,parent_id) and
remove the child node list from the bundle and separate the
hierarchical information from the content. But I changing the storage
format is a huge change and I don't think it would justify the
performance gain.

Bart


-- 
Hippo B.V.  -  Amsterdam
Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466

Hippo USA Inc.  -  San Francisco
101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
-----------------------------------------------------------------
http://www.onehippo.com   -  info@onehippo.com
-----------------------------------------------------------------

Re: Performance of Jackrabbit

Posted by Guo Du <mr...@gmail.com>.
On Thu, Jul 23, 2009 at 12:37 PM, Bart van der
Schans<b....@onehippo.com> wrote:
> Iirc there's a similar problem with multi value properties when you
> add a lot of values.
>
> Is there any room left in the current implementation to improve the
> performance of those two use cases? Or did somebody already look at it
> thoroughly and squeezed out the last bit of performance gain? If not,
> I would be happy to run some tests in a profiler and see what comes
> up..
>

By Alex:  bundle pm stores nodes + its properties + its list of child
nodes in a compact binary blob.

This means read a node will load all child reference/properties to
mem, I cannot see it could be improved significantly by bundle pm.
It's possible with other pm store if the child/properties are stored
separately.

--Guo

Re: Performance of Jackrabbit

Posted by Bart van der Schans <b....@onehippo.com>.
On Thu, Jul 23, 2009 at 1:00 PM, Alexander Klimetschek<ak...@day.com> wrote:
> On Thu, Jul 23, 2009 at 9:31 AM, Ajai<aj...@gmail.com> wrote:
>> http://www.nabble.com/file/p24620741/ThreadFeeder.java ThreadFeeder.java
>> http://www.nabble.com/file/p24620741/repository.xml repository.xml
>> http://www.nabble.com/file/p24620741/indexingconfiguration.xml
>> indexingconfiguration.xml
>>
>> Kindly let me know your suggestions.
>
> From a quick look at your code it looks like you create a flat
> hierarchy with all nodes on the same level. You should try to
> distribute the load by creating more subfolders (which should follow
> some useful structure, eg. dates like 2009/07/23 works with most
> content). The limit where Jackrabbit gets a bit slower is at around
> 10k child nodes.
Iirc there's a similar problem with multi value properties when you
add a lot of values.

Is there any room left in the current implementation to improve the
performance of those two use cases? Or did somebody already look at it
thoroughly and squeezed out the last bit of performance gain? If not,
I would be happy to run some tests in a profiler and see what comes
up..

Regards,
Bart




-- 
Hippo B.V.  -  Amsterdam
Oosteinde 11, 1017 WT, Amsterdam, +31(0)20-5224466

Hippo USA Inc.  -  San Francisco
101 H Street, Suite Q, Petaluma CA, 94952-3329, +1 (707) 773-4646
-----------------------------------------------------------------
http://www.onehippo.com   -  info@onehippo.com
-----------------------------------------------------------------

Re: Performance of Jackrabbit

Posted by Alexander Klimetschek <ak...@day.com>.
On Thu, Jul 23, 2009 at 9:31 AM, Ajai<aj...@gmail.com> wrote:
> http://www.nabble.com/file/p24620741/ThreadFeeder.java ThreadFeeder.java
> http://www.nabble.com/file/p24620741/repository.xml repository.xml
> http://www.nabble.com/file/p24620741/indexingconfiguration.xml
> indexingconfiguration.xml
>
> Kindly let me know your suggestions.

>From a quick look at your code it looks like you create a flat
hierarchy with all nodes on the same level. You should try to
distribute the load by creating more subfolders (which should follow
some useful structure, eg. dates like 2009/07/23 works with most
content). The limit where Jackrabbit gets a bit slower is at around
10k child nodes.

Apart from that I could imagine the text extraction of the search
index slows down things a bit (although they should happen in the
background if they take too long) if you throw a lot of documents at
them at once. Disabling the search index could give a measurement of
that effect.

Finally, as already mentioned, an embedded database such as derby and
the appropriate pm are always faster than remote dbs. The use of the
file datastore is already good.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: Performance of Jackrabbit

Posted by Stefan Guggisberg <st...@gmail.com>.
On Mon, Jul 27, 2009 at 3:56 PM, Ajai<aj...@gmail.com> wrote:
>
> Hi Guo,
>
> Yes, i am adding a document to the repository.
> Is there multiple ways to do a save?
>
> I am doing it the following way,
>
> fileNode = matterNode.addNode(fileName, "nt:file");
> fileNode.addMixin("mix:versionable");
> fileNode.addMixin("mix:referenceable");

adding mix:referenceable is redundant since it's already
included through mix:versionable (mix:versionable inherits
from mix:referenceable.

> Node resNode = fileNode.addNode("jcr:content", "nt:resource");
> resNode.addMixin("mix:versionable");
> resNode.addMixin("mix:referenceable");

same here.

btw: why are you making the jcr:content node versionable?
you already made the nt:file node versionable.

cheers
stefan

> resNode.setProperty("jcr:mimeType", mimeType);
> resNode.setProperty("jcr:encoding", ENCODING_UTF_8);
> resNode.setProperty("jcr:data", new FileInputStream(file));
> Calendar lastModified = Calendar.getInstance();
> lastModified.setTimeInMillis(file.lastModified());
> resNode.setProperty("jcr:lastModified", lastModified);
> // finally
> session.save();
>
> Please suggest if any changes can be done.
>
>
> Thanks,
> Ajai G
>
>
> Guo Du wrote:
>>
>>> I tried using the Derby database to upload 375000 Documents.
>>>
>>> When i tried to add a document to this setup. It took more than 30 mins
>>> to
>>> do a checkin,
>>> The system CPU utilization was around 90% to 100% and the JVM heap size
>>> also
>>> is around 1.5GB.
>>
>> When did you check out the document? Are you mean add and save
>> documents to repository?
>>
>> I am not sure how you save the documents. The save  do the actual
>> persistent to db, so you should avoid keep a big change list in your
>> mem before call save.
>>
>> --Guo
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24681170.html
> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>
>

Re: Performance of Jackrabbit

Posted by Guo Du <mr...@gmail.com>.
On Mon, Jul 27, 2009 at 3:36 PM, Ajai<aj...@gmail.com> wrote:

> But i do have text extractors and indexes turned on.

Sorry, I didn't know how the index affect your result.

Good luck!

-Guo

Re: Performance of Jackrabbit

Posted by Alexander Klimetschek <ak...@day.com>.
On Tue, Jul 28, 2009 at 6:07 PM, Ajai<aj...@gmail.com> wrote:
> I used profiler to look into this issue, It seems PDFbox is taking a lot of
> time.
> Also i had set "indexMergerPoolSize" parameter to 50, "extractorPoolSize"
> parameter to 50.
>
> Can you help me to resolve this problem.

Disable the search index to test import times without indexing and
PDFBox text extraction.

You can define the text extractors to use in the "textFilterClasses"
field of the SearchIndex configuration [1].

[1] http://wiki.apache.org/jackrabbit/Search

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: Performance of Jackrabbit

Posted by Ajai <aj...@gmail.com>.
Hi Team,

Thanks for the responses.

I was able to upload 25000 folders each with 15 documents in a derby
database.

When i tried to add a new document to one of these folders, It is taking a
lot of time to do this addition of new document. The document size that i
used is 2.5 MB pdf document.

I used profiler to look into this issue, It seems PDFbox is taking a lot of
time.
Also i had set "indexMergerPoolSize" parameter to 50, "extractorPoolSize"
parameter to 50.

Can you help me to resolve this problem.

Thanks 
Ajai G



Stefan Guggisberg wrote:
> 
> On Mon, Jul 27, 2009 at 4:36 PM, Ajai<aj...@gmail.com> wrote:
>>
>> Actually i am doing the right way as you mentioned, having session.save()
>> after each file.
>> But i do have text extractors and indexes turned on.
>> My Configuration:
>>
>> for searchindex:
>>
>> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>                </SearchIndex>
>>
>>
>> My Index config:
>>
>> <?xml version="1.0"?>
>> <!DOCTYPE configuration SYSTEM
>> "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
>> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
>>        xmlns:jcr="http://www.jcp.org/jcr/1.0">
>>        <index-rule nodeType="nt:file">
>>                <property>jcr:content</property>
>>        </index-rule>
>>        <index-rule nodeType="nt:resource">
>>                <property>jcr:data</property>
>>        </index-rule>
>> </configuration>
>>
>> Kindly tell me the optimal way to use them.
> 
> as already suggested in my earlier post:
> 
> 1. disable search index or text extractors and compare results
> 2. remove checkin() call and compare results
> 3. use embedded derby and compare results
> 4. if you provide GenRandom.java, i'll run the test on my own machine.
> 
> cheers
> stefan
> 
>>
>>
>> Thanks
>> Ajai G
>>
>>
>>
>> Guo Du wrote:
>>>
>>> On Mon, Jul 27, 2009 at 2:56 PM, Ajai<aj...@gmail.com> wrote:
>>>>
>>>> Hi Guo,
>>>>
>>>> Yes, i am adding a document to the repository.
>>>> Is there multiple ways to do a save?
>>>>
>>>> I am doing it the following way,
>>>>
>>>> fileNode = matterNode.addNode(fileName, "nt:file");
>>>> fileNode.addMixin("mix:versionable");
>>>> fileNode.addMixin("mix:referenceable");
>>>> Node resNode = fileNode.addNode("jcr:content", "nt:resource");
>>>> resNode.addMixin("mix:versionable");
>>>> resNode.addMixin("mix:referenceable");
>>>> resNode.setProperty("jcr:mimeType", mimeType);
>>>> resNode.setProperty("jcr:encoding", ENCODING_UTF_8);
>>>> resNode.setProperty("jcr:data", new FileInputStream(file));
>>>> Calendar lastModified = Calendar.getInstance();
>>>> lastModified.setTimeInMillis(file.lastModified());
>>>> resNode.setProperty("jcr:lastModified", lastModified);
>>>> // finally
>>>> session.save();
>>>>
>>>> Please suggest if any changes can be done.
>>>>
>>>
>>>
>>> Your code doesn't show details of the loop.
>>>
>>>
>>> WRONG
>>> ==============
>>> loop{ // 375000 times
>>>   addNode(...)
>>> }
>>> session.save();
>>> ==============
>>>
>>>
>>>
>>> CORRECT
>>> ==============
>>> loop{ // 375000 times
>>>   addNode(...)
>>>   session.save();
>>> }
>>> ==============
>>> You may also add multiple documents before call session.save() to take
>>> advantage of batch process more efficiently. But not after add all
>>> 375000 documents.
>>>
>>> --Guo
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24681862.html
>> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24702639.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Performance of Jackrabbit

Posted by Stefan Guggisberg <st...@gmail.com>.
On Mon, Jul 27, 2009 at 4:36 PM, Ajai<aj...@gmail.com> wrote:
>
> Actually i am doing the right way as you mentioned, having session.save()
> after each file.
> But i do have text extractors and indexes turned on.
> My Configuration:
>
> for searchindex:
>
> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>                </SearchIndex>
>
>
> My Index config:
>
> <?xml version="1.0"?>
> <!DOCTYPE configuration SYSTEM
> "http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
>        xmlns:jcr="http://www.jcp.org/jcr/1.0">
>        <index-rule nodeType="nt:file">
>                <property>jcr:content</property>
>        </index-rule>
>        <index-rule nodeType="nt:resource">
>                <property>jcr:data</property>
>        </index-rule>
> </configuration>
>
> Kindly tell me the optimal way to use them.

as already suggested in my earlier post:

1. disable search index or text extractors and compare results
2. remove checkin() call and compare results
3. use embedded derby and compare results
4. if you provide GenRandom.java, i'll run the test on my own machine.

cheers
stefan

>
>
> Thanks
> Ajai G
>
>
>
> Guo Du wrote:
>>
>> On Mon, Jul 27, 2009 at 2:56 PM, Ajai<aj...@gmail.com> wrote:
>>>
>>> Hi Guo,
>>>
>>> Yes, i am adding a document to the repository.
>>> Is there multiple ways to do a save?
>>>
>>> I am doing it the following way,
>>>
>>> fileNode = matterNode.addNode(fileName, "nt:file");
>>> fileNode.addMixin("mix:versionable");
>>> fileNode.addMixin("mix:referenceable");
>>> Node resNode = fileNode.addNode("jcr:content", "nt:resource");
>>> resNode.addMixin("mix:versionable");
>>> resNode.addMixin("mix:referenceable");
>>> resNode.setProperty("jcr:mimeType", mimeType);
>>> resNode.setProperty("jcr:encoding", ENCODING_UTF_8);
>>> resNode.setProperty("jcr:data", new FileInputStream(file));
>>> Calendar lastModified = Calendar.getInstance();
>>> lastModified.setTimeInMillis(file.lastModified());
>>> resNode.setProperty("jcr:lastModified", lastModified);
>>> // finally
>>> session.save();
>>>
>>> Please suggest if any changes can be done.
>>>
>>
>>
>> Your code doesn't show details of the loop.
>>
>>
>> WRONG
>> ==============
>> loop{ // 375000 times
>>   addNode(...)
>> }
>> session.save();
>> ==============
>>
>>
>>
>> CORRECT
>> ==============
>> loop{ // 375000 times
>>   addNode(...)
>>   session.save();
>> }
>> ==============
>> You may also add multiple documents before call session.save() to take
>> advantage of batch process more efficiently. But not after add all
>> 375000 documents.
>>
>> --Guo
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24681862.html
> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>
>

Re: Performance of Jackrabbit

Posted by Ajai <aj...@gmail.com>.
Actually i am doing the right way as you mentioned, having session.save()
after each file.
But i do have text extractors and indexes turned on.
My Configuration:

for searchindex:

<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
			
			
			
			
			
			
			
			
			
			
			
			
			
			
			
			

			
		</SearchIndex>


My Index config:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM
"http://jackrabbit.apache.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
	xmlns:jcr="http://www.jcp.org/jcr/1.0">
	<index-rule nodeType="nt:file">
		<property>jcr:content</property>
	</index-rule>
	<index-rule nodeType="nt:resource">
		<property>jcr:data</property>
	</index-rule>
</configuration>

Kindly tell me the optimal way to use them.


Thanks
Ajai G



Guo Du wrote:
> 
> On Mon, Jul 27, 2009 at 2:56 PM, Ajai<aj...@gmail.com> wrote:
>>
>> Hi Guo,
>>
>> Yes, i am adding a document to the repository.
>> Is there multiple ways to do a save?
>>
>> I am doing it the following way,
>>
>> fileNode = matterNode.addNode(fileName, "nt:file");
>> fileNode.addMixin("mix:versionable");
>> fileNode.addMixin("mix:referenceable");
>> Node resNode = fileNode.addNode("jcr:content", "nt:resource");
>> resNode.addMixin("mix:versionable");
>> resNode.addMixin("mix:referenceable");
>> resNode.setProperty("jcr:mimeType", mimeType);
>> resNode.setProperty("jcr:encoding", ENCODING_UTF_8);
>> resNode.setProperty("jcr:data", new FileInputStream(file));
>> Calendar lastModified = Calendar.getInstance();
>> lastModified.setTimeInMillis(file.lastModified());
>> resNode.setProperty("jcr:lastModified", lastModified);
>> // finally
>> session.save();
>>
>> Please suggest if any changes can be done.
>>
> 
> 
> Your code doesn't show details of the loop.
> 
> 
> WRONG
> ==============
> loop{ // 375000 times
>   addNode(...)
> }
> session.save();
> ==============
> 
> 
> 
> CORRECT
> ==============
> loop{ // 375000 times
>   addNode(...)
>   session.save();
> }
> ==============
> You may also add multiple documents before call session.save() to take
> advantage of batch process more efficiently. But not after add all
> 375000 documents.
> 
> --Guo
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24681862.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Performance of Jackrabbit

Posted by Guo Du <mr...@gmail.com>.
On Mon, Jul 27, 2009 at 2:56 PM, Ajai<aj...@gmail.com> wrote:
>
> Hi Guo,
>
> Yes, i am adding a document to the repository.
> Is there multiple ways to do a save?
>
> I am doing it the following way,
>
> fileNode = matterNode.addNode(fileName, "nt:file");
> fileNode.addMixin("mix:versionable");
> fileNode.addMixin("mix:referenceable");
> Node resNode = fileNode.addNode("jcr:content", "nt:resource");
> resNode.addMixin("mix:versionable");
> resNode.addMixin("mix:referenceable");
> resNode.setProperty("jcr:mimeType", mimeType);
> resNode.setProperty("jcr:encoding", ENCODING_UTF_8);
> resNode.setProperty("jcr:data", new FileInputStream(file));
> Calendar lastModified = Calendar.getInstance();
> lastModified.setTimeInMillis(file.lastModified());
> resNode.setProperty("jcr:lastModified", lastModified);
> // finally
> session.save();
>
> Please suggest if any changes can be done.
>


Your code doesn't show details of the loop.


WRONG
==============
loop{ // 375000 times
  addNode(...)
}
session.save();
==============



CORRECT
==============
loop{ // 375000 times
  addNode(...)
  session.save();
}
==============
You may also add multiple documents before call session.save() to take
advantage of batch process more efficiently. But not after add all
375000 documents.

--Guo

Re: Performance of Jackrabbit

Posted by Ajai <aj...@gmail.com>.
Hi Guo,

Yes, i am adding a document to the repository.
Is there multiple ways to do a save?

I am doing it the following way,

fileNode = matterNode.addNode(fileName, "nt:file");
fileNode.addMixin("mix:versionable");
fileNode.addMixin("mix:referenceable");
Node resNode = fileNode.addNode("jcr:content", "nt:resource");
resNode.addMixin("mix:versionable");
resNode.addMixin("mix:referenceable");
resNode.setProperty("jcr:mimeType", mimeType);
resNode.setProperty("jcr:encoding", ENCODING_UTF_8);
resNode.setProperty("jcr:data", new FileInputStream(file));
Calendar lastModified = Calendar.getInstance();
lastModified.setTimeInMillis(file.lastModified());
resNode.setProperty("jcr:lastModified", lastModified);
// finally
session.save();

Please suggest if any changes can be done.


Thanks,
Ajai G


Guo Du wrote:
> 
>> I tried using the Derby database to upload 375000 Documents.
>>
>> When i tried to add a document to this setup. It took more than 30 mins
>> to
>> do a checkin,
>> The system CPU utilization was around 90% to 100% and the JVM heap size
>> also
>> is around 1.5GB.
> 
> When did you check out the document? Are you mean add and save
> documents to repository?
> 
> I am not sure how you save the documents. The save  do the actual
> persistent to db, so you should avoid keep a big change list in your
> mem before call save.
> 
> --Guo
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24681170.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Performance of Jackrabbit

Posted by Guo Du <mr...@gmail.com>.
> I tried using the Derby database to upload 375000 Documents.
>
> When i tried to add a document to this setup. It took more than 30 mins to
> do a checkin,
> The system CPU utilization was around 90% to 100% and the JVM heap size also
> is around 1.5GB.

When did you check out the document? Are you mean add and save
documents to repository?

I am not sure how you save the documents. The save  do the actual
persistent to db, so you should avoid keep a big change list in your
mem before call save.

--Guo

Re: Performance of Jackrabbit

Posted by Ajai <aj...@gmail.com>.
Hi Stefan,

I tried using the Derby database to upload 375000 Documents.

When i tried to add a document to this setup. It took more than 30 mins to
do a checkin, 
The system CPU utilization was around 90% to 100% and the JVM heap size also
is around 1.5GB.
Is there someway to handle this out.

Now i am using the following hierarchical structure:
Folder1
 ------ Folder
    -------- File1
    -------- File2


Also in this i am not doing the fileNode.checkin() operation.

Thanks
Ajai G




Thanks
Ajai G



Stefan Guggisberg wrote:
> 
> hi ajai
> 
> On Thu, Jul 23, 2009 at 9:31 AM, Ajai<aj...@gmail.com> wrote:
>>
>> Hi Stefan,
>>
>> Thanks for the quick response.
>>
>> We are running the tests on a "Core 2 Duo 2.3 GHz, 4 GB  RAM running
>> Windows
>> Server 2003" machine.
>>
>> Please find attached the
>>
>> 1. repository.xml
>> 2. indexconfiguration.xml.
>> 3. source java file for upload (ThreadFeeder.java)
>>
>> http://www.nabble.com/file/p24620741/ThreadFeeder.java ThreadFeeder.java
>> http://www.nabble.com/file/p24620741/repository.xml repository.xml
>> http://www.nabble.com/file/p24620741/indexingconfiguration.xml
>> indexingconfiguration.xml
> 
> thanks!
> 
> as far as i can tell, you're not doing anything unreasonable.
> however, you have to be aware that some features come at
> a certain cost.
> 
> 1. fulltext search/text extractors do impact write performance
> significantly.
> 2. versioning: same here, checkin() is a pretty expensive operation on
>     nt:file nodes
> 3. mssql server is not known to be terribly fast (at least not when used
>     as jackrabbit backend).
> 
> in order to identify what's causing the appallingly bad results please
> do the following:
> 1. disable search index or text extractors and compare results
> 2. remove checkin() call and compare results
> 1. use emnedded derby and compare results
> 
> if you could provide GenRandom.java, i'll run the test on my own machine.
> 
> cheers
> stefan
> 
>>
>> Kindly let me know your suggestions.
>>
>> Thanks,
>> Ajai G
>>
>>
>>
>>
>> Stefan Guggisberg wrote:
>>>
>>> On Thu, Jul 23, 2009 at 8:10 AM, Ajai<aj...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am in the process of Evaluation of Jackrabbit. We are running few
>>>> performance tests.
>>>> Here we are adding 25,000 Folder nodes with each consisting of 15
>>>> documents.
>>>>
>>>> It is taking around 37 hours to complete this process, we also tried
>>>> using
>>>> thread to achieve this.
>>>> But still the time hasn't come down.
>>>>
>>>> It also seems that, when adding 500 Folders with 15 docs each, takes  ~
>>>> 20
>>>> mins for a empty repository,
>>>>
>>>> After uploading 25000 folders, when trying to add same 500 Folders with
>>>> 15
>>>> docs each, it takes ~ 5 hrs.
>>>>
>>>
>>> all figures are way too high. please provide more information on your
>>> setup/configuration and environment. if possible, please also provide
>>> some code of your tests.
>>>
>>> cheers
>>> stefan
>>>
>>>> So is there a way to improve the performance of above mentioned
>>>> functions
>>>> ?.
>>>>
>>>> Also kindly suggest an alternate solution to perform bulk upload?
>>>>
>>>> Thanks
>>>> Ajai G
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24619853.html
>>>> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24620741.html
>> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24680489.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Performance of Jackrabbit

Posted by Stefan Guggisberg <st...@gmail.com>.
hi ajai

On Thu, Jul 23, 2009 at 9:31 AM, Ajai<aj...@gmail.com> wrote:
>
> Hi Stefan,
>
> Thanks for the quick response.
>
> We are running the tests on a "Core 2 Duo 2.3 GHz, 4 GB  RAM running Windows
> Server 2003" machine.
>
> Please find attached the
>
> 1. repository.xml
> 2. indexconfiguration.xml.
> 3. source java file for upload (ThreadFeeder.java)
>
> http://www.nabble.com/file/p24620741/ThreadFeeder.java ThreadFeeder.java
> http://www.nabble.com/file/p24620741/repository.xml repository.xml
> http://www.nabble.com/file/p24620741/indexingconfiguration.xml
> indexingconfiguration.xml

thanks!

as far as i can tell, you're not doing anything unreasonable.
however, you have to be aware that some features come at
a certain cost.

1. fulltext search/text extractors do impact write performance significantly.
2. versioning: same here, checkin() is a pretty expensive operation on
    nt:file nodes
3. mssql server is not known to be terribly fast (at least not when used
    as jackrabbit backend).

in order to identify what's causing the appallingly bad results please
do the following:
1. disable search index or text extractors and compare results
2. remove checkin() call and compare results
1. use emnedded derby and compare results

if you could provide GenRandom.java, i'll run the test on my own machine.

cheers
stefan

>
> Kindly let me know your suggestions.
>
> Thanks,
> Ajai G
>
>
>
>
> Stefan Guggisberg wrote:
>>
>> On Thu, Jul 23, 2009 at 8:10 AM, Ajai<aj...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I am in the process of Evaluation of Jackrabbit. We are running few
>>> performance tests.
>>> Here we are adding 25,000 Folder nodes with each consisting of 15
>>> documents.
>>>
>>> It is taking around 37 hours to complete this process, we also tried
>>> using
>>> thread to achieve this.
>>> But still the time hasn't come down.
>>>
>>> It also seems that, when adding 500 Folders with 15 docs each, takes  ~
>>> 20
>>> mins for a empty repository,
>>>
>>> After uploading 25000 folders, when trying to add same 500 Folders with
>>> 15
>>> docs each, it takes ~ 5 hrs.
>>>
>>
>> all figures are way too high. please provide more information on your
>> setup/configuration and environment. if possible, please also provide
>> some code of your tests.
>>
>> cheers
>> stefan
>>
>>> So is there a way to improve the performance of above mentioned functions
>>> ?.
>>>
>>> Also kindly suggest an alternate solution to perform bulk upload?
>>>
>>> Thanks
>>> Ajai G
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24619853.html
>>> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24620741.html
> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>
>

Re: Performance of Jackrabbit

Posted by Alexander Klimetschek <ak...@day.com>.
On Thu, Jul 23, 2009 at 12:27 PM, Guo Du<mr...@gmail.com> wrote:
> The size of uploaded file may affect the result significantly.
>
> I read some email that some one said the uploaded file are stored
> based on the hash value. This means your 15 unique files only
> stored/indexed once,  it may not the real world case. I am not sure,
> can any one confirm :)

Yes, that's right.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: Performance of Jackrabbit

Posted by Guo Du <mr...@gmail.com>.
The size of uploaded file may affect the result significantly.

I read some email that some one said the uploaded file are stored
based on the hash value. This means your 15 unique files only
stored/indexed once,  it may not the real world case. I am not sure,
can any one confirm :)


Just FYI.
I did a test some time ago against version 1.4.4, the embed derby db
taken 22 minutes, while mysql db taken about 60 minutes.
The test created 1M node each with 6 properties, not files uploaded.
After finish, the derby db is about 1GB including index.


--Guo

On Thu, Jul 23, 2009 at 8:31 AM, Ajai<aj...@gmail.com> wrote:
>
> Hi Stefan,
>
> Thanks for the quick response.
>
> We are running the tests on a "Core 2 Duo 2.3 GHz, 4 GB  RAM running Windows
> Server 2003" machine.
>
> Please find attached the
>
> 1. repository.xml
> 2. indexconfiguration.xml.
> 3. source java file for upload (ThreadFeeder.java)
>
> http://www.nabble.com/file/p24620741/ThreadFeeder.java ThreadFeeder.java
> http://www.nabble.com/file/p24620741/repository.xml repository.xml
> http://www.nabble.com/file/p24620741/indexingconfiguration.xml
> indexingconfiguration.xml
>
> Kindly let me know your suggestions.
>
> Thanks,
> Ajai G
>
>
>
>
> Stefan Guggisberg wrote:
>>
>> On Thu, Jul 23, 2009 at 8:10 AM, Ajai<aj...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I am in the process of Evaluation of Jackrabbit. We are running few
>>> performance tests.
>>> Here we are adding 25,000 Folder nodes with each consisting of 15
>>> documents.
>>>
>>> It is taking around 37 hours to complete this process, we also tried
>>> using
>>> thread to achieve this.
>>> But still the time hasn't come down.
>>>
>>> It also seems that, when adding 500 Folders with 15 docs each, takes  ~
>>> 20
>>> mins for a empty repository,
>>>
>>> After uploading 25000 folders, when trying to add same 500 Folders with
>>> 15
>>> docs each, it takes ~ 5 hrs.
>>>
>>
>> all figures are way too high. please provide more information on your
>> setup/configuration and environment. if possible, please also provide
>> some code of your tests.
>>
>> cheers
>> stefan
>>
>>> So is there a way to improve the performance of above mentioned functions
>>> ?.
>>>
>>> Also kindly suggest an alternate solution to perform bulk upload?
>>>
>>> Thanks
>>> Ajai G
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24619853.html
>>> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24620741.html
> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>
>



-- 
Kind regards,

Du, Guo
__________________________________________________
Phone     : +353-86-176 6186
Email     : online@duguo.com
__________________________________________________
http://duguo.com  - Career Life Balance

Re: Performance of Jackrabbit

Posted by Ajai <aj...@gmail.com>.
Hi Stefan,

Thanks for the quick response.

We are running the tests on a "Core 2 Duo 2.3 GHz, 4 GB  RAM running Windows
Server 2003" machine.

Please find attached the 

1. repository.xml 
2. indexconfiguration.xml.
3. source java file for upload (ThreadFeeder.java)

http://www.nabble.com/file/p24620741/ThreadFeeder.java ThreadFeeder.java 
http://www.nabble.com/file/p24620741/repository.xml repository.xml 
http://www.nabble.com/file/p24620741/indexingconfiguration.xml
indexingconfiguration.xml 

Kindly let me know your suggestions.

Thanks,
Ajai G




Stefan Guggisberg wrote:
> 
> On Thu, Jul 23, 2009 at 8:10 AM, Ajai<aj...@gmail.com> wrote:
>>
>> Hi,
>>
>> I am in the process of Evaluation of Jackrabbit. We are running few
>> performance tests.
>> Here we are adding 25,000 Folder nodes with each consisting of 15
>> documents.
>>
>> It is taking around 37 hours to complete this process, we also tried
>> using
>> thread to achieve this.
>> But still the time hasn't come down.
>>
>> It also seems that, when adding 500 Folders with 15 docs each, takes  ~
>> 20
>> mins for a empty repository,
>>
>> After uploading 25000 folders, when trying to add same 500 Folders with
>> 15
>> docs each, it takes ~ 5 hrs.
>>
> 
> all figures are way too high. please provide more information on your
> setup/configuration and environment. if possible, please also provide
> some code of your tests.
> 
> cheers
> stefan
> 
>> So is there a way to improve the performance of above mentioned functions
>> ?.
>>
>> Also kindly suggest an alternate solution to perform bulk upload?
>>
>> Thanks
>> Ajai G
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24619853.html
>> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24620741.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.


Re: Performance of Jackrabbit

Posted by Stefan Guggisberg <st...@gmail.com>.
On Thu, Jul 23, 2009 at 8:10 AM, Ajai<aj...@gmail.com> wrote:
>
> Hi,
>
> I am in the process of Evaluation of Jackrabbit. We are running few
> performance tests.
> Here we are adding 25,000 Folder nodes with each consisting of 15 documents.
>
> It is taking around 37 hours to complete this process, we also tried using
> thread to achieve this.
> But still the time hasn't come down.
>
> It also seems that, when adding 500 Folders with 15 docs each, takes  ~ 20
> mins for a empty repository,
>
> After uploading 25000 folders, when trying to add same 500 Folders with 15
> docs each, it takes ~ 5 hrs.
>

all figures are way too high. please provide more information on your
setup/configuration and environment. if possible, please also provide
some code of your tests.

cheers
stefan

> So is there a way to improve the performance of above mentioned functions ?.
>
> Also kindly suggest an alternate solution to perform bulk upload?
>
> Thanks
> Ajai G
>
>
>
>
>
> --
> View this message in context: http://www.nabble.com/Performance-of-Jackrabbit-tp24619853p24619853.html
> Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.
>
>