You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by docxa <de...@docxa.com> on 2011/01/05 10:24:11 UTC

Performance issue when removing high amount of nodes

Hi,

We have to store in our repository a high amount of data, using this kind of
tree:

Project1
|_Stream1
  |__Record1
  |__Record2
  ...
  |__Record120000
...
|_Stream2
  |__Record1
  |__Record2
  ...
  |__Record120000

etc.

It takes some time to add those records, which was expected, but it's even
more time-consuming to remove them. (sometimes even crashing the VM)
I understand it has to do with Jackrabbit putting it all in memory to check
for referential integrity violations.

While searching for answers on the mailing list I saw two ways of dealing
with this:
1- Deactivate referential integrity checking. I tried that, and it did not
seem to accelerate the process, so I may be doing it wrong. (And I guess
it's quite wrong to even do it)
2- Recursively removing nodes by packs.

I noticed than when using the second method, the more children a node have,
the more time it will take to remove some of them. So I guess it would be
best to try and split the records through multiple subtrees.

So I'd like to know if there is a better way of organizing my data in order
to improve the adding and removing operations. And if the deactivation of
referential integrity checking is really risky, and how I'm supposed to do
it? (I tried subclassing RepositoryImpl and using
setReferentialIntegrityChecking but it didn't seem to change anything)

Thank you for your help.

A. Mariette
DOCXA
-- 
View this message in context: http://jackrabbit.510166.n4.nabble.com/Performance-issue-when-removing-high-amount-of-nodes-tp3175050p3175050.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: AW: Performance issue when removing high amount of nodes

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 05.01.11 13:35, "docxa" <de...@docxa.com> wrote:
>Ok, I was afraid I had to do something like that.

In any way it is very useful to have a meaningful naming of your content
structure and not just a numbered list. Note that the common use of Ids
only comes from the technical way RDMBS works, not necessarily from the
(human) data models.

So using some structure of your record data (like owner for example, using
access control and containment as drivers, see also rule #2 of
http://wiki.apache.org/jackrabbit/DavidsModel ) to build nested folders.

In nearly all cases you have some kind of date or timestamp on your data
and you could use "2011/01/05" as folders, for example - but only if you
don't have a better structure of course.

Regards,
Alex

-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel





Re: AW: Performance issue when removing high amount of nodes

Posted by docxa <de...@docxa.com>.
Ok, I was afraid I had to do something like that.

Thank you very much for your answer.

A. Mariette
DOCXA
-- 
View this message in context: http://jackrabbit.510166.n4.nabble.com/Performance-issue-when-removing-high-amount-of-nodes-tp3175050p3175420.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: AW: Performance issue when removing high amount of nodes

Posted by Robert Oschwald <ro...@symentis.com>.
> 
> 
> Jackrabbit keeps all children of a node in memory, so you should not have more than 10.000 direct child nodes at a node. (See http://jackrabbit.510166.n4.nabble.com/Suggestions-for-node-hierarchy-td2966757.html)
> 


Jira Ticket for this problem is https://issues.apache.org/jira/browse/JCR-642





AW: Performance issue when removing high amount of nodes

Posted by "Seidel. Robert" <Ro...@aeb.de>.
Hi,

Jackrabbit keeps all children of a node in memory, so you should not have more than 10.000 direct child nodes at a node. (See http://jackrabbit.510166.n4.nabble.com/Suggestions-for-node-hierarchy-td2966757.html)

The solution in your case would be to add another level like

Project1
|_Stream1
  |__Records1-10000
     |_Record 1
     |_Record 2
  |__Records10001-20000
  ...
...
|_Stream2
  |__Records1-10000
     |_Record1
     |_Record2
  ...
  |__Records110001-120000

Regards, Robert

-----Ursprüngliche Nachricht-----
Von: docxa [mailto:dev@docxa.com] 
Gesendet: Mittwoch, 5. Januar 2011 10:24
An: users@jackrabbit.apache.org
Betreff: Performance issue when removing high amount of nodes


Hi,

We have to store in our repository a high amount of data, using this kind of
tree:

Project1
|_Stream1
  |__Record1
  |__Record2
  ...
  |__Record120000
...
|_Stream2
  |__Record1
  |__Record2
  ...
  |__Record120000

etc.

It takes some time to add those records, which was expected, but it's even
more time-consuming to remove them. (sometimes even crashing the VM)
I understand it has to do with Jackrabbit putting it all in memory to check
for referential integrity violations.

While searching for answers on the mailing list I saw two ways of dealing
with this:
1- Deactivate referential integrity checking. I tried that, and it did not
seem to accelerate the process, so I may be doing it wrong. (And I guess
it's quite wrong to even do it)
2- Recursively removing nodes by packs.

I noticed than when using the second method, the more children a node have,
the more time it will take to remove some of them. So I guess it would be
best to try and split the records through multiple subtrees.

So I'd like to know if there is a better way of organizing my data in order
to improve the adding and removing operations. And if the deactivation of
referential integrity checking is really risky, and how I'm supposed to do
it? (I tried subclassing RepositoryImpl and using
setReferentialIntegrityChecking but it didn't seem to change anything)

Thank you for your help.

A. Mariette
DOCXA
-- 
View this message in context: http://jackrabbit.510166.n4.nabble.com/Performance-issue-when-removing-high-amount-of-nodes-tp3175050p3175050.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.