You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Frédéric Esnault <fe...@legisway.com> on 2007/06/21 08:10:37 UTC

About Jackrabbit repository model

Hi there !

I've been thinking about what I've been told here.

It has been said that :

- Jackrabbit doesn't like same name siblings, so it would be better to give nodes specific names;

- Jackrabbit doesn't like too many child nodes under on same nodes (ie. 1000 or more child nodes for one parent node)

But I don't think my repository structure is unique. My repository has one business root (lgw:root) with (currently) three category child nodes :

- contacts;

- contractors;

- contracts;

And each of them (say contacts) hold all the nodes of this category (all the contacts), and the contact nodes are all called lgw:contact.

This breaks the two "jackrabbit doesn't like" rules said before, but I think this is not a strange architecture. I saw here on this mailing list someone

talking about his repository holding books, authors and so on... I guess his nodes are called book or ns:book (if ns is his namespace), with

properties and child nodes, and all have the same parent.

It seems that searching one this kind of architecture takes ages, I made a search yesterday that I stopped after 25 minutes.

/jcr:root/lgw:root/lgw:contracts/lgw:contract/jcr:deref(@lgw:internalContractor, 'lgw:contractorType')[@lgw:companyName='Legisway']

This is a search on ALL the contracts, and on each contract, it dereferences an uuid on ALL internal contractors (multivalued property) and applies a predicate

on the contractor nodes. This is, for us, a very classic search (find contracts signed with a specific company), and this search should be fast. 25 minutes is not

acceptable.

Is there documentation on how jackrabbit uses Lucene, how indexes are built and on what, and is jackrabbit going to deal better with such architectures (which, once again,

I think is quite common)?

Frederic Esnault

RE: About Jackrabbit repository model

Posted by Frédéric Esnault <fe...@legisway.com>.


> Well, it is probably not that easy to come up with such an architecture at
large.

Right ;)

> In your use case, I would suggest to not use same name sibblings -
> use a counter or some other id to name the nodes. I assume this would
> already solve many (if not most) of your performance issues.

I was actually thinking of naming my nodes with some kind of "business ID" or something (thinking in progress)

> Regarding hierarchy: There is no single rule to this. Given your contracts
> (and other) lists, you could structure by geograhic location, by a number of
> leading characters in the customer name, by contract year, whatever.

Full of good ideas there, thanks, but I must keep an ease-of-use for other use cases, like administration, automatic node creation, movement, refactoring and so on, and such categorization may lead to the multiplication of (easy or not) use cases (like changing a contract date would potentially require to move it from a catgory to another...), so modifying the architecture is a brainstorming issue.

I appreciate very much your proposals , thanks a lot !!
 
> Maybe, if you drop same name sibblings, you might not even be required to
> introduce hierarchies.

This is also what I've been thinking, and why I'm planning to try another naming pattern and test again performances. I'll definitely inform the community of results.

Frederic Esnault

Re: About Jackrabbit repository model

Posted by Felix Meschberger <Fe...@day.com>.

Hi Frédéric,

Can you come with a more Jackrabbit-friendly architecture to deal with my
> contracts/contractors/contacts architecture, or the book/author/comments we
> read on this mailing list?


Well, it is probably not that easy to come up with such an architecture at
large. In your use case, I would suggest to not use same name sibblings -
use a counter or some other id to name the nodes. I assume this would
already solve many (if not most) of your performance issues.

Regarding hierarchy: There is no single rule to this. Given your contracts
(and other) lists, you could structure by geograhic location, by a number of
leading characters in the customer name, by contract year, whatever.

Maybe, if you drop same name sibblings, you might not even be required to
introduce hierarchies.

Regards
Felix

RE: About Jackrabbit repository model

Posted by Frédéric Esnault <fe...@legisway.com>.


> For example, consider a node /A with 5 children named B. You would access
> the first as /A/B[0] the second as /A/B[1] and so forth. If you now decide
> to delete the second one (B[1]) the indices of all same name siblings whose
> index is larger than 1 will change. So B[2] becomes B[1], B[3] becomes B[2]
> and B[4] becomes B[3]. IMHO this is problematic and this is why I never use
> same name siblings.

I agree on this, but this is not a problem. Spec (section 4.3.1, if I don't make mistake) states that same name siblings order must be consistent. Which meanse whatever you do, even if you remove a node, the order remains the same, which is the most important to me. And basically I don't think you'll access a specific nodes with its position, but more likely with a predicate on a property value.


> This is purely an issue of the current implementation of node states. But,
> honestly, I do not think the critical limit is already at 1000 child nodes,
> this number should be much higher - of course not considering same name
> siblings.

Of course, 1000 was just a random number, not an evalutation of Jackrabbit limitations. But my problem here is that I need both : many child nodes being same name siblings ;)

> Your main issue is using same name siblings. (Did I already say, I don't
> like them ? :-) )
Yes I got that.


> I tend to think, that not everything which is "quite common" is also "good"
> by definition. Don't get me wrong here, I do not want to say, your
> architecture is bad. But  it seems that the architecture does not cope well
> with the current implementation of Jackrabbit.

I agree on this, even if many people do the same thing does not mean they are not doing a mistake, you're perfectly right, but for software requirements, if most users want to use a certain architecture (or whatever), it actually should become a requirement for the software, imo (and not only my opinion, I guess).

Can you come with a more Jackrabbit-friendly architecture to deal with my contracts/contractors/contacts architecture, or the book/author/comments we read on this mailing list?

Frederic Esnault

Re: About Jackrabbit repository model

Posted by Felix Meschberger <Fe...@day.com>.

Hi Frédéric,

-          Jackrabbit doesn't like same name siblings, so it would be better
> to give nodes specific names;


The problem with same name siblings is two-fold: One is the implementation
inside Jackrabbit and one is the specification.

For example, consider a node /A with 5 children named B. You would access
the first as /A/B[0] the second as /A/B[1] and so forth. If you now decide
to delete the second one (B[1]) the indices of all same name siblings whose
index is larger than 1 will change. So B[2] becomes B[1], B[3] becomes B[2]
and B[4] becomes B[3]. IMHO this is problematic and this is why I never use
same name siblings.

AFAIK, inside Jackrabbit same name sibblings result in an additional
indirection to handle the list of nodes with the same name.

-          Jackrabbit doesn't like too many child nodes under on same nodes
> (ie. 1000 or more child nodes for one parent node)


This is purely an issue of the current implementation of node states. But,
honestly, I do not think the critical limit is already at 1000 child nodes,
this number should be much higher - of course not considering same name
siblings. But it is true the more child nodes you have on a single parent
node, the more performance will suffer and the more memory consuming
Jackrabbit will be.

Your main issue is using same name siblings. (Did I already say, I don't
like them ? :-) )

But I don't think my repository structure is unique.


While the JCR API is intended to also support such flat hierarchies, it is
really the support for higher level hierarchies, which makes the API unique.
And this is also at the bottom of the Jackrabbit implementation: Flat
hierarchies have only been a side goal of the design.

It seems that searching one this kind of architecture takes ages, I made a
> search yesterday that I stopped after 25 minutes.
>
> /jcr:root/lgw:root/lgw:contracts/lgw:contract/jcr:deref(@lgw:internalContractor,
> 'lgw:contractorType')[@lgw:companyName='Legisway']


I agree, that 25 minutes is not acceptable. I cannot tell for sure, what the
problem here is, but I assume it is all linked together: flat hierarchy and
a query which seems to be expensive to handle - I could assume that the
jcr:deref function might have quite a high cost.

I think is quite common)?


I tend to think, that not everything which is "quite common" is also "good"
by definition. Don't get me wrong here, I do not want to say, your
architecture is bad. But  it seems that the architecture does not cope well
with the current implementation of Jackrabbit.

Regards
Felix

Re: About Jackrabbit repository model

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 6/21/07, Frédéric Esnault <fe...@legisway.com> wrote:
> This breaks the two "jackrabbit doesn't like" rules said before, but I think
> this is not a strange architecture.

Responding to the general issue; as more and more people are starting
to use Jackrabbit (see the growth in user list subscriptions at [1])
we are encountering use cases and application architectures that don't
fit the original design goals of Jackrabbit that well.

The "quick" solution to such cases is to modify the application
architecture or work around the limitations, but it is important that
people raise these issues on the mailing lists and the issue tracker
as it shows that Jackrabbit isn't meeting some use cases that people
would like to use it for.

It is then up for debate whether those cases are something we want to
support and how they would be best implemented, but even having the
feedback in the first place is very important and healthy for the
project. Such feedback is IMHO much more important in driving the
project than architectural visions of how things should be.

[1] blue line in http://people.apache.org/~coar/users_jackrabbit_apache_org.png

BR,

Jukka Zitting

RE: About Jackrabbit repository model

Posted by Frédéric Esnault <fe...@legisway.com>.

> absolutely! your feedback is very appreciated. i guess i didn't
> express myself clearly enough, sorry. with architecture i referred to
> jackrabbit's core design, not your
> repository model.

There was no misunderstanding, don't worry ;) And it's possible that using Jackrabbit for our project is not a good idea, this is exactly what I'm trying to evaluate currently, but for this I have to push Jackrabbit to its limits and see how good (or bad) it reacts to our requirements. 


Frédéric Esnault

Re: About Jackrabbit repository model

Posted by Stefan Guggisberg <st...@gmail.com>.

On 6/21/07, Frédéric Esnault <fe...@legisway.com> wrote:
>
>
> Frédéric Esnault - Ingénieur R&D
>
> > i tested write performance (adding a node to a parent with many child nodes)
> > since this suffers with very large child node sets. using samename siblings has
> > no impact on write performance.
>
> Writing one node, even with thousands of child nodes already on the parent node is not bad for performances, first, and for us (my company and the project) the writing is not critical, but reading/searching IS critical. That's why my testing is - mostly - about reading/searching.

ok, i see.

>
> > i guess jukka meant that we should first gather a "real life"
> > requirements/use cases list from user feedback rather than discussing
> > architectures accommodating theoretical use cases. I agree with jukka.
>
> I agree on this, a debate must be based on enough material. If ever I can help.....
> But the architecture I'm using for tests is not coming from nowhere, it sticks very closely to the projects needs, based on 4 previous version history use cases. So it may have a value as a requirement proposal.

absolutely! your feedback is very appreciated. i guess i didn't
express myself clearly enough, sorry. with architecture i referred to
jackrabbit's core design, not your
repository model.

cheers
stefan

>
> Frederic Esnault
>

RE: About Jackrabbit repository model

Posted by Frédéric Esnault <fe...@legisway.com>.


Frédéric Esnault - Ingénieur R&D

> i tested write performance (adding a node to a parent with many child nodes)
> since this suffers with very large child node sets. using samename siblings has
> no impact on write performance.

Writing one node, even with thousands of child nodes already on the parent node is not bad for performances, first, and for us (my company and the project) the writing is not critical, but reading/searching IS critical. That's why my testing is - mostly - about reading/searching.

> i guess jukka meant that we should first gather a "real life"
> requirements/use cases list from user feedback rather than discussing
> architectures accommodating theoretical use cases. I agree with jukka.

I agree on this, a debate must be based on enough material. If ever I can help.....
But the architecture I'm using for tests is not coming from nowhere, it sticks very closely to the projects needs, based on 4 previous version history use cases. So it may have a value as a requirement proposal.

Frederic Esnault

Re: About Jackrabbit repository model

Posted by Stefan Guggisberg <st...@gmail.com>.

On 6/21/07, Frédéric Esnault <fe...@legisway.com> wrote:
> > it's not that Jackrabbit "doesn't like it", it's because samename
> > siblings cause a lot of issues from a user perspective (e.g. the path
> > '/a/b[2]' is not stable since it might become
> > '/a/b[1]' if '/a/b[1]' were removed).
>
> Yes, but as i answered before, the spec only says that the order of same name siblings must stay consistent, not their exact path. So let's stick to it...like for SQL queries and joins... ;-)
>
> > jackrabbit's implementation is optimized for small to medium sized
> > child node sets.
> > performance tests showed that up to ~20k child nodes there's no significant
> > negative performance impact.
>
> Your performance tests were not made with same name siblings I guess? Because mine,

i tested write performance (adding a node to a parent with many child nodes)
since this suffers with very large child node sets. using samename siblings has
no impact on write performance.

  with 12K same name siblings child nodes, a simple search is awful.
Not a search on something like //lgw:contract[5], which is fast, but
even /jcr:root/lgw:root/lgw:contracts/lgw:contract is an awful
query....
>
> I agree with Jukka Zitting, a debate on the architectures for which Jackrabbit should be optimized/made for would be interesting, and I'd be glad to participate to such a talk.

i guess jukka meant that we should first gather a "real life"
requirements/use cases list from user feedback rather than discussing
architectures accommodating theoretical use cases. i agree with jukka.

cheers
stefan

>
> cheers
> stefan
>

Re: About Jackrabbit repository model

Posted by Marcel Reutegger <ma...@gmx.net>.

Frédéric Esnault wrote:
> Your performance tests were not made with same name siblings I guess? Because
> mine, with 12K same name siblings child nodes, a simple search is awful. Not
> a search on something like //lgw:contract[5], which is fast, but even
> /jcr:root/lgw:root/lgw:contracts/lgw:contract is an awful query....

the default configuration in jackrabbit uses document order on the nodes in a
query result. for larger result sets this is know to be slow because it reads
the nodes from the persistence manager. sibling ordering information is not
present in the index.

did you try to disable document order?

you can add the following parameter to the SearchIndex tag in the workspace.xml
file:

      <param name="respectDocumentOrder" value="false"/>

regards
   marcel

RE: About Jackrabbit repository model

Posted by Frédéric Esnault <fe...@legisway.com>.

> it's not that Jackrabbit "doesn't like it", it's because samename
> siblings cause a lot of issues from a user perspective (e.g. the path
> '/a/b[2]' is not stable since it might become
> '/a/b[1]' if '/a/b[1]' were removed).

Yes, but as i answered before, the spec only says that the order of same name siblings must stay consistent, not their exact path. So let's stick to it...like for SQL queries and joins... ;-) 

> jackrabbit's implementation is optimized for small to medium sized
> child node sets.
> performance tests showed that up to ~20k child nodes there's no significant
> negative performance impact.

Your performance tests were not made with same name siblings I guess? Because mine, with 12K same name siblings child nodes, a simple search is awful. Not a search on something like //lgw:contract[5], which is fast, but even /jcr:root/lgw:root/lgw:contracts/lgw:contract is an awful query....

I agree with Jukka Zitting, a debate on the architectures for which Jackrabbit should be optimized/made for would be interesting, and I'd be glad to participate to such a talk.

cheers
stefan

Re: About Jackrabbit repository model

Posted by Stefan Guggisberg <st...@gmail.com>.

On 6/21/07, Frédéric Esnault <fe...@legisway.com> wrote:
> Hi there !
>
>
>
> I've been thinking about what I've been told here.
>
> It has been said that :
>
> -          Jackrabbit doesn't like same name siblings, so it would be better to give nodes specific names;

it's not that Jackrabbit "doesn't like it", it's because samename
siblings cause a lot of issues from a user perspective (e.g. the path
'/a/b[2]' is not stable since it might become
'/a/b[1]' if '/a/b[1]' were removed).

>
> -          Jackrabbit doesn't like too many child nodes under on same nodes (ie. 1000 or more child nodes for one parent node)

jackrabbit's implementation is optimized for small to medium sized
child node sets.
performance tests showed that up to ~20k child nodes there's no significant
negative performance impact.

cheers
stefan

>
>
>
> But I don't think my repository structure is unique. My repository has one business root (lgw:root) with (currently) three category child nodes :
>
> -          contacts;
>
> -          contractors;
>
> -          contracts;
>
> And each of them (say contacts) hold all the nodes of this category (all the contacts), and the contact nodes are all called lgw:contact.
>
> This breaks the two "jackrabbit doesn't like" rules said before, but I think this is not a strange architecture. I saw here on this mailing list someone
>
> talking about his repository holding books, authors and so on... I guess his nodes are called book or ns:book (if ns is his namespace), with
>
> properties and child nodes, and all have the same parent.
>
>
>
> It seems that searching one this kind of architecture takes ages, I made a search yesterday that I stopped after 25 minutes.
>
>
>
> /jcr:root/lgw:root/lgw:contracts/lgw:contract/jcr:deref(@lgw:internalContractor, 'lgw:contractorType')[@lgw:companyName='Legisway']
>
>
>
> This is a search on ALL the contracts, and on each contract, it dereferences an uuid on ALL internal contractors (multivalued property) and applies a predicate
>
> on the contractor nodes. This is, for us, a very classic search (find contracts signed with a specific company), and this search should be fast. 25 minutes is not
>
> acceptable.
>
>
>
> Is there documentation on how jackrabbit uses Lucene, how indexes are built and on what, and is jackrabbit going to deal better with such architectures (which, once again,
>
> I think is quite common)?
>
>
>
> Frederic Esnault
>
>