You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Mark Baker (JIRA)" <ji...@apache.org> on 2009/10/06 21:13:31 UTC

[jira] Created: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Support hierarchical child node creation from SlingPostServlet
--------------------------------------------------------------

                 Key: SLING-1137
                 URL: https://issues.apache.org/jira/browse/SLING-1137
             Project: Sling
          Issue Type: Improvement
          Components: Servlets
            Reporter: Mark Baker
            Priority: Minor


The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.

I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Ian Boston <ie...@tfd.co.uk>.
On 7 Oct 2009, at 14:07, Mike Müller (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763042 
> #action_12763042 ]
>
> Mike Müller commented on SLING-1137:
> ------------------------------------
>
> I think this would be a really cool feature to work around the  
> performance issue of jackrabbit if you have many nodes under the  
> same node.
> What about a service which can be registered unter a specific path  
> which addresses this issue very transparent for the client:
> 1) the service takes care that the subnode under the registered path  
> is saved in a hierachical tree (eg. a tree based on hashes)
> 2) the service also acts as a ResourceProvider which returns the  
> searched resource
>
> For example:
> 1) You register this new service unter /my/path
> 2) If you post a new node under /my/path/newnode the service takes  
> care to save the new node in a non flat structure, for example in a  
> hashed structure under /my/path/b7/newnode
> 3) If you get the resource under /my/path/newnode the service  
> (ResourceProvider) takes care the /my/path/b7/newnode is returned.
>
> WDYT?

IIUC (and this may be off track)

We  (Sakai) do this already, its the only way we could get to billions  
of child nodes at URL level. For instance we have 3 stores that depend  
on this.
MessageStore, a Gmail like store of all messages in the system. chat,  
discussion, email, comments etc URL space is /user/ieb/messages/xxxx  
where xxx is the message ID, JCR path is hashed at ieb and at xxxx
Contact, a social network graph where each user may have 1000's of  
contacts and there are 1000's of users URL space //contacts/ieb/ 
mikedev JCR path hashed at ieb and mikedev

The final one that really needs to scale for us is a shared pool of  
resources /files/xxxx is the permanent URL, which is referenced from  
many other places in the URL structure. the xxxx is hashed. We have  
chosen a 4x255 hash eg ff/ff/ff/ff which should get us to 4e9 items  
without too much contention or update slowdown.

The core code is in an AbstractVirtualServlet that when extended  
allows mapping to a parent resource type to indicate the location in  
the JCR is hashed, but that one needs a modification to the  
JcrResourceResolver2 and the API to work.

I am currently re-writing the ResourceProviderEntry to accommodate the  
modification without changing the API, but its complex and slow work  
to get it right.

Ian


>
>> Support hierarchical child node creation from SlingPostServlet
>> --------------------------------------------------------------
>>
>>                Key: SLING-1137
>>                URL: https://issues.apache.org/jira/browse/SLING-1137
>>            Project: Sling
>>         Issue Type: Improvement
>>         Components: Servlets
>>           Reporter: Mark Baker
>>           Priority: Minor
>>
>> The default node creation functionality on "/" terminated paths via  
>> the SlingPostServlet doesn't scale very well as it only supports  
>> creation of nodes immediately under the targeted path.  So, for  
>> example, when using this via a CQ form to capture form responses in  
>> the repository, a site can potentially have thousands of child  
>> nodes, leading to well known performance problems.
>> I think it would be useful to offer an option for the servlet to  
>> save a hierarchy of nodes, perhaps via the common convention of  
>> using the first 4 characters of the would-be node id to create a 2  
>> level hierarchy.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>


[jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by "Mike Müller (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763042#action_12763042 ] 

Mike Müller commented on SLING-1137:
------------------------------------

I think this would be a really cool feature to work around the performance issue of jackrabbit if you have many nodes under the same node. 
What about a service which can be registered unter a specific path which addresses this issue very transparent for the client:
1) the service takes care that the subnode under the registered path is saved in a hierachical tree (eg. a tree based on hashes)
2) the service also acts as a ResourceProvider which returns the searched resource 

For example:
1) You register this new service unter /my/path
2) If you post a new node under /my/path/newnode the service takes care to save the new node in a non flat structure, for example in a hashed structure under /my/path/b7/newnode
3) If you get the resource under /my/path/newnode the service (ResourceProvider) takes care the /my/path/b7/newnode is returned.

WDYT?

> Support hierarchical child node creation from SlingPostServlet
> --------------------------------------------------------------
>
>                 Key: SLING-1137
>                 URL: https://issues.apache.org/jira/browse/SLING-1137
>             Project: Sling
>          Issue Type: Improvement
>          Components: Servlets
>            Reporter: Mark Baker
>            Priority: Minor
>
> The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.
> I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762904#action_12762904 ] 

Bertrand Delacretaz commented on SLING-1137:
--------------------------------------------

Mark, how would you differentiate between a) "client wants specific path" and b) "client wants path to be generated" cases?

a)
new/jcr:primaryType = nt:folder
new/path/jcr:primaryType = nt:folder 

b)
*/jcr:primaryType = nt:folder
*/path/jcr:primaryType = nt:folder 

or something like that?

> Support hierarchical child node creation from SlingPostServlet
> --------------------------------------------------------------
>
>                 Key: SLING-1137
>                 URL: https://issues.apache.org/jira/browse/SLING-1137
>             Project: Sling
>          Issue Type: Improvement
>          Components: Servlets
>            Reporter: Mark Baker
>            Priority: Minor
>
> The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.
> I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by "Mark Baker (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762775#action_12762775 ] 

Mark Baker commented on SLING-1137:
-----------------------------------

Hi Alex.  Yes, that's true, but the client won't always be able to easily specify the path - for example, CQ can't do that without a custom form action.

I suppose the case could be made that this should be a higher level feature, but I personally think it's generally valuable enough to belong in Sling.  YMMV of course 8-)

> Support hierarchical child node creation from SlingPostServlet
> --------------------------------------------------------------
>
>                 Key: SLING-1137
>                 URL: https://issues.apache.org/jira/browse/SLING-1137
>             Project: Sling
>          Issue Type: Improvement
>          Components: Servlets
>            Reporter: Mark Baker
>            Priority: Minor
>
> The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.
> I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by "Ian Boston (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763077#action_12763077 ] 

Ian Boston commented on SLING-1137:
-----------------------------------

I agree with the plugin approach since there are plenty of situations where there is no semantic structure that does not generate scalability and concurrency issues (until JCR-642 has been fixed), for those situations you have no option but to use something that generates a flat distribution of node paths throughout the chosen taxonomy.

It might also be helpful to distinguish between JCR path and URL path, since they IMHO don't have to be the same and it would be completely wrong to expose an JCR path structured for scalability and concurrency to the user.

eg 
http://myresearch.cam.ac.uk/~ieb

where ieb is one of 25K users.

putting that in jcr as /users/_ieb wont work, but giving urls out like

http://myresearch.cam.ac.uk/i/e/~ieb

will just be embarrassing (afs eg https://www.sit.auckland.ac.nz/Mapping_a_network_drive_to_an_AFS_path_%28Windows%29)

> Support hierarchical child node creation from SlingPostServlet
> --------------------------------------------------------------
>
>                 Key: SLING-1137
>                 URL: https://issues.apache.org/jira/browse/SLING-1137
>             Project: Sling
>          Issue Type: Improvement
>          Components: Servlets
>            Reporter: Mark Baker
>            Priority: Minor
>
> The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.
> I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by "Alexander Klimetschek (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762760#action_12762760 ] 

Alexander Klimetschek commented on SLING-1137:
----------------------------------------------

You can create a hierarchy with a single post request, it's only the client that has to know how it should look like.

For example, a post to

/content/new/path/to/something

with jcr:primaryType=my:nodetype and other properties set, will create the path (assuming only /content exists) /content/new/path/to with nt:unstructured node as node type.

To define nodetypes and properties for the intermediary nodes, you can adjust the post to go to

/content

and set

new/jcr:primaryType = nt:folder
new/path/jcr:primaryType = nt:folder
etc.

If you are stuck with the post to the full path, you can also use absolute paths in the fields (AFAIK):

/content/new/jcr:primaryType = nt:folder

> Support hierarchical child node creation from SlingPostServlet
> --------------------------------------------------------------
>
>                 Key: SLING-1137
>                 URL: https://issues.apache.org/jira/browse/SLING-1137
>             Project: Sling
>          Issue Type: Improvement
>          Components: Servlets
>            Reporter: Mark Baker
>            Priority: Minor
>
> The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.
> I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Alexander Klimetschek <ak...@day.com>.
On Thu, Oct 8, 2009 at 00:24, John Norman <jo...@caret.cam.ac.uk> wrote:
> Does it make any difference that different users might have a different
> logical tree for organising the same content? I have seen quite a few
> hierarchical information organisation models that make sense to one human
> being but are completely unhelpful to another. I quite like the concept of
> an amorphous information pool that can have multiple apparent organisations
> according to viewer and context.

I agree: one just has to look at the different folder structures on
people's computers. The best one I have seen (and he is a developer!)
was to put everything flat on the desktop and after a while, when it
becomes full and old, he simply moved everything into an "old" folder
- on the desktop. Over time, this gives quite a few nested old
folders. Unusual but very simple... oh, I am getting off-topic ;-)

You could achieve multiple views with shareable nodes in JCR 2.0 plus
hiding the other nodes that point to the same structure depending on
the current user. But I don't think it's worth the effort. When you
have shared content you have to come up with a structure that is
useable for everyone (at least a bit). Here at Day we define the base
structure of the repository for our CQ5 product, a structure that
emerged through the product's history and follows many conventions
from the unix file system hierarchy (eg. short names are important).
Only at the lower levels customers are free to use a different
structure (for code and other non-site content stuff).

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by John Norman <jo...@caret.cam.ac.uk>.
Non-developer warning:

On 7 Oct 2009, at 22:47, Alexander Klimetschek wrote:
> [...]
> But even if Jackrabbit scales with hundred thousands of child nodes
> per node, you still have the problem of an unbalanced tree: it will be
> hard or not to say impossible to browse that tree for a human - you'd
> need a very advanced paging tree view to be able to go through that)
> and just doesn't "feel" right. Well, at least to me ;-)

Does it make any difference that different users might have a  
different logical tree for organising the same content? I have seen  
quite a few hierarchical information organisation models that make  
sense to one human being but are completely unhelpful to another. I  
quite like the concept of an amorphous information pool that can have  
multiple apparent organisations according to viewer and context.



RE: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Mike Müller <mi...@mysign.ch>.
> > IMHO the content structure is a critical part of the system
> (at least
> > on the technical side), so I would involve a developer with
> knowledge
> > and experience about the "underlying" repository and
> content modeling
> > whenever a new URL space is created by some application. Or
> teach the
> > UI developers about content modeling and provide them with
> simple APIs
> > to avoid flat or hash-based hierarchies.
>
> I agree, I have been trying to do that just for the past 18 months.
> Sadly the non technical view is that the URL space is *user* space,
> which has been the motivation to provide some virtualization between
> jcr URI's and http URI's.


IMHO, it's not important if the URL space is UI space or developer space.
It's a simple fact that there are use cases where we have thousands of
resources under the same node, which are not structurable in a reasonable
manner, like users under a node /user/. But it's also a fact, that we
need a solution here only because Jackrabbit does not scale very well if
the structure is flat. So the real solution to that problem would be
so solve https://issues.apache.org/jira/browse/JCR-642. But as you have
a look at this issue it doesn't seem to be solved in the next time.
That's why I think we should provide such a service in Sling which - there
I'm totally agree - is in fact only a workaround. - But a very needed one.

best regards
mike

Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Ian Boston <ie...@tfd.co.uk>.
On 8 Oct 2009, at 09:24, Alexander Klimetschek wrote:

> IMHO the content structure is a critical part of the system (at least
> on the technical side), so I would involve a developer with knowledge
> and experience about the "underlying" repository and content modeling
> whenever a new URL space is created by some application. Or teach the
> UI developers about content modeling and provide them with simple APIs
> to avoid flat or hash-based hierarchies.

I agree, I have been trying to do that just for the past 18 months.  
Sadly the non technical view is that the URL space is *user* space,  
which has been the motivation to provide some virtualization between  
jcr URI's and http URI's.

Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Alexander Klimetschek <ak...@day.com>.
On Thu, Oct 8, 2009 at 10:11, Ian Boston <ie...@tfd.co.uk> wrote:
> 1. The URL space is part of the UI and "owned" by the User, UX Designer, UI
> developer.
> 2. Imposing a convention on that URL space for the affordances of the back
> end causes just the problem that you are concerned about. Now the UI
> developer needs to know the internals of how to structure those URLs to
> achieve scalability.
> ...
> BTW, a UI developer does not write Java code. They use the REST interfaces,
> they might write some py, esp or rb.
> ...
> On 1, our Users, UX Designers and UI developers are demanding URLs like
> /xxxx/yy where for all instances of yy, yy is unique, and yy might be on of
> 1-200K and in some instances I know of upto 4M (the 16G is an edge case but
> if I break out of the Higher Ed use case there are plenty of examples of
> URLs where yy is one of billions).   There are two solid examples /user/eid
> where eid is the institutional ID and /site/siteid where site ID the name of
> the Site, eg physics101.
> ...
> On 2. If we have to communicate how to structure the URL to UI developers
> for storage, then it hardly matters what the scheme is, we have to
> communicate it. An algorithm that says formatTime(now,"/{YYYY}/{MM}/{DD}/")
> is almost as simple as formatSha1(pathInfo,"/{01}/{23}/{45}/{67}") but I
> cant ask the UI developer to to do either. This is not to say that they
> might not decide to structure the URL in a semantic form, and I would
> encourage them to do so, but they always come back to the case where there
> is a user generated URL space that will have > 10K items at yy.

IMHO the content structure is a critical part of the system (at least
on the technical side), so I would involve a developer with knowledge
and experience about the "underlying" repository and content modeling
whenever a new URL space is created by some application. Or teach the
UI developers about content modeling and provide them with simple APIs
to avoid flat or hash-based hierarchies.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Ian Boston <ie...@tfd.co.uk>.
On 7 Oct 2009, at 22:47, Alexander Klimetschek wrote:

> On Wed, Oct 7, 2009 at 20:34, Ian Boston <ie...@tfd.co.uk> wrote:
>> I agree, I would like to adopt sensible naming, but we keep on  
>> hitting
>> situations where even with the most reasonable domain prefix we end  
>> up with
>>> 2K items in a folder and then the update rates go through the  
>>> floor, and
>> contention and un mergable changes fall over. (usually just at the  
>> worst
>> time possible... when load is highest )
>>
>> In our case we often run out of things to slice before we reach a  
>> position
>> where the store works. eg ieb i/ie/ieb  gives 64 at level 1 which  
>> generates
>> huge amounts of collision at level2 which again only has 64 making  
>> the
>> maximum scale of somewhere around 4096*1024 items assuming a perfect
>> distribution before the bottom level folders breach 1024 children.  
>> For
>> messaging for instance, I need a store that does about > 255^3 before
>> colliding, ie 16M *1024.  Am I wrong to be choosing jcr as a  
>> message store
>> to support this use case ?
>
> I think you are really at the edge of scaling here. How many messages
> are added per day? I'd think that date + maybe time (if there are more
> than 2K per day) should balance it enough, for example. Organizing
> messages by date is probably the best way anyway. And I guess they
> won't change at all, only new ones are added, which also should reduce
> contention to the node with the current time.


I agree that all of these structures help avoid the scaling issues but  
IMO they miss two points that have been highlighted in our use of Sling.

I am only talking about URL's here *not* the path in JCR unless we are  
forced to have a 1:1 mapping.

1. The URL space is part of the UI and "owned" by the User, UX  
Designer, UI developer.
2. Imposing a convention on that URL space for the affordances of the  
back end causes just the problem that you are concerned about. Now the  
UI developer needs to know the internals of how to structure those  
URLs to achieve scalability.

BTW, a UI developer does not write Java code. They use the REST  
interfaces, they might write some py, esp or rb.

On 1, our Users, UX Designers and UI developers are demanding URLs  
like /xxxx/yy where for all instances of yy, yy is unique, and yy  
might be on of 1-200K and in some instances I know of upto 4M (the 16G  
is an edge case but if I break out of the Higher Ed use case there are  
plenty of examples of URLs where yy is one of billions).   There are  
two solid examples /user/eid where eid is the institutional ID and / 
site/siteid where site ID the name of the Site, eg physics101.

These URLs *must* be speakable human to human. so /site/e4f3-de45-f345- 
efe4 is not acceptable and /user/i/ie/ieb although just speakable will  
remind our community of their institutional deployments of Andrews  
File System, IMHO *not* a good thing as for many institutions it has  
not been synonymous with scalability.

On 2. If we have to communicate how to structure the URL to UI  
developers for storage, then it hardly matters what the scheme is, we  
have to communicate it. An algorithm that says formatTime(now,"/{YYYY}/ 
{MM}/{DD}/") is almost as simple as formatSha1(pathInfo,"/{01}/{23}/ 
{45}/{67}") but I cant ask the UI developer to to do either. This is  
not to say that they might not decide to structure the URL in a  
semantic form, and I would encourage them to do so, but they always  
come back to the case where there is a user generated URL space that  
will have > 10K items at yy.


eg "What! you mean I can just put it at /site/xxx, I have to structure  
the url, but that not what the users are saying they want, they want  
to be able to decide what the url to their site is and, btw, they dont  
like using /site they want /xxx you know like http://www.bbc.co.uk/radio4 
" (I paraphrase a discussion of a few months ago)




>
> If there is some other categorization of messages, eg. like the
> project or group or whatever they belong to, you can put them in the
> project's folder and then do the substructure via the dates. If you
> give the messages a nodetype + other metadata as properties, you can
> search them across projects or months/years.
>
>> Sounds like if JCR-642 was fixed, none of this would be an issue?
>
> Not really. First of all it's not just a "fix", it requires a complete
> rewrite of the internal persistence architecture in Jackrabbit.
> Something for a 3.0 maybe (and there are various ideas how to do that
> and also improve other bottlenecks).
>
> But even if Jackrabbit scales with hundred thousands of child nodes
> per node, you still have the problem of an unbalanced tree: it will be
> hard or not to say impossible to browse that tree for a human - you'd
> need a very advanced paging tree view to be able to go through that)
> and just doesn't "feel" right. Well, at least to me ;-)


agreed a list of all nodes at yy is explicitly not supported, we use  
search to provide a number of different hierarchies into that space

eg
date organized http://host/messages/yyyy/mm/dd.json
tag organized http://host/tags/sling-dev.json

with a default paging enforced just as any search engine does.

One point here is there are *multiple* views into the information set.


Sorry the message is so long, this is a real, possibly blocking issue  
for us.

Ian

>
> Regards,
> Alex
>
> -- 
> Alexander Klimetschek
> alexander.klimetschek@day.com


Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Alexander Klimetschek <ak...@day.com>.
On Wed, Oct 7, 2009 at 20:34, Ian Boston <ie...@tfd.co.uk> wrote:
> I agree, I would like to adopt sensible naming, but we keep on hitting
> situations where even with the most reasonable domain prefix we end up with
>> 2K items in a folder and then the update rates go through the floor, and
> contention and un mergable changes fall over. (usually just at the worst
> time possible... when load is highest )
>
> In our case we often run out of things to slice before we reach a position
> where the store works. eg ieb i/ie/ieb  gives 64 at level 1 which generates
> huge amounts of collision at level2 which again only has 64 making the
> maximum scale of somewhere around 4096*1024 items assuming a perfect
> distribution before the bottom level folders breach 1024 children. For
> messaging for instance, I need a store that does about > 255^3 before
> colliding, ie 16M *1024.  Am I wrong to be choosing jcr as a message store
> to support this use case ?

I think you are really at the edge of scaling here. How many messages
are added per day? I'd think that date + maybe time (if there are more
than 2K per day) should balance it enough, for example. Organizing
messages by date is probably the best way anyway. And I guess they
won't change at all, only new ones are added, which also should reduce
contention to the node with the current time.

If there is some other categorization of messages, eg. like the
project or group or whatever they belong to, you can put them in the
project's folder and then do the substructure via the dates. If you
give the messages a nodetype + other metadata as properties, you can
search them across projects or months/years.

> Sounds like if JCR-642 was fixed, none of this would be an issue?

Not really. First of all it's not just a "fix", it requires a complete
rewrite of the internal persistence architecture in Jackrabbit.
Something for a 3.0 maybe (and there are various ideas how to do that
and also improve other bottlenecks).

But even if Jackrabbit scales with hundred thousands of child nodes
per node, you still have the problem of an unbalanced tree: it will be
hard or not to say impossible to browse that tree for a human - you'd
need a very advanced paging tree view to be able to go through that)
and just doesn't "feel" right. Well, at least to me ;-)

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Ian Boston <ie...@tfd.co.uk>.
On 7 Oct 2009, at 18:48, Alexander Klimetschek wrote:

> On Wed, Oct 7, 2009 at 18:10, Ian Boston <ie...@tfd.co.uk> wrote:
>> so if the abstraction and isolation is perfect and the hashed and  
>> ugly jcr
>> path never exposed to a developer or user above the layer of  
>> service or api,
>> then using them in the JCR itself is unfortunate but acceptable ?
>
> I'd say the "layer of service or api" is the JCR API and that's
> exposed to developers. With Sling, this is certainly the case.

Ahh,
Many of the developers we work with only see the URLs, they are UI  
developers working in HTML/Javascript and Java developers accessing  
this area work through ServiceAPI's that abstract.


>
>> It would obviously be desirable not to have to resort to hashing,  
>> but there
>> are cases as soon as a system has more than a few 10K users.
>
> Yes, but I am just saying that a somehow senseful naming is preferred
> over arbitrary hashes. Dates like 2009/12/01 or nodename prefixes
> "a/ad/admin" or more domain specific categorization.


I agree, I would like to adopt sensible naming, but we keep on hitting  
situations where even with the most reasonable domain prefix we end up  
with > 2K items in a folder and then the update rates go through the  
floor, and contention and un mergable changes fall over. (usually just  
at the worst time possible... when load is highest )

In our case we often run out of things to slice before we reach a  
position where the store works. eg ieb i/ie/ieb  gives 64 at level 1  
which generates huge amounts of collision at level2 which again only  
has 64 making the maximum scale of somewhere around 4096*1024 items  
assuming a perfect distribution before the bottom level folders breach  
1024 children. For messaging for instance, I need a store that does  
about > 255^3 before colliding, ie 16M *1024.  Am I wrong to be  
choosing jcr as a message store to support this use case ?


>
>> Longer term, looking at the storage of child nodes relative to  
>> parents in
>> Jackrabbit itself *might* address this. You mention the Persistance  
>> Manager.
>> Are there PM's that dont have the problem or is it above the PM  
>> layer ?
>
> I was just using this as an analogy; it affects most persistence
> managers, but especially the optimized bundle pms. They store nodes by
> uuid and as a binary bundle in the database, so accessing the database
> (for doing JCR workarounds for migration, large-style copying or
> whatever) is not anything that really works because you cannot browse
> it without additional programming help. But every now people on the
> Jackrabbit list, that are new to JR, ask for that: how can I modify
> the nodes in the db, etc. That's because they want to reuse their
> experience with databases and all the admin tools available.
>
> Now with JCR, if you have a JCR-level browser and admin tool, you
> don't need it. And the PM is just an implementation detail. So IMO
> this is a good thing - one that gives you the unstructuredness. But
> above that level you don't want to introduce such a complex mapping so
> that people have no way to use the repository as a fundamental
> infrastructure.


Sounds like if JCR-642 was fixed, none of this would be an issue?

Ian

>
> Regards,
> Alex
>
> -- 
> Alexander Klimetschek
> alexander.klimetschek@day.com


Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Alexander Klimetschek <ak...@day.com>.
On Wed, Oct 7, 2009 at 18:10, Ian Boston <ie...@tfd.co.uk> wrote:
> so if the abstraction and isolation is perfect and the hashed and ugly jcr
> path never exposed to a developer or user above the layer of service or api,
> then using them in the JCR itself is unfortunate but acceptable ?

I'd say the "layer of service or api" is the JCR API and that's
exposed to developers. With Sling, this is certainly the case.

> It would obviously be desirable not to have to resort to hashing, but there
> are cases as soon as a system has more than a few 10K users.

Yes, but I am just saying that a somehow senseful naming is preferred
over arbitrary hashes. Dates like 2009/12/01 or nodename prefixes
"a/ad/admin" or more domain specific categorization.

> Longer term, looking at the storage of child nodes relative to parents in
> Jackrabbit itself *might* address this. You mention the Persistance Manager.
> Are there PM's that dont have the problem or is it above the PM layer ?

I was just using this as an analogy; it affects most persistence
managers, but especially the optimized bundle pms. They store nodes by
uuid and as a binary bundle in the database, so accessing the database
(for doing JCR workarounds for migration, large-style copying or
whatever) is not anything that really works because you cannot browse
it without additional programming help. But every now people on the
Jackrabbit list, that are new to JR, ask for that: how can I modify
the nodes in the db, etc. That's because they want to reuse their
experience with databases and all the admin tools available.

Now with JCR, if you have a JCR-level browser and admin tool, you
don't need it. And the PM is just an implementation detail. So IMO
this is a good thing - one that gives you the unstructuredness. But
above that level you don't want to introduce such a complex mapping so
that people have no way to use the repository as a fundamental
infrastructure.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Ian Boston <ie...@tfd.co.uk>.
On 7 Oct 2009, at 16:38, Alexander Klimetschek wrote:

> On Wed, Oct 7, 2009 at 16:48, Ian Boston <ie...@tfd.co.uk> wrote:
>> On 7 Oct 2009, at 14:50, Alexander Klimetschek (JIRA) wrote:
>>> I would refrain from building in an automatic mechanism that creates
>>> hash-based paths because they are bad ;-)
>>
>> Could you elaborate on why they are bad in the JCR ?
>> (not taking about in URLS exposed to users here, only the JCR)
>
> I think the answer already lies in a) having a URL structure for the
> public + hashes as jcr paths and b) mapping between them. Why
> shouldn't the nice/intuitive/readable structure that the web users see
> (and expect) not be used in the JCR?

agreed

>
> A custom or special mapping should only be the exception (eg. the
> ~user directory you mentioned).

agreed

>
> A common and understandable content structure makes everything simpler
> - any kind of mapping unnecessarily complicates the life for
> developers, administrators and other technical users: they have to
> handle two different models and it's often application logic involved
> to find the items in question, which is hard to achieve manually. (The
> same goes for looking up JCR nodes in a typical db bundle persistence
> manager - it's not intended for the user, but here it's an independent
> implementation detail).

agreed,

so if the abstraction and isolation is perfect and the hashed and ugly  
jcr path never exposed to a developer or user above the layer of  
service or api, then using them in the JCR itself is unfortunate but  
acceptable ?

It would obviously be desirable not to have to resort to hashing, but  
there are cases as soon as a system has more than a few 10K users.

Longer term, looking at the storage of child nodes relative to parents  
in Jackrabbit itself *might* address this. You mention the Persistance  
Manager. Are there PM's that dont have the problem or is it above the  
PM layer ?

Ian

>
> Regards,
> Alex
>
> -- 
> Alexander Klimetschek
> alexander.klimetschek@day.com


Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Alexander Klimetschek <ak...@day.com>.
On Wed, Oct 7, 2009 at 16:48, Ian Boston <ie...@tfd.co.uk> wrote:
> On 7 Oct 2009, at 14:50, Alexander Klimetschek (JIRA) wrote:
>> I would refrain from building in an automatic mechanism that creates
>> hash-based paths because they are bad ;-)
>
> Could you elaborate on why they are bad in the JCR ?
> (not taking about in URLS exposed to users here, only the JCR)

I think the answer already lies in a) having a URL structure for the
public + hashes as jcr paths and b) mapping between them. Why
shouldn't the nice/intuitive/readable structure that the web users see
(and expect) not be used in the JCR?

A custom or special mapping should only be the exception (eg. the
~user directory you mentioned).

A common and understandable content structure makes everything simpler
- any kind of mapping unnecessarily complicates the life for
developers, administrators and other technical users: they have to
handle two different models and it's often application logic involved
to find the items in question, which is hard to achieve manually. (The
same goes for looking up JCR nodes in a typical db bundle persistence
manager - it's not intended for the user, but here it's an independent
implementation detail).

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by Ian Boston <ie...@tfd.co.uk>.
On 7 Oct 2009, at 14:50, Alexander Klimetschek (JIRA) wrote:

> I would refrain from building in an automatic mechanism that creates  
> hash-based paths because they are bad ;-)


Could you elaborate on why they are bad in the JCR ?
(not taking about in URLS exposed to users here, only the JCR)

Thanks
Ian



[jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by "Alexander Klimetschek (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763058#action_12763058 ] 

Alexander Klimetschek commented on SLING-1137:
----------------------------------------------

I would refrain from building in an automatic mechanism that creates hash-based paths because they are bad ;-) It's better to find a proper semantic structure, and the most general structure are dates (eg. 2009/10/07/*) as most content has a date.

I wonder if we could make the NodeNameGenerator in the SlingPostServlet a service and thus extensible. For example, a new node name generator for this would listen to a new parameter, eg. :hierarchicalNaming = date, and create (or reuse) the appropriate date structure for any  new node that has to be generated. (The date may be extracted or given in another parameter).

Thinking further: as you'd probably want multiple node name generators active at the same time, there should be a proper selection mechanism. Maybe through a param (:nameGenerator)!?

> Support hierarchical child node creation from SlingPostServlet
> --------------------------------------------------------------
>
>                 Key: SLING-1137
>                 URL: https://issues.apache.org/jira/browse/SLING-1137
>             Project: Sling
>          Issue Type: Improvement
>          Components: Servlets
>            Reporter: Mark Baker
>            Priority: Minor
>
> The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.
> I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SLING-1137) Support hierarchical child node creation from SlingPostServlet

Posted by "Justin Edelson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SLING-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12763069#action_12763069 ] 

Justin Edelson commented on SLING-1137:
---------------------------------------

Well, it'd be best if there was a resolution to JCR-642 so as to avoid needing to workaround the flat hierarchy problem. Although I firmly believe that a good structure can be found in most cases, most != all.

+1 to the idea of making NodeNameGenerator pluggable.

> Support hierarchical child node creation from SlingPostServlet
> --------------------------------------------------------------
>
>                 Key: SLING-1137
>                 URL: https://issues.apache.org/jira/browse/SLING-1137
>             Project: Sling
>          Issue Type: Improvement
>          Components: Servlets
>            Reporter: Mark Baker
>            Priority: Minor
>
> The default node creation functionality on "/" terminated paths via the SlingPostServlet doesn't scale very well as it only supports creation of nodes immediately under the targeted path.  So, for example, when using this via a CQ form to capture form responses in the repository, a site can potentially have thousands of child nodes, leading to well known performance problems.
> I think it would be useful to offer an option for the servlet to save a hierarchy of nodes, perhaps via the common convention of using the first 4 characters of the would-be node id to create a 2 level hierarchy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.