You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by lujie <lj...@126.com> on 2008/11/02 12:15:47 UTC

Some questions when large data sets exists

Hi,
   We are using jackrabbit 1.4.1 at present.
   the content repository may look like this:
   /root
      /product
          /component
              /unit
                  /type
   There are millions of nodes in this structure with detailed acl control
such as user, group and role.
   As we can see, exploring the structure or searching it will finially
cause performance problem, because every node should be checked againt the
AccessManager.
   We notice that some experienced jackrabbit user suggested loading acl
infomations at  startup,but because the amount of the nodes ,it is
impossible to do like this.
   So we have to put the acls into another database. When user explores this
product directory, we can search in the acl database for actual nodes id or
path.
   Is this way reasonable?
   --lujie
   
-- 
View this message in context: http://www.nabble.com/Some-questions-when-large-data-sets-exists-tp20288831p20288831.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Some questions when large data sets exists

Posted by lujie <lj...@126.com>.

Hi,
   Thanks for your reply.
   So in my opinion , the access manager problem is finally in an
application-level . You can implement your own access policy, such as
loading acls at startup, caching acl results, or loading it from external
database. I think Ivon is right, xpath query or filtering of nodes can be
optimistic.
   As stefan mentioned,  jackrabbit does resolve a node path by traversing
the path and accessing every intermediary node along the path. I alse
observed that when accessing node or property,even if using XPATH,the node
must be loaded.So you must use JR in an traverse way.
   If i want to  access the n-th descendant of a given node without
accessing the n-1 intermediary nodes, then i must implement it myself. Maybe
this is application-related. An external database would help it. That is,
When a node with n-th descendant nodes is saved, the last child node's some
propertis are saved also in the external database. Then you can access the
parent parent node with it's child child node's properties in a relational
database.
   --lujie
   
  
-- 
View this message in context: http://www.nabble.com/Some-questions-when-large-data-sets-exists-tp20288831p20353531.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Some questions when large data sets exists

Posted by Ivan Latysh <iv...@gmail.com>.

lujie wrote:

>     1. The accessmanager's performance. See i have 10000 nodes for query, i
> know there only fifty nodes that i have access to read. But in the worst
> situation, the fifty nodes is count between 8000 to 8050. So,i have to
> iterator to skip previous 7999 nodes to get the actual result, the skipping
> is not just without loading the node, because accessmanage must check the
> access right.
>         what i'm doing is using xpath such as @jcr:contains("reader
> access","admin") to avoid loading much nodes. But you see, this is not a
> usual way.
   I have implemented access manager for our application that read permissions 
from the repository, similar set-up. And experienced very similar problems.
   First, want to make sure that you are aware that access manager lifespan is 
the same as a session, so each session will have it's own manager.
   Second be prepared to receive 1-2 million requests to fetch a few hundred nodes.
   So the best approach is to cache access rights. We do it by XPath that allow 
for some optimization, for instance if `/jcr:root/node1` is read-only so any 
path that start from `/jcr:root/node1` is read-only, and decision can be made 
without fetching the node again.

-- 
Ivan Latysh
IvanLatysh@gmail.com

Re: Some questions when large data sets exists

Posted by Stefan Guggisberg <st...@day.com>.

On Wed, Nov 5, 2008 at 1:47 PM, lujie <lj...@126.com> wrote:
>
> Hi stefan ,
>    Thanks for your kind reply.
>    Actually i have 2 main questions about JR.
>    1. The accessmanager's performance. See i have 10000 nodes for query, i
> know there only fifty nodes that i have access to read. But in the worst
> situation, the fifty nodes is count between 8000 to 8050. So,i have to
> iterator to skip previous 7999 nodes to get the actual result, the skipping
> is not just without loading the node, because accessmanage must check the
> access right.
>        what i'm doing is using xpath such as @jcr:contains("reader
> access","admin") to avoid loading much nodes. But you see, this is not a
> usual way.

as i already mentioned, i don't feel competent in access control questions.
maybe somebody else can answer your question...

>
>    2. The property loading performance. Even if i have this 50 nodes, when
> i want to load it's sub sub nodes trying to get some properties, then the
> persistenceManger have to load more 50*2 nodes to get these properties. so i
> say, jackrabbit lacks the ability to list child nodes and properties for a
> given node in an efficient way.

sorry, i am not sure i can follow you here. are you saying that you want to
be able to access the n-th descendant of a given node without accessing
the n-1 intermediary nodes?  jackrabbit does resolve a node path by traversing
the path and accessing every intermediary node along the path. however,
jackrabbit provides a very effective and efficient internal cache in order to
minimize access to the persistence layer.

if you want to directly access a node you can use its uuid (assuming it is
a referenceable node).

cheers
stefan
>
>    JR is wonderful work. Data saved in tree node, fulltext using lucene.
> But if from other view, we can load some nodes and it's children's property
> more directly, it will be exciting.
>
>    any suggestions?
>
>    --lujie
>
> --
> View this message in context: http://www.nabble.com/Some-questions-when-large-data-sets-exists-tp20288831p20341055.html
> Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
>
>

Re: Some questions when large data sets exists

Posted by lujie <lj...@126.com>.

Hi stefan ,
    Thanks for your kind reply.
    Actually i have 2 main questions about JR.
    1. The accessmanager's performance. See i have 10000 nodes for query, i
know there only fifty nodes that i have access to read. But in the worst
situation, the fifty nodes is count between 8000 to 8050. So,i have to
iterator to skip previous 7999 nodes to get the actual result, the skipping
is not just without loading the node, because accessmanage must check the
access right.
        what i'm doing is using xpath such as @jcr:contains("reader
access","admin") to avoid loading much nodes. But you see, this is not a
usual way.
 
    2. The property loading performance. Even if i have this 50 nodes, when
i want to load it's sub sub nodes trying to get some properties, then the
persistenceManger have to load more 50*2 nodes to get these properties. so i
say, jackrabbit lacks the ability to list child nodes and properties for a
given node in an efficient way. 

    JR is wonderful work. Data saved in tree node, fulltext using lucene.
But if from other view, we can load some nodes and it's children's property
more directly, it will be exciting.

    any suggestions?

    --lujie

-- 
View this message in context: http://www.nabble.com/Some-questions-when-large-data-sets-exists-tp20288831p20341055.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.

Re: Some questions when large data sets exists

Posted by Angela Schreiber <an...@day.com>.

> from the top of my head:
> every node stores a list of child node entries (i.e. name/id pairs). the
> child node entries are filtered at runtime while iterating over child nodes,
> i.e. the id's are passed to the AccessManager to check the permissions
> *before* the child node is loaded. at least that's how it used to be.

currently that is removed, for i wanted to avoid the duplicate
permission check as it used to happen (first upon building the 
id-iterator within ItemManager#getChild... and later on again
upon accessing the Item).

however, if this causes problems we may add the permission check again
within ItemManager#getChildNodes and ItemManager#getChildProperties.

regards
angela

Re: Some questions when large data sets exists

Posted by Stefan Guggisberg <st...@day.com>.

hi lujie

On Tue, Nov 4, 2008 at 1:36 AM, lujie <lj...@126.com> wrote:
>
> Hi,
>   Maybe this question is somewhat stupid.
>   But for business usage, the situation is very common. In jsr 283, the acl
> control is  an option choice, but in  jackrabbit the acl control is finally
> guarded by the accessManager.  So, it is a must to use jackrabbit's
> accessmanager?
>   But jackrabbit lacks the "view" to list child nodes and some properties
> for a given node in an efficient way. we must iterate the node, and for
> access check, if we load a node and find that this node is denied by the
> accessmanager, the work is just a waste.
>   So,my oponion is , jackrabbit lacks the ability to list child nodes and
> properties for a given node in an efficient way.

not sure i correctly understand your problem.

from the top of my head:
every node stores a list of child node entries (i.e. name/id pairs). the
child node entries are filtered at runtime while iterating over child nodes,
i.e. the id's are passed to the AccessManager to check the permissions
*before* the child node is loaded. at least that's how it used to be.

maybe angela can provide more information on the current implementation.

cheers
stefan


>   any suggestions?
>   --lujie
> --
> View this message in context: http://www.nabble.com/Some-questions-when-large-data-sets-exists-tp20288831p20314115.html
> Sent from the Jackrabbit - Users mailing list archive at Nabble.com.
>
>

Re: Some questions when large data sets exists

Posted by lujie <lj...@126.com>.

Hi,
   Maybe this question is somewhat stupid.
   But for business usage, the situation is very common. In jsr 283, the acl
control is  an option choice, but in  jackrabbit the acl control is finally
guarded by the accessManager.  So, it is a must to use jackrabbit's
accessmanager?
   But jackrabbit lacks the "view" to list child nodes and some properties
for a given node in an efficient way. we must iterate the node, and for
access check, if we load a node and find that this node is denied by the
accessmanager, the work is just a waste.
   So,my oponion is , jackrabbit lacks the ability to list child nodes and
properties for a given node in an efficient way.
   any suggestions?
   --lujie
-- 
View this message in context: http://www.nabble.com/Some-questions-when-large-data-sets-exists-tp20288831p20314115.html
Sent from the Jackrabbit - Users mailing list archive at Nabble.com.