You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sling.apache.org by Jos Snellings <Jo...@pandora.be> on 2010/11/09 14:11:57 UTC

ACL evaluation with scattered permissions

You are right, Ian,

This question deserves a new thread.
Currently I am drawing up an architecture for a file handling system for 
e-government:
permissions are scattered up to:
- the citizen : one active file for a citizen (= folder, infoholder in 
xml, attachments)
- the community :  visibility and handling for the citizens of one community
- the regional authority : regional indicators

This worries me for it is a typical case where you would run into 
scalability problems.
Think of 50 000 open applications via that system. With 10 documents per 
application
you would have 500 000.

Is that a nogo for Sling? Would be a pity. I wanted to come up with an 
elegant solution :-)

Jos




On 11/09/2010 09:22 AM, Ian Boston wrote:
> Jos,
> If by result you mean a search result, then thats a separate issue from the dynamic ACL itself, and not the direct subject of this thread. When I said performance I was referring to the atomic act of determining if the ACE was active for any attempt to access an item, not just search results.
>
>
> However,
> thats the way jackrabbit works.
> JCR searches are "compiled" into Lucene Queries that generate Lucene Hits where the Lucene document contains a node ID, which is extracted in the normal manner from JCR (IIRC). If the current user cant read the item, its discarded.
>
> This is fine for dense searches where most items can be read by the user, but problematic for sparse searches.
> Its also problematic for sorts that can't be performed inside Lucene, as this results in all the items being loaded into memory before searching.
> One way to avoid sorts of this form is to ban "order by" clauses that reference any items other than properties of the node found.
>
>
> BTW, problematic == non scalable, vertically or horizontally.
> Ian
>

Re: ACL evaluation with scattered permissions

Posted by Jos Snellings <Jo...@pandora.be>.

Thank you, Ian !
I am writing the proposal as a warned subject.

Jos

On 11/10/2010 10:05 AM, Ian Boston wrote:
> On 10 Nov 2010, at 00:09, Jos Snellings wrote:
>
>
>> Thank you for your prompt answer, Ian.
>> You mean "the natural way".
>> That would be true for a citizen.
>> That would be true for a community, so a path could be Stockholm/234987488.
>> But to extract a regional indicator, like 'how many applications were handled on time during the first half of 2014'. This is something that is not requested in the first place,
>> but I know it *will* come up.  ==>  then the user performing this query would have read access on all files. Would the query scale better?
>>
>
>> 'how many applications were handled on time during the first half of 2014'
>>
> implies a date range.
> IIRC date ranges are problematic in Lucene and although the query might be Ok from a sparse search point of view, the date range might cause a problem. Again experimentation before committing to implementation is going to remove more of the risk.
> Ian
>
>
>
>> Thanks,
>> Jos
>>
>> On 11/09/2010 07:56 PM, Ian Boston wrote:
>>
>>> On 9 Nov 2010, at 13:11, Jos Snellings wrote:
>>>
>>>
>>>
>>>> You are right, Ian,
>>>>
>>>> This question deserves a new thread.
>>>> Currently I am drawing up an architecture for a file handling system for e-government:
>>>> permissions are scattered up to:
>>>> - the citizen : one active file for a citizen (= folder, infoholder in xml, attachments)
>>>> - the community :  visibility and handling for the citizens of one community
>>>> - the regional authority : regional indicators
>>>>
>>>> This worries me for it is a typical case where you would run into scalability problems.
>>>> Think of 50 000 open applications via that system. With 10 documents per application
>>>> you would have 500 000.
>>>>
>>>>
>>> If 1 user only has access to 10 applications, then doing a search that finds 500,000 applications only to return 10 readable ones would not scale, just as a table scan on a RDBMS table containing .5M rows with no index would also not scale.
>>>
>>>
>>>
>>>
>>>
>>>> Is that a nogo for Sling? Would be a pity. I wanted to come up with an elegant solution :-)
>>>>
>>>>
>>> Sling is not the issue here, its Jackrabbit, and knowing that the above situation does not scale you would do 2 things.
>>> Never use that type of search.
>>>
>>> Access all data via pointers and paths into the data based on something that was not a search. eg if the application was 2919100291
>>> you might find the application and all the information in
>>> /applications/29/19/10/2919100291
>>>
>>> and if the user had an ID of e31231231432
>>> they might have a folder
>>> /users/e3/12/31/23/1432
>>>       with a sub folder
>>>           2919100291
>>>
>>> containing a property
>>>                egov:application-path : /applications/29/19/10/2919100291
>>>
>>>
>>>
>>> ie you have to model your data to avoid searches and non direct access pathways,
>>>
>>> but......
>>>
>>> Please
>>> ask on users@jackrabbit.a.o as the committers there will be able to give you a complete and honest answer to if Jackrabbit is a No Go.
>>> and
>>> do some tests to prove to yourself that it will work at the scale that you want.
>>>
>>> (bash + curl + sling is a good way of doing these sort of tests)
>>>
>>>
>>>
>>>
>>>
>>>> Jos
>>>>
>>>>
>>>>
>>>>
>>>> On 11/09/2010 09:22 AM, Ian Boston wrote:
>>>>
>>>>
>>>>> Jos,
>>>>> If by result you mean a search result, then thats a separate issue from the dynamic ACL itself, and not the direct subject of this thread. When I said performance I was referring to the atomic act of determining if the ACE was active for any attempt to access an item, not just search results.
>>>>>
>>>>>
>>>>> However,
>>>>> thats the way jackrabbit works.
>>>>> JCR searches are "compiled" into Lucene Queries that generate Lucene Hits where the Lucene document contains a node ID, which is extracted in the normal manner from JCR (IIRC). If the current user cant read the item, its discarded.
>>>>>
>>>>> This is fine for dense searches where most items can be read by the user, but problematic for sparse searches.
>>>>> Its also problematic for sorts that can't be performed inside Lucene, as this results in all the items being loaded into memory before searching.
>>>>> One way to avoid sorts of this form is to ban "order by" clauses that reference any items other than properties of the node found.
>>>>>
>>>>>
>>>>> BTW, problematic == non scalable, vertically or horizontally.
>>>>> Ian
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: ACL evaluation with scattered permissions

Posted by Ian Boston <ie...@tfd.co.uk>.

On 10 Nov 2010, at 00:09, Jos Snellings wrote:

> Thank you for your prompt answer, Ian.
> You mean "the natural way".
> That would be true for a citizen.
> That would be true for a community, so a path could be Stockholm/234987488.
> But to extract a regional indicator, like 'how many applications were handled on time during the first half of 2014'. This is something that is not requested in the first place,
> but I know it *will* come up.  ==> then the user performing this query would have read access on all files. Would the query scale better?

> 'how many applications were handled on time during the first half of 2014' 
implies a date range.
IIRC date ranges are problematic in Lucene and although the query might be Ok from a sparse search point of view, the date range might cause a problem. Again experimentation before committing to implementation is going to remove more of the risk.
Ian


> 
> Thanks,
> Jos
> 
> On 11/09/2010 07:56 PM, Ian Boston wrote:
>> 
>> On 9 Nov 2010, at 13:11, Jos Snellings wrote:
>> 
>>   
>>> You are right, Ian,
>>> 
>>> This question deserves a new thread.
>>> Currently I am drawing up an architecture for a file handling system for e-government:
>>> permissions are scattered up to:
>>> - the citizen : one active file for a citizen (= folder, infoholder in xml, attachments)
>>> - the community :  visibility and handling for the citizens of one community
>>> - the regional authority : regional indicators
>>> 
>>> This worries me for it is a typical case where you would run into scalability problems.
>>> Think of 50 000 open applications via that system. With 10 documents per application
>>> you would have 500 000.
>>>     
>> If 1 user only has access to 10 applications, then doing a search that finds 500,000 applications only to return 10 readable ones would not scale, just as a table scan on a RDBMS table containing .5M rows with no index would also not scale.
>> 
>> 
>> 
>>   
>>> Is that a nogo for Sling? Would be a pity. I wanted to come up with an elegant solution :-)
>>>     
>> 
>> Sling is not the issue here, its Jackrabbit, and knowing that the above situation does not scale you would do 2 things.
>> Never use that type of search.
>> 
>> Access all data via pointers and paths into the data based on something that was not a search. eg if the application was 2919100291
>> you might find the application and all the information in
>> /applications/29/19/10/2919100291
>> 
>> and if the user had an ID of e31231231432
>> they might have a folder
>> /users/e3/12/31/23/1432
>>      with a sub folder
>>          2919100291
>> 
>> containing a property
>>               egov:application-path : /applications/29/19/10/2919100291
>> 
>> 
>> 
>> ie you have to model your data to avoid searches and non direct access pathways,
>> 
>> but......
>> 
>> Please
>> ask on users@jackrabbit.a.o as the committers there will be able to give you a complete and honest answer to if Jackrabbit is a No Go.
>> and
>> do some tests to prove to yourself that it will work at the scale that you want.
>> 
>> (bash + curl + sling is a good way of doing these sort of tests)
>> 
>> 
>> 
>>   
>>> Jos
>>> 
>>> 
>>> 
>>> 
>>> On 11/09/2010 09:22 AM, Ian Boston wrote:
>>>     
>>>> Jos,
>>>> If by result you mean a search result, then thats a separate issue from the dynamic ACL itself, and not the direct subject of this thread. When I said performance I was referring to the atomic act of determining if the ACE was active for any attempt to access an item, not just search results.
>>>> 
>>>> 
>>>> However,
>>>> thats the way jackrabbit works.
>>>> JCR searches are "compiled" into Lucene Queries that generate Lucene Hits where the Lucene document contains a node ID, which is extracted in the normal manner from JCR (IIRC). If the current user cant read the item, its discarded.
>>>> 
>>>> This is fine for dense searches where most items can be read by the user, but problematic for sparse searches.
>>>> Its also problematic for sorts that can't be performed inside Lucene, as this results in all the items being loaded into memory before searching.
>>>> One way to avoid sorts of this form is to ban "order by" clauses that reference any items other than properties of the node found.
>>>> 
>>>> 
>>>> BTW, problematic == non scalable, vertically or horizontally.
>>>> Ian
>>>> 
>>>>       
>>>     
>> 
>>   
>

Re: ACL evaluation with scattered permissions

Posted by Jos Snellings <Jo...@pandora.be>.

Thank you for your prompt answer, Ian.
You mean "the natural way".
That would be true for a citizen.
That would be true for a community, so a path could be Stockholm/234987488.
But to extract a regional indicator, like 'how many applications were 
handled on time during the first half of 2014'. This is something that 
is not requested in the first place,
but I know it *will* come up.  ==> then the user performing this query 
would have read access on all files. Would the query scale better?

Thanks,
Jos

On 11/09/2010 07:56 PM, Ian Boston wrote:
>
> On 9 Nov 2010, at 13:11, Jos Snellings wrote:
>
>    
>> You are right, Ian,
>>
>> This question deserves a new thread.
>> Currently I am drawing up an architecture for a file handling system for e-government:
>> permissions are scattered up to:
>> - the citizen : one active file for a citizen (= folder, infoholder in xml, attachments)
>> - the community :  visibility and handling for the citizens of one community
>> - the regional authority : regional indicators
>>
>> This worries me for it is a typical case where you would run into scalability problems.
>> Think of 50 000 open applications via that system. With 10 documents per application
>> you would have 500 000.
>>      
> If 1 user only has access to 10 applications, then doing a search that finds 500,000 applications only to return 10 readable ones would not scale, just as a table scan on a RDBMS table containing .5M rows with no index would also not scale.
>
>
>
>    
>> Is that a nogo for Sling? Would be a pity. I wanted to come up with an elegant solution :-)
>>      
>
> Sling is not the issue here, its Jackrabbit, and knowing that the above situation does not scale you would do 2 things.
> Never use that type of search.
>
> Access all data via pointers and paths into the data based on something that was not a search. eg if the application was 2919100291
> you might find the application and all the information in
> /applications/29/19/10/2919100291
>
> and if the user had an ID of e31231231432
> they might have a folder
> /users/e3/12/31/23/1432
>       with a sub folder
>           2919100291
>
> containing a property
>                egov:application-path : /applications/29/19/10/2919100291
>
>
>
> ie you have to model your data to avoid searches and non direct access pathways,
>
> but......
>
> Please
> ask on users@jackrabbit.a.o as the committers there will be able to give you a complete and honest answer to if Jackrabbit is a No Go.
> and
> do some tests to prove to yourself that it will work at the scale that you want.
>
> (bash + curl + sling is a good way of doing these sort of tests)
>
>
>
>    
>> Jos
>>
>>
>>
>>
>> On 11/09/2010 09:22 AM, Ian Boston wrote:
>>      
>>> Jos,
>>> If by result you mean a search result, then thats a separate issue from the dynamic ACL itself, and not the direct subject of this thread. When I said performance I was referring to the atomic act of determining if the ACE was active for any attempt to access an item, not just search results.
>>>
>>>
>>> However,
>>> thats the way jackrabbit works.
>>> JCR searches are "compiled" into Lucene Queries that generate Lucene Hits where the Lucene document contains a node ID, which is extracted in the normal manner from JCR (IIRC). If the current user cant read the item, its discarded.
>>>
>>> This is fine for dense searches where most items can be read by the user, but problematic for sparse searches.
>>> Its also problematic for sorts that can't be performed inside Lucene, as this results in all the items being loaded into memory before searching.
>>> One way to avoid sorts of this form is to ban "order by" clauses that reference any items other than properties of the node found.
>>>
>>>
>>> BTW, problematic == non scalable, vertically or horizontally.
>>> Ian
>>>
>>>        
>>      
>
>

Re: ACL evaluation with scattered permissions

Posted by Ian Boston <ie...@tfd.co.uk>.


On 9 Nov 2010, at 13:11, Jos Snellings wrote:

> You are right, Ian,
> 
> This question deserves a new thread.
> Currently I am drawing up an architecture for a file handling system for e-government:
> permissions are scattered up to:
> - the citizen : one active file for a citizen (= folder, infoholder in xml, attachments)
> - the community :  visibility and handling for the citizens of one community
> - the regional authority : regional indicators
> 
> This worries me for it is a typical case where you would run into scalability problems.
> Think of 50 000 open applications via that system. With 10 documents per application
> you would have 500 000.

If 1 user only has access to 10 applications, then doing a search that finds 500,000 applications only to return 10 readable ones would not scale, just as a table scan on a RDBMS table containing .5M rows with no index would also not scale.



> 
> Is that a nogo for Sling? Would be a pity. I wanted to come up with an elegant solution :-)


Sling is not the issue here, its Jackrabbit, and knowing that the above situation does not scale you would do 2 things.
Never use that type of search.

Access all data via pointers and paths into the data based on something that was not a search. eg if the application was 2919100291
you might find the application and all the information in 
/applications/29/19/10/2919100291

and if the user had an ID of e31231231432
they might have a folder 
/users/e3/12/31/23/1432
     with a sub folder 
         2919100291 

containing a property
              egov:application-path : /applications/29/19/10/2919100291



ie you have to model your data to avoid searches and non direct access pathways,

but......

Please 
ask on users@jackrabbit.a.o as the committers there will be able to give you a complete and honest answer to if Jackrabbit is a No Go.
and
do some tests to prove to yourself that it will work at the scale that you want.

(bash + curl + sling is a good way of doing these sort of tests)



> 
> Jos
> 
> 
> 
> 
> On 11/09/2010 09:22 AM, Ian Boston wrote:
>> Jos,
>> If by result you mean a search result, then thats a separate issue from the dynamic ACL itself, and not the direct subject of this thread. When I said performance I was referring to the atomic act of determining if the ACE was active for any attempt to access an item, not just search results.
>> 
>> 
>> However,
>> thats the way jackrabbit works.
>> JCR searches are "compiled" into Lucene Queries that generate Lucene Hits where the Lucene document contains a node ID, which is extracted in the normal manner from JCR (IIRC). If the current user cant read the item, its discarded.
>> 
>> This is fine for dense searches where most items can be read by the user, but problematic for sparse searches.
>> Its also problematic for sorts that can't be performed inside Lucene, as this results in all the items being loaded into memory before searching.
>> One way to avoid sorts of this form is to ban "order by" clauses that reference any items other than properties of the node found.
>> 
>> 
>> BTW, problematic == non scalable, vertically or horizontally.
>> Ian
>>   
>