You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@gmail.com> on 2008/12/04 00:25:21 UTC

Implementing the PagedSearchControl

Hi guys,

I'm trying to implement this control for 1.5.5, and I'm a bit stuck in 
the way we should handle it in the core server. Right now, the control 
is decoded (the ASN.1 decoder has been written and tested, it works 
fine). The SearchHandler has been modified so that we can implement the 
control in only one case : the doSimpleSearch() method.

The problem I have is the following : we have to remember the pointer to 
the last entry we have sent back to the client

How should we do ? My first approach was pretty naive : we are using a 
cursor, so it's easy, we simply store the cursor into the session, and 
the next request will just have to get back this cursor from the 
session, and get the N next elements from this cursor.

This has the advantage of being simple, but there are some very 
important cons :
- it's memory consuming, as we may keep those cursor in the session for 
a very long time
- we will have to close all the cursors when the session is closed (for 
whatever reason)
- if some data has been modified since the cursor creation, it may 
contain invalid data
- if the user don't send and abandon search request, those cursors will 
remain in the session until it's closed (this is very likely to happen)

So I'm considering an alternative - though more expensive and less 
performant - approach :
 - we build a new cursor for each request,
 - we move forward the Nth entry in the newly created cursor, and return 
back the M requested elements
 - and when done, we discard the cursor.

The pros are
- we don't have to keep n cursors in memory for ever.
- from the client POV, it respects the PagedSeach contract
- it's easier to implement as we have less information to keep in the 
session, and to restore back

The cons are :
- it's time consuming, as if we have N entry to return, with a P page 
size, we will construct N/P cursors.

Do you see a better way to implement this control, and if not, which one 
do you think is the best ?

Thanks !

-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: Implementing the PagedSearchControl

Posted by Emmanuel Lecharny <el...@gmail.com>.

Norval Hope wrote:
> Hi,
>
> I don't know the core code like you, but given my current
> understanding (pardon me if my understanding is woefully inadequate
> :-) I'm not sure your option 2 is really viable. After the first page
> is returned, how are you going to initialize the cursor for the next
> page?
You have to store some session information in any case. So assuming we 
implement the second option, we just store the number of entries we 
already transmitted.
>  Wouldn't you have to remember where you were up to (using some
> sort of unique per-result key) and rerun the search, skipping results
> up to this key value (if so then it's unworkable no?). 
Yes, this is exactly what I have in mind :)
> BTW Is there
> guaranteed to be a stable sorting order for search results internally,
> and is the sort control supported by AD so that it may be posed
> externally etc.
>   
There are no such guarantee, either with option 1 or option 2, as stated 
by RFC 2696 :
"A client may safely assume that all entries that satisfy a given search 
query are returned once and only once during the set of paged Search 
requests/responses necessary to enumerate the entire result set, unless 
the result set for that query has changed since the searchRequest 
starting the request/response sequence was processed."
> At any rate I don't think keeping a cursor open is that big a deal,
> although there probably needs to be a server-side timeout so that if
> the cursor is not accessed for a long time (say an hour for argument)
> it is automatically closed. Well behaved clients should explicitly
> close their NamingEnumerations which I would hope would allow the
> cursor to be discarded immediately.
>   
The problem is certainly not well behaved client. but there are many 
case when even a well behaved client might be troublesome. Typically, 
Apache Studio (a well behaved client, that's for sure :), when it comes 
to load thousands of entries, uses this control to get the entries 
little by little. It present the entries in a list of N sub-entries, and 
when you open a block, then a new search is sent to the server. Another 
possibility is to design a cursored presentation of data, with, say, 20 
entries per page. The client click on 'next' or 'prev' to get the next 
20 entries. As it's interactive, we may have some long period of time 
between two interactions.

So sure, we will have to implement a timeout (but we already have a 
session timeout), which is a bit painfull, as you need a thread to 
manage it.

But anyway, I think this is the way to go too... It's just that it's 
painful ! :-]

-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: Implementing the PagedSearchControl

Posted by Norval Hope <nr...@gmail.com>.

Hi,

I don't know the core code like you, but given my current
understanding (pardon me if my understanding is woefully inadequate
:-) I'm not sure your option 2 is really viable. After the first page
is returned, how are you going to initialize the cursor for the next
page? Wouldn't you have to remember where you were up to (using some
sort of unique per-result key) and rerun the search, skipping results
up to this key value (if so then it's unworkable no?). BTW Is there
guaranteed to be a stable sorting order for search results internally,
and is the sort control supported by AD so that it may be posed
externally etc.

At any rate I don't think keeping a cursor open is that big a deal,
although there probably needs to be a server-side timeout so that if
the cursor is not accessed for a long time (say an hour for argument)
it is automatically closed. Well behaved clients should explicitly
close their NamingEnumerations which I would hope would allow the
cursor to be discarded immediately.

Cheers

On Thu, Dec 4, 2008 at 10:25 AM, Emmanuel Lecharny <el...@gmail.com> wrote:
> Hi guys,
>
> I'm trying to implement this control for 1.5.5, and I'm a bit stuck in the
> way we should handle it in the core server. Right now, the control is
> decoded (the ASN.1 decoder has been written and tested, it works fine). The
> SearchHandler has been modified so that we can implement the control in only
> one case : the doSimpleSearch() method.
>
> The problem I have is the following : we have to remember the pointer to the
> last entry we have sent back to the client
>
> How should we do ? My first approach was pretty naive : we are using a
> cursor, so it's easy, we simply store the cursor into the session, and the
> next request will just have to get back this cursor from the session, and
> get the N next elements from this cursor.
>
> This has the advantage of being simple, but there are some very important
> cons :
> - it's memory consuming, as we may keep those cursor in the session for a
> very long time
> - we will have to close all the cursors when the session is closed (for
> whatever reason)
> - if some data has been modified since the cursor creation, it may contain
> invalid data
> - if the user don't send and abandon search request, those cursors will
> remain in the session until it's closed (this is very likely to happen)
>
> So I'm considering an alternative - though more expensive and less
> performant - approach :
> - we build a new cursor for each request,
> - we move forward the Nth entry in the newly created cursor, and return back
> the M requested elements
> - and when done, we discard the cursor.
>
> The pros are
> - we don't have to keep n cursors in memory for ever.
> - from the client POV, it respects the PagedSeach contract
> - it's easier to implement as we have less information to keep in the
> session, and to restore back
>
> The cons are :
> - it's time consuming, as if we have N entry to return, with a P page size,
> we will construct N/P cursors.
>
> Do you see a better way to implement this control, and if not, which one do
> you think is the best ?
>
> Thanks !
>
> --
> --
> cordialement, regards,
> Emmanuel Lécharny
> www.iktek.com
> directory.apache.org
>
>
>

Re: Implementing the PagedSearchControl

Posted by Norval Hope <nr...@gmail.com>.

>
> We have to compare the SearchRequest, not the SearchResult. This is dictated
> by the RFC, and it makes sense, as if you change the filter, the baseDN or
> something else, then it's another request.

Sorry, my bad, didn't read carefully enough. This sounds more like a
sanity check on the server side then, rather then a critical part of
the implementation. But if its in the RFC there's no arguing. I
understand the dilema now, but part of me is thinking that anytime the
server normalizes anything (including filters) it should still keep
the original around for comparison (just as LdapDN does) in which case
there wouldn't be a problem.

>> >From the perspective (in part influenced by my VD work, but also
>> driven by encapsulation considerations) I think things are cleaner
>> when the codec (or at least some well defined layer of the codec) does
>> *only* the coding and decoding job, without any expectation of
>> normalization or access to schema information. IMO for the VD usecase
>> the AD server shouldn't expect schema information to be available, as
>> the "real" validation is going to be performed on the ultimate
>> concrete endpoints and the AD server is just acting as a transport /
>> container.
>>
>
> maybe, but from the server POV, this is really a problem. Right now, we
> already do a DN normalization because we need it to manage referrals before
> the request goes through the interceptor chain.
>
> We discussed a lot about VD architectures with Alex, and my personal vision
> is that it's almost impossible to manage correctly a VD without having a
> complete schema knowledge of all the targeted servers. But until I implement
> a VD myself, and face all the associated problems, I may miss something.

I suspect this has a lot to do with whether the VD is actually trying
to persist any data on behalf of the target system (in which case
schema knowledge is definitely required) or simply acting as an
aggreation point (in which case I think the more hands-off it is, the
better). But anyway the major point I'm making is that IMO there
should be a well defined layer in the codec where the data has been
simply endcoded and decoded without any manipulation, if the server
then needs a layer above which normalizes based on access to schema
information that's fine as long as there is an unencumbered layer
below. I haven't got right down into this code in the current trunk
yet, but I think the previous code needed access to the schema by the
codec (for LdapDN / RDNs etc). However, as we've discussed a number of
times my requirements are distinct from the primary ones for the AD
server and therefore it's fine when I need to make custom mods. I'll
now shut-up about VD stuff for good :-)

> No, the way it works is simple: we use the request's thread to handle the
> cursor. Currently, I'm storing the cursor into the session, and restore it
> when a new paged request arrives. It works pretty well. I have not yet
> committed the ocde, because I have some issues with size limit management,
> but I'm almost done (and I was busy those last few days too ...)

Cool, sounds good.
>
> The couple I store in the session is <cookie, pagedSearchCookie> where the
> pagedSearchCookie object contains all the needed information to manage the
> paged search and the current cursor.

Also sounds good.

Thanks

Re: Implementing the PagedSearchControl

Posted by Emmanuel Lecharny <el...@gmail.com>.

Norval Hope wrote:
> Hi Guys,
>
>   
>> There are vicious issues, though. Some of them are related to the way we
>> have designed the server. For instance, when comparing the previous
>> searchRequest with the current one, you have to compare attributes, DN and
>> filters. That's not complicated, except that those elements might not be
>> equals, just because they have not yet been normalized at this point (in
>> SearchHandler).
>>     
>
> I'm missing something here - why is comparing search results required?
>   
We have to compare the SearchRequest, not the SearchResult. This is 
dictated by the RFC, and it makes sense, as if you change the filter, 
the baseDN or something else, then it's another request.
> I would have thought that all that would be stored on the server-side
> for the paged search would be a key that could be used to look up a
> live cursor, in which case there would never be a need to compare
> search result entries (just to call Cursor.next() N times until a full
> page of results could be returned, and when the next request came to
> use the key stored in the session to look up the cursor and repeat).
>
>   
>> This is a big issue. At this point, we can manage to normalize the DN and
>> attributes, but for the filter, this is another story. This make me think
>> that the Normalize interceptor is not necessary, and that it should be moved
>> up in the stack (in the codec, in fact).
>>     
>
> >From the perspective (in part influenced by my VD work, but also
> driven by encapsulation considerations) I think things are cleaner
> when the codec (or at least some well defined layer of the codec) does
> *only* the coding and decoding job, without any expectation of
> normalization or access to schema information. IMO for the VD usecase
> the AD server shouldn't expect schema information to be available, as
> the "real" validation is going to be performed on the ultimate
> concrete endpoints and the AD server is just acting as a transport /
> container.
>   
maybe, but from the server POV, this is really a problem. Right now, we 
already do a DN normalization because we need it to manage referrals 
before the request goes through the interceptor chain.

We discussed a lot about VD architectures with Alex, and my personal 
vision is that it's almost impossible to manage correctly a VD without 
having a complete schema knowledge of all the targeted servers. But 
until I implement a VD myself, and face all the associated problems, I 
may miss something.
>   
>> Otherwise, the other problem we have is the Cursor closure. When we are done
>> with them, we should close those guys. This is easy if the client behave
>> correctly (ie, send a last request with 0 as the number of element to
>> return, or if we reach the end of the entries to return), but io the client
>> don't do that, we will end with potentially thousands of open cursors in
>> memory.
>>
>> So we need to add a cleanup thread associated with each session, closing the
>> cursor after a timeout has occured.
>>     
>
> I'd expect a single thread (or singleton execution Executor) could do
> this job for all sessions.
Yep, this is the base idea.
>  And also if the searches are being paged
> I'd imagine retrieving a single page of search results could also be
> done in the MINA / AD worker thread rather then requiring a separate
> thread, unless the Cursor implementation itself mandates a specific
> thread is kept around.
No, the way it works is simple: we use the request's thread to handle 
the cursor. Currently, I'm storing the cursor into the session, and 
restore it when a new paged request arrives. It works pretty well. I 
have not yet committed the ocde, because I have some issues with size 
limit management, but I'm almost done (and I was busy those last few 
days too ...)
>  Hence the only resources needed for each search
> request would be an entry in a <String, Cursor> hashmap (presuming the
> key stored is a string) and the Cursor itself.
>   
The couple I store in the session is <cookie, pagedSearchCookie> where 
the pagedSearchCookie object contains all the needed information to 
manage the paged search and the current cursor.

-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: Implementing the PagedSearchControl

Posted by Norval Hope <nr...@gmail.com>.

Hi Guys,

> There are vicious issues, though. Some of them are related to the way we
> have designed the server. For instance, when comparing the previous
> searchRequest with the current one, you have to compare attributes, DN and
> filters. That's not complicated, except that those elements might not be
> equals, just because they have not yet been normalized at this point (in
> SearchHandler).

I'm missing something here - why is comparing search results required?
I would have thought that all that would be stored on the server-side
for the paged search would be a key that could be used to look up a
live cursor, in which case there would never be a need to compare
search result entries (just to call Cursor.next() N times until a full
page of results could be returned, and when the next request came to
use the key stored in the session to look up the cursor and repeat).

>
> This is a big issue. At this point, we can manage to normalize the DN and
> attributes, but for the filter, this is another story. This make me think
> that the Normalize interceptor is not necessary, and that it should be moved
> up in the stack (in the codec, in fact).

>From the perspective (in part influenced by my VD work, but also
driven by encapsulation considerations) I think things are cleaner
when the codec (or at least some well defined layer of the codec) does
*only* the coding and decoding job, without any expectation of
normalization or access to schema information. IMO for the VD usecase
the AD server shouldn't expect schema information to be available, as
the "real" validation is going to be performed on the ultimate
concrete endpoints and the AD server is just acting as a transport /
container.

>
> Otherwise, the other problem we have is the Cursor closure. When we are done
> with them, we should close those guys. This is easy if the client behave
> correctly (ie, send a last request with 0 as the number of element to
> return, or if we reach the end of the entries to return), but io the client
> don't do that, we will end with potentially thousands of open cursors in
> memory.
>
> So we need to add a cleanup thread associated with each session, closing the
> cursor after a timeout has occured.

I'd expect a single thread (or singleton execution Executor) could do
this job for all sessions. And also if the searches are being paged
I'd imagine retrieving a single page of search results could also be
done in the MINA / AD worker thread rather then requiring a separate
thread, unless the Cursor implementation itself mandates a specific
thread is kept around. Hence the only resources needed for each search
request would be an entry in a <String, Cursor> hashmap (presuming the
key stored is a string) and the Cursor itself.

Cheers,
Norval

Re: Implementing the PagedSearchControl

Posted by Emmanuel Lecharny <el...@gmail.com>.

Alex Karasulu wrote:
> Hi Emmanuel,
>
> On Wed, Dec 3, 2008 at 6:25 PM, Emmanuel Lecharny <el...@gmail.com>wrote:
>
>   
>> The problem I have is the following : we have to remember the pointer to
>> the last entry we have sent back to the client
>>
>> How should we do ? My first approach was pretty naive : we are using a
>> cursor, so it's easy, we simply store the cursor into the session, and the
>> next request will just have to get back this cursor from the session, and
>> get the N next elements from this cursor.
>>
>> This has the advantage of being simple, but there are some very important
>> cons :
>> - it's memory consuming, as we may keep those cursor in the session for a
>> very long time
>> - we will have to close all the cursors when the session is closed (for
>> whatever reason)
>> - if some data has been modified since the cursor creation, it may contain
>> invalid data
>> - if the user don't send and abandon search request, those cursors will
>> remain in the session until it's closed (this is very likely to happen)
>>
>> So I'm considering an alternative - though more expensive and less
>> performant - approach :
>> - we build a new cursor for each request,
>> - we move forward the Nth entry in the newly created cursor, and return
>> back the M requested elements
>> - and when done, we discard the cursor.
>>
>>     
>
> I would avoid this approach.  The problem is that it requires almost a
> factorial amount of computation as you scan back to the point you were at
> before to advance the cursor.  Say you have 100 entries and you advance
> reading the first 10.  Then create a new cursor and ask for the next 11-20
> elements.  This means you'll scan through the first 10 elements checking if
> each element is a match for the filter and as you know this shifts a nested
> structure of cursors structured to reflect the logic of the filter.  So
> you're doing a search for 10, then 20, 30, 40, 50, 60 and so on elements.
>   
Yes, I'm aware of that. And I will certainly not go this way ...
>
>   
>> The pros are
>> - we don't have to keep n cursors in memory for ever.
>>     
>
>
> The whole point to this feature is to maintain state so the search continues
> where it left off.  But this should be cheap both for the server and for the
> client. This approach is a brute force approach and it's going to mix up a
> lot of code in complicated places.
>
> It's OK to hold off on this until we see a better approach.  I'd rather wait
> until we feel that eureka light bulb go off.
>
>
>   
>> - from the client POV, it respects the PagedSeach contract
>> - it's easier to implement as we have less information to keep in the
>> session, and to restore back
>>
>> The cons are :
>> - it's time consuming, as if we have N entry to return, with a P page size,
>> we will construct N/P cursors.
>>
>>     
>
> Yes and there will be costs to advances.  Both are going to make this
> approach limiting.
>   
I'm currently going a bit forward into the other direction (ie, storing 
the cursor in the session).

There are vicious issues, though. Some of them are related to the way we 
have designed the server. For instance, when comparing the previous 
searchRequest with the current one, you have to compare attributes, DN 
and filters. That's not complicated, except that those elements might 
not be equals, just because they have not yet been normalized at this 
point (in SearchHandler).

This is a big issue. At this point, we can manage to normalize the DN 
and attributes, but for the filter, this is another story. This make me 
think that the Normalize interceptor is not necessary, and that it 
should be moved up in the stack (in the codec, in fact).

Otherwise, the other problem we have is the Cursor closure. When we are 
done with them, we should close those guys. This is easy if the client 
behave correctly (ie, send a last request with 0 as the number of 
element to return, or if we reach the end of the entries to return), but 
io the client don't do that, we will end with potentially thousands of 
open cursors in memory.

So we need to add a cleanup thread associated with each session, closing 
the cursor after a timeout has occured.

Those are the two problems I'm currently facing...

Otherwise, the implementation itself is pretty straightforward (well, 
not that much, but it's just simple code).

Any idea about how to handle those two problems ?
> Alex
>
>   


-- 
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org

Re: Implementing the PagedSearchControl

Posted by Alex Karasulu <ak...@apache.org>.

Hi Emmanuel,

On Wed, Dec 3, 2008 at 6:25 PM, Emmanuel Lecharny <el...@gmail.com>wrote:

> The problem I have is the following : we have to remember the pointer to
> the last entry we have sent back to the client
>
> How should we do ? My first approach was pretty naive : we are using a
> cursor, so it's easy, we simply store the cursor into the session, and the
> next request will just have to get back this cursor from the session, and
> get the N next elements from this cursor.
>
> This has the advantage of being simple, but there are some very important
> cons :
> - it's memory consuming, as we may keep those cursor in the session for a
> very long time
> - we will have to close all the cursors when the session is closed (for
> whatever reason)
> - if some data has been modified since the cursor creation, it may contain
> invalid data
> - if the user don't send and abandon search request, those cursors will
> remain in the session until it's closed (this is very likely to happen)
>
> So I'm considering an alternative - though more expensive and less
> performant - approach :
> - we build a new cursor for each request,
> - we move forward the Nth entry in the newly created cursor, and return
> back the M requested elements
> - and when done, we discard the cursor.
>

I would avoid this approach.  The problem is that it requires almost a
factorial amount of computation as you scan back to the point you were at
before to advance the cursor.  Say you have 100 entries and you advance
reading the first 10.  Then create a new cursor and ask for the next 11-20
elements.  This means you'll scan through the first 10 elements checking if
each element is a match for the filter and as you know this shifts a nested
structure of cursors structured to reflect the logic of the filter.  So
you're doing a search for 10, then 20, 30, 40, 50, 60 and so on elements.

>
> The pros are
> - we don't have to keep n cursors in memory for ever.

The whole point to this feature is to maintain state so the search continues
where it left off.  But this should be cheap both for the server and for the
client. This approach is a brute force approach and it's going to mix up a
lot of code in complicated places.

It's OK to hold off on this until we see a better approach.  I'd rather wait
until we feel that eureka light bulb go off.

>
> - from the client POV, it respects the PagedSeach contract
> - it's easier to implement as we have less information to keep in the
> session, and to restore back
>
> The cons are :
> - it's time consuming, as if we have N entry to return, with a P page size,
> we will construct N/P cursors.
>

Yes and there will be costs to advances.  Both are going to make this
approach limiting.

Alex