You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2012/09/11 12:08:11 UTC

The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Hi,

[moving this to oak-dev@ for a broader discussion]

On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) <ji...@apache.org> wrote:
> [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would be good to
> have a clearly defined way to get the size of the result. [...]

I've always found the -1 return value from getSize() incredibly
annoying as it forces client code to use extra conditionals and go
through extra hoops if the size turns out not to be available. There
are basically three potential scenarios:

1. The client doesn't need to know the size, so it never calls getSize().
2. The client does need to know the size, so it calls getSize() and
has to iterate through all results if getSize() returns -1.
3. The client could use the size (for UI, optimization, etc.), so it
calls getSize() and ignores the result if its -1.

The main problem I have with the -1 return value is that case 2
becomes really annoying to handle.

Instead I'd propose the following design:

* The getSize() method always returns the size, by buffering all
results in memory if necessary.
* A separate hasSize() method can be used to check if the size is
quickly available (i.e. if getSize() will complete in O(1) time).

With such a design the above cases become easier to handle:

1. The client doesn't need to know the size, so it never calls getSize().
2. The client does need to know the size, so it calls getSize().
3. The client could use the size (for UI, optimization, etc.), so it
calls hasSize() and possibly follows up with getSize().

PS. Note that implementing an "estimated size" feature like seen in
many public search engines ("results 1-10 of thousands") is really
difficult to implement in a manner that's both efficient and secure.
Public search engines can make such estimates efficiently since all
their content is public and they thus don't need to worry about
accidentally leaking sensitive information.

BR,

Jukka Zitting

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>1) Next to getSize() iterator we also added getTotalSize(). I don't
>like the name because it is actually more something like:
>getTotalSizeWithoutCheckingACLs().

Hm, wouldn't that be a security problem?

Couldn't it be better (from a security perspective) if you can only get
this number if you use an admin session or similar?

The query engine currently doesn't have a way to retrieve the size,
without reading the nodes - even for the admin session. Could you create
an (OAK-) issue about this? It could be implemented, but I wonder if we
shouldn't rather implement aggregate functions in the query language:

    select count(*) from [nt:base] where ...

That way the query engine knows in advance it doesn't have to read the
nodes at all.

Regards,
Thomas


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Ard Schrijvers <a....@onehippo.com>.
Hello,

Sorry to chime in so late in this thread, hope my remarks are still
welcome. I did read the entire thread, and won't reply in line, but
just try to recap and explain how we got around it in Hippo
repository. The problem is obvious:

*** How to get efficiently a correct count of total hits  when
finegrained authorization is involved ***

Regarding the remark in this thread : 'For example, it doesn't make
any sense to display "1045430 hits" if calculating this number takes
1.5 hours' , I wholeheartedly agree, but our customers *never* agree.
They want the exact hit count, no matter what

So, we tackled this at Hippo as follows.

1) Next to getSize() iterator we also added getTotalSize(). I don't
like the name because it is actually more something like:
getTotalSizeWithoutCheckingACLs(). This method give you back directly
the number of hits from the backing search index. That one is fast of
course. What is slow, is authorizing potentially 1.000.000 hits
because all those nodes need to be fetched from backing storage, etc
etc. However, most of our customers have an application that show the
results for some siteuser below some folder : The siteuser has read
access for the entire folder. We just show the
getTotalSizeWithoutCheckingACLs() as total hits. Worst case, the
number is higher than the actual number the siteuser is allowed to
read

2) We have our ACLs based on node properties. Hence, we have been able
to create an AuthorizationQuery, mapped directly to a cached Lucene
bitset. When a jcr session searches in our repository, we combine his
cached authorization bitset. After changes in the repository we need
to reload these bitsets (on request) but they are shared between all
users that have the same authorization. It is blistering fast that
way, resulting in correct authorized counts

Now, I don't think (2) can be part of oak, as it implies a certain ACL
model which is not generic enough. Quite some ACL mappings of course
cannot be translated to a Lucene query. However, (1) should be very
issue, and already is a lot better. It is up to the developer then to
use getTotalSizeWithoutCheckingACLs() (and then a decent name :-) or
not

My 2 cents

Regards Ard

On Tue, Sep 11, 2012 at 12:08 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> [moving this to oak-dev@ for a broader discussion]
>
> On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) <ji...@apache.org> wrote:
>> [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would be good to
>> have a clearly defined way to get the size of the result. [...]
>
> I've always found the -1 return value from getSize() incredibly
> annoying as it forces client code to use extra conditionals and go
> through extra hoops if the size turns out not to be available. There
> are basically three potential scenarios:
>
> 1. The client doesn't need to know the size, so it never calls getSize().
> 2. The client does need to know the size, so it calls getSize() and
> has to iterate through all results if getSize() returns -1.
> 3. The client could use the size (for UI, optimization, etc.), so it
> calls getSize() and ignores the result if its -1.
>
> The main problem I have with the -1 return value is that case 2
> becomes really annoying to handle.
>
> Instead I'd propose the following design:
>
> * The getSize() method always returns the size, by buffering all
> results in memory if necessary.
> * A separate hasSize() method can be used to check if the size is
> quickly available (i.e. if getSize() will complete in O(1) time).
>
> With such a design the above cases become easier to handle:
>
> 1. The client doesn't need to know the size, so it never calls getSize().
> 2. The client does need to know the size, so it calls getSize().
> 3. The client could use the size (for UI, optimization, etc.), so it
> calls hasSize() and possibly follows up with getSize().
>
> PS. Note that implementing an "estimated size" feature like seen in
> many public search engines ("results 1-10 of thousands") is really
> difficult to implement in a manner that's both efficient and secure.
> Public search engines can make such estimates efficiently since all
> their content is public and they thus don't need to worry about
> accidentally leaking sensitive information.
>
> BR,
>
> Jukka Zitting



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Sep 13, 2012 at 10:40 AM, Thomas Mueller <mu...@adobe.com> wrote:
> Sounds good. I guess we could do both: always prefetch 20 nodes, and if
> there is still time, fetch more up to 0.1 seconds or so, or at most 200
> nodes. I guess 200 should be enough to for a GUI to decide what to display
> (10 pages for 20 nodes per page).

+1

BR,

Jukka Zitting

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>How do I know it's for sure more than 20

Because the PrefetchIterator will try to prefetch 20 nodes.

>or whatever my page size happens to be?

If you have a higher page size then you need getSize(max).

>>Please note if you use offset and limit, getSize() will return the size
>>of
>> the result minus the offset (it will give you the number of rows you can
>> read from the result).
>
>Really? Is that according to the JCR spec?

Is it not?

Regards,
Thomas


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 14.09.2012, at 07:41, Thomas Mueller <mu...@adobe.com> wrote:

> If getSize() returns -1 then you know for sure there are more than 20
> results, so you know you have to display 'next page'.

How do I know it's for sure more than 20, or whatever my page size happens to be? -1 simply means: "no idea". That the implementation does a timeout handling cannot really be assumed on a JCR application level.

>> In that case the oak getSize(max) is probably a better fit.
> 
> If you call getSize(Integer.MAX_VALUE), then no, it's worse, because that
> call might take hours. The only thing where it helps is if you have a
> different page size than 20.

I was not referring to Integer.MAX_VALUE, but simply to the proposed new method.

> Please note if you use offset and limit, getSize() will return the size of
> the result minus the offset (it will give you the number of rows you can
> read from the result).

Really? Is that according to the JCR spec?

Cheers,
Alex

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>The idea with the timeout sounds good, but what should we recommend an
>application to do if getSize() takes too long and returns -1?
>
>Imagine while paging search results, the first page query is fast enough
>(getSize() returns something), but the second is too long and now returns
>-1: should the application give up and "loose" the page navigation or
>should it count itself, now taking ages again....

If getSize() returns -1 then you know for sure there are more than 20
results, so you know you have to display 'next page'. If it returns 0 you
know there are no rows. If it returns a value between 1 and 20 you know
that there are this many rows. Does this answer your question?

>In that case the oak getSize(max) is probably a better fit.

If you call getSize(Integer.MAX_VALUE), then no, it's worse, because that
call might take hours. The only thing where it helps is if you have a
different page size than 20.

Please note if you use offset and limit, getSize() will return the size of
the result minus the offset (it will give you the number of rows you can
read from the result).

Regards,
Thomas


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 13.09.2012, at 01:40, Thomas Mueller <mu...@adobe.com> wrote:

> Sounds good. I guess we could do both: always prefetch 20 nodes, and if
> there is still time, fetch more up to 0.1 seconds or so, or at most 200
> nodes. I guess 200 should be enough to for a GUI to decide what to display
> (10 pages for 20 nodes per page).

The idea with the timeout sounds good, but what should we recommend an application to do if getSize() takes too long and returns -1? 

Imagine while paging search results, the first page query is fast enough (getSize() returns something), but the second is too long and now returns -1: should the application give up and "loose" the page navigation or should it count itself, now taking ages again....

In that case the oak getSize(max) is probably a better fit. (BTW, there should probably be a way to indicate the difference between search results == max and search results > max, for example by returning max + 1 in the latter case).

Cheers,
Alex


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

Sounds good. I guess we could do both: always prefetch 20 nodes, and if
there is still time, fetch more up to 0.1 seconds or so, or at most 200
nodes. I guess 200 should be enough to for a GUI to decide what to display
(10 pages for 20 nodes per page).

Regards,
Thomas


On 9/13/12 10:27 AM, "Jukka Zitting" <ju...@gmail.com> wrote:

>Hi,
>
>On Thu, Sep 13, 2012 at 10:22 AM, Thomas Mueller <mu...@adobe.com>
>wrote:
>> Yes. Let's discuss the value now! 1000 sounds OK in general, however
>>there
>> is a potential performance problem.
>
>How about instead of a size limit we set a time limit on getSize()?
>I.e. we limit the method to use no more than the mentioned 0.2 seconds
>for prefetching nodes. If the total size is not available by that
>time, it'll return -1.
>
>BR,
>
>Jukka Zitting


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Sep 13, 2012 at 10:22 AM, Thomas Mueller <mu...@adobe.com> wrote:
> Yes. Let's discuss the value now! 1000 sounds OK in general, however there
> is a potential performance problem.

How about instead of a size limit we set a time limit on getSize()?
I.e. we limit the method to use no more than the mentioned 0.2 seconds
for prefetching nodes. If the total size is not available by that
time, it'll return -1.

BR,

Jukka Zitting

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>return the correct size if the result set has fewer than
>something like 1000 entries. That should cover most practical cases

Yes. Let's discuss the value now! 1000 sounds OK in general, however there
is a potential performance problem. For Jackrabbit 2.x, if there are more
than a few million nodes in the repository, you can only load about 100
nodes per second (when using a regular hard disk). Now if we always try to
prefetch 1000 nodes (which is what you have to do in this case), then that
could result in getSize() to take 10 seconds. I think that's not
acceptable. Therefore I argue the value should be about 20, which means
worst case getSize() would take 0.2 seconds. Of course we can still change
the value later, but I would start with 20 (or even lower) just so we
detect problems with that early on.

>> If getSize(int max) is just in the Oak API but not
>> available for the end user, then we didn't really gain much :-)
>
>If we come up with a better API design in oak-core, that'll be a good
>candidate for inclusion in JCR 2.1.

Yes, that makes sense of course.

Regards,
Thomas


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Sep 13, 2012 at 9:56 AM, Thomas Mueller <mu...@adobe.com> wrote:
> [...] When I opened OAK-300, it was for the JCR API implementation. [...]

Ah, I see, sorry for confusing the matter.

For the JCR API I'd simply start by ensuring that getSize() will
always return the correct size if the result set has fewer than
something like 1000 entries. That should cover most practical cases
and existing client code without requiring new query syntax or other
API extensions.

> If getSize(int max) is just in the Oak API but not
> available for the end user, then we didn't really gain much :-)

If we come up with a better API design in oak-core, that'll be a good
candidate for inclusion in JCR 2.1.

BR,

Jukka Zitting

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>There's no need for the Oak API to reflect JCR in all its details.

Sure. First we need to define how the JCR API implementation is supposed
to behave. Based on that we can then still decide what the Oak API should
look like. The Oak API is (more or less) an implementation detail. Of
course the Oak API support more features, but that's not really relevant
for the user of Oak. When I opened OAK-300, it was for the JCR API
implementation. If getSize(int max) is just in the Oak API but not
available for the end user, then we didn't really gain much :-)

Regards,
Thomas


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Sep 12, 2012 at 10:37 AM, Thomas Mueller <mu...@adobe.com> wrote:
> Yes, that would work as well. I have added that to OAK-300. The
> disadvantage is that it's a new API (not part of the JCR specification).
> The two options I proposed don't require a new API.

There's no need for the Oak API to reflect JCR in all its details. If
there's a better way to design something, we should do so as long as
it's still possible for oak-jcr to implement the relevant parts of JCR
based on the underlying Oak API.

BR,

Jukka Zitting

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 12.09.2012, at 01:23, Michael Dürig <md...@apache.org> wrote:

> As an alternative: we could use a separate method getSize(int max) which
> 
> * if called with max == -1 returns the exact size if quickly available,
> * returns -1 otherwise, and
> * returns the exact size but not more then max when called with max >= 0.

+1 clean and obvious (points the app developer to the issue without having to know another method like hasSize())

Cheers,
Alex

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>>(and getSize() would still return -1 I guess?)
>
>For backward compatibility I'd leave the behaviour "as unchanged as
>possible". That is, return -1 if the size is not quickly available.

Yes, that what I had in mind. The question is now, what exactly *is*
quickly available. I suggest that the implementation (by default) tries to
prefetch 20 rows, so it can give an accurate result for up to that many
rows. If there are more than 20 rows, getSize() will return -1. We need to
document that the constant '20' may change in the future. Internally,
getSize() would call getSize(20).

About a new method getSize(int max): I guess we would need to add at least
one new interface, OakRangeIterator extends RangeIterator, and probably
also OakNodeIterator and OakRowIterator. The application would then call
getSize(int max) as follows:

Session session = ..

QueryManager qm = session.getWorkspace().getQueryManager();
Query q = qm.createQuery("select ..", Query.JCR_SQL2);
QueryResult r = q.execute();
RowIterator it = (OakRowIterator) r.getRows();
int size = it.getSize(100);





So there is a cast needed. I personally think that adding a new interface,
and then casting to this interface in the application isn't very nice, but
well what can you do.

Regards,
Thomas




Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Michael Dürig <md...@apache.org>.

On 12.9.12 19:06, Michael Marth wrote:
>> As an alternative: we could use a separate method getSize(int max) which
>>
>> * if called with max == -1 returns the exact size if quickly available,
>> * returns -1 otherwise, and
>> * returns the exact size but not more then max when called with max >= 0.
>>
>> This allows for estimates but leaves the caller in control.
>
> +1
>
> (and getSize() would still return -1 I guess?)

For backward compatibility I'd leave the behaviour "as unchanged as 
possible". That is, return -1 if the size is not quickly available.

Michael

>

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

The main question is still what the JCR API method getSize() should return.


A new method getSize(int max) is nice, and of course we can do that. But I
guess people will not use it in the near future because it's not part of
the JCR API.

Regards,
Thomas




On 9/12/12 8:06 PM, "Michael Marth" <mm...@adobe.com> wrote:

>> As an alternative: we could use a separate method getSize(int max) which
>> 
>> * if called with max == -1 returns the exact size if quickly available,
>> * returns -1 otherwise, and
>> * returns the exact size but not more then max when called with max >=
>>0.
>> 
>> This allows for estimates but leaves the caller in control.
>
>+1
>
>(and getSize() would still return -1 I guess?)


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Michael Marth <mm...@adobe.com>.
> As an alternative: we could use a separate method getSize(int max) which
> 
> * if called with max == -1 returns the exact size if quickly available,
> * returns -1 otherwise, and
> * returns the exact size but not more then max when called with max >= 0.
> 
> This allows for estimates but leaves the caller in control.

+1

(and getSize() would still return -1 I guess?)

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>As an alternative: we could use a separate method getSize(int max) which

Yes, that would work as well. I have added that to OAK-300. The
disadvantage is that it's a new API (not part of the JCR specification).
The two options I proposed don't require a new API.

Regards,
Thomas


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Michael Dürig <md...@apache.org>.
Hi,

On 11.9.12 11:08, Jukka Zitting wrote:

>
> Instead I'd propose the following design:
>
> * The getSize() method always returns the size, by buffering all
> results in memory if necessary.
> * A separate hasSize() method can be used to check if the size is
> quickly available (i.e. if getSize() will complete in O(1) time).
>

As an alternative: we could use a separate method getSize(int max) which

* if called with max == -1 returns the exact size if quickly available,
* returns -1 otherwise, and
* returns the exact size but not more then max when called with max >= 0.

This allows for estimates but leaves the caller in control.

Michael

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 12.09.2012, at 00:33, Thomas Mueller <mu...@adobe.com> wrote:

>> Display the actual number of search results to the user?
> 
> Do you want to risk that the method getSize() takes 1.5 hours just to
> display the actual number of search results to the user?

Well, the application can decide if there are other options such as the proposed hasSize() or getSize(max int) (which I like).

>> If they show "next page", they usually also show the number of results
>> and the number of pages.
> 
> Yes, I understand it is a nice feature. But it is a very dangerous feature
> given the problems described above.

Developers are used to get fast counts in RDBMS with dedicated indexes.

Cheers,
Alex

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

>>>2. The client does need to know the size, so it calls getSize() and
>> 
>> I currently can't come up with a convincing use case - what is your use
>> case?
>
>Display the actual number of search results to the user?

Do you want to risk that the method getSize() takes 1.5 hours just to
display the actual number of search results to the user?

In this case, I suggest the application doesn't always display the exact
size, he could display one of the following instead:


* 0 hits
* 2 hits
* 13 hits
* 23 hits
* More than 30 hits

The reason for the "More than 30 hits" versus the exact number of hits is:
calculating the exact number of hits is very inefficient if there are more
than just a few hits. For example, it doesn't make any sense to display
"1045430 hits" if calculating this number takes 1.5 hours.

>If they show "next page", they usually also show the number of results
>and the number of pages.

Yes, I understand it is a nice feature. But it is a very dangerous feature
given the problems described above.

>In many cases the paging etc. doesn't have to be exact, so if there is a
>way that Oak could provide a fairly good estimate that would be really
>cool (like caching the numbers).

Estimating the result size is sometimes possible, that's true, but it is
potentially a security problem. Do you want me to describe the attack
scenario? It is easier for Google and so on where everybody may see
everything.

Regards,
Thomas
 


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 11.09.2012, at 07:18, Thomas Mueller <mu...@adobe.com> wrote:

>> 2. The client does need to know the size, so it calls getSize() and
> 
> I currently can't come up with a convincing use case - what is your use
> case?

Display the actual number of search results to the user?

>> has to iterate through all results if getSize() returns -1.
> 
> I would add: clients that do call getSize() even thought they don't really
> need the result, or don't care so much about the result, and only use it
> to display a 'next page' button / adjust the scrollbar.

If they show "next page", they usually also show the number of results and the number of pages.

>> PS. Note that implementing an "estimated size" feature like seen in
>> many public search engines ("results 1-10 of thousands") is really
>> difficult to implement in a manner that's both efficient and secure.
> 
> I agree.

In many cases the paging etc. doesn't have to be exact, so if there is a way that Oak could provide a fairly good estimate that would be really cool (like caching the numbers).

Cheers,
Alex


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

I'm worried about queries that return a huge number of rows, for example 1
million nodes. If getSize() is supposed to return the correct result, it
could potentially take hours (when reading 100 nodes per second). I'm more
in favour of returning -1 if there are more than just a few rows (for
example 20, or 100). Where 'few' is configurable (see OAK-300 for
details). And I think it should be configurable so that a client can
decide how much time he is willing to wait for the answer.

Specially I'm worried about existing clients that do call getSize() every
time.

>2. The client does need to know the size, so it calls getSize() and

I currently can't come up with a convincing use case - what is your use
case?


>has to iterate through all results if getSize() returns -1.

I would add: clients that do call getSize() even thought they don't really
need the result, or don't care so much about the result, and only use it
to display a 'next page' button / adjust the scrollbar.

>The main problem I have with the -1 return value is that case 2
>becomes really annoying to handle.
>
>Instead I'd propose the following design:
>
>* The getSize() method always returns the size, by buffering all
>results in memory if necessary.

Buffering will only work up to some point. Past that point, the query has
to be re-executed.

>* A separate hasSize() method can be used to check if the size is
>quickly available (i.e. if getSize() will complete in O(1) time).

O(1), without reading all rows, is hard to achieve, first because of
access rights checks and second because regular indexes will not be able
to provide this information within O(1) time (a counted b-tree can do it,
but we don't use one of those).

>PS. Note that implementing an "estimated size" feature like seen in
>many public search engines ("results 1-10 of thousands") is really
>difficult to implement in a manner that's both efficient and secure.

I agree.

Regards,
Thomas