You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cayenne.apache.org by Aristedes Maniatis <ar...@ish.com.au> on 2007/05/25 07:11:31 UTC

Paging and SQL queries

We have recently been testing our 3 tier Cayenne app with a larger  
number of records and finding some performance problems with scale.  
In particular we have a test database with 60,000 student records  
(the production db will have 120,000). We are using paging (of 25  
records) to ensure that when a user brings up the list of students on  
the client we don't load them all across the network. However we are  
still seeing load times of around 40 seconds. At the moment we are  
using objects, not DataRows on the client.

One part of the problem appears to occur on the server. The SQL  
Cayenne generates looks like this:

SELECT * FROM Student t0 WHERE (t0.isDeleted IS NULL) OR  
(t0.isDeleted = 0) ORDER BY t0.firstName

although the * is actually a list of every field in the table. We  
timed this query against the db and even directly without Cayenne it  
takes 16 seconds to return all the data. I presume the next 24  
seconds Cayenne spends parsing the results into DataRows or Objects.  
Instead, if we executed:

SELECT t0.id FROM Student t0 WHERE (t0.isDeleted IS NULL) OR  
(t0.isDeleted = 0) ORDER BY t0.firstName

Then it returns in about 2 seconds (when performing this query  
directly against the db and not through Cayenne).


Q1: So, my first question is why is Cayenne doing this? If paging is  
switched on, shouldn't Cayenne be fetching only the primary keys and  
then faulting in the particular records it needs in full?

We can see paging working with the query which Cayenne executes next:

SELECT * t0 WHERE ((t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR  
(t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id  
= ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR  
(t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id  
= ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR  
(t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?)) AND  
((t0.isDeleted IS NULL) OR (t0.isDeleted = ?)) [bind: 18569, 18521,  
18253, 18060, 17932, 16451, 16395, 15967, 15819, 14773, 14492, 14317,  
14280, 13088, 12464, 11183, 10957, 10378, 10219, 9686, 9554, 9435,  
9155, 8611, 8038, 0]

25 records are paged into Cayenne as expected. But Cayenne has  
already fetched into memory (server-side) all the records in the  
original query.

Q2: Would it more more efficient from a db perspective to use the  
"t0.id IS IN (?, ?, ?...)" style syntax here instead or does it make  
no difference? I don't know enough about SQL optimisation to know  
either way, but at least the SQL is shorter (which might be good!).



Cheers
Ari Maniatis


-------------------------->
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001   fax +61 2 9550 4001
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Paging and SQL queries

Posted by Andrus Adamchik <an...@objectstyle.org>.

On May 28, 2007, at 9:49 AM, Andrus Adamchik wrote:

>
>>
>> I don't quite understand what 'resolvesFirstPage()' refers to.  
>> Does it mean 'firstPageResolved()'? Should it be negated here:
>
> No. "resolvesFirstPage" really a shorthand for "is it possible to  
> reliably resolve objects on the first page using the existing  
> algorithm".

I just clarified that with a JavaDoc comment. Sorry for mostly  
skipping JavaDocs on the private methods...

     /**
      * Returns true if it is possible to read the first page of  
inflated objects from the
      * ResultSet returned from the main query. False is returned for  
queries with
      * prefetches as resolving prefetches is not possible in this  
situation.
      */
     private boolean resolvesFirstPage() {
         return internalQuery.getPrefetchTree() == null;
     }

Andrus

Re: Paging and SQL queries

Posted by Andrus Adamchik <an...@objectstyle.org>.

On May 28, 2007, at 1:50 PM, Aristedes Maniatis wrote:

>
> On 28/05/2007, at 4:49 PM, Andrus Adamchik wrote:
>
>> No. "resolvesFirstPage" really a shorthand for "is it possible to  
>> reliably resolve objects on the first page using the existing  
>> algorithm".
>
> OK. I guess I would have named it isFirstPageResolvable() or  
> canResolveFirstPage(). But from your description it appears that it  
> isn't specific to the first page and that either it is resolvable  
> for any page, or not. So does it mean canResolveData()?

Actually no. This is specific to the first page that most of the time  
can be resolved from the "select *" query result set, while the  
second and following pages are resolved using separate queries that  
select a group of id's.

>> Just to confirm, is this a JDBC test or a test using DB client  
>> tools (on Derby this may be the same thing, still worth doing a  
>> JDBC run with your own code, as there can be unrelated bottlenecks  
>> in the tools).  E.g. create two PreparedStatements, one with  
>> "select *", another with "select id", and only read "id" column in  
>> both resultsets. See how much faster the second case is.
>
> OK. I just tried it a different way and the results are the same.  
> In this case I used the Derby ij tool to make a JDBC connection to  
> a Derby server on localhost. That server has real data from one of  
> our clients with 120,000 students.
>
> select * from student;  -> 53 seconds
> select id from student; -> 4 seconds
>
> I performed several runs to try and exclude cache warming and the  
> variance was less than 10%. The output was redirected to a file on  
> disk. There are around 30 columns in the student table with a range  
> of data types. The output file on disk with all the fields was  
> 120Mb when finished.

Could you also exclude "ij" from the picture? It can add the overhead  
of the terminal I/O, and the overhead of actually reading those extra  
29 columns - something Cayenne paginated query wouldn't do. What I  
think of as a *clean test* in this situation is a small Java program  
that opens a Connection, creates a PreparedStatement, and reads a  
ResultSet, discarding the data. In case of "select *" it would still  
only read the "id" column.

Thanks
Andrus

Re: Paging and SQL queries

Posted by Aristedes Maniatis <ar...@maniatis.org>.

On 28/05/2007, at 4:49 PM, Andrus Adamchik wrote:

> No. "resolvesFirstPage" really a shorthand for "is it possible to  
> reliably resolve objects on the first page using the existing  
> algorithm".

OK. I guess I would have named it isFirstPageResolvable() or  
canResolveFirstPage(). But from your description it appears that it  
isn't specific to the first page and that either it is resolvable for  
any page, or not. So does it mean canResolveData()? But I'm still  
working through the code to understand when prefetches are relevant.

> Just to confirm, is this a JDBC test or a test using DB client  
> tools (on Derby this may be the same thing, still worth doing a  
> JDBC run with your own code, as there can be unrelated bottlenecks  
> in the tools).  E.g. create two PreparedStatements, one with  
> "select *", another with "select id", and only read "id" column in  
> both resultsets. See how much faster the second case is.

OK. I just tried it a different way and the results are the same. In  
this case I used the Derby ij tool to make a JDBC connection to a  
Derby server on localhost. That server has real data from one of our  
clients with 120,000 students.

select * from student;  -> 53 seconds
select id from student; -> 4 seconds

I performed several runs to try and exclude cache warming and the  
variance was less than 10%. The output was redirected to a file on  
disk. There are around 30 columns in the student table with a range  
of data types. The output file on disk with all the fields was 120Mb  
when finished.

So even when we exclude any overhead of Cayenne, the database itself  
takes that much longer to simply read all that data off the disk and  
return it. To be honest, I'm a bit disappointed by even that 4  
seconds on a 2Gb G5 iMac since that will result in some serious UI  
lag by the time that makes it through Cayenne, to the client over a  
network (Jetty and Hessian) and then a jTable is drawn.

Ari

-------------------------->
Aristedes Maniatis
phone +61 2 9660 9700
PGP fingerprint 08 57 20 4B 80 69 59 E2  A9 BF 2D 48 C2 20 0C C8

Re: Paging and SQL queries

Posted by Andrus Adamchik <an...@objectstyle.org>.

On May 27, 2007, at 2:09 PM, Aristedes Maniatis wrote:

>
> I don't quite understand what 'resolvesFirstPage()' refers to. Does  
> it mean 'firstPageResolved()'? Should it be negated here:

No. "resolvesFirstPage" really a shorthand for "is it possible to  
reliably resolve objects on the first page using the existing  
algorithm".

>> In some circumstances this is true, in some (like yours) it is  
>> clearly not. I am leaning towards making a second strategy the  
>> default, as paginated queries are really intended for the huge  
>> result sets... Anybody has other thoughts on that?
>
> Without a LIMIT on the records which are fetched as a 'fat query'  
> there is little point in paging at all I think.

Not true. Compare the timing of the paginated vs. normal query on  
your data set - you should see a big difference. (Also do not use the  
time printed in Cayenne SQL logs, as this is for the SQL part only;  
do the timing yourself, around "performQuery" method). In fact our  
original performance testing back when the paginated query was  
implemented showed very little difference between "select *" and  
"select id", so this bottleneck may actually be DB-specific.

Again, I am for updating the default algorithm, but I still think  
this is an edge case, and we need to do some JDBC performance testing  
to prove that such change can fix it.

> We timed this query against the db and even directly without  
> Cayenne it takes 16 seconds to return all the data. I presume the  
> next 24 seconds Cayenne spends parsing the results into DataRows or  
> Objects. Instead, if we executed:
>
> SELECT t0.id FROM Student t0 WHERE (t0.isDeleted IS NULL) OR  
> (t0.isDeleted = 0) ORDER BY t0.firstName
>
> Then it returns in about 2 seconds

Just to confirm, is this a JDBC test or a test using DB client tools  
(on Derby this may be the same thing, still worth doing a JDBC run  
with your own code, as there can be unrelated bottlenecks in the  
tools).  E.g. create two PreparedStatements, one with "select *",  
another with "select id", and only read "id" column in both  
resultsets. See how much faster the second case is.

Andrus

Re: Paging and SQL queries

Posted by Aristedes Maniatis <ar...@maniatis.org>.

On 25/05/2007, at 7:45 PM, Andrus Adamchik wrote:

> The default strategy, as implemented by  
> org.apache.cayenne.access.IncrementalFaultList, is to run a full  
> query to fully resolve page #1, but only read id columns from the  
> result set for pages 2, 3, etc... This strategy bets that a single  
> fat SQL query with full ResultSet fetching, but without reading  
> unneeded columns is faster than a PK-only query followed by a  
> second query to resolve page #1.

I don't quite understand what 'resolvesFirstPage()' refers to. Does  
it mean 'firstPageResolved()'? Should it be negated here:

                    // resolve first page if we can
                     if (resolvesFirstPage()) {
                         // read first page completely, the rest as  
ObjectIds
                         for (int i = 0; i < pageSize && it.hasNextRow 
(); i++) {
                             elements.add(it.nextDataRow());
                             lastResolved++;
                         }

                         // defer DataRows -> Objects conversion till  
we are completely done.
                     }

                     // continue reading ids
                     DbEntity entity = rootEntity.getDbEntity();
                     while (it.hasNextRow()) {
                         elements.add(it.nextObjectId(entity));
                     }



>
> In some circumstances this is true, in some (like yours) it is  
> clearly not. I am leaning towards making a second strategy the  
> default, as paginated queries are really intended for the huge  
> result sets... Anybody has other thoughts on that?

Without a LIMIT on the records which are fetched as a 'fat query'  
there is little point in paging at all I think. At any rate it is  
certainly a problem in our workload of only 60,000 records.

Ari



-------------------------->
Aristedes Maniatis
phone +61 2 9660 9700
PGP fingerprint 08 57 20 4B 80 69 59 E2  A9 BF 2D 48 C2 20 0C C8

Re: Paging and SQL queries

Posted by Michael Gentry <bl...@gmail.com>.

I had never used the paginated queries, but I would've bet money that it
fetched only the PKs initially and then fully resolved objects on a per-page
basis as needed.  On large datasets, it is a big win (in time and memory) to
do it that way.  On smaller datasets, it would be about a wash either way --
not significant.  Seems to me that would be the logical default setting and
maybe an option to do it the other way if the developer knows his dataset
would fetch faster that way.

/dev/mrg

On 5/25/07, Andrus Adamchik <an...@objectstyle.org> wrote:
>
>
> The default strategy, as implemented by
> org.apache.cayenne.access.IncrementalFaultList, is to run a full
> query to fully resolve page #1, but only read id columns from the
> result set for pages 2, 3, etc... This strategy bets that a single
> fat SQL query with full ResultSet fetching, but without reading
> unneeded columns is faster than a PK-only query followed by a second
> query to resolve page #1.
>
> In some circumstances this is true, in some (like yours) it is
> clearly not. I am leaning towards making a second strategy the
> default, as paginated queries are really intended for the huge result
> sets... Anybody has other thoughts on that?
>

Re: Paging and SQL queries

Posted by Andrus Adamchik <an...@objectstyle.org>.

Hi Ari,

On May 25, 2007, at 8:11 AM, Aristedes Maniatis wrote:
> Q1: So, my first question is why is Cayenne doing this? If paging  
> is switched on, shouldn't Cayenne be fetching only the primary keys  
> and then faulting in the particular records it needs in full?

The default strategy, as implemented by  
org.apache.cayenne.access.IncrementalFaultList, is to run a full  
query to fully resolve page #1, but only read id columns from the  
result set for pages 2, 3, etc... This strategy bets that a single  
fat SQL query with full ResultSet fetching, but without reading  
unneeded columns is faster than a PK-only query followed by a second  
query to resolve page #1.

In some circumstances this is true, in some (like yours) it is  
clearly not. I am leaning towards making a second strategy the  
default, as paginated queries are really intended for the huge result  
sets... Anybody has other thoughts on that?

> Q2: Would it more more efficient from a db perspective to use the  
> "t0.id IS IN (?, ?, ?...)" style syntax here instead or does it  
> make no difference? I don't know enough about SQL optimisation to  
> know either way, but at least the SQL is shorter (which might be  
> good!).

Can't comment on the performance - this has to be checked against a  
specific DB, but the OR syntax is more generic, as it handles  
compound primary keys, while IN would only work with a single column PK.

Andrus