You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cayenne.apache.org by Aristedes Maniatis <ar...@ish.com.au> on 2007/05/25 07:11:31 UTC
Paging and SQL queries
We have recently been testing our 3 tier Cayenne app with a larger
number of records and finding some performance problems with scale.
In particular we have a test database with 60,000 student records
(the production db will have 120,000). We are using paging (of 25
records) to ensure that when a user brings up the list of students on
the client we don't load them all across the network. However we are
still seeing load times of around 40 seconds. At the moment we are
using objects, not DataRows on the client.
One part of the problem appears to occur on the server. The SQL
Cayenne generates looks like this:
SELECT * FROM Student t0 WHERE (t0.isDeleted IS NULL) OR
(t0.isDeleted = 0) ORDER BY t0.firstName
although the * is actually a list of every field in the table. We
timed this query against the db and even directly without Cayenne it
takes 16 seconds to return all the data. I presume the next 24
seconds Cayenne spends parsing the results into DataRows or Objects.
Instead, if we executed:
SELECT t0.id FROM Student t0 WHERE (t0.isDeleted IS NULL) OR
(t0.isDeleted = 0) ORDER BY t0.firstName
Then it returns in about 2 seconds (when performing this query
directly against the db and not through Cayenne).
Q1: So, my first question is why is Cayenne doing this? If paging is
switched on, shouldn't Cayenne be fetching only the primary keys and
then faulting in the particular records it needs in full?
We can see paging working with the query which Cayenne executes next:
SELECT * t0 WHERE ((t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR
(t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id
= ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR
(t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id
= ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR
(t0.id = ?) OR (t0.id = ?) OR (t0.id = ?) OR (t0.id = ?)) AND
((t0.isDeleted IS NULL) OR (t0.isDeleted = ?)) [bind: 18569, 18521,
18253, 18060, 17932, 16451, 16395, 15967, 15819, 14773, 14492, 14317,
14280, 13088, 12464, 11183, 10957, 10378, 10219, 9686, 9554, 9435,
9155, 8611, 8038, 0]
25 records are paged into Cayenne as expected. But Cayenne has
already fetched into memory (server-side) all the records in the
original query.
Q2: Would it more more efficient from a db perspective to use the
"t0.id IS IN (?, ?, ?...)" style syntax here instead or does it make
no difference? I don't know enough about SQL optimisation to know
either way, but at least the SQL is shorter (which might be good!).
Cheers
Ari Maniatis
-------------------------->
ish
http://www.ish.com.au
Level 1, 30 Wilson Street Newtown 2042 Australia
phone +61 2 9550 5001 fax +61 2 9550 4001
GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
Re: Paging and SQL queries
Posted by Andrus Adamchik <an...@objectstyle.org>.
On May 28, 2007, at 9:49 AM, Andrus Adamchik wrote:
>
>>
>> I don't quite understand what 'resolvesFirstPage()' refers to.
>> Does it mean 'firstPageResolved()'? Should it be negated here:
>
> No. "resolvesFirstPage" really a shorthand for "is it possible to
> reliably resolve objects on the first page using the existing
> algorithm".
I just clarified that with a JavaDoc comment. Sorry for mostly
skipping JavaDocs on the private methods...
/**
* Returns true if it is possible to read the first page of
inflated objects from the
* ResultSet returned from the main query. False is returned for
queries with
* prefetches as resolving prefetches is not possible in this
situation.
*/
private boolean resolvesFirstPage() {
return internalQuery.getPrefetchTree() == null;
}
Andrus
Re: Paging and SQL queries
Posted by Andrus Adamchik <an...@objectstyle.org>.
On May 28, 2007, at 1:50 PM, Aristedes Maniatis wrote:
>
> On 28/05/2007, at 4:49 PM, Andrus Adamchik wrote:
>
>> No. "resolvesFirstPage" really a shorthand for "is it possible to
>> reliably resolve objects on the first page using the existing
>> algorithm".
>
> OK. I guess I would have named it isFirstPageResolvable() or
> canResolveFirstPage(). But from your description it appears that it
> isn't specific to the first page and that either it is resolvable
> for any page, or not. So does it mean canResolveData()?
Actually no. This is specific to the first page that most of the time
can be resolved from the "select *" query result set, while the
second and following pages are resolved using separate queries that
select a group of id's.
>> Just to confirm, is this a JDBC test or a test using DB client
>> tools (on Derby this may be the same thing, still worth doing a
>> JDBC run with your own code, as there can be unrelated bottlenecks
>> in the tools). E.g. create two PreparedStatements, one with
>> "select *", another with "select id", and only read "id" column in
>> both resultsets. See how much faster the second case is.
>
> OK. I just tried it a different way and the results are the same.
> In this case I used the Derby ij tool to make a JDBC connection to
> a Derby server on localhost. That server has real data from one of
> our clients with 120,000 students.
>
> select * from student; -> 53 seconds
> select id from student; -> 4 seconds
>
> I performed several runs to try and exclude cache warming and the
> variance was less than 10%. The output was redirected to a file on
> disk. There are around 30 columns in the student table with a range
> of data types. The output file on disk with all the fields was
> 120Mb when finished.
Could you also exclude "ij" from the picture? It can add the overhead
of the terminal I/O, and the overhead of actually reading those extra
29 columns - something Cayenne paginated query wouldn't do. What I
think of as a *clean test* in this situation is a small Java program
that opens a Connection, creates a PreparedStatement, and reads a
ResultSet, discarding the data. In case of "select *" it would still
only read the "id" column.
Thanks
Andrus
Re: Paging and SQL queries
Posted by Aristedes Maniatis <ar...@maniatis.org>.
On 28/05/2007, at 4:49 PM, Andrus Adamchik wrote:
> No. "resolvesFirstPage" really a shorthand for "is it possible to
> reliably resolve objects on the first page using the existing
> algorithm".
OK. I guess I would have named it isFirstPageResolvable() or
canResolveFirstPage(). But from your description it appears that it
isn't specific to the first page and that either it is resolvable for
any page, or not. So does it mean canResolveData()? But I'm still
working through the code to understand when prefetches are relevant.
> Just to confirm, is this a JDBC test or a test using DB client
> tools (on Derby this may be the same thing, still worth doing a
> JDBC run with your own code, as there can be unrelated bottlenecks
> in the tools). E.g. create two PreparedStatements, one with
> "select *", another with "select id", and only read "id" column in
> both resultsets. See how much faster the second case is.
OK. I just tried it a different way and the results are the same. In
this case I used the Derby ij tool to make a JDBC connection to a
Derby server on localhost. That server has real data from one of our
clients with 120,000 students.
select * from student; -> 53 seconds
select id from student; -> 4 seconds
I performed several runs to try and exclude cache warming and the
variance was less than 10%. The output was redirected to a file on
disk. There are around 30 columns in the student table with a range
of data types. The output file on disk with all the fields was 120Mb
when finished.
So even when we exclude any overhead of Cayenne, the database itself
takes that much longer to simply read all that data off the disk and
return it. To be honest, I'm a bit disappointed by even that 4
seconds on a 2Gb G5 iMac since that will result in some serious UI
lag by the time that makes it through Cayenne, to the client over a
network (Jetty and Hessian) and then a jTable is drawn.
Ari
-------------------------->
Aristedes Maniatis
phone +61 2 9660 9700
PGP fingerprint 08 57 20 4B 80 69 59 E2 A9 BF 2D 48 C2 20 0C C8
Re: Paging and SQL queries
Posted by Andrus Adamchik <an...@objectstyle.org>.
On May 27, 2007, at 2:09 PM, Aristedes Maniatis wrote:
>
> I don't quite understand what 'resolvesFirstPage()' refers to. Does
> it mean 'firstPageResolved()'? Should it be negated here:
No. "resolvesFirstPage" really a shorthand for "is it possible to
reliably resolve objects on the first page using the existing
algorithm".
>> In some circumstances this is true, in some (like yours) it is
>> clearly not. I am leaning towards making a second strategy the
>> default, as paginated queries are really intended for the huge
>> result sets... Anybody has other thoughts on that?
>
> Without a LIMIT on the records which are fetched as a 'fat query'
> there is little point in paging at all I think.
Not true. Compare the timing of the paginated vs. normal query on
your data set - you should see a big difference. (Also do not use the
time printed in Cayenne SQL logs, as this is for the SQL part only;
do the timing yourself, around "performQuery" method). In fact our
original performance testing back when the paginated query was
implemented showed very little difference between "select *" and
"select id", so this bottleneck may actually be DB-specific.
Again, I am for updating the default algorithm, but I still think
this is an edge case, and we need to do some JDBC performance testing
to prove that such change can fix it.
> We timed this query against the db and even directly without
> Cayenne it takes 16 seconds to return all the data. I presume the
> next 24 seconds Cayenne spends parsing the results into DataRows or
> Objects. Instead, if we executed:
>
> SELECT t0.id FROM Student t0 WHERE (t0.isDeleted IS NULL) OR
> (t0.isDeleted = 0) ORDER BY t0.firstName
>
> Then it returns in about 2 seconds
Just to confirm, is this a JDBC test or a test using DB client tools
(on Derby this may be the same thing, still worth doing a JDBC run
with your own code, as there can be unrelated bottlenecks in the
tools). E.g. create two PreparedStatements, one with "select *",
another with "select id", and only read "id" column in both
resultsets. See how much faster the second case is.
Andrus
Re: Paging and SQL queries
Posted by Aristedes Maniatis <ar...@maniatis.org>.
On 25/05/2007, at 7:45 PM, Andrus Adamchik wrote:
> The default strategy, as implemented by
> org.apache.cayenne.access.IncrementalFaultList, is to run a full
> query to fully resolve page #1, but only read id columns from the
> result set for pages 2, 3, etc... This strategy bets that a single
> fat SQL query with full ResultSet fetching, but without reading
> unneeded columns is faster than a PK-only query followed by a
> second query to resolve page #1.
I don't quite understand what 'resolvesFirstPage()' refers to. Does
it mean 'firstPageResolved()'? Should it be negated here:
// resolve first page if we can
if (resolvesFirstPage()) {
// read first page completely, the rest as
ObjectIds
for (int i = 0; i < pageSize && it.hasNextRow
(); i++) {
elements.add(it.nextDataRow());
lastResolved++;
}
// defer DataRows -> Objects conversion till
we are completely done.
}
// continue reading ids
DbEntity entity = rootEntity.getDbEntity();
while (it.hasNextRow()) {
elements.add(it.nextObjectId(entity));
}
>
> In some circumstances this is true, in some (like yours) it is
> clearly not. I am leaning towards making a second strategy the
> default, as paginated queries are really intended for the huge
> result sets... Anybody has other thoughts on that?
Without a LIMIT on the records which are fetched as a 'fat query'
there is little point in paging at all I think. At any rate it is
certainly a problem in our workload of only 60,000 records.
Ari
-------------------------->
Aristedes Maniatis
phone +61 2 9660 9700
PGP fingerprint 08 57 20 4B 80 69 59 E2 A9 BF 2D 48 C2 20 0C C8
Re: Paging and SQL queries
Posted by Michael Gentry <bl...@gmail.com>.
I had never used the paginated queries, but I would've bet money that it
fetched only the PKs initially and then fully resolved objects on a per-page
basis as needed. On large datasets, it is a big win (in time and memory) to
do it that way. On smaller datasets, it would be about a wash either way --
not significant. Seems to me that would be the logical default setting and
maybe an option to do it the other way if the developer knows his dataset
would fetch faster that way.
/dev/mrg
On 5/25/07, Andrus Adamchik <an...@objectstyle.org> wrote:
>
>
> The default strategy, as implemented by
> org.apache.cayenne.access.IncrementalFaultList, is to run a full
> query to fully resolve page #1, but only read id columns from the
> result set for pages 2, 3, etc... This strategy bets that a single
> fat SQL query with full ResultSet fetching, but without reading
> unneeded columns is faster than a PK-only query followed by a second
> query to resolve page #1.
>
> In some circumstances this is true, in some (like yours) it is
> clearly not. I am leaning towards making a second strategy the
> default, as paginated queries are really intended for the huge result
> sets... Anybody has other thoughts on that?
>
Re: Paging and SQL queries
Posted by Andrus Adamchik <an...@objectstyle.org>.
Hi Ari,
On May 25, 2007, at 8:11 AM, Aristedes Maniatis wrote:
> Q1: So, my first question is why is Cayenne doing this? If paging
> is switched on, shouldn't Cayenne be fetching only the primary keys
> and then faulting in the particular records it needs in full?
The default strategy, as implemented by
org.apache.cayenne.access.IncrementalFaultList, is to run a full
query to fully resolve page #1, but only read id columns from the
result set for pages 2, 3, etc... This strategy bets that a single
fat SQL query with full ResultSet fetching, but without reading
unneeded columns is faster than a PK-only query followed by a second
query to resolve page #1.
In some circumstances this is true, in some (like yours) it is
clearly not. I am leaning towards making a second strategy the
default, as paginated queries are really intended for the huge result
sets... Anybody has other thoughts on that?
> Q2: Would it more more efficient from a db perspective to use the
> "t0.id IS IN (?, ?, ?...)" style syntax here instead or does it
> make no difference? I don't know enough about SQL optimisation to
> know either way, but at least the SQL is shorter (which might be
> good!).
Can't comment on the performance - this has to be checked against a
specific DB, but the OR syntax is more generic, as it handles
compound primary keys, while IN would only work with a single column PK.
Andrus