You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Scott Ogden <sc...@mitsonline.com> on 2005/09/17 01:42:14 UTC

derby performance and 'order by'

I have observed some interesting query performance behavior and am
hoping someone here can explain.  

 

In my scenario, it appears that an existing index is not being used for
the 'order by' part of the operation and as a result the performance of
certain queries is suffering.  Can someone explain if this is supposed
to be what is happening and why?  Please see below for the specific
queries and their performance characteristics.  

 

 

 

Here are the particulars:

---------------------------------

 

 

create table orders(

order_id varchar(50) NOT NULL

CONSTRAINT ORDERS_PK PRIMARY KEY,

amount numeric(31,2), 

time date,

inv_num varchar(50),

line_num varchar(50),

phone varchar(50),

prod_num varchar(50));

 

 

--Load a large amount of data (720,000 records) into the 'orders' table

 

 

--Create an index on the time column as that will be used in the 'where'
clause. 

 

create index IX_ORDERS_TIME on orders(time);

 

 

--When I run a query against this table returning top 1,000 records,
this query returns very quickly, consistently less than .010 seconds.

 

 

select * from orders 

where time > '10/01/2002' and time < '11/30/2002'

order by time;

 

 

--Now run a similarly query against same table, returning the top 1,000
records. 

--The difference is that the results are now sorted by the primary key
('order_id') rather than 'time'.  

--This query returns slowly, approximately 15 seconds.  Why??

 

 

select * from orders 

where time > '10/01/2002' and time < '11/30/2002'

order by order_id;

 

 

--Now run a third query against the same 'orders' table, removing the
where clause

--This query returns quickly, around .010 seconds.  

 

select * from orders 

order by order_id;

 

---------------------------------------------

Re: derby performance and 'order by'

Posted by Oy...@Sun.COM.

Rick Hillegas wrote:
> Thanks for the pointer to this presentation, Oyvind. It's a pretty 
> startling observation though I'm not sure how to use it. I'd be 
> interested in hearing your thoughts about this some time.

The question raised in this presentation is whether the optimizers in 
the databases examined (Oracle, DB2 and SQL Server) are actually too 
clever, evaluating (in some cases) 100+ query plans for a single query 
when simply changing the selectivity on two relations. The assumption is 
that they could evaluate far less plans and still perform well.

Now, in Derby, I think we're still on the opposite side - we should 
probably investigate more optimization techniques and execution plans, 
as this thread has shown.

-- 
Oyvind Bakksjo
Sun Microsystems, Database Technology Group
Trondheim, Norway
http://weblogs.java.net/blog/bakksjo/

Re: derby performance and 'order by'

Posted by Rick Hillegas <Ri...@Sun.COM>.

Thanks for the pointer to this presentation, Oyvind. It's a pretty 
startling observation though I'm not sure how to use it. I'd be 
interested in hearing your thoughts about this some time.

Cheers,
-Rick

>
> That reminds me of a very entertaining presentation which was held at 
> VLDB this year:
>
> Slides: http://www.vldb2005.org/program/slides/fri/s1228-reddy.ppt
> Paper: http://www.vldb2005.org/program/paper/fri/p1228-reddy.pdf
>
> Have a look, it's worth the time.
>
> We should definitely consider more execution plans in Derby, so that 
> we, too, could draw such interesting pictures. ;o)
>

Re: derby performance and 'order by'

Posted by Oy...@Sun.COM.

Rick Hillegas wrote:
> Hi Oyvind,
> 
> I agree that this is inelegant. As you note, this approach step by step 
> forces a plan which the current Derby optimizer is capable of 
> considering--with or without the covering index. Regardless of whether 
> we teach the optimizer some better tricks, I think it's worth beefing up 
> our support for in-memory temporary tables:
> 
> o There are always deceptively simple queries which the optimizer 
> misjudges. It's good to give the customer tools for getting unstuck 
> while they wait for the bugfix release.
> 
> o Often the customer knows facts about the data which the optimizer 
> can't plausibly learn.

Yes, I agree with you.

> o The current Derby optimizer is capable of considering only a very 
> limited subset of the useful plans.

That reminds me of a very entertaining presentation which was held at 
VLDB this year:

Slides: http://www.vldb2005.org/program/slides/fri/s1228-reddy.ppt
Paper: http://www.vldb2005.org/program/paper/fri/p1228-reddy.pdf

Have a look, it's worth the time.

We should definitely consider more execution plans in Derby, so that we, 
too, could draw such interesting pictures. ;o)

-- 
Oyvind Bakksjo
Sun Microsystems, Database Technology Group
Trondheim, Norway
http://weblogs.java.net/blog/bakksjo/

Re: derby performance and 'order by'

Posted by Rick Hillegas <Ri...@Sun.COM>.

Hi Oyvind,

I agree that this is inelegant. As you note, this approach step by step 
forces a plan which the current Derby optimizer is capable of 
considering--with or without the covering index. Regardless of whether 
we teach the optimizer some better tricks, I think it's worth beefing up 
our support for in-memory temporary tables:

o There are always deceptively simple queries which the optimizer 
misjudges. It's good to give the customer tools for getting unstuck 
while they wait for the bugfix release.

o Often the customer knows facts about the data which the optimizer 
can't plausibly learn.

o The current Derby optimizer is capable of considering only a very 
limited subset of the useful plans.

Cheers,
-Rick

Oyvind.Bakksjo@Sun.COM wrote:

> Rick Hillegas wrote:
>
>> It might help to add another column to the index so that it covers 
>> both the restriction and the ordering information. And if we could 
>> add a primary key to a temporary table, then something like the 
>> following might take us in the right direction:
>>
>> create index time_index on orders( time, orderID );
>>
>> declare global temporary table session.ztemp
>> ( orderID varchar( 50 ) primary key )
>> not logged;
>>
>> -- all the information we need is in the index so there's no need
>> -- to probe the base table
>> insert into session.ztemp
>>  select orderID
>>  from orders
>>  where time between '10/01/2002' and '11/30/2002'
>> ;
>>
>> -- hopefully the temporary table didn't have to spill to disk.
>> -- no sort should be needed for this query and you can just
>> -- stop siphoning out rows after you get the first 1000.
>> select l.*
>> from orders l, session.ztemp r
>> where l.orderID = r.orderID
>> order by orderID;
>
>
> If I understand this correctly, you're here relying on the fact that 
> the primary key constraint on the temporary table creates an 
> underlying index, so the records inserted can be read out in sorted 
> order.
>
> This is also a form of sorting. Shouldn't the engine be able to use 
> similar techniques in the execution of the original query, without 
> relying on the user splitting up the query, creating temporary tables 
> etc.?
>

Re: derby performance and 'order by'

Posted by Oy...@Sun.COM.

Rick Hillegas wrote:
> It might help to add another column to the index so that it covers both 
> the restriction and the ordering information. And if we could add a 
> primary key to a temporary table, then something like the following 
> might take us in the right direction:
> 
> create index time_index on orders( time, orderID );
> 
> declare global temporary table session.ztemp
> ( orderID varchar( 50 ) primary key )
> not logged;
> 
> -- all the information we need is in the index so there's no need
> -- to probe the base table
> insert into session.ztemp
>  select orderID
>  from orders
>  where time between '10/01/2002' and '11/30/2002'
> ;
> 
> -- hopefully the temporary table didn't have to spill to disk.
> -- no sort should be needed for this query and you can just
> -- stop siphoning out rows after you get the first 1000.
> select l.*
> from orders l, session.ztemp r
> where l.orderID = r.orderID
> order by orderID;

If I understand this correctly, you're here relying on the fact that the 
primary key constraint on the temporary table creates an underlying 
index, so the records inserted can be read out in sorted order.

This is also a form of sorting. Shouldn't the engine be able to use 
similar techniques in the execution of the original query, without 
relying on the user splitting up the query, creating temporary tables etc.?

-- 
Oyvind Bakksjo
Sun Microsystems, Database Technology Group
Haakon VII gt. 7b, N-7485 Trondheim, Norway
Tel: x43419 / +47 73842119, Fax: +47 73842101

Re: derby performance and 'order by'

Posted by Rick Hillegas <Ri...@Sun.COM>.

It might help to add another column to the index so that it covers both 
the restriction and the ordering information. And if we could add a 
primary key to a temporary table, then something like the following 
might take us in the right direction:

create index time_index on orders( time, orderID );

declare global temporary table session.ztemp
( orderID varchar( 50 ) primary key )
not logged;

-- all the information we need is in the index so there's no need
-- to probe the base table
insert into session.ztemp
  select orderID
  from orders
  where time between '10/01/2002' and '11/30/2002'
;

-- hopefully the temporary table didn't have to spill to disk.
-- no sort should be needed for this query and you can just
-- stop siphoning out rows after you get the first 1000.
select l.*
from orders l, session.ztemp r
where l.orderID = r.orderID
order by orderID;

The lighter weight our temporary tables are, the better these kinds of 
solutions will perform. Making temporary tables lighter weight could 
boost the performance of large families of problem queries.

Re: derby performance and 'order by'

Posted by Suavi Ali Demir <de...@yahoo.com>.

On the network server, does Derby have facilities to communicate current "max rows" choice to the backend at every execute, or is it a client side setting only?
Regards,
Suavi


Daniel John Debrunner <dj...@debrunners.com> wrote:
I agree with Øystein, given that the standard JDBC api for the maximum
rows is Statement.setMaxRows then Derby should be able to take advantage
of that regardless of when it is set.

This doesn't imply that a plan gets reprepared, though that could be a
solution, it could be implemented as a plan that is flexible enough to
handle an optional max rows.

Dan.


Suavi Ali Demir wrote:
> Yes, good idea, but if execution plan is going to be different it beats
> the purpose of being "prepared". When it's prepared it gets compiled
> into bytecode, and different bytecode could be generated for different
> cases. Some databases do have the sql syntax to "fetch first n rows
> only" or "limit n" for example (i don't know if they can be parametric).
> Since they have incorporated this functionality, perhaps they have
> thought these details. I would even imagine that facilities which make
> this kind of optimization may be available or feasible to use at prepare
> time but not execute time. 
> Ali
> 
> 
> */Øystein Grøvlen /* wrote:
> 
> >>>>> "SAD" == Suavi Ali Demir writes:
> 
> SAD> Another little detail about optimization is that
> SAD> Statement.setMaxRows() kind of functions on the JDBC side may
> SAD> not be sufficient since it is called after SQL statement is
> SAD> prepared and returned as an object (after query plan is
> SAD> built). Therefore, it may be necessary to have language
> SAD> syntax to indicate the intention to fetch first 1000 rows
> SAD> only, so that when the query is prepared, this intention can
> SAD> be taken into account.
> 
> It would be much better if this could be changed at execute-time for
> an already prepared statement. That is, the same prepared statement
> could be used regardless of how many rows one is going to fetch.
> 
> -- 
> Øystein
>

Re: derby performance and 'order by'

Posted by Daniel John Debrunner <dj...@debrunners.com>.

I agree with Øystein, given that the standard JDBC api for the maximum
rows is Statement.setMaxRows then Derby should be able to take advantage
of that regardless of when it is set.

This doesn't imply that a plan gets reprepared, though that could be a
solution, it could be implemented as a plan that is flexible enough to
handle an optional max rows.

Dan.


Suavi Ali Demir wrote:
> Yes, good idea, but if execution plan is going to be different it beats
> the purpose of being "prepared". When it's prepared it gets compiled
> into bytecode, and different bytecode could be generated for different
> cases. Some databases do have the sql syntax to "fetch first n rows
> only" or "limit n" for example (i don't know if they can be parametric).
> Since they have incorporated this functionality, perhaps they have
> thought these details. I would even imagine that facilities which make
> this kind of optimization may be available or feasible to use at prepare
> time but not execute time. 
> Ali
> 
> 
> */Øystein Grøvlen <Oy...@Sun.COM>/* wrote:
> 
>     >>>>> "SAD" == Suavi Ali Demir writes:
> 
>     SAD> Another little detail about optimization is that
>     SAD> Statement.setMaxRows() kind of functions on the JDBC side may
>     SAD> not be sufficient since it is called after SQL statement is
>     SAD> prepared and returned as an object (after query plan is
>     SAD> built). Therefore, it may be necessary to have language
>     SAD> syntax to indicate the intention to fetch first 1000 rows
>     SAD> only, so that when the query is prepared, this intention can
>     SAD> be taken into account.
> 
>     It would be much better if this could be changed at execute-time for
>     an already prepared statement. That is, the same prepared statement
>     could be used regardless of how many rows one is going to fetch.
> 
>     -- 
>     Øystein
>

Re: derby performance and 'order by'

Posted by Suavi Ali Demir <de...@yahoo.com>.

Yes, good idea, but if execution plan is going to be different it beats the purpose of being "prepared". When it's prepared it gets compiled into bytecode, and different bytecode could be generated for different cases. Some databases do have the sql syntax to "fetch first n rows only" or "limit n" for example (i don't know if they can be parametric). Since they have incorporated this functionality, perhaps they have thought these details. I would even imagine that facilities which make this kind of optimization may be available or feasible to use at prepare time but not execute time. 
Ali


Øystein Grøvlen <Oy...@Sun.COM> wrote:>>>>> "SAD" == Suavi Ali Demir writes:

SAD> Another little detail about optimization is that
SAD> Statement.setMaxRows() kind of functions on the JDBC side may
SAD> not be sufficient since it is called after SQL statement is
SAD> prepared and returned as an object (after query plan is
SAD> built). Therefore, it may be necessary to have language
SAD> syntax to indicate the intention to fetch first 1000 rows
SAD> only, so that when the query is prepared, this intention can
SAD> be taken into account.

It would be much better if this could be changed at execute-time for
an already prepared statement. That is, the same prepared statement
could be used regardless of how many rows one is going to fetch.

-- 
Øystein

Re: derby performance and 'order by'

Posted by Øystein Grøvlen <Oy...@Sun.COM>.

>>>>> "SAD" == Suavi Ali Demir <de...@yahoo.com> writes:

    SAD> Another little detail about optimization is that
    SAD> Statement.setMaxRows() kind of functions on the JDBC side may
    SAD> not be sufficient since it is called after SQL statement is
    SAD> prepared and returned as an object (after query plan is
    SAD> built). Therefore, it may be necessary to have language
    SAD> syntax to indicate the intention to fetch first 1000 rows
    SAD> only, so that when the query is prepared, this intention can
    SAD> be taken into account.

It would be much better if this could be changed at execute-time for
an already prepared statement.  That is, the same prepared statement
could be used regardless of how many rows one is going to fetch.

-- 
Øystein

Re: derby performance and 'order by'

Posted by Craig Russell <Cr...@Sun.COM>.

Hi Ali,

I've filed a Minor Feature Request in JIRA: DERBY-581.

I agree that the information should be expressed in SQL so that the  
query optimized and execution strategy can know what the user needs  
in terms of cardinality.

I'd also like to ask that when we consider extending the SQL in this  
manner we consider skipping the first N rows and returning the next M  
rows.

Craig

On Sep 20, 2005, at 10:19 AM, Suavi Ali Demir wrote:

> Another little detail about optimization is that  
> Statement.setMaxRows() kind of functions on the JDBC side may not  
> be sufficient since it is called after SQL statement is prepared  
> and returned as an object (after query plan is built). Therefore,  
> it may be necessary to have language syntax to indicate the  
> intention to fetch first 1000 rows only, so that when the query is  
> prepared, this intention can be taken into account.
> Regards,
> Ali
>
>
> Mike Matrigali <mi...@sbcglobal.net> wrote:
> As craig points out it is important in performance testing to say
> exactly what you are measuring. In general Derby will try to
> stream rows to the user before it has finished looking at all rows.
> So often looking at the first row will and stopping will mean that
> many rows have not been processed. BUT when an order by is involved
> and the query plan either has no appropriate matching index, or  
> decides
> to use a different index then all the rows are processed, then they  
> are
> sent to the sorter and finally after all rows are processed they are
> streamed to the client.
>
> So as you have seen reading the first 1000 rows of a much larger data
> set can happen very quickly.
>
> As subsequent mail threads have pointed out, returning the top 1000
> sorted rows is an interesting problem which could be costed and  
> executed
> differently if that information was pushed into the optimizer and the
> sorter (and medium level projects were done in those areas).
>
>
> scotto wrote:
> > The test was set up and run using the SQuirreL client, not ij.  
> All 3 of
> > the queries return the top 1000 rows and the times I reported are to
> > return these top 1000 rows, not just the first row.
> >
> >
> >
> >  
> ---------------------------------------------------------------------- 
> --
> >
> > *From:* Craig Russell [mailto:Craig.Russell@Sun.COM]
> > *Sent:* Saturday, September 17, 2005 2:35 PM
> > *To:* Derby Discussion
> > *Subject:* Re: derby performance and 'order by'
> >
> >
> >
> > Hi Scott,
> >
> >
> >
> > How have you set up the test? Are you using ij and displaying all  
> of the
> > data or using jdbc to access the data?
> >
> >
> >
> > What do you do in 0.010 seconds? Do you read all of the rows into
> > memory, or just record the ti me until you get the first row? Are  
> you
> > measuring the time taken to return all the rows or just the first  
> row?
> >
> > Another reader has already commented on the fact that the second  
> query
> > is doing a lot more work than the first. The second query must  
> sort the
> > results after filtering the data, whereas the first and third  
> queries
> > can simply use the indexes and filter on the fly.
> >
> >
> >
> > I'm a little suspicious of the third query returning 720,000  
> results in
> > 0.010 seconds.
> >
> >
> >
> > Craig
> >
> >
> >
> > On Sep 16, 2005, at 4:42 PM, Scott Ogden wrote:
> >
> >
> >
> > I have observed some interesting query performance behavior and am
> > hoping someone here can explain.
> >
> >
> >
> > In my scenario, it appears that an existing index is not being  
> used for
> > the ‘order by’ part of the operation and as a result the perfo  
> rmance of
> > certain queries is suffering. Can someone explain if this is  
> supposed
> > to be what is happening and why? Please see below for the specific
> > queries and their performance characteristics.
> >
> >
> >
> >
> >
> >
> >
> > Here are the particulars:
> >
> > ---------------------------------
> >
> >
> >
> >
> >
> > create table orders(
> >
> > order_id varchar(50) NOT NULL
> >
> > CONSTRAINT ORDERS_PK PRIMARY KEY,
> >
> > amount numeric(31,2),
> >
> > time date,
> >
> > inv_num varchar(50),
> >
> > line_num varchar(50),
> >
> > phone varchar(50),
> >
> > prod_num varchar(50));
> >
> >
> >
> >
> >
> > --Load a large amount of data (720,000 records) into the ‘orders’  
> table
> >
> >
> >
> >
> >
> > --Create an index on the time column as that will be used i n the  
> ‘where’
> > clause.
> >
> >
> >
> > create index IX_ORDERS_TIME on orders(time);
> >
> >
> >
> >
> >
> > --When I run a query against this table returning top 1,000 records,
> > this query returns very quickly, consistently less than .010  
> seconds.
> >
> >>
> >>
> >>
> >>
> >> select * from orders
> >>
> >> where time > '10/01/2002' and time < '11/30/2002'
> >>
> >> order by time;
> >>
> >>
> >>
> >>
> >>
> >> --Now run a similarly query against same table, returning the top
> >> 1,000 records.
> >>
> >> --The difference is that the results are now sorted by the  
> primary key
> >> (‘order_id’) rather than ‘time’.
> >>
> >> --This query returns slowly, approximately 15 seconds. Why??
> >>
> >>
> >>
> >>
> >>
> >> select * from orders
> >>
> >> where time > '10/01/2002' and time < '11/30/2002'
> >>
> >> order by order_id;
> >>
> >>
> >>
> >>
> >>
> >> --Now run a third query against the same ‘orders’ table,  
> removing the
> >> where clause
> >>
> >> --This query returns quickly, around .010 seconds.
> >>
> >>
> >>
> >> select * from orders
> >>
> >> order by order_id;
> >>
> >>
> >>
> >> ---------------------------------------------
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> > Craig Russell
> >
> > Architect, Sun Java Enterprise System http://java.sun.com/ 
> products/jdo
> >
> > 408 276-5638 mailto:Craig.Russell@sun.com
> >
> > P.S. A good JDO? O, Gasp!
> >
> >
> >

Craig Russell
Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
408 276-5638 mailto:Craig.Russell@sun.com
P.S. A good JDO? O, Gasp!

Re: derby performance and 'order by'

Posted by Suavi Ali Demir <de...@yahoo.com>.

Another little detail about optimization is that Statement.setMaxRows() kind of functions on the JDBC side may not be sufficient since it is called after SQL statement is prepared and returned as an object (after query plan is built). Therefore, it may be necessary to have language syntax to indicate the intention to fetch first 1000 rows only, so that when the query is prepared, this intention can be taken into account.
Regards,
Ali


Mike Matrigali <mi...@sbcglobal.net> wrote:
As craig points out it is important in performance testing to say
exactly what you are measuring. In general Derby will try to
stream rows to the user before it has finished looking at all rows.
So often looking at the first row will and stopping will mean that
many rows have not been processed. BUT when an order by is involved
and the query plan either has no appropriate matching index, or decides
to use a different index then all the rows are processed, then they are
sent to the sorter and finally after all rows are processed they are
streamed to the client.

So as you have seen reading the first 1000 rows of a much larger data
set can happen very quickly.

As subsequent mail threads have pointed out, returning the top 1000
sorted rows is an interesting problem which could be costed and executed
differently if that information was pushed into the optimizer and the
sorter (and medium level projects were done in those areas).


scotto wrote:
> The test was set up and run using the SQuirreL client, not ij. All 3 of
> the queries return the top 1000 rows and the times I reported are to
> return these top 1000 rows, not just the first row.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> *From:* Craig Russell [mailto:Craig.Russell@Sun.COM]
> *Sent:* Saturday, September 17, 2005 2:35 PM
> *To:* Derby Discussion
> *Subject:* Re: derby performance and 'order by'
> 
> 
> 
> Hi Scott,
> 
> 
> 
> How have you set up the test? Are you using ij and displaying all of the
> data or using jdbc to access the data?
> 
> 
> 
> What do you do in 0.010 seconds? Do you read all of the rows into
> memory, or just record the time until you get the first row? Are you
> measuring the time taken to return all the rows or just the first row?
> 
> Another reader has already commented on the fact that the second query
> is doing a lot more work than the first. The second query must sort the
> results after filtering the data, whereas the first and third queries
> can simply use the indexes and filter on the fly.
> 
> 
> 
> I'm a little suspicious of the third query returning 720,000 results in
> 0.010 seconds.
> 
> 
> 
> Craig
> 
> 
> 
> On Sep 16, 2005, at 4:42 PM, Scott Ogden wrote:
> 
> 
> 
> I have observed some interesting query performance behavior and am
> hoping someone here can explain. 
> 
> 
> 
> In my scenario, it appears that an existing index is not being used for
> the order by part of the operation and as a result the performance of
> certain queries is suffering. Can someone explain if this is supposed
> to be what is happening and why? Please see below for the specific
> queries and their performance characteristics. 
> 
> 
> 
> 
> 
> 
> 
> Here are the particulars:
> 
> ---------------------------------
> 
> 
> 
> 
> 
> create table orders(
> 
> order_id varchar(50) NOT NULL
> 
> CONSTRAINT ORDERS_PK PRIMARY KEY,
> 
> amount numeric(31,2),
> 
> time date,
> 
> inv_num varchar(50),
> 
> line_num varchar(50),
> 
> phone varchar(50),
> 
> prod_num varchar(50));
> 
> 
> 
> 
> 
> --Load a large amount of data (720,000 records) into the orders table
> 
> 
> 
> 
> 
> --Create an index on the time column as that will be used in the where
> clause.
> 
> 
> 
> create index IX_ORDERS_TIME on orders(time);
> 
> 
> 
> 
> 
> --When I run a query against this table returning top 1,000 records,
> this query returns very quickly, consistently less than .010 seconds.
> 
>> 
>>
>> 
>>
>> select * from orders
>>
>> where time > '10/01/2002' and time < '11/30/2002'
>>
>> order by time;
>>
>> 
>>
>> 
>>
>> --Now run a similarly query against same table, returning the top
>> 1,000 records.
>>
>> --The difference is that the results are now sorted by the primary key
>> (order_id) rather than time. 
>>
>> --This query returns slowly, approximately 15 seconds. Why??
>>
>> 
>>
>> 
>>
>> select * from orders
>>
>> where time > '10/01/2002' and time < '11/30/2002'
>>
>> order by order_id;
>>
>> 
>>
>> 
>>
>> --Now run a third query against the same orders table, removing the
>> where clause
>>
>> --This query returns quickly, around .010 seconds. 
>>
>> 
>>
>> select * from orders
>>
>> order by order_id;
>>
>> 
>>
>> ---------------------------------------------
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>>
>>
> 
> 
> Craig Russell
> 
> Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
> 
> 408 276-5638 mailto:Craig.Russell@sun.com
> 
> P.S. A good JDO? O, Gasp!
> 
> 
>

Re: derby performance and 'order by'

Posted by Mike Matrigali <mi...@sbcglobal.net>.

As craig points out it is important in performance testing to say
exactly what you are measuring.  In general Derby will try to
stream rows to the user before it has finished looking at all rows.
So often looking at the first row will and stopping will mean that
many rows have not been processed.  BUT when an order by is involved
and the query plan either has no appropriate matching index, or decides
to use a different index then all the rows are processed, then they are
sent to the sorter and finally after all rows are processed they are
streamed to the client.

So as you have seen reading the first 1000 rows of a much larger data
set can happen very quickly.

As subsequent mail threads have pointed out, returning the top 1000
sorted rows is an interesting problem which could be costed and executed
differently if that information was pushed into the optimizer and the
sorter (and medium level projects were done in those areas).


scotto wrote:
> The test was set up and run using the SQuirreL client, not ij.  All 3 of
> the queries return the top 1000 rows and the times I reported are to
> return these top 1000 rows, not just the first row.
> 
>  
> 
> ------------------------------------------------------------------------
> 
> *From:* Craig Russell [mailto:Craig.Russell@Sun.COM]
> *Sent:* Saturday, September 17, 2005 2:35 PM
> *To:* Derby Discussion
> *Subject:* Re: derby performance and 'order by'
> 
>  
> 
> Hi Scott,
> 
>  
> 
> How have you set up the test? Are you using ij and displaying all of the
> data or using jdbc to access the data?
> 
>  
> 
> What do you do in 0.010 seconds? Do you read all of the rows into
> memory, or just record the time until you get the first row? Are you
> measuring the time taken to return all the rows or just the first row?
> 
> Another reader has already commented on the fact that the second query
> is doing a lot more work than the first. The second query must sort the
> results after filtering the data, whereas the first and third queries
> can simply use the indexes and filter on the fly.
> 
>  
> 
> I'm a little suspicious of the third query returning 720,000 results in
> 0.010 seconds.
> 
>  
> 
> Craig
> 
>  
> 
> On Sep 16, 2005, at 4:42 PM, Scott Ogden wrote:
> 
> 
> 
> I have observed some interesting query performance behavior and am
> hoping someone here can explain. 
> 
>  
> 
> In my scenario, it appears that an existing index is not being used for
> the ‘order by’ part of the operation and as a result the performance of
> certain queries is suffering.  Can someone explain if this is supposed
> to be what is happening and why?  Please see below for the specific
> queries and their performance characteristics. 
> 
>  
> 
>  
> 
>  
> 
> Here are the particulars:
> 
> ---------------------------------
> 
>  
> 
>  
> 
> create table orders(
> 
> order_id varchar(50) NOT NULL
> 
> CONSTRAINT ORDERS_PK PRIMARY KEY,
> 
> amount numeric(31,2),
> 
> time date,
> 
> inv_num varchar(50),
> 
> line_num varchar(50),
> 
> phone varchar(50),
> 
> prod_num varchar(50));
> 
>  
> 
>  
> 
> --Load a large amount of data (720,000 records) into the ‘orders’ table
> 
>  
> 
>  
> 
> --Create an index on the time column as that will be used in the ‘where’
> clause.
> 
>  
> 
> create index IX_ORDERS_TIME on orders(time);
> 
>  
> 
>  
> 
> --When I run a query against this table returning top 1,000 records,
> this query returns very quickly, consistently less than .010 seconds.
> 
>>  
>>
>>  
>>
>> select * from orders
>>
>> where time > '10/01/2002' and time < '11/30/2002'
>>
>> order by time;
>>
>>  
>>
>>  
>>
>> --Now run a similarly query against same table, returning the top
>> 1,000 records.
>>
>> --The difference is that the results are now sorted by the primary key
>> (‘order_id’) rather than ‘time’. 
>>
>> --This query returns slowly, approximately 15 seconds.  Why??
>>
>>  
>>
>>  
>>
>> select * from orders
>>
>> where time > '10/01/2002' and time < '11/30/2002'
>>
>> order by order_id;
>>
>>  
>>
>>  
>>
>> --Now run a third query against the same ‘orders’ table, removing the
>> where clause
>>
>> --This query returns quickly, around .010 seconds. 
>>
>>  
>>
>> select * from orders
>>
>> order by order_id;
>>
>>  
>>
>> ---------------------------------------------
>>
>>  
>>
>>  
>>
>>  
>>
>>  
>>
>>
>>
>  
> 
> Craig Russell
> 
> Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
> 
> 408 276-5638 mailto:Craig.Russell@sun.com
> 
> P.S. A good JDO? O, Gasp!
> 
>  
>

RE: derby performance and 'order by'

Posted by scotto <sc...@mitsonline.com>.

The test was set up and run using the SQuirreL client, not ij.  All 3 of the
queries return the top 1000 rows and the times I reported are to return
these top 1000 rows, not just the first row. 

 

  _____  

From: Craig Russell [mailto:Craig.Russell@Sun.COM] 
Sent: Saturday, September 17, 2005 2:35 PM
To: Derby Discussion
Subject: Re: derby performance and 'order by'

 

Hi Scott,

 

How have you set up the test? Are you using ij and displaying all of the
data or using jdbc to access the data?

 

What do you do in 0.010 seconds? Do you read all of the rows into memory, or
just record the time until you get the first row? Are you measuring the time
taken to return all the rows or just the first row?



Another reader has already commented on the fact that the second query is
doing a lot more work than the first. The second query must sort the results
after filtering the data, whereas the first and third queries can simply use
the indexes and filter on the fly.

 

I'm a little suspicious of the third query returning 720,000 results in
0.010 seconds.

 

Craig

 

On Sep 16, 2005, at 4:42 PM, Scott Ogden wrote:





I have observed some interesting query performance behavior and am hoping
someone here can explain. 

 

In my scenario, it appears that an existing index is not being used for the
'order by' part of the operation and as a result the performance of certain
queries is suffering.  Can someone explain if this is supposed to be what is
happening and why?  Please see below for the specific queries and their
performance characteristics. 

 

 

 

Here are the particulars:

---------------------------------

 

 

create table orders(

order_id varchar(50) NOT NULL

CONSTRAINT ORDERS_PK PRIMARY KEY,

amount numeric(31,2),

time date,

inv_num varchar(50),

line_num varchar(50),

phone varchar(50),

prod_num varchar(50));

 

 

--Load a large amount of data (720,000 records) into the 'orders' table

 

 

--Create an index on the time column as that will be used in the 'where'
clause.

 

create index IX_ORDERS_TIME on orders(time);

 

 

--When I run a query against this table returning top 1,000 records, this
query returns very quickly, consistently less than .010 seconds.



 

 

select * from orders

where time > '10/01/2002' and time < '11/30/2002'

order by time;

 

 

--Now run a similarly query against same table, returning the top 1,000
records.

--The difference is that the results are now sorted by the primary key
('order_id') rather than 'time'. 

--This query returns slowly, approximately 15 seconds.  Why??

 

 

select * from orders

where time > '10/01/2002' and time < '11/30/2002'

order by order_id;

 

 

--Now run a third query against the same 'orders' table, removing the where
clause

--This query returns quickly, around .010 seconds. 

 

select * from orders

order by order_id;

 

---------------------------------------------

 

 

 

 





 

Craig Russell

Architect, Sun Java Enterprise System http://java.sun.com/products/jdo

408 276-5638 mailto:Craig.Russell@sun.com

P.S. A good JDO? O, Gasp!

Re: derby performance and 'order by'

Posted by Craig Russell <Cr...@Sun.COM>.

Hi Scott,

How have you set up the test? Are you using ij and displaying all of  
the data or using jdbc to access the data?

What do you do in 0.010 seconds? Do you read all of the rows into  
memory, or just record the time until you get the first row? Are you  
measuring the time taken to return all the rows or just the first row?
Another reader has already commented on the fact that the second  
query is doing a lot more work than the first. The second query must  
sort the results after filtering the data, whereas the first and  
third queries can simply use the indexes and filter on the fly.

I'm a little suspicious of the third query returning 720,000 results  
in 0.010 seconds.

Craig

On Sep 16, 2005, at 4:42 PM, Scott Ogden wrote:

> I have observed some interesting query performance behavior and am  
> hoping someone here can explain.
>
>
>
> In my scenario, it appears that an existing index is not being used  
> for the ‘order by’ part of the operation and as a result the  
> performance of certain queries is suffering.  Can someone explain  
> if this is supposed to be what is happening and why?  Please see  
> below for the specific queries and their performance characteristics.
>
>
>
>
>
>
>
> Here are the particulars:
>
> ---------------------------------
>
>
>
>
>
> create table orders(
>
> order_id varchar(50) NOT NULL
>
> CONSTRAINT ORDERS_PK PRIMARY KEY,
>
> amount numeric(31,2),
>
> time date,
>
> inv_num varchar(50),
>
> line_num varchar(50),
>
> phone varchar(50),
>
> prod_num varchar(50));
>
>
>
>
>
> --Load a large amount of data (720,000 records) into the ‘orders’  
> table
>
>
>
>
>
> --Create an index on the time column as that will be used in the  
> ‘where’ clause.
>
>
>
> create index IX_ORDERS_TIME on orders(time);
>
>
>
>
>
> --When I run a query against this table returning top 1,000  
> records, this query returns very quickly, consistently less than . 
> 010 seconds.
>
>
>
>
> select * from orders
>
> where time > '10/01/2002' and time < '11/30/2002'
>
> order by time;
>
>
>
>
>
> --Now run a similarly query against same table, returning the top  
> 1,000 records.
>
> --The difference is that the results are now sorted by the primary  
> key (‘order_id’) rather than ‘time’.
>
> --This query returns slowly, approximately 15 seconds.  Why??
>
>
>
>
>
> select * from orders
>
> where time > '10/01/2002' and time < '11/30/2002'
>
> order by order_id;
>
>
>
>
>
> --Now run a third query against the same ‘orders’ table, removing  
> the where clause
>
> --This query returns quickly, around .010 seconds.
>
>
>
> select * from orders
>
> order by order_id;
>
>
>
> ---------------------------------------------
>
>
>
>
>
>
>
>
>
>

Craig Russell
Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
408 276-5638 mailto:Craig.Russell@sun.com
P.S. A good JDO? O, Gasp!

Re: derby performance and 'order by'

Posted by Craig Russell <Cr...@Sun.COM>.

Hi Ali,

On Sep 19, 2005, at 10:26 AM, Suavi Ali Demir wrote:

> Actually, it sounds like the problem of finding top 1000 rows out  
> of  166333 rows is different than sorting 166333 rows and maybe it  
> could be optimized.

Indeed.

> There is no need to sort all 166333 but the information that we are  
> only looking 1000 rows would have to be passed all the way down to  
> the point where Derby decides to sort. I have not thought through  
> the details of an algorithm but when nRows we want is substantially  
> smaller than TotalRows then i just feel there should be a better  
> way to pick those nRows. For example, if nRows were 1, then all we  
> had to do would be 1 single pass on 166333 rows to find the max.  
> That is quite different than sorting all and this idea should be  
> possible to generalize on 1<=nRows<TotalRows.

I agree that this would be a useful improvement. Now, how do we tell  
the back end that we want only the first  1000 rows?

Or more generally (as found in competitive products):

How do we tell the back end that we want to skip the first N and  
return the next M rows?

Craig

> Ali
>
>
> Craig Russell <Cr...@Sun.COM> wrote:
> Hi Scott,
>
> From the query plan it appears that your filter selects 166,333  
> rows, of which you want the first 1000 according to the ordering of  
> the order_id column. You can see that this is an effective strategy  
> because             Number of rows qualified=166333 Number of rows  
> visited=166333. There's no time lost visiting rows that don't qualify.
>
> The database has to sort the 166,333 rows because the results are  
> ordered according to the index scan column "time" not according to  
> the order_id column. All of the rows need to be sorted even though  
> you only want the first 1000 rows. I'd guess that the sorting of  
> the 166,333 rows is what accounts for the 15 second delay you are  
> experiencing.
>
> The index on order_id doesn't do you any good because you have a  
> result that isn't indexed on order_id. If this isn't obvious, try  
> to think of an algorithm that would use the order_id index on the  
> result set.
>
> Craig
>
> On Sep 19, 2005, at 9:29 AM, scotto wrote:
>
>> So for the second query:
>>
>> select * from orders
>> where time > '10/01/2002' and time < '11/30/2002'
>> order by order_id;
>>
>> the query plan shows that the index IX_ORDERS_TIME is used to  
>> filter the
>> result set by time.  The order by step does not use the primary  
>> key index to
>> sort the results after the filter step.  My questions:
>>
>> --Is it correct that the sort step not use the primary key index  
>> in this
>> case?
>>
>> --Why is it not possible to use the index on order_id to sort  
>> after the
>> filter has happened?
>>
>> Here is the query plan:
>>
>> Statement Name:
>>     null
>> Statement Text:
>>     select * from orders
>> where time > '10/01/2002' and time < '11/30/2002'
>> order by order_id
>> Parse Time: 0
>> Bind Time: 0
>> Optimize Time: 0
>> Generate Time: 0
>> Compile Time: 0
>> Execute Time: 14329
>> Begin Compilation Timestamp : null
>> End Compilation Timestamp : null
>> Begin Execution Timestamp : 2005-09-19 09:20:06.171
>> End Execution Timestamp : 2005-09-19 09:20:20.5
>> Statement Execution Plan Text:
>> Sort ResultSet:
>> Number of opens = 1
>> Rows input = 166333
>> Rows returned = 1000
>> Eliminate duplicates = false
>> In sorted order = false
>> Sort information:
>>     Number of merge runs=1
>>     Number of rows input=166333
>>     Number of rows output=166333
>>     Size of merge runs=[93695]
>>     Sort type=external
>>     constructor time (milliseconds) = 0
>>     open time (milliseconds) = 14297
>>     next time (milliseconds) = 32
>>     close time (milliseconds) = 0
>>     optimizer estimated row count:        78377.51
>>     optimizer estimated cost:       166745.12
>>
>> Source result set:
>>     Index Row to Base Row ResultSet for ORDERS:
>>     Number of opens = 1
>>     Rows seen = 166333
>>     Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6}
>>         constructor time (milliseconds) = 0
>>         open time (milliseconds) = 0
>>         next time (milliseconds) = 10488
>>         close time (milliseconds) = 0
>>         optimizer estimated row count:        78377.51
>>         optimizer estimated cost:       166745.12
>>
>>         Index Scan ResultSet for ORDERS using index IX_ORDERS_TIME
>> at read committed isolation level using instantaneous share row  
>> locking
>> chosen by the optimizer
>>         Number of opens = 1
>>         Rows seen = 166333
>>         Rows filtered = 0
>>         Fetch Size = 16
>>             constructor time (milliseconds) = 0
>>             open time (milliseconds) = 0
>>             next time (milliseconds) = 3438
>>             close time (milliseconds) = 0
>>             next time in milliseconds/row = 0
>>
>>         scan information:
>>             Bit set of columns fetched=All
>>             Number of columns fetched=2
>>             Number of deleted rows visited=0
>>             Number of pages visited=887
>>             Number of rows qualified=166333
>>             Number of rows visited=166333
>>             Scan type=btree
>>             Tree height=3
>>             start position:
>>     > on first 1 column(s).
>>     Ordered null semantics on the following columns:
>>
>>             stop position:
>>     >= on first 1 column(s).
>>     Ordered null semantics on the following columns:
>>
>>             qualifiers:
>> None
>>             optimizer estimated row count:        78377.51
>>             optimizer estimated cost:       166745.12
>>
>>
>>
>>
>> --scott
>>
>>
>>
>> -----Original Message-----
>> From: Sunitha Kambhampati [mailto:ksunithaghm@gmail.com]
>> Sent: Friday, September 16, 2005 5:55 PM
>> To: Derby Discussion
>> Subject: Re: derby performance and 'order by'
>>
>> Scott Ogden wrote:
>>
>>
>>> I have observed some interesting query performance behavior and am
>>> hoping someone here can explain.
>>>
>>> In my scenario, it appears that an existing index is not being used
>>> for the 'order by' part of the operation and as a result the
>>> performance of certain queries is suffering.
>>>
>>> Can someone explain if this is supposed to be what is happening and
>>> why? Please see below for the specific queries and their performance
>>> characteristics.
>>>
>>> Here are the particulars:
>>>
>>> ---------------------------------
>>>
>>> create table orders(
>>>
>>> order_id varchar(50) NOT NULL
>>>
>>> CONSTRAINT ORDERS_PK PRIMARY KEY,
>>>
>>> amount numeric(31,2),
>>>
>>> time date,
>>>
>>> inv_num varchar(50),
>>>
>>> line_num varchar(50),
>>>
>>> phone varchar(50),
>>>
>>> prod_num varchar(50));
>>>
>>> --Load a large amount of data (720,000 records) into the 'orders'  
>>> table
>>>
>>> --Create an index on the time column as that will be used in the
>>> 'where' clause.
>>>
>>> create index IX_ORDERS_TIME on orders(time);
>>>
>>> --When I run a query against this table returning top 1,000 records,
>>> this query returns very quickly, consistently less than .010  
>>> seconds.
>>>
>>> select * from orders
>>>
>>> where time > '10/01/2002' and time < '11/30/2002'
>>>
>>> order by time;
>>>
>>> --Now run a similarly query against same table, returning the top
>>> 1,000 records.
>>>
>>> --The difference is that the results are now sorted by the  
>>> primary key
>>> ('order_id') rather than 'time'.
>>>
>>> --This query returns slowly, approximately 15 seconds. Why??
>>>
>>> select * from orders
>>>
>>> where time > '10/01/2002' and time < '11/30/2002'
>>>
>>> order by order_id;
>>>
>>> --Now run a third query against the same 'orders' table, removing  
>>> the
>>> where clause
>>>
>>> --This query returns quickly, around .010 seconds.
>>>
>>> select * from orders
>>>
>>> order by order_id;
>>>
>>> ---------------------------------------------
>>>
>>>
>> If you run with derby.language.logQueryPlan=true, the actual query  
>> plans
>> used for the following queries will be written to derby.log. This  
>> will
>> show what indexes was used by the optimizer. Also see
>> http://db.apache.org/derby/docs/10.1/tuning/rtunproper43414.html .
>>
>> Query with 'order by' will require sorting. Usually, sorting  
>> requires an
>> extra step to put the data into the right order. This extra step  
>> can be
>> avoided for data that are already in the right order. For example,  
>> if a
>> single-table query has an ORDER BY on a single column, and there  
>> is an
>> index on that column, sorting can be avoided if Derby uses the  
>> index as
>> the access path.
>>
>> I think in case of your first and third query the optimizer will pick
>> the available index thus probably avoiding requiring the sort step.
>>
>> Your second query involves more work than the first query, since  
>> it has
>> a search condition on time, and an order by order_id. Thus if the
>> optimizer picks the index on time, that will involve a sort step on
>> order_id.
>> ____________
>>
>> Thanks,
>> Sunitha.
>>
>>
>>
>
> Craig Russell
> Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
> 408 276-5638 mailto:Craig.Russell@sun.com
> P.S. A good JDO? O, Gasp!
>

Craig Russell
Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
408 276-5638 mailto:Craig.Russell@sun.com
P.S. A good JDO? O, Gasp!

Re: derby performance and 'order by'

Posted by Suavi Ali Demir <de...@yahoo.com>.

How about this: 
Make 1 pass through the big chunk which is 166333 rows (or could be millions): For each row, decide whether or not it belongs to the final 1000 chunk. To do this efficiently, the tricky part needs to be on the 1000 chunk side. 
Steps:
1. Keep and maintain max-min values for this chunk (the 1000 rows result) at all times. 
2. If value is above max, accept right away, add to the top. If required (when chunk is already full): drop last value from bottom.
3. If value is below min and we already have 1000 in hand, reject right away. If we have less than 1000: add this value to the bottom.
4. If value falls in between max and min: Interpolate (or simple binary search - or some kind of a hashtable that keeps values sorted) to find exact location for this value in our 1000 chunk. Insert value in proper location. If required: drop last value from bottom and adjust min-max.
5. When we are done scanning through 166333 rows will have our 1000 chunk in our hand ready to return.

This looks more scalable than sorting 100s of thousands of rows when Derby returns small chunks out of big big tables. It may be slower when chunk size is bigger. If hash sort etc (constant time) is used to decide position of a value in the final chunk, then even if slower, it will still be scalable (it should not be grossly bad even if optimizer picks this algorithm over normal sort when it should not have done so). 

For parallelism, it gives the opportunity (simple enough to outline here on the fly) to divide the whole task (divide 166333 into chunks of 16K rows for 10 threads to work on for example) into multiple threads where min-max, insert, drop from bottom kinda stuff needs to be protected and modifications on the final chunk structure need to be communicated to other threads. Assuming reading a row from 166333 result needs IO, when one thread is doing IO, another thread will be trying to decide where to insert it's newly found value and it *might* bring performance gains. Since final chunk needs to be synchronized at various points, two threads cannot insert at the same time and one thread cannot read the chunk structure while another is inserting a value into it... Come to think of it, it might run faster in single thread version when synchronization is not involved.
Regards,
Ali

Daniel John Debrunner <dj...@debrunners.com> wrote:
Suavi Ali Demir wrote:

> Actually, it sounds like the problem of finding top 1000 rows out of 
> 166333 rows is different than sorting 166333 rows and maybe it could be
> optimized. There is no need to sort all 166333 but the information that
> we are only looking 1000 rows would have to be passed all the way down
> to the point where Derby decides to sort. I have not thought through the
> details of an algorithm but when nRows we want is substantially smaller
> than TotalRows then i just feel there should be a better way to pick
> those nRows. For example, if nRows were 1, then all we had to do would
> be 1 single pass on 166333 rows to find the max. That is quite different
> than sorting all and this idea should be possible to generalize on
> 1<=nRows
One optimization would be to pass the 1,000 down from
Statement.setMaxRows to the sorter. Then the sorter could keep the sort
set at 1000 rows, discarding any rows moved to or inserted at the end.
This would most likely enable the sorted set to remain in memory, rather
than spilling to disk. Even more likley if the application sets a max
rows to a reasonable number to be views in a single on-screen page (e.g.
20-50).

Or an even wilder idea would be to have a predicate that is modified on
the fly, to represent the maximum once the sorted set reaches 1,000.
E.g. an additional predicate of <= MAX_INT for an INT column.

Dan.

Re: derby performance and 'order by'

Posted by Daniel John Debrunner <dj...@debrunners.com>.

Suavi Ali Demir wrote:

> Actually, it sounds like the problem of finding top 1000 rows out of 
> 166333 rows is different than sorting 166333 rows and maybe it could be
> optimized. There is no need to sort all 166333 but the information that
> we are only looking 1000 rows would have to be passed all the way down
> to the point where Derby decides to sort. I have not thought through the
> details of an algorithm but when nRows we want is substantially smaller
> than TotalRows then i just feel there should be a better way to pick
> those nRows. For example, if nRows were 1, then all we had to do would
> be 1 single pass on 166333 rows to find the max. That is quite different
> than sorting all and this idea should be possible to generalize on
> 1<=nRows<TotalRows.

One optimization would be to pass the 1,000 down from
Statement.setMaxRows to the sorter. Then the sorter could keep the sort
set at 1000 rows, discarding any rows moved to or inserted at the end.
This would most likely enable the sorted set to remain in memory, rather
than spilling to disk. Even more likley if the application sets a max
rows to a reasonable number to be views in a single on-screen page (e.g.
20-50).

Or an even wilder idea would be to have a predicate that is modified on
the fly, to represent the maximum once the sorted set reaches 1,000.
E.g. an additional predicate of <= MAX_INT for an INT column.

Dan.

Re: derby performance and 'order by'

Posted by Suavi Ali Demir <de...@yahoo.com>.

Actually, it sounds like the problem of finding top 1000 rows out of  166333 rows is different than sorting 166333 rows and maybe it could be optimized. There is no need to sort all 166333 but the information that we are only looking 1000 rows would have to be passed all the way down to the point where Derby decides to sort. I have not thought through the details of an algorithm but when nRows we want is substantially smaller than TotalRows then i just feel there should be a better way to pick those nRows. For example, if nRows were 1, then all we had to do would be 1 single pass on 166333 rows to find the max. That is quite different than sorting all and this idea should be possible to generalize on 1<=nRows<TotalRows.
Ali

Craig Russell <Cr...@Sun.COM> wrote:
Hi Scott,

>From the query plan it appears that your filter selects 166,333 rows, of which you want the first 1000 according to the ordering of the order_id column. You can see that this is an effective strategy because             Number of rows qualified=166333 Number of rows visited=166333. There's no time lost visiting rows that don't qualify.

The database has to sort the 166,333 rows because the results are ordered according to the index scan column "time" not according to the order_id column. All of the rows need to be sorted even though you only want the first 1000 rows. I'd guess that the sorting of the 166,333 rows is what accounts for the 15 second delay you are experiencing.

The index on order_id doesn't do you any good because you have a result that isn't indexed on order_id. If this isn't obvious, try to think of an algorithm that would use the order_id index on the result set.

Craig

On Sep 19, 2005, at 9:29 AM, scotto wrote:

So for the second query: 

select * from orders
where time > '10/01/2002' and time < '11/30/2002'
order by order_id;

the query plan shows that the index IX_ORDERS_TIME is used to filter the
result set by time.  The order by step does not use the primary key index to
sort the results after the filter step.  My questions: 

--Is it correct that the sort step not use the primary key index in this
case?  

--Why is it not possible to use the index on order_id to sort after the
filter has happened? 

Here is the query plan: 

Statement Name: 
    null
Statement Text: 
    select * from orders 
where time > '10/01/2002' and time < '11/30/2002'
order by order_id
Parse Time: 0
Bind Time: 0
Optimize Time: 0
Generate Time: 0
Compile Time: 0
Execute Time: 14329
Begin Compilation Timestamp : null
End Compilation Timestamp : null
Begin Execution Timestamp : 2005-09-19 09:20:06.171
End Execution Timestamp : 2005-09-19 09:20:20.5
Statement Execution Plan Text: 
Sort ResultSet:
Number of opens = 1
Rows input = 166333
Rows returned = 1000
Eliminate duplicates = false
In sorted order = false
Sort information: 
    Number of merge runs=1
    Number of rows input=166333
    Number of rows output=166333
    Size of merge runs=[93695]
    Sort type=external
    constructor time (milliseconds) = 0
    open time (milliseconds) = 14297
    next time (milliseconds) = 32
    close time (milliseconds) = 0
    optimizer estimated row count:        78377.51
    optimizer estimated cost:       166745.12

Source result set:
    Index Row to Base Row ResultSet for ORDERS:
    Number of opens = 1
    Rows seen = 166333
    Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6}
        constructor time (milliseconds) = 0
        open time (milliseconds) = 0
        next time (milliseconds) = 10488
        close time (milliseconds) = 0
        optimizer estimated row count:        78377.51
        optimizer estimated cost:       166745.12

        Index Scan ResultSet for ORDERS using index IX_ORDERS_TIME
at read committed isolation level using instantaneous share row locking
chosen by the optimizer
        Number of opens = 1
        Rows seen = 166333
        Rows filtered = 0
        Fetch Size = 16
            constructor time (milliseconds) = 0
            open time (milliseconds) = 0
            next time (milliseconds) = 3438
            close time (milliseconds) = 0
            next time in milliseconds/row = 0

        scan information: 
            Bit set of columns fetched=All
            Number of columns fetched=2
            Number of deleted rows visited=0
            Number of pages visited=887
            Number of rows qualified=166333
            Number of rows visited=166333
            Scan type=btree
            Tree height=3
            start position: 
    > on first 1 column(s).
    Ordered null semantics on the following columns: 

            stop position: 
    >= on first 1 column(s).
    Ordered null semantics on the following columns: 

            qualifiers:
None
            optimizer estimated row count:        78377.51
            optimizer estimated cost:       166745.12 

--scott

-----Original Message-----
From: Sunitha Kambhampati [mailto:ksunithaghm@gmail.com] 
Sent: Friday, September 16, 2005 5:55 PM
To: Derby Discussion
Subject: Re: derby performance and 'order by'

Scott Ogden wrote:

I have observed some interesting query performance behavior and am 
hoping someone here can explain.

In my scenario, it appears that an existing index is not being used 
for the 'order by' part of the operation and as a result the 
performance of certain queries is suffering.

Can someone explain if this is supposed to be what is happening and 
why? Please see below for the specific queries and their performance 
characteristics.

Here are the particulars:

---------------------------------

create table orders(

order_id varchar(50) NOT NULL

CONSTRAINT ORDERS_PK PRIMARY KEY,

amount numeric(31,2),

time date,

inv_num varchar(50),

line_num varchar(50),

phone varchar(50),

prod_num varchar(50));

--Load a large amount of data (720,000 records) into the 'orders' table

--Create an index on the time column as that will be used in the 
'where' clause.

create index IX_ORDERS_TIME on orders(time);

--When I run a query against this table returning top 1,000 records, 
this query returns very quickly, consistently less than .010 seconds.

select * from orders

where time > '10/01/2002' and time < '11/30/2002'

order by time;

--Now run a similarly query against same table, returning the top 
1,000 records.

--The difference is that the results are now sorted by the primary key 
('order_id') rather than 'time'.

--This query returns slowly, approximately 15 seconds. Why??

select * from orders

where time > '10/01/2002' and time < '11/30/2002'

order by order_id;

--Now run a third query against the same 'orders' table, removing the 
where clause

--This query returns quickly, around .010 seconds.

select * from orders

order by order_id;

---------------------------------------------

If you run with derby.language.logQueryPlan=true, the actual query plans 
used for the following queries will be written to derby.log. This will 
show what indexes was used by the optimizer. Also see 
http://db.apache.org/derby/docs/10.1/tuning/rtunproper43414.html .

Query with 'order by' will require sorting. Usually, sorting requires an 
extra step to put the data into the right order. This extra step can be 
avoided for data that are already in the right order. For example, if a 
single-table query has an ORDER BY on a single column, and there is an 
index on that column, sorting can be avoided if Derby uses the index as 
the access path.

I think in case of your first and third query the optimizer will pick 
the available index thus probably avoiding requiring the sort step.

Your second query involves more work than the first query, since it has 
a search condition on time, and an order by order_id. Thus if the 
optimizer picks the index on time, that will involve a sort step on 
order_id.
____________

Thanks,
Sunitha.

Craig Russell

Architect, Sun Java Enterprise System http://java.sun.com/products/jdo

408 276-5638 mailto:Craig.Russell@sun.com

P.S. A good JDO? O, Gasp!

Re: derby performance and 'order by'

Posted by Craig Russell <Cr...@Sun.COM>.

Hi Scott,

 From the query plan it appears that your filter selects 166,333  
rows, of which you want the first 1000 according to the ordering of  
the order_id column. You can see that this is an effective strategy  
because             Number of rows qualified=166333 Number of rows  
visited=166333. There's no time lost visiting rows that don't qualify.

The database has to sort the 166,333 rows because the results are  
ordered according to the index scan column "time" not according to  
the order_id column. All of the rows need to be sorted even though  
you only want the first 1000 rows. I'd guess that the sorting of the  
166,333 rows is what accounts for the 15 second delay you are  
experiencing.

The index on order_id doesn't do you any good because you have a  
result that isn't indexed on order_id. If this isn't obvious, try to  
think of an algorithm that would use the order_id index on the result  
set.

Craig

On Sep 19, 2005, at 9:29 AM, scotto wrote:

> So for the second query:
>
> select * from orders
> where time > '10/01/2002' and time < '11/30/2002'
> order by order_id;
>
> the query plan shows that the index IX_ORDERS_TIME is used to  
> filter the
> result set by time.  The order by step does not use the primary key  
> index to
> sort the results after the filter step.  My questions:
>
> --Is it correct that the sort step not use the primary key index in  
> this
> case?
>
> --Why is it not possible to use the index on order_id to sort after  
> the
> filter has happened?
>
> Here is the query plan:
>
> Statement Name:
>     null
> Statement Text:
>     select * from orders
> where time > '10/01/2002' and time < '11/30/2002'
> order by order_id
> Parse Time: 0
> Bind Time: 0
> Optimize Time: 0
> Generate Time: 0
> Compile Time: 0
> Execute Time: 14329
> Begin Compilation Timestamp : null
> End Compilation Timestamp : null
> Begin Execution Timestamp : 2005-09-19 09:20:06.171
> End Execution Timestamp : 2005-09-19 09:20:20.5
> Statement Execution Plan Text:
> Sort ResultSet:
> Number of opens = 1
> Rows input = 166333
> Rows returned = 1000
> Eliminate duplicates = false
> In sorted order = false
> Sort information:
>     Number of merge runs=1
>     Number of rows input=166333
>     Number of rows output=166333
>     Size of merge runs=[93695]
>     Sort type=external
>     constructor time (milliseconds) = 0
>     open time (milliseconds) = 14297
>     next time (milliseconds) = 32
>     close time (milliseconds) = 0
>     optimizer estimated row count:        78377.51
>     optimizer estimated cost:       166745.12
>
> Source result set:
>     Index Row to Base Row ResultSet for ORDERS:
>     Number of opens = 1
>     Rows seen = 166333
>     Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6}
>         constructor time (milliseconds) = 0
>         open time (milliseconds) = 0
>         next time (milliseconds) = 10488
>         close time (milliseconds) = 0
>         optimizer estimated row count:        78377.51
>         optimizer estimated cost:       166745.12
>
>         Index Scan ResultSet for ORDERS using index IX_ORDERS_TIME
> at read committed isolation level using instantaneous share row  
> locking
> chosen by the optimizer
>         Number of opens = 1
>         Rows seen = 166333
>         Rows filtered = 0
>         Fetch Size = 16
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 3438
>             close time (milliseconds) = 0
>             next time in milliseconds/row = 0
>
>         scan information:
>             Bit set of columns fetched=All
>             Number of columns fetched=2
>             Number of deleted rows visited=0
>             Number of pages visited=887
>             Number of rows qualified=166333
>             Number of rows visited=166333
>             Scan type=btree
>             Tree height=3
>             start position:
>     > on first 1 column(s).
>     Ordered null semantics on the following columns:
>
>             stop position:
>     >= on first 1 column(s).
>     Ordered null semantics on the following columns:
>
>             qualifiers:
> None
>             optimizer estimated row count:        78377.51
>             optimizer estimated cost:       166745.12
>
>
>
>
> --scott
>
>
>
> -----Original Message-----
> From: Sunitha Kambhampati [mailto:ksunithaghm@gmail.com]
> Sent: Friday, September 16, 2005 5:55 PM
> To: Derby Discussion
> Subject: Re: derby performance and 'order by'
>
> Scott Ogden wrote:
>
>
>> I have observed some interesting query performance behavior and am
>> hoping someone here can explain.
>>
>> In my scenario, it appears that an existing index is not being used
>> for the 'order by' part of the operation and as a result the
>> performance of certain queries is suffering.
>>
>> Can someone explain if this is supposed to be what is happening and
>> why? Please see below for the specific queries and their performance
>> characteristics.
>>
>> Here are the particulars:
>>
>> ---------------------------------
>>
>> create table orders(
>>
>> order_id varchar(50) NOT NULL
>>
>> CONSTRAINT ORDERS_PK PRIMARY KEY,
>>
>> amount numeric(31,2),
>>
>> time date,
>>
>> inv_num varchar(50),
>>
>> line_num varchar(50),
>>
>> phone varchar(50),
>>
>> prod_num varchar(50));
>>
>> --Load a large amount of data (720,000 records) into the 'orders'  
>> table
>>
>> --Create an index on the time column as that will be used in the
>> 'where' clause.
>>
>> create index IX_ORDERS_TIME on orders(time);
>>
>> --When I run a query against this table returning top 1,000 records,
>> this query returns very quickly, consistently less than .010 seconds.
>>
>> select * from orders
>>
>> where time > '10/01/2002' and time < '11/30/2002'
>>
>> order by time;
>>
>> --Now run a similarly query against same table, returning the top
>> 1,000 records.
>>
>> --The difference is that the results are now sorted by the primary  
>> key
>> ('order_id') rather than 'time'.
>>
>> --This query returns slowly, approximately 15 seconds. Why??
>>
>> select * from orders
>>
>> where time > '10/01/2002' and time < '11/30/2002'
>>
>> order by order_id;
>>
>> --Now run a third query against the same 'orders' table, removing the
>> where clause
>>
>> --This query returns quickly, around .010 seconds.
>>
>> select * from orders
>>
>> order by order_id;
>>
>> ---------------------------------------------
>>
>>
> If you run with derby.language.logQueryPlan=true, the actual query  
> plans
> used for the following queries will be written to derby.log. This will
> show what indexes was used by the optimizer. Also see
> http://db.apache.org/derby/docs/10.1/tuning/rtunproper43414.html .
>
> Query with 'order by' will require sorting. Usually, sorting  
> requires an
> extra step to put the data into the right order. This extra step  
> can be
> avoided for data that are already in the right order. For example,  
> if a
> single-table query has an ORDER BY on a single column, and there is an
> index on that column, sorting can be avoided if Derby uses the  
> index as
> the access path.
>
> I think in case of your first and third query the optimizer will pick
> the available index thus probably avoiding requiring the sort step.
>
> Your second query involves more work than the first query, since it  
> has
> a search condition on time, and an order by order_id. Thus if the
> optimizer picks the index on time, that will involve a sort step on
> order_id.
> ____________
>
> Thanks,
> Sunitha.
>
>
>

Craig Russell
Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
408 276-5638 mailto:Craig.Russell@sun.com
P.S. A good JDO? O, Gasp!

RE: derby performance and 'order by'

Posted by scotto <sc...@mitsonline.com>.

So for the second query: 

select * from orders
where time > '10/01/2002' and time < '11/30/2002'
order by order_id;

the query plan shows that the index IX_ORDERS_TIME is used to filter the
result set by time.  The order by step does not use the primary key index to
sort the results after the filter step.  My questions: 

--Is it correct that the sort step not use the primary key index in this
case?  

--Why is it not possible to use the index on order_id to sort after the
filter has happened? 

Here is the query plan: 

Statement Name: 
	null
Statement Text: 
	select * from orders 
where time > '10/01/2002' and time < '11/30/2002'
order by order_id
Parse Time: 0
Bind Time: 0
Optimize Time: 0
Generate Time: 0
Compile Time: 0
Execute Time: 14329
Begin Compilation Timestamp : null
End Compilation Timestamp : null
Begin Execution Timestamp : 2005-09-19 09:20:06.171
End Execution Timestamp : 2005-09-19 09:20:20.5
Statement Execution Plan Text: 
Sort ResultSet:
Number of opens = 1
Rows input = 166333
Rows returned = 1000
Eliminate duplicates = false
In sorted order = false
Sort information: 
	Number of merge runs=1
	Number of rows input=166333
	Number of rows output=166333
	Size of merge runs=[93695]
	Sort type=external
	constructor time (milliseconds) = 0
	open time (milliseconds) = 14297
	next time (milliseconds) = 32
	close time (milliseconds) = 0
	optimizer estimated row count:        78377.51
	optimizer estimated cost:       166745.12

Source result set:
	Index Row to Base Row ResultSet for ORDERS:
	Number of opens = 1
	Rows seen = 166333
	Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6}
		constructor time (milliseconds) = 0
		open time (milliseconds) = 0
		next time (milliseconds) = 10488
		close time (milliseconds) = 0
		optimizer estimated row count:        78377.51
		optimizer estimated cost:       166745.12

		Index Scan ResultSet for ORDERS using index IX_ORDERS_TIME
at read committed isolation level using instantaneous share row locking
chosen by the optimizer
		Number of opens = 1
		Rows seen = 166333
		Rows filtered = 0
		Fetch Size = 16
			constructor time (milliseconds) = 0
			open time (milliseconds) = 0
			next time (milliseconds) = 3438
			close time (milliseconds) = 0
			next time in milliseconds/row = 0

		scan information: 
			Bit set of columns fetched=All
			Number of columns fetched=2
			Number of deleted rows visited=0
			Number of pages visited=887
			Number of rows qualified=166333
			Number of rows visited=166333
			Scan type=btree
			Tree height=3
			start position: 
	> on first 1 column(s).
	Ordered null semantics on the following columns: 

			stop position: 
	>= on first 1 column(s).
	Ordered null semantics on the following columns: 

			qualifiers:
None
			optimizer estimated row count:        78377.51
			optimizer estimated cost:       166745.12 




--scott



-----Original Message-----
From: Sunitha Kambhampati [mailto:ksunithaghm@gmail.com] 
Sent: Friday, September 16, 2005 5:55 PM
To: Derby Discussion
Subject: Re: derby performance and 'order by'

Scott Ogden wrote:

> I have observed some interesting query performance behavior and am 
> hoping someone here can explain.
>
> In my scenario, it appears that an existing index is not being used 
> for the 'order by' part of the operation and as a result the 
> performance of certain queries is suffering.
>
> Can someone explain if this is supposed to be what is happening and 
> why? Please see below for the specific queries and their performance 
> characteristics.
>
> Here are the particulars:
>
> ---------------------------------
>
> create table orders(
>
> order_id varchar(50) NOT NULL
>
> CONSTRAINT ORDERS_PK PRIMARY KEY,
>
> amount numeric(31,2),
>
> time date,
>
> inv_num varchar(50),
>
> line_num varchar(50),
>
> phone varchar(50),
>
> prod_num varchar(50));
>
> --Load a large amount of data (720,000 records) into the 'orders' table
>
> --Create an index on the time column as that will be used in the 
> 'where' clause.
>
> create index IX_ORDERS_TIME on orders(time);
>
> --When I run a query against this table returning top 1,000 records, 
> this query returns very quickly, consistently less than .010 seconds.
>
> select * from orders
>
> where time > '10/01/2002' and time < '11/30/2002'
>
> order by time;
>
> --Now run a similarly query against same table, returning the top 
> 1,000 records.
>
> --The difference is that the results are now sorted by the primary key 
> ('order_id') rather than 'time'.
>
> --This query returns slowly, approximately 15 seconds. Why??
>
> select * from orders
>
> where time > '10/01/2002' and time < '11/30/2002'
>
> order by order_id;
>
> --Now run a third query against the same 'orders' table, removing the 
> where clause
>
> --This query returns quickly, around .010 seconds.
>
> select * from orders
>
> order by order_id;
>
> ---------------------------------------------
>
If you run with derby.language.logQueryPlan=true, the actual query plans 
used for the following queries will be written to derby.log. This will 
show what indexes was used by the optimizer. Also see 
http://db.apache.org/derby/docs/10.1/tuning/rtunproper43414.html .

Query with 'order by' will require sorting. Usually, sorting requires an 
extra step to put the data into the right order. This extra step can be 
avoided for data that are already in the right order. For example, if a 
single-table query has an ORDER BY on a single column, and there is an 
index on that column, sorting can be avoided if Derby uses the index as 
the access path.

I think in case of your first and third query the optimizer will pick 
the available index thus probably avoiding requiring the sort step.

Your second query involves more work than the first query, since it has 
a search condition on time, and an order by order_id. Thus if the 
optimizer picks the index on time, that will involve a sort step on 
order_id.
____________

Thanks,
Sunitha.

Re: derby performance and 'order by'

Posted by Sunitha Kambhampati <ks...@gmail.com>.

Scott Ogden wrote:

> I have observed some interesting query performance behavior and am 
> hoping someone here can explain.
>
> In my scenario, it appears that an existing index is not being used 
> for the ‘order by’ part of the operation and as a result the 
> performance of certain queries is suffering.
>
> Can someone explain if this is supposed to be what is happening and 
> why? Please see below for the specific queries and their performance 
> characteristics.
>
> Here are the particulars:
>
> ---------------------------------
>
> create table orders(
>
> order_id varchar(50) NOT NULL
>
> CONSTRAINT ORDERS_PK PRIMARY KEY,
>
> amount numeric(31,2),
>
> time date,
>
> inv_num varchar(50),
>
> line_num varchar(50),
>
> phone varchar(50),
>
> prod_num varchar(50));
>
> --Load a large amount of data (720,000 records) into the ‘orders’ table
>
> --Create an index on the time column as that will be used in the 
> ‘where’ clause.
>
> create index IX_ORDERS_TIME on orders(time);
>
> --When I run a query against this table returning top 1,000 records, 
> this query returns very quickly, consistently less than .010 seconds.
>
> select * from orders
>
> where time > '10/01/2002' and time < '11/30/2002'
>
> order by time;
>
> --Now run a similarly query against same table, returning the top 
> 1,000 records.
>
> --The difference is that the results are now sorted by the primary key 
> (‘order_id’) rather than ‘time’.
>
> --This query returns slowly, approximately 15 seconds. Why??
>
> select * from orders
>
> where time > '10/01/2002' and time < '11/30/2002'
>
> order by order_id;
>
> --Now run a third query against the same ‘orders’ table, removing the 
> where clause
>
> --This query returns quickly, around .010 seconds.
>
> select * from orders
>
> order by order_id;
>
> ---------------------------------------------
>
If you run with derby.language.logQueryPlan=true, the actual query plans 
used for the following queries will be written to derby.log. This will 
show what indexes was used by the optimizer. Also see 
http://db.apache.org/derby/docs/10.1/tuning/rtunproper43414.html .

Query with 'order by' will require sorting. Usually, sorting requires an 
extra step to put the data into the right order. This extra step can be 
avoided for data that are already in the right order. For example, if a 
single-table query has an ORDER BY on a single column, and there is an 
index on that column, sorting can be avoided if Derby uses the index as 
the access path.

I think in case of your first and third query the optimizer will pick 
the available index thus probably avoiding requiring the sort step.

Your second query involves more work than the first query, since it has 
a search condition on time, and an order by order_id. Thus if the 
optimizer picks the index on time, that will involve a sort step on 
order_id.
____________

Thanks,
Sunitha.