You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Jim Newsham <jn...@referentia.com> on 2006/11/10 02:46:26 UTC

slow subqueries (was: StackOverflowError)

Hi everyone,

The stack overflow problem went away after I upgraded to 10.2.1.6.  There's
no longer an error running the query, however, the query is very slow.  To
be honest, I haven't run the query to completion, as my computer makes a
prohibitively annoying sound when the CPU is pegged and I have to kill derby
after a number of minutes.  But I know the query runs for many, many minutes
without completing.  Why do queries with subqueries perform so poorly?

The following query completes almost instantly.  I suppose Derby simplifies
the subquery internally to "band.id = 11":

ij(SAMPLEBASE)> select count(*) from time where time.id in (select distinct
time.id from time, double_sample, band where band.id =
double_sample.fk_band_id and double_sample.fk_time_id = time.id and band.id
in (11));

This query "does not complete":

ij(SAMPLEBASE)> select count(*) from time where time.id in (select distinct
time.id from time, double_sample, band where band.id =
double_sample.fk_band_id and double_sample.fk_time_id = time.id and band.id
in (11, 12));

I've been searching/reading around and trying to understand a little about
how Derby works and how to check the statement plan.  I've read through the
tuning document.  Unfortunately if a query doesn't complete, I can't get the
statement plan.  

Is the entire inner query being re-evaluated for every candidate record in
the outer query?  I read somewhere that inner queries may not be
materialized if they are larger than a certain threshold.  But if the outer
query is also very large, then this seems a poor strategy.  If the inner
result set is too large for memory, couldn't it be cached to disk?  

But a much better optimization would be for Derby to notice that the outer
variable (time.id) and the result of the inner query (time.id) are both
indexed/unique (already sorted?), and then to internally generate two
sub-resultsets which are processed together in parallel, instead of using a
very expensive nested loop.

The above queries are only representative of my problem.  I have a generic
API which uses a sql database under the covers.  Sql queries are generated
dynamically.  The queries only involve a small number of tables, but will
typically involve subqueries, unions, intersects.  While the above sample
query can be easily modified to run well, I'm not sure how to do it in the
general case.

Any feedback or responses to the above comments are welcome.

Regards,
Jim

> -----Original Message-----
> From: Rajesh Kartha [mailto:kartha02@gmail.com]
> Sent: Monday, November 06, 2006 6:04 PM
> To: Derby Discussion
> Subject: Re: StackOverflowError
> 
> 
> Hello Jim,
> 
> Is it possible to try this query out with the latest GA release 10.2.1.6 ?
> 
> http://db.apache.org/derby/derby_downloads.html#Latest+Official+Release
> 
> There have been some fixes in the related code generation areas for 10.2:
> DERBY-766, DERBY-1714 etc.
> 
> Also, can post the schema that can  reproduce this issue ?
> 
> Regards,
> Rajesh
> 
> 
> 
> 
> 
> Jim Newsham wrote:
> 
> >Hi everyone,
> >
> >I thought the problem would go away if I gave Derby a better written
> query,
> >so I fixed my query generator to be a bit smarter.  Unfortunately, I
> still
> >get a stack overflow error.  Here's the new query:
> >
> >ij(SAMPLEBASE)> select count(*) from time where time.id in (select
> time.id
> >from time, double_sample, band where band.id = double_sample.fk_band_id
> and
> >double_sample.fk_time_id = time.id and (band.id = 39 or band.id = 55));
> >1
> >-----------
> >ERROR 38000: The exception 'java.lang.StackOverflowError' was thrown
> while
> >evaluating an expression. SQLSTATE: XJ001: Java exception: ':
> >java.lang.StackOverflowError'.
> >
> >
> >This looks like an ugly bug for such a simple query.  I didn't find any
> bug
> >in jira which seemed to relate to this.  Is this a known bug?
> >
> >Any advice on how to work around the problem is appreciated.
> >
> >I'm using Derby 10.1.3.1.
> >
> >Thanks,
> >Jim
> >
> >P.S.  I just included the outer query ("select count(*) from..") to
> >reproduce the problem in ij.  My program actually uses jdbc, executes the
> >inner query, and calls ResultSet.last().  The result is the same, a
> >StackOverflowError.  Here's the stack trace I get in my app:
> >
> >Caused by: org.apache.derby.client.am.SqlException: The exception
> >'java.lang.StackOverflowError' was thrown while evaluating an expression.
> >SQLSTATE: XJ001: Java exception: ': java.lang.StackOverflowError'.
> >        at org.apache.derby.client.am.ResultSet.completeSqlca(Unknown
> >Source)
> >        at
> >org.apache.derby.client.net.NetResultSetReply.parseFetchError(Unknown
> > Source)
> >        at
> >org.apache.derby.client.net.NetResultSetReply.parseCNTQRYreply(Unknown
> >Source)
> >        at
> >org.apache.derby.client.net.NetResultSetReply.readPositioningFetch(Unknow
> n
> >Source)
> >        at
> >org.apache.derby.client.net.ResultSetReply.readPositioningFetch(Unknown
> >Source)
> >        at
> >org.apache.derby.client.net.NetResultSet.readPositioningFetch_(Unknown
> >Source)
> >        at org.apache.derby.client.am.ResultSet.getRowCount(Unknown
> Source)
> >        at org.apache.derby.client.am.ResultSet.lastX(Unknown Source)
> >        at org.apache.derby.client.am.ResultSet.last(Unknown Source)
> >        at
> >com.referentia.sdf.monitor.samplebase.derby.QueryDataSet.getSize(QueryDat
> aSe
> >t.java:139)
> >
> >
> >
> >
> >>-----Original Message-----
> >>From: Jim Newsham [mailto:jnewsham@referentia.com]
> >>Sent: Monday, November 06, 2006 11:21 AM
> >>To: 'Derby Discussion'
> >>Subject: StackOverflowError
> >>
> >>
> >>Any reason why I should get a stack overflow error with the following
> >>query?
> >>
> >>Yes, I know the query is a bit odd... it's not hand-written.  The query
> >>generator could be optimized.  Nevertheless... is the stack overflow
> here
> >>considered a bug or a limitation?  If limitation, what specifically is
> the
> >>limitation?
> >>
> >>
> >>ij(SAMPLEBASE)> select count(*) from time where time.id in (select
> time.id
> >>from time, double_sample, band where band.id = double_sample.fk_band_id
> >>and
> >>double_sample.fk_time_id = time.id and band.id = 57 union select id from
> >>time where 1 = 0);
> >>1
> >>-----------
> >>ERROR 38000: The exception 'java.lang.StackOverflowError' was thrown
> while
> >>evaluating an expression. SQLSTATE: XJ001: Java exception: ':
> >>java.lang.StackOverflowError'.
> >>
> >>Thanks,
> >>Jim
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
>

Re: slow subqueries

Posted by "Bernt M. Johnsen" <Be...@Sun.COM>.

>>>>>>>>>>>> Jim Newsham wrote (2006-11-11 10:07:18):
> 
> 
> > > This is why a nested loop is not going to work here... 20,000 squared
> > > operations is very expensive, let alone millions squared.  For a query
> > with
> > > this profile, the inner query should only be executed once.
> > 
> > Perhaps you can get the behavior you desire by explicitly creating
> > a temporary table, selecting the data from your inner query into
> > the temporary table, then re-writing your main query to join against
> > the temporary table?
> 
> I'd be willing to try that.  Some questions come to mind:
> 
> - Is there a way in sql to select the results of a query into a
> - table?

insert into someTable (select * from someOtherTable);

> - When I'm done with the temp table, can I just drop it as for a normal
> table?  

Yes. DROP table works on temporary tables.

> - I think I read somewhere that I can't add an index to temporary tables.
> Is this true?  I can't even have a primary key?  

Yes, no indices nor primary keys.

> If so, I think the performance will still be intractable for
> thousands or millions of results in the temp table.

Why not let your temporary table be a normal table in your schema?
> 
> Thanks,
> Jim
> 
> 

-- 
Bernt Marius Johnsen, Database Technology Group, 
Staff Engineer, Technical Lead Derby/Java DB
Sun Microsystems, Trondheim, Norway

RE: slow subqueries

Posted by de...@segel.com.

This is an "ugly" work around to an existing bug.

AFAIK, the problem was when you used an IN clause in the inner select.
This is a known derby bug.

Going to a temp table, you probably won't see any improvement, and you don't
have the ability to put an index on the temp table. Unless you're returning
over 10K rows on the inner query, a simple table scan would be more
efficient. (At least for IDS et al. YMMV for Derby.)


> -----Original Message-----
> From: Jim Newsham [mailto:jnewsham@referentia.com]
> Sent: Saturday, November 11, 2006 2:07 PM
> To: 'Derby Discussion'
> Subject: RE: slow subqueries
> 
> 
> 
> > > This is why a nested loop is not going to work here... 20,000 squared
> > > operations is very expensive, let alone millions squared.  For a query
> > with
> > > this profile, the inner query should only be executed once.
> >
> > Perhaps you can get the behavior you desire by explicitly creating
> > a temporary table, selecting the data from your inner query into
> > the temporary table, then re-writing your main query to join against
> > the temporary table?
> 
> I'd be willing to try that.  Some questions come to mind:
> 
> - Is there a way in sql to select the results of a query into a table?
> - When I'm done with the temp table, can I just drop it as for a normal
> table?
> - I think I read somewhere that I can't add an index to temporary tables.
> Is this true?  I can't even have a primary key?  If so, I think the
> performance will still be intractable for thousands or millions of results
> in the temp table.
> 
> Thanks,
> Jim
>

RE: slow subqueries

Posted by Jim Newsham <jn...@referentia.com>.


> > This is why a nested loop is not going to work here... 20,000 squared
> > operations is very expensive, let alone millions squared.  For a query
> with
> > this profile, the inner query should only be executed once.
> 
> Perhaps you can get the behavior you desire by explicitly creating
> a temporary table, selecting the data from your inner query into
> the temporary table, then re-writing your main query to join against
> the temporary table?

I'd be willing to try that.  Some questions come to mind:

- Is there a way in sql to select the results of a query into a table?
- When I'm done with the temp table, can I just drop it as for a normal
table?  
- I think I read somewhere that I can't add an index to temporary tables.
Is this true?  I can't even have a primary key?  If so, I think the
performance will still be intractable for thousands or millions of results
in the temp table.

Thanks,
Jim

Re: slow subqueries

Posted by Bryan Pendleton <bp...@amberpoint.com>.

> This is why a nested loop is not going to work here... 20,000 squared
> operations is very expensive, let alone millions squared.  For a query with
> this profile, the inner query should only be executed once. 

Perhaps you can get the behavior you desire by explicitly creating
a temporary table, selecting the data from your inner query into
the temporary table, then re-writing your main query to join against
the temporary table?

thanks,

bryan

Re: slow subqueries

Posted by Army <qo...@gmail.com>.

Jim Newsham wrote:
> 
> You previously mentioned a Derby property called
> derby.language.maxMemoryPerTable.  I don't see it mentioned in the current
> documentation, but I found a jira issue which contains some proposed
> documentation (http://issues.apache.org/jira/browse/DERBY-1397).  But it's
> still not entirely clear to me.  Is this property only used when deciding
> whether to perform a hash join or nested loop join?  Or does it also
> influence how much of the hash join is kept in memory, or when that hash
> join is spilled to disk?

Great question.  My first thought was that the property *just* determined 
whether or not to perform a hash join.  But I did some looking around in the 
code and was a bit surprised to see that the answer depends on whether or not 
the inner result set of the hash join is a base table.  If the inner result set 
*is* a base table then the maxMemoryPerTable property does in fact determine 
when we will spill to disk.  For any other hash join, though, the determination 
of when to spill over is based on the JVM memory size at the time the in-memory 
hash table is created.

> Essentially, I'd like to know, if using the above query with optimizer 
> overrides which force a hash join, what is the threshold for spilling a 
> hash to disk, and is this a tunable parameter?

Since your queries involve hash joins with subqueries as the inner table, I 
think this means that spill-over to disk occurs when the the size of the 
in-memory result set reaches or exceeds one hundredth of the *current* memory in 
the JVM.  In the code this is coded as:

   Runtime.getRuntime().totalMemory()/100;

Thus for your case, there does not appear to be a way to tune the threshold for 
spilling the materialized subquery rows to disk. At least, no direct way; you 
can of course change the start and max heap size for the JVM, which indirectly 
affects the spill-over threshold.  But I'm not sure that's what you were looking 
for...

Army

RE: slow subqueries

Posted by Jim Newsham <jn...@referentia.com>.

> Army wrote:
> 
> > In order to get Derby to recognize the equijoin predicate, the subquery
> > has to appear in the FROM list.  So I created a view for the subquery:
> 
> <snip view>
> 
> > and then I changed the query to select and join with that view:
> 
> <snip>
> 
> > Working off the modified query above I used "optimizer overrides", which
> > were added as part of Derby 10.2, to 1) fix the join order so that the
> > subquery becomes the "inner" table, and 2) to force ("guarantee") a hash
> > join using the equijoin predicate, thereby allowing materialization of
> > the inner table (i.e. of the subquery).  The query to do that is as
> > follows:
> >
> > select count(*) from --DERBY-PROPERTIES joinOrder=FIXED
> >   time, v_hash --DERBY-PROPERTIES joinStrategy=HASH
> >   where time.id = v_hash.id;
> 
> Just one thing to mention as an afterthought: you do not have to create a
> separate view for this to work.  I did that for the sake of clarity, but
> you
> could just as easily write:
> 
> select count(*) from --DERBY-PROPERTIES joinOrder=FIXED
>    time,
>    (select distinct
>      time.id from time, double_sample, band
>     where band.id = double_sample.fk_band_id
>       and double_sample.fk_time_id = time.id and
>       band.id in (11, 12)
>    ) v_hash --DERBY-PROPERTIES joinStrategy=HASH
>    where time.id = v_hash.id;
> 
> Of course after writing all of that I realized that you said your queries
> are
> generated (as opposed to hand-written) and that they "are only
> representative of
> my problem. [...] While the above sample query can be easily modified to
> run
> well, I'm not sure how to do it in the general case."
> 
> So maybe this isn't going to work for you, after all...

Hi Army,

Thanks for the equijoin/subquery example, it makes sense to me now.

Regarding the optimizer override, I should have no problem getting those
hints into my sql statement as I'm generating the sql statements myself, and
they all follow a similar structure.  So on the contrary, your above idea
seems very feasible.  

I have three possible workaround ideas, but I'm going to pursue this one
first, as it seems likely to work reasonably well while not taking too much
development time, which is a critical issue now.  (The other ideas are to
simulate a subquery by iterating over two sorted queries in parallel; and to
use a temporary table for inner query results, using an equijoin with the
outer query).

You previously mentioned a Derby property called
derby.language.maxMemoryPerTable.  I don't see it mentioned in the current
documentation, but I found a jira issue which contains some proposed
documentation (http://issues.apache.org/jira/browse/DERBY-1397).  But it's
still not entirely clear to me.  Is this property only used when deciding
whether to perform a hash join or nested loop join?  Or does it also
influence how much of the hash join is kept in memory, or when that hash
join is spilled to disk?  Essentially, I'd like to know, if using the above
query with optimizer overrides which force a hash join, what is the
threshold for spilling a hash to disk, and is this a tunable parameter?

Thanks,

Jim

Re: slow subqueries

Posted by Army <qo...@gmail.com>.

Army wrote:

> In order to get Derby to recognize the equijoin predicate, the subquery 
> has to appear in the FROM list.  So I created a view for the subquery:

<snip view>

> and then I changed the query to select and join with that view:

<snip>

> Working off the modified query above I used "optimizer overrides", which 
> were added as part of Derby 10.2, to 1) fix the join order so that the 
> subquery becomes the "inner" table, and 2) to force ("guarantee") a hash 
> join using the equijoin predicate, thereby allowing materialization of 
> the inner table (i.e. of the subquery).  The query to do that is as 
> follows:
> 
> select count(*) from --DERBY-PROPERTIES joinOrder=FIXED
>   time, v_hash --DERBY-PROPERTIES joinStrategy=HASH
>   where time.id = v_hash.id;

Just one thing to mention as an afterthought: you do not have to create a 
separate view for this to work.  I did that for the sake of clarity, but you 
could just as easily write:

select count(*) from --DERBY-PROPERTIES joinOrder=FIXED
   time,
   (select distinct
     time.id from time, double_sample, band
    where band.id = double_sample.fk_band_id
      and double_sample.fk_time_id = time.id and
      band.id in (11, 12)
   ) v_hash --DERBY-PROPERTIES joinStrategy=HASH
   where time.id = v_hash.id;

Of course after writing all of that I realized that you said your queries are 
generated (as opposed to hand-written) and that they "are only representative of 
my problem. [...] While the above sample query can be easily modified to run 
well, I'm not sure how to do it in the general case."

So maybe this isn't going to work for you, after all...

Army

Re: slow subqueries

Posted by Army <qo...@gmail.com>.

Jim Newsham wrote:
> Thanks for the tips... both on getting compilation time, and getting the
> plan without running the full query.  Does the query plan really show up in
> the derby log?

It did for me, and I've always thought that was where it was supposed to go. 
Maybe I'm missing something about your environment, though...

> To turn my query into an equijoin subquery, would I just change "where x in
> (...subquery...)" to "where x = (...subquery...)"?  I believe I could do
> that above, if I change the outer query to return distinct results.

In Derby an equijoin predicate has to exist between two result sets in the FROM 
list of the query.  So I don't think changing "where x in " to "where x = " will 
be enough.

I did some (very quick) playing with the following query that you posted in an 
earlier thread:

select count(*) from time where time.id =
   (select distinct
     time.id from time, double_sample, band
    where band.id = double_sample.fk_band_id
      and double_sample.fk_time_id = time.id and
      band.id in (11, 12));

In order to get Derby to recognize the equijoin predicate, the subquery has to 
appear in the FROM list.  So I created a view for the subquery:

create view v_hash as
   select distinct
     time.id from time, double_sample, band
    where band.id = double_sample.fk_band_id
      and double_sample.fk_time_id = time.id and
      band.id in (11, 12);

and then I changed the query to select and join with that view:

select count(*) from time, v_hash where time.id = v_hash.id;

When I did that, Derby chose a query plan that put "time" as the inner table and 
then did a nested loop join with a table scan on "time"--but that's not what you 
wanted.  Which brings us to your next question:

> But if I understand what you wrote below correctly, this is a required
> condition but not a guarantee that a hash join will be used.  If the above
> transformation makes sense (including for performance) and there is a way to
> guarantee hash join materialization including caching to disk, I would love
> to hear about it.

Working off the modified query above I used "optimizer overrides", which were 
added as part of Derby 10.2, to 1) fix the join order so that the subquery 
becomes the "inner" table, and 2) to force ("guarantee") a hash join using the 
equijoin predicate, thereby allowing materialization of the inner table (i.e. of 
the subquery).  The query to do that is as follows:

select count(*) from --DERBY-PROPERTIES joinOrder=FIXED
   time, v_hash --DERBY-PROPERTIES joinStrategy=HASH
   where time.id = v_hash.id;

Note that a "return" is required after each Derby override.

At that point the query plan showed a hash join in which the view "v_hash" was 
materialized into memory--which is, I think, what you are looking for.

Are you working with 10.2, and if so would something like this work for you?  I 
did all of this with zero rows, so I have no idea how or if this will work for 
tables with *millions* of rows in them.  Might be worth a shot, though...

> In the end, we'd like to stick with Derby if we can, because it is pure Java
> and embedded, and this is a real big plus for deployment within a desktop
> application such as ours.  I'm hoping that I can come up with a feasible
> workaround for the subquery performance issue we have, and hope that in the
> near future Derby can make my hack obsolete.

If we're lucky the above workaround will make things more bearable from a 
performance standpoint.  It's a bit of a pain in that you would have to rewrite 
all of your queries, but I'll leave it up to you decide whether or not it's 
worth it...

Army

RE: slow subqueries

Posted by Jim Newsham <jn...@referentia.com>.

Hi everyone, and thanks Army for your detailed response.  My responses are
interspersed below...

> -----Original Message-----
> From: Army [mailto:qozinx@gmail.com]
> Sent: Friday, November 10, 2006 6:53 AM
> To: Derby Discussion
> Subject: Re: slow subqueries
> 
> Jim Newsham wrote:
> > Why do queries with subqueries perform so poorly?
> 
> This is one of the places where Derby optimization could definitely use
> some
> improvement.  There are several know issues with the optimization of
> subqueries
> that can potentially lead to sub-optimal plans, including unreasonably
> high cost
> estimates (DERBY-1905, DERBY-1260), lack of an effective pruning algorithm
> (DERBY-1907), and a possibly insufficient "timeout" mechanism (DERBY-
> 1906).
> There has been work in this area to hopefully improve the situation for
> 10.2
> (DERBY-805, DERBY-781, DERBY-1007), but as you yourself have noticed, we
> have a
> long way to go.
> 
> > The following query completes almost instantly.  I suppose Derby
> simplifies
> > the subquery internally to "band.id = 11":
> >
> > ij(SAMPLEBASE)> select count(*) from time where time.id in (select
> distinct
> > time.id from time, double_sample, band where band.id =
> > double_sample.fk_band_id and double_sample.fk_time_id = time.id and
> band.id
> > in (11));
> >
> > This query "does not complete":
> >
> > ij(SAMPLEBASE)> select count(*) from time where time.id in (select
> distinct
> > time.id from time, double_sample, band where band.id =
> > double_sample.fk_band_id and double_sample.fk_time_id = time.id and
> band.id
> > in (11, 12));
> 
> Using the DDL that you posted in an earlier email I ran these queries and
> they
> both completed in less than a second.  Note, though, that I didn't have
> any data
> in the tables :)  That may sound like useless information to you, but it
> tells
> me that the query compilation is at least completing in a reasonable
> amount of
> time, which is a first step.
> 
> See http://wiki.apache.org/db-derby/PerformanceDiagnosisTips
> 
> In your case do you know how much time is being spent in compilation vs
> execution?  The easiest way to tell this in ij is to prepare the query
> first,
> then execute it:
> 
> ij> elapsedtime on;
> ij> prepare ps1 as 'select count(*) from time where time.id in (select
> distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id
> and double_sample.fk_time_id = time.id and band.id in (11))';
> ELAPSED TIME = 2344 milliseconds
> ij> execute ps1;
> 1
> -----------
> 0
> 
> 1 row selected
> ELAPSED TIME = 60 milliseconds
> ij> prepare ps2 as 'select count(*) from time where time.id in (select
> distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id
> and double_sample.fk_time_id = time.id and band.id in (11, 12))';
> ELAPSED TIME = 301 milliseconds
> ij> execute ps2;
> 1
> -----------
> 0
> 
> 1 row selected
> ELAPSED TIME = 10 milliseconds
> 
> If you have a script or a Java program that loads data into your tables,
> and if
> you are willing and able to post that script/program to the list, then
> those who
> are following this discussion might be more easily able to reproduce the
> problem, which would help investigation.  If you cannot or do not want to
> post
> such a script/program, then any description about the kind and amount of
> data in
> the relevant tables would be helpful, as well.  For example, what value do
> you
> expect "count(*)" to return in these queries?  How many rows does the
> subquery
> in the IN clause return?  How many rows are in the various tables?

Unfortunately I don't have a data load script that I can release at the
moment.  My biggest problem is that I'm working on something that needs to
have been done yesterday.  I'd be happy to see performance enhancements in
Derby which apply to the cases I am using, but they won't come soon enough.
I need to find an interim solution.  Though if I do have some time later
I'll try to put together a test schema, data set, and collection of queries
that I can release.

Both the inner query and the outer query may return a very large number of
records.  In my test data set, this is approximately 20,000 each, although
in practice it can be a much larger number... perhaps millions or more.
This is why a nested loop is not going to work here... 20,000 squared
operations is very expensive, let alone millions squared.  For a query with
this profile, the inner query should only be executed once.  If it needs to
be cached, fine... but...

In my previous email I suggested the idea of Derby generating two internal
result sets (one for the outer, and one for the inner query) and then
scanning these two result sets in parallel to produce the external result
set.  I know I don't really understand the internals of Derby... does this
even make sense?  Is it feasible?  It seems like the most efficient solution
if both the inner and outer queries have a very large number of results.
For this to work, the internal result sets would need to be sorted on the
column that they're related on.  For the cases I'm working with, this would
always be a primary key column.

In fact, as a workaround to the performance problem I have, I'm very
seriously considering simulating my subqueries outside of Derby, by writing
a ResultSet implementation which merges the results of two other (Derby)
result sets iterated over in parallel as described above.  Integrating this
into my other code is going to be a pretty ugly hack, and obviously it would
be much more efficient if Derby could do this internally.  

Obviously, I'd like to hear about any flaws you see with this approach (both
within Derby and for my hack/workaround).

> 
> > I've been searching/reading around and trying to understand a little
> about
> > how Derby works and how to check the statement plan.  I've read through
> the
> > tuning document.  Unfortunately if a query doesn't complete, I can't get
> the
> > statement plan.
> 
> Actually, I think you only need to retrieve the first row in order to get
> the
> statement plan.  For example, if you set
> "derby.language.logQueryPlan=true" and
> then run the following:
> 
> ij> get cursor c1 as 'select * from time where time.id in (select distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id
> and double_sample.fk_time_id = time.id and band.id in (11, 12))';
> ij> next c1;
> ij> close c1;
> ij> exit;
> 
> then look at derby.log, you should (hopefully) see the query plan.
> 
> NOTE: I removed the "count(*)" from the query so that we can fetch one row
> at a
> time.  If you do this, you should see the query plan in the log file after
> the
> first row has been retrieved.  Of course, whether or not this will help
> depends
> on how long it takes to retrieve the first row--but it's worth a try.

Thanks for the tips... both on getting compilation time, and getting the
plan without running the full query.  Does the query plan really show up in
the derby log?  I was getting the plan in ij, which is not so convenient.  I
suppose this is because I configured logging using a database-wide property
instead of a global property.

> 
> > Is the entire inner query being re-evaluated for every candidate record
> in
> > the outer query?
> 
> That's possible, yes.  DERBY-781 made some changes to materialize a
> subquery by
> use of a hash join, but a hash join requires an equijoin predicate.  For
> your
> query there is no explicit equijoin between "time" and the subquery, so I
> do not
> think DERBY-781 will help.  Therefore it is quite possible that the
> subquery is
> in fact being re-evaluated multiple times, as you say.

To turn my query into an equijoin subquery, would I just change "where x in
(...subquery...)" to "where x = (...subquery...)"?  I believe I could do
that above, if I change the outer query to return distinct results.

But if I understand what you wrote below correctly, this is a required
condition but not a guarantee that a hash join will be used.  If the above
transformation makes sense (including for performance) and there is a way to
guarantee hash join materialization including caching to disk, I would love
to hear about it.

> 
> > I read somewhere that inner queries may not be materialized if they are
> > larger than a certain threshold.
> 
> When talking about hash joins, the estimated in-memory size of the inner
> table/query has to be less than derby.language.maxMemoryPerTable, where
> that
> property is specified in kilobytes and defaults to 1024 (i.e. default max
> mem
> per table is 1 meg).  You can try setting that property to a larger value
> (within the constraints of JVM heap), but I do not know if that will help.
> It
> may make hash joins more feasible *within* the subquery, but I do not
> think it
> will allow materialization of the subquery itself since, as mentioned
> above,
> there is no explicit equijoin with the subquery.
> 
> > If the inner result set is too large for memory, couldn't it be cached
> > to disk?
> 
> Yes, it can.  And if, for example, you set the
> derby.language.maxMemoryPerTable
> property to a value that is too large for memory, this "spill-over" to
> disk will
> in fact occur.  I think there is an improvement filed somewhere to make it
> so
> that instead of "aborting" a hash join when the inner result set is too
> large,
> the optimizer should just make the join more expensive to account for the
> spill-to-disk i/o.  But I don't think anyone has made that improvement
> yet...
> 
> So in the end, there's probably not much information in this email that
> will
> tell you how to make your query run more quickly.  As you can see, though,
> there
> is plenty of room for improvement in this particular area of Derby.  If
> this is
> something that is exciting and/or essential for you, you should definitely
> feel
> free to join the development community and to help address the problem.
> We'd be
> thrilled to have you!
> 
> Army

I appreciate your feedback, and there was plenty of useful information in
your response.  I really wish I had the time to dig into Derby's source code
to see what it's doing, let alone spend time contributing improvements.
Unfortunately I'm working on borrowed time and have absolutely no time to
deviate.

In the end, we'd like to stick with Derby if we can, because it is pure Java
and embedded, and this is a real big plus for deployment within a desktop
application such as ours.  I'm hoping that I can come up with a feasible
workaround for the subquery performance issue we have, and hope that in the
near future Derby can make my hack obsolete.

Thanks,
Jim Newsham

Re: slow subqueries

Posted by Army <qo...@gmail.com>.

Jim Newsham wrote:
> Why do queries with subqueries perform so poorly?

This is one of the places where Derby optimization could definitely use some 
improvement.  There are several know issues with the optimization of subqueries 
that can potentially lead to sub-optimal plans, including unreasonably high cost 
estimates (DERBY-1905, DERBY-1260), lack of an effective pruning algorithm 
(DERBY-1907), and a possibly insufficient "timeout" mechanism (DERBY-1906). 
There has been work in this area to hopefully improve the situation for 10.2 
(DERBY-805, DERBY-781, DERBY-1007), but as you yourself have noticed, we have a 
long way to go.

> The following query completes almost instantly.  I suppose Derby simplifies
> the subquery internally to "band.id = 11":
> 
> ij(SAMPLEBASE)> select count(*) from time where time.id in (select distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id and double_sample.fk_time_id = time.id and band.id
> in (11));
> 
> This query "does not complete":
> 
> ij(SAMPLEBASE)> select count(*) from time where time.id in (select distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id and double_sample.fk_time_id = time.id and band.id
> in (11, 12));

Using the DDL that you posted in an earlier email I ran these queries and they 
both completed in less than a second.  Note, though, that I didn't have any data 
in the tables :)  That may sound like useless information to you, but it tells 
me that the query compilation is at least completing in a reasonable amount of 
time, which is a first step.

See http://wiki.apache.org/db-derby/PerformanceDiagnosisTips

In your case do you know how much time is being spent in compilation vs 
execution?  The easiest way to tell this in ij is to prepare the query first, 
then execute it:

ij> elapsedtime on;
ij> prepare ps1 as 'select count(*) from time where time.id in (select distinct 
time.id from time, double_sample, band where band.id = double_sample.fk_band_id 
and double_sample.fk_time_id = time.id and band.id in (11))';
ELAPSED TIME = 2344 milliseconds
ij> execute ps1;
1
-----------
0

1 row selected
ELAPSED TIME = 60 milliseconds
ij> prepare ps2 as 'select count(*) from time where time.id in (select distinct 
time.id from time, double_sample, band where band.id = double_sample.fk_band_id 
and double_sample.fk_time_id = time.id and band.id in (11, 12))';
ELAPSED TIME = 301 milliseconds
ij> execute ps2;
1
-----------
0

1 row selected
ELAPSED TIME = 10 milliseconds

If you have a script or a Java program that loads data into your tables, and if 
you are willing and able to post that script/program to the list, then those who 
are following this discussion might be more easily able to reproduce the 
problem, which would help investigation.  If you cannot or do not want to post 
such a script/program, then any description about the kind and amount of data in 
the relevant tables would be helpful, as well.  For example, what value do you 
expect "count(*)" to return in these queries?  How many rows does the subquery 
in the IN clause return?  How many rows are in the various tables?

> I've been searching/reading around and trying to understand a little about
> how Derby works and how to check the statement plan.  I've read through the
> tuning document.  Unfortunately if a query doesn't complete, I can't get the
> statement plan.

Actually, I think you only need to retrieve the first row in order to get the 
statement plan.  For example, if you set "derby.language.logQueryPlan=true" and 
then run the following:

ij> get cursor c1 as 'select * from time where time.id in (select distinct 
time.id from time, double_sample, band where band.id = double_sample.fk_band_id 
and double_sample.fk_time_id = time.id and band.id in (11, 12))';
ij> next c1;
ij> close c1;
ij> exit;

then look at derby.log, you should (hopefully) see the query plan.

NOTE: I removed the "count(*)" from the query so that we can fetch one row at a 
time.  If you do this, you should see the query plan in the log file after the 
first row has been retrieved.  Of course, whether or not this will help depends 
on how long it takes to retrieve the first row--but it's worth a try.

> Is the entire inner query being re-evaluated for every candidate record in
> the outer query?

That's possible, yes.  DERBY-781 made some changes to materialize a subquery by 
use of a hash join, but a hash join requires an equijoin predicate.  For your 
query there is no explicit equijoin between "time" and the subquery, so I do not 
think DERBY-781 will help.  Therefore it is quite possible that the subquery is 
in fact being re-evaluated multiple times, as you say.

> I read somewhere that inner queries may not be materialized if they are 
> larger than a certain threshold.

When talking about hash joins, the estimated in-memory size of the inner 
table/query has to be less than derby.language.maxMemoryPerTable, where that 
property is specified in kilobytes and defaults to 1024 (i.e. default max mem 
per table is 1 meg).  You can try setting that property to a larger value 
(within the constraints of JVM heap), but I do not know if that will help.  It 
may make hash joins more feasible *within* the subquery, but I do not think it 
will allow materialization of the subquery itself since, as mentioned above, 
there is no explicit equijoin with the subquery.

> If the inner result set is too large for memory, couldn't it be cached 
> to disk?

Yes, it can.  And if, for example, you set the derby.language.maxMemoryPerTable 
property to a value that is too large for memory, this "spill-over" to disk will 
in fact occur.  I think there is an improvement filed somewhere to make it so 
that instead of "aborting" a hash join when the inner result set is too large, 
the optimizer should just make the join more expensive to account for the 
spill-to-disk i/o.  But I don't think anyone has made that improvement yet...

So in the end, there's probably not much information in this email that will 
tell you how to make your query run more quickly.  As you can see, though, there 
is plenty of room for improvement in this particular area of Derby.  If this is 
something that is exciting and/or essential for you, you should definitely feel 
free to join the development community and to help address the problem.  We'd be 
thrilled to have you!

Army