You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Harshad <ha...@gmail.com> on 2008/12/15 14:33:40 UTC

Performance issue of derby using JDBC

Hi all,

I am new to Derby and database programming in general. This is my first real-world setup, and I am getting horrifying performance from my application. (The code is available at http://code.google.com/p/jamun/ )

It is one particular query that is causing me major worry. It is a simple query that searches for a string in a "character varying" column.

I have tried these following optimizations already:
  * Using prepared statements
  * Creating an index on the searched column (and ensuring that it indeed
    gets used)
  * Used a non-scrollable, forward only result set
  * Avoided row-level locking

These seem to have paid off; the actual query time (statement.executeQuery) is pretty fast : about 1 or 2 milliseconds.

But reading from the resultSet using APIs such as getString, getLong, takes about 250 ms or more!!

A couple of questions:
1. Are these numbers typical?
2. Are there alternatives to the JDBC api (for derby)?
3. Is the time consumed by executeQuery misleading, because it gets evaluated lazily on actual access of resultSet?

thanks in advance,
Harshad

Re: Performance issue of derby using JDBC

Posted by Harshad <ha...@gmail.com>.

Rick Hillegas wrote:
>> I would have thought that the DISTINCT is not redundant and actually
>> allows for the transformation into a normal join.
>>   
> I think that, for instance, a problem arises when there are M tuples in
> PROVIDES with name='Fred' but there are < M distinct ids in those
> tuples. In that case, you can imagine data sets on which the original
> query returns < M tuples but the transformed query returns M tuples.

You are right. I goofed up.

> It may be that Derby did apply the EXISTS transformation--and that
> degraded rather than improved the performance. If you are interested in
> seeing the plan which Derby selected, please consult the following wiki
> pages:
> 
>   http://wiki.apache.org/db-derby/PerformanceDiagnosisTips
>   http://wiki.apache.org/db-derby/StmtExecutionPlan

I tried looking at the optimiser query plan, but it is beyond my understanding right now. Thanks for all your help and the links. I will use them as reference later.

cheers,
Harshad

Re: Performance issue of derby using JDBC

Posted by Rick Hillegas <Ri...@Sun.COM>.

Hi Harshad,

Some comments inline...

Harshad wrote:
> Rick Hillegas wrote:
>   
>> Depending on your data, the two queries could return different results,
>> so Bryan's faster query is not a transformation which Derby would
>> perform automatically. 
>>     
>
> Oh yes! Although my data ensures that the subquery would return distinct results, Derby wouldn't know about it. Is there a way to say that the combination of two columns is unique?
>
> When I tried to specify DISTINCT explicitly like this:
> select name,version,release,time from PKG where PKG.id in (select DISTINCT PROVIDES.id from PROVIDES where (PROVIDES.name = ?))
>
> ...I didn't get any change in the performance figures. That is probably because of the "DISTINCT elimination in IN sub-queries" mentioned in "Tuning derby".
>
> I would have thought that the DISTINCT is not redundant and actually allows for the transformation into a normal join.
>   
I think that, for instance, a problem arises when there are M tuples in 
PROVIDES with name='Fred' but there are < M distinct ids in those 
tuples. In that case, you can imagine data sets on which the original 
query returns < M tuples but the transformed query returns M tuples.
>   
>> Nevertheless, it looks to me as though your
>> original query almost qualifies for the EXISTS transformation documented
>> in the Derby Tuning Guide. 
>>     
>
> Why do you say "almost qualifies"; where is it not qualifying? And, will the EXISTS transformation speed up the query as much as the transformation to a normal join?
>   
I haven't looked under the hood here. The Tuning Guide doesn't 
completely describe the optimizer's reasoning. A smart optimizer won't 
unconditionally flatten subqueries.

Without having looked at the code (but knowing who wrote it), I suspect 
that the EXISTS transformation may be a normal join combined with a 
tuple filter which throws away duplicates on a sorted leg of the join.

It may be that Derby did apply the EXISTS transformation--and that 
degraded rather than improved the performance. If you are interested in 
seeing the plan which Derby selected, please consult the following wiki 
pages:

  http://wiki.apache.org/db-derby/PerformanceDiagnosisTips
  http://wiki.apache.org/db-derby/StmtExecutionPlan

I realize that I'm raising more questions than I'm answering.

Hope this is a little useful,
-Rick
> Thanks to all who have responded on this thread.
>
> I have learnt not to trust the execution time of Statement.executeQuery(). It looks like the actual table access is made lazily (as and when the result set is accessed).
>
>

Re: Performance issue of derby using JDBC

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Harshad <ha...@gmail.com> writes:

> Thanks Mike & Kristian for the tips on indexes.
>
> Creating an index on multiple columns in my DB does indeed speed up
> the original query, so Yay!
>
> It would have been even cooler if I could just declare that
> combination of two columns will be unique, rather than create an index
> (with its associated overheads). (I am assuming that in this instance
> the actual index is not useful to the optimiser)

Actually you can declare that the combination of two columns is unique,
like this:

  CREATE TABLE T (X INT, Y INT, Z INT, UNIQUE(X,Y))

Derby will still create an index, though, since it needs an efficient
way to check that no duplicates are introduced by INSERT or UPDATE
statements. But you don't need to issue a CREATE INDEX statement
yourself.

-- 
Knut Anders

Re: Performance issue of derby using JDBC

Posted by Harshad <ha...@gmail.com>.

Thanks Mike & Kristian for the tips on indexes.

Creating an index on multiple columns in my DB does indeed speed up the original query, so Yay!

It would have been even cooler if I could just declare that combination of two columns will be unique, rather than create an index (with its associated overheads). (I am assuming that in this instance the actual index is not useful to the optimiser)

Kristian Waagan wrote:

> derby@segel.com wrote:
>> Hi,
>>   
> [ snip ]
>> You can create as many indexes on a table as you want, provided that the
>> names of the indexes are unique. One major caveat... Too many indexes may
>> confuse the optimizer and when your query is processed, the engine may
>> choose a different index.
>>   
> 
> In addition to what Mike is saying, you should also consider a few more
> issues regarding indexes.
> Indexes have to be maintained, which can hurt your performance for
> certain types of load (i.e. inserts and deletes).
> Another issue is that they take up storage space. In Derby each index is
> represented by a file on disk.
> 
> Finally, I believe Derby still has an issue with keeping the index
> statistics updated (but an automatic solution is being worked on). If
> you're loading up your tables with lots of data, consider creating the
> indexes after the load or compress the tables to update the statistics.
> 
> 
> Regards,

Re: Performance issue of derby using JDBC

Posted by Kristian Waagan <Kr...@Sun.COM>.

derby@segel.com wrote:
> Hi,
>   
[ snip ]
> You can create as many indexes on a table as you want, provided that the
> names of the indexes are unique. One major caveat... Too many indexes may
> confuse the optimizer and when your query is processed, the engine may
> choose a different index.
>   

In addition to what Mike is saying, you should also consider a few more 
issues regarding indexes.
Indexes have to be maintained, which can hurt your performance for 
certain types of load (i.e. inserts and deletes).
Another issue is that they take up storage space. In Derby each index is 
represented by a file on disk.

Finally, I believe Derby still has an issue with keeping the index 
statistics updated (but an automatic solution is being worked on). If 
you're loading up your tables with lots of data, consider creating the 
indexes after the load or compress the tables to update the statistics.

Regards,
-- 
Kristian

> HTH
>
> -Mike
>
>
>   
>> -----Original Message-----
>> From: news [mailto:news@ger.gmane.org] On Behalf Of Harshad
>> Sent: Tuesday, December 16, 2008 11:20 PM
>> To: derby-user@db.apache.org
>> Subject: Re: Performance issue of derby using JDBC
>>
>>     
[ snip ]

RE: Performance issue of derby using JDBC

Posted by de...@segel.com.

Hi,

Sorry for top posting.
What you're looking for is an index. :-)

Ok, I know you know about indexes on individual columns but you can create
an index on multiple columns too! 

It's the same statement that you use to create your other indexes but you
can identify multiple columns where the order of the column is the order of
precedence. (foo, bar) would first index on foo then on bar. 

I'm not sure what online references exist, or what is today a good book on
databases. I don't know if they have anything from CJ Date ??? in print
these days. (Sorry, its either alzhimers, or a lack of caffeine, or both.)
Date was the golden reference for relational databases when I was in school
some 20+ years ago. 

You can create as many indexes on a table as you want, provided that the
names of the indexes are unique. One major caveat... Too many indexes may
confuse the optimizer and when your query is processed, the engine may
choose a different index.

HTH

-Mike

> -----Original Message-----
> From: news [mailto:news@ger.gmane.org] On Behalf Of Harshad
> Sent: Tuesday, December 16, 2008 11:20 PM
> To: derby-user@db.apache.org
> Subject: Re: Performance issue of derby using JDBC
> 
> Rick Hillegas wrote:
> > Depending on your data, the two queries could return different results,
> > so Bryan's faster query is not a transformation which Derby would
> > perform automatically.
> 
> Oh yes! Although my data ensures that the subquery would return distinct
> results, Derby wouldn't know about it. Is there a way to say that the
> combination of two columns is unique?
> 
> When I tried to specify DISTINCT explicitly like this:
> select name,version,release,time from PKG where PKG.id in (select DISTINCT
> PROVIDES.id from PROVIDES where (PROVIDES.name = ?))
> 
> ...I didn't get any change in the performance figures. That is probably
> because of the "DISTINCT elimination in IN sub-queries" mentioned in
> "Tuning derby".
> 
> I would have thought that the DISTINCT is not redundant and actually
> allows for the transformation into a normal join.
> 
> > Nevertheless, it looks to me as though your
> > original query almost qualifies for the EXISTS transformation documented
> > in the Derby Tuning Guide.
> 
> Why do you say "almost qualifies"; where is it not qualifying? And, will
> the EXISTS transformation speed up the query as much as the transformation
> to a normal join?
> 
> Thanks to all who have responded on this thread.
> 
> I have learnt not to trust the execution time of Statement.executeQuery().
> It looks like the actual table access is made lazily (as and when the
> result set is accessed).

Re: Performance issue of derby using JDBC

Posted by Harshad <ha...@gmail.com>.

Rick Hillegas wrote:
> Depending on your data, the two queries could return different results,
> so Bryan's faster query is not a transformation which Derby would
> perform automatically. 

Oh yes! Although my data ensures that the subquery would return distinct results, Derby wouldn't know about it. Is there a way to say that the combination of two columns is unique?

When I tried to specify DISTINCT explicitly like this:
select name,version,release,time from PKG where PKG.id in (select DISTINCT PROVIDES.id from PROVIDES where (PROVIDES.name = ?))

...I didn't get any change in the performance figures. That is probably because of the "DISTINCT elimination in IN sub-queries" mentioned in "Tuning derby".

I would have thought that the DISTINCT is not redundant and actually allows for the transformation into a normal join.

> Nevertheless, it looks to me as though your
> original query almost qualifies for the EXISTS transformation documented
> in the Derby Tuning Guide. 

Why do you say "almost qualifies"; where is it not qualifying? And, will the EXISTS transformation speed up the query as much as the transformation to a normal join?

Thanks to all who have responded on this thread.

I have learnt not to trust the execution time of Statement.executeQuery(). It looks like the actual table access is made lazily (as and when the result set is accessed).

Re: Performance issue of derby using JDBC

Posted by Rick Hillegas <Ri...@Sun.COM>.

Harshad wrote:
> Bryan Pendleton wrote:
>
>   
>> Harshad wrote:
>>     
>>>   "select name,version,release,time from PKG where PKG.id in (select id
>>>   from PROVIDES where name = ?)",
>>>       
>> What happens if you run this statement instead:
>>
>> select pkg.name,pkg.version,pkg.release,pkg.time from pkg, provides
>>         where pkg.id = provides.id and provides.name = ?
>>     
>
>
> Whew! That solves it! No need for numbers, they are sub milli-second now!
>
> Can't the SQL optimiser do this automatically? Is this a derby specific problem?  (Sorry for amateurish questions; this is my first real db application)
>
> I ask because I have another application where I programmatically generate queries, and the former type of query is easier to generate than the one you suggested. I always assumed that the SQL optimiser will "flatten" the query...
>
> Thanks,
> Harshad
>
>   
Hi Harshad,

Depending on your data, the two queries could return different results, 
so Bryan's faster query is not a transformation which Derby would 
perform automatically. Nevertheless, it looks to me as though your 
original query almost qualifies for the EXISTS transformation documented 
in the Derby Tuning Guide. For more information, please see the 
"Subquery processing and transformations" section of the Derby Tuning 
Guide: http://db.apache.org/derby/docs/10.4/tuning/

Hope this helps,
-Rick

Regards,
-Rick

Re: Performance issue of derby using JDBC

Posted by Harshad <ha...@gmail.com>.

Bryan Pendleton wrote:

> Harshad wrote:
>>   "select name,version,release,time from PKG where PKG.id in (select id
>>   from PROVIDES where name = ?)",
> 
> What happens if you run this statement instead:
> 
> select pkg.name,pkg.version,pkg.release,pkg.time from pkg, provides
>         where pkg.id = provides.id and provides.name = ?

Whew! That solves it! No need for numbers, they are sub milli-second now!

Can't the SQL optimiser do this automatically? Is this a derby specific problem?  (Sorry for amateurish questions; this is my first real db application)

I ask because I have another application where I programmatically generate queries, and the former type of query is easier to generate than the one you suggested. I always assumed that the SQL optimiser will "flatten" the query...

Thanks,
Harshad

Re: Performance issue of derby using JDBC

Posted by Bryan Pendleton <bp...@amberpoint.com>.

Harshad wrote:
>   "select name,version,release,time from PKG where PKG.id in (select id from PROVIDES where name = ?)",

What happens if you run this statement instead:

select pkg.name,pkg.version,pkg.release,pkg.time from pkg, provides
        where pkg.id = provides.id and provides.name = ?

thanks,

bryan

Re: Performance issue of derby using JDBC

Posted by Harshad <ha...@gmail.com>.

Hi Kristian,

The entire code is available at http://code.google.com/p/jamun/.
It is written in scala, which might not be familiar to all. I am posting below some snippets that might help.

Kristian Waagan wrote:

> I think you'll have a better chance of getting any answers if you
> provide more information, for instance;
>   o data volume; number of rows in table(s)

Number of rows in the concerned table = ~ 50k

>   o the SQL query, and possibly the Java code executing it (a little
> later, the query plan might come in handy too)

This is the prepared statement:

conn.prepareStatement ( 
  "select name,version,release,time from PKG where PKG.id in (select id from PROVIDES where name = ?)",
  ResultSet.FETCH_FORWARD,ResultSet.TYPE_FORWARD_ONLY )


This is the collector function (it is in scala, but I hope it's logic would be apparent):

def extractPkgResult(repo:Repo, resultSet:ResultSet) = {
     if (resultSet.next) {
       var pkgList:List[MngdPackage] = Nil
       var keepGoing = true
       while (keepGoing) {
         pkgList ::= MngdPackage(resultSet.getString(1), resultSet.getString(2), resultSet.getString(3), "s umm", "", Nil, Nil, Some(repo), resultSet.getLong(4))

         keepGoing = resultSet.next
       }
       pkgList
     } else Nil
   }


>   o database schema (including indexes)
create table PKG  (
                         id integer primary key,
                         name character varying (60),
                         version character varying (40),
                         release character varying (40),
                         time bigint)
create table PROVIDES (
                         id integer, foreign key (id) references PKG(id),
                         name character varying (256),
                         version character varying (40))

create index PROVIDES_NAME_IDX on PROVIDES (name)

> 
> Since you're searching for a string inside a VARCHAR, Derby has to check
> all rows in the table. Are you using LIKE in the query?

No, it's a "where x = y"

> Is there a way that you can reduce the number of rows that qualify?
No. It is already down to 1 (99% of the time)

> Are you seeing the long wait on the first rs.next() ?

Yes! I tried more detailed profiling now and this is indeed true.

> What about the subsequent rs.next() calls (if more than one row in the
> result)?

In case of empty result, first rs.next consumes big time.
In case of non-empty result, the rs.next at the end of the result (when it returns false) consumes little time; less than 1 ms. All other rs.next calls consume big time (150 to 200 ms)

> 
> 
> There are also some tips here:
> http://wiki.apache.org/db-derby/PerformanceDiagnosisTips

I am attaching the query plan.

thanks,
Harshad

Re: Performance issue of derby using JDBC

Posted by Kristian Waagan <Kr...@Sun.COM>.

Harshad wrote:
> Harshad wrote:
>> These seem to have paid off; the actual query time
>> (statement.executeQuery) is pretty fast : about 1 or 2 milliseconds.
>>
>> But reading from the resultSet using APIs such as getString, getLong,
>> takes about 250 ms or more!!
> 
> Forgot to mention; the query usually (99%) results in just a singe row. The 250ms is consumed in just reading the data from that single row!
> 

Hello Harshad,

I think you'll have a better chance of getting any answers if you 
provide more information, for instance;
  o data volume; number of rows in table(s)
  o the SQL query, and possibly the Java code executing it (a little 
later, the query plan might come in handy too)
  o database schema (including indexes)

Since you're searching for a string inside a VARCHAR, Derby has to check 
all rows in the table. Are you using LIKE in the query?
I'm just guessing here based on the information that is already available.
Is there a way that you can reduce the number of rows that qualify?
Are you seeing the long wait on the first rs.next() ?
What about the subsequent rs.next() calls (if more than one row in the 
result)?

There are also some tips here: 
http://wiki.apache.org/db-derby/PerformanceDiagnosisTips

Regards,
-- 
Kristian

Re: Performance issue of derby using JDBC

Posted by Harshad <ha...@gmail.com>.

Hi Emmanuel,

Emmanuel Cecchet wrote:

> Are you using getString("columnName") or getString(1) to access the
> result? Some databases only fetch ResultSetMetaData (including column
> names) when it needs them. Using the column index can dramatically improve
> performance. I am not sure if this is the case with Derby.

I am using getString(1).

I will post more detailed code samples in my reply to Kristian.

Thanks,
Harshad

Re: Performance issue of derby using JDBC

Posted by Emmanuel Cecchet <ma...@frogthinker.org>.

Harshad wrote:
> Harshad wrote:
>   
>> These seem to have paid off; the actual query time
>> (statement.executeQuery) is pretty fast : about 1 or 2 milliseconds.
>>
>> But reading from the resultSet using APIs such as getString, getLong,
>> takes about 250 ms or more!!
>>     
>
> Forgot to mention; the query usually (99%) results in just a singe row. The 250ms is consumed in just reading the data from that single row
Are you using getString("columnName") or getString(1) to access the result?
Some databases only fetch ResultSetMetaData (including column names) 
when it needs them. Using the column index can dramatically improve 
performance. I am not sure if this is the case with Derby.

Emmanuel

-- 
Emmanuel Cecchet
FTO @ Frog Thinker 
Open Source Development & Consulting
--
Web: http://www.frogthinker.org
email: manu@frogthinker.org
Skype: emmanuel_cecchet

Re: Performance issue of derby using JDBC

Posted by Harshad <ha...@gmail.com>.

Harshad wrote:
> These seem to have paid off; the actual query time
> (statement.executeQuery) is pretty fast : about 1 or 2 milliseconds.
> 
> But reading from the resultSet using APIs such as getString, getLong,
> takes about 250 ms or more!!

Forgot to mention; the query usually (99%) results in just a singe row. The 250ms is consumed in just reading the data from that single row!