You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Tim Dudgeon <td...@informaticsmatters.com> on 2009/03/25 20:29:34 UTC

Index perfomance

I found an interesting preformace problem that I'd welcome some help in 
inderstanding.

I have a table of 4,000,000 rows that has a DOUBLE column containing 
floating point numbers.
I run a query like this:

select pk_column from the_table where the_double_column > 300 and 
the_double_column < 320

271136 rows are returned.
I then go through the ResultSet and extract all the id_column values.
All of this is done using standard JDBC.

When I do this it takes 23 seconds, which I though was not unreasonable 
as a full table scan was involved and the table was pretty big.

But of course I thought an index would help, so I added an index to the 
the_double_column and repeated the process. It took 720 seconds, 31x 
slower! I thought this was strange, but thought it might be because I 
was using standard settings and the 4MB page cache was much too small to 
hold the index.

So I increased the page cache size (derby.storage.pageCacheSize 
property) to 10x the size (10,000) and repeated the process. There was 
only a very minor improvement in speed.

In all cases the memory usage, as reported by:
Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
really no differnt, and the used memory was much less that the maximum 
available specified by the -Xmx setting.


Any ideas what to do?

Tim

Re: Index perfomance

Posted by Mike Matrigali <mi...@sbcglobal.net>.

The technique of adding extra columns to indexes works well for derby if
it matching your application needs.  The docs usually refer to this as a
covering index and the optimizer is pretty good at looking for cases 
where it can use a covering index and avoid going to the base table. 
Hopefully this will help your application.

 From info posted, the query with the index is definitely doing an index
to base row lookup for every qualifying row. It looks like for this
data distribution and number of qualifying rows this a worst plan than
just doing a full table scan.
The optimizer should have picked the base table scan
rather than the index to base row given how the 2 performed. I think
this is another case showing the costs for the optimizer need to be 
updated to reflect current technology.  There has been a lot of work to
make scans go fast and that is not reflected in the current costing.

The optimizer estimated 203425 rows for the index qualification and got
271136 which seems not too bad (5% vs. 6.8%).  This info comes from the
with-index.txt query plan.  It assumes equal distribution of values 
across all the values so maybe this range was a
little hotter than others.  Since the row count estimate looks close I
lean toward the base costing as the problem.

It would be interesting to know how much of the index/baserow 
performance issue is that it keeps getting cache misses vs. the cpu 
overhead of just doing the base row look up for every row.  For this db
it would take a 50,000 page cache just to cache the base row plus 
whatever it takes to cache the index.

For this kind of distribution I have seen db's gather all the row 
pointers from the index and then do ordered probes into the base table.
This insures good caching for the lookups.  Derby does not have this
technique available.

Tim Dudgeon wrote:
> Thanks for the comments. Here is some more info.
> I attach the DDL for the table concerned, the simple test program I use 
> and the execution strategy with and without an index.
> 
> Some additional points:
> 
> 1. the query returning 7% of the table is certainly not an extreme case. 
> The exact query criteria are specified by the user in the UI, and can be 
> much worse than this case. I have no control over the natur eof the 
> query that the user specifies.
> 
> 2. Yes, if the query is much more selective the index can be a help.
> 
> 3. The biggest data file in seg0 is 1452572672 bytes in size (e.g. 1.4GB).
> 
> 4. the index was added after the rows were added.
> 
> 5. making the index also have the pk_column as the second indexed field 
> makes it go like lightning! search runs in 2 secs, about 14x faster.
> 
> 
> So in summary it seems like an index will be of no help to me in this 
> situation, unless I make it an index of both columns.
> 
> Many thanks
> 
> Tim
> 
> 
> 
> 
> 
> Tim Dudgeon wrote:
>> I found an interesting preformace problem that I'd welcome some help 
>> in inderstanding.
>>
>> I have a table of 4,000,000 rows that has a DOUBLE column containing 
>> floating point numbers.
>> I run a query like this:
>>
>> select pk_column from the_table where the_double_column > 300 and 
>> the_double_column < 320
>>
>> 271136 rows are returned.
>> I then go through the ResultSet and extract all the id_column values.
>> All of this is done using standard JDBC.
>>
>> When I do this it takes 23 seconds, which I though was not 
>> unreasonable as a full table scan was involved and the table was 
>> pretty big.
>>
>> But of course I thought an index would help, so I added an index to 
>> the the_double_column and repeated the process. It took 720 seconds, 
>> 31x slower! I thought this was strange, but thought it might be 
>> because I was using standard settings and the 4MB page cache was much 
>> too small to hold the index.
>>
>> So I increased the page cache size (derby.storage.pageCacheSize 
>> property) to 10x the size (10,000) and repeated the process. There was 
>> only a very minor improvement in speed.
>>
>> In all cases the memory usage, as reported by:
>> Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
>> really no differnt, and the used memory was much less that the maximum 
>> available specified by the -Xmx setting.
>>
>>
>> Any ideas what to do?
>>
>> Tim
>>
>>
>

RE: Index perfomance

Posted by de...@segel.com.

Tim,

Look, sorry if this is a bit of a rant.
I don't think that your question is a bad question. In fact it's a good
question. What concerns me is that the responses to your question missed the
obvious.

This isn't a bug in Derby. So opening a Jira would be a waste of time.
Since you just created the index after the table was loaded, the statistics
should be fine. Maybe I'm making an assumption that the data was static for
the sake of this test...

I realize that a lot of people who are using and investigating JavaDB/Derby
are not 'classically' trained DBAs and some view the database as nothing
more than a persistence of their java objects. 

My point is that you have to think about what you're asking the database to
do and how indexes work. When you have a 'tight' cluster of data around a
set of values, you're not going to get good performance out of a B-Tree
index. (What happens when your index contains non-unique values? Do you then
walk through the set of non-unique values sequentially? )

I don't believe that Derby offers a Bitmap Index, which may be a better fit.
Usually it's for columns which have a low number of distinct values. Tim's
column is a decimal column which could have a lot of different values,
however, in his specific instance, he has a lot of data tightly clustered in
a small set of values. 

For those not familiar with a Bitmap Index, here's a reference that might
help.
http://www.oracle.com/technology/pub/articles/sharma_indexes.html

I've added some comments below, and I again apologize for this mini-rant. 

HTH

-Mikey

> -----Original Message-----
> From: news [mailto:news@ger.gmane.org] On Behalf Of Tim Dudgeon
> Sent: Thursday, March 26, 2009 7:38 AM
> To: derby-user@db.apache.org
> Subject: Re: Index perfomance
> 
> Thanks for the comments. Here is some more info.
> I attach the DDL for the table concerned, the simple test program I use
> and the execution strategy with and without an index.
> 
> Some additional points:
> 
[SNIP] 
> 4. the index was added after the rows were added.
> 
This is a non issue. Actually when loading a large table, you would want to
drop the index load the rows and then re-add the index for better
performance. (Assuming that you're not talking about a cluster index...)

> 5. making the index also have the pk_column as the second indexed field
> makes it go like lightning! search runs in 2 secs, about 14x faster.
> 
No Duh!

I'm sorry, but you have to stop and think about what you were trying to do.
As I said in my earlier response, you had something like 20,000 rows with
the same indexed value. Think about what your index tree looked like.

By adding the pk_column as the second column of the index, you then had a
unique pair in your index.

Remember that your index is a B-tree index!
> 
> So in summary it seems like an index will be of no help to me in this
> situation, unless I make it an index of both columns.
>

Re: Index perfomance

Posted by Tim Dudgeon <td...@informaticsmatters.com>.

Thanks for the comments. Here is some more info.
I attach the DDL for the table concerned, the simple test program I use 
and the execution strategy with and without an index.

Some additional points:

1. the query returning 7% of the table is certainly not an extreme case. 
The exact query criteria are specified by the user in the UI, and can be 
much worse than this case. I have no control over the natur eof the 
query that the user specifies.

2. Yes, if the query is much more selective the index can be a help.

3. The biggest data file in seg0 is 1452572672 bytes in size (e.g. 1.4GB).

4. the index was added after the rows were added.

5. making the index also have the pk_column as the second indexed field 
makes it go like lightning! search runs in 2 secs, about 14x faster.

So in summary it seems like an index will be of no help to me in this 
situation, unless I make it an index of both columns.

Many thanks

Tim

Tim Dudgeon wrote:
> I found an interesting preformace problem that I'd welcome some help in 
> inderstanding.
> 
> I have a table of 4,000,000 rows that has a DOUBLE column containing 
> floating point numbers.
> I run a query like this:
> 
> select pk_column from the_table where the_double_column > 300 and 
> the_double_column < 320
> 
> 271136 rows are returned.
> I then go through the ResultSet and extract all the id_column values.
> All of this is done using standard JDBC.
> 
> When I do this it takes 23 seconds, which I though was not unreasonable 
> as a full table scan was involved and the table was pretty big.
> 
> But of course I thought an index would help, so I added an index to the 
> the_double_column and repeated the process. It took 720 seconds, 31x 
> slower! I thought this was strange, but thought it might be because I 
> was using standard settings and the 4MB page cache was much too small to 
> hold the index.
> 
> So I increased the page cache size (derby.storage.pageCacheSize 
> property) to 10x the size (10,000) and repeated the process. There was 
> only a very minor improvement in speed.
> 
> In all cases the memory usage, as reported by:
> Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
> really no differnt, and the used memory was much less that the maximum 
> available specified by the -Xmx setting.
> 
> 
> Any ideas what to do?
> 
> Tim
> 
>

Re: Index perfomance

Posted by Mike Matrigali <mi...@sbcglobal.net>.

A reproducible test case submitted to jira is the best way to get help 
from the list.  This would answer all the following questions:
1) what is row size?
2) what is table size - assuming you have just one big table easiest 
might be just to do a listing of seg0 and give the size of the biggest file.
3) could you post the jdbc you use.
4) what is the query plan when you add the index.
5) what is the exact ddl of the created table and what is the exact ddl 
of all the indexes?
6) what happens if you add an index on (double column, pk_column) 
instead?  A possible problem with the query is that even though you only 
look at exactly the 271136 rows out of 4M in the index, the query then 
has to do
a separate probe to the base table to get the pk_column.  Depending on
row size and cache hits the worst case could mean that each requires a
real I/O.  Where the table scan is guaranteed to only to read each page 
once, and it does it in order which tends to get free big block I/O on 
most OS's.
7) Do you create the index after or before loading the rows?

The following wiki may also help:

http://wiki.apache.org/db-derby/PerformanceDiagnosisTips?highlight=(performance)

Answers to above would help.
Tim Dudgeon wrote:
> I found an interesting preformace problem that I'd welcome some help in 
> inderstanding.
> 
> I have a table of 4,000,000 rows that has a DOUBLE column containing 
> floating point numbers.
> I run a query like this:
> 
> select pk_column from the_table where the_double_column > 300 and 
> the_double_column < 320
> 
> 271136 rows are returned.
> I then go through the ResultSet and extract all the id_column values.
> All of this is done using standard JDBC.
> 
> When I do this it takes 23 seconds, which I though was not unreasonable 
> as a full table scan was involved and the table was pretty big.
> 
> But of course I thought an index would help, so I added an index to the 
> the_double_column and repeated the process. It took 720 seconds, 31x 
> slower! I thought this was strange, but thought it might be because I 
> was using standard settings and the 4MB page cache was much too small to 
> hold the index.
> 
> So I increased the page cache size (derby.storage.pageCacheSize 
> property) to 10x the size (10,000) and repeated the process. There was 
> only a very minor improvement in speed.
> 
> In all cases the memory usage, as reported by:
> Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
> really no differnt, and the used memory was much less that the maximum 
> available specified by the -Xmx setting.
> 
> 
> Any ideas what to do?
> 
> Tim
> 
>

Re: Index perfomance

Posted by Bryan Pendleton <bp...@amberpoint.com>.

> But of course I thought an index would help, so I added an index to the 
> the_double_column and repeated the process. It took 720 seconds, 31x slower! 

It would be instructive to see the RUNTIMESTATISTICS information for each
of your two cases.

My guess, just based on reading your message, is that you'll find:

  - for the first query, Derby extracted the pk_column and the_double_column
values by scanning the table exactly once, as you suspected.

- for the second query, Derby traversed the index. For each matching row
in the index, Derby then fetched the corresponding row from the base table,
and retrieved the pk_column from that row.

That is, I'm guessing that the crucial piece of information is that your
query fetched 271,000 rows, which is 7% of the total table, and that what
you discovered is that it's vastly cheaper to do 1 table scan than to do
271,000 row fetches.

If your query had provided a WHERE clause which fetched, say, 200 rows,
I bet it would be MUCH faster to run it via the index, than by scanning the table.

thanks,

bryan

Re: Index perfomance

Posted by Mark Thornton <mt...@optrak.co.uk>.

Tim Dudgeon wrote:
> I found an interesting preformace problem that I'd welcome some help 
> in inderstanding.
>
> I have a table of 4,000,000 rows that has a DOUBLE column containing 
> floating point numbers.
> I run a query like this:
>
> select pk_column from the_table where the_double_column > 300 and 
> the_double_column < 320
>
> 271136 rows are returned.
> I then go through the ResultSet and extract all the id_column values.
> All of this is done using standard JDBC.
You are selecting about 7% of the rows. If the double column values are 
independent of the pk_column and a reasonable number of rows fit on a 
page then you are likely to have at least one desired row on most pages 
in the table. Without an index, the scan will proceed in the most 
convenient order through all the pages. With an index, it will still hit 
most of the pages, but in the most inconvenient order (plus of course 
having to read/decode the index pages).

Mark Thornton

RE: Index perfomance

Posted by de...@segel.com.

Tim,

First, why are you using a floating point or a double inside of the
database? Try using numeric data types like DECIMAL. (Unless you mean that
you're using a DECIMAL inside derby but are storing it in your java app as a
double...)

Putting an index on a floating point column? Ick! 

Then there's one other question... you have ~270K rows where the indexed
value is between 300 and 320. That's 270K where the value is one of 20
values. Just on average each index has ~10K rows associated with it. So an
index isn't going to help out that much.

Does that make sense?

HTH

-Mike


> -----Original Message-----
> From: news [mailto:news@ger.gmane.org] On Behalf Of Tim Dudgeon
> Sent: Wednesday, March 25, 2009 2:30 PM
> To: derby-user@db.apache.org
> Subject: Index perfomance
> 
> I found an interesting preformace problem that I'd welcome some help in
> inderstanding.
> 
> I have a table of 4,000,000 rows that has a DOUBLE column containing
> floating point numbers.
> I run a query like this:
> 
> select pk_column from the_table where the_double_column > 300 and
> the_double_column < 320
> 
> 271136 rows are returned.
> I then go through the ResultSet and extract all the id_column values.
> All of this is done using standard JDBC.
> 
> When I do this it takes 23 seconds, which I though was not unreasonable
> as a full table scan was involved and the table was pretty big.
> 
> But of course I thought an index would help, so I added an index to the
> the_double_column and repeated the process. It took 720 seconds, 31x
> slower! I thought this was strange, but thought it might be because I
> was using standard settings and the 4MB page cache was much too small to
> hold the index.
> 
> So I increased the page cache size (derby.storage.pageCacheSize
> property) to 10x the size (10,000) and repeated the process. There was
> only a very minor improvement in speed.
> 
> In all cases the memory usage, as reported by:
> Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()
> really no differnt, and the used memory was much less that the maximum
> available specified by the -Xmx setting.
> 
> 
> Any ideas what to do?
> 
> Tim