You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Michael Segel <ms...@segel.com> on 2009/01/13 15:19:10 UTC

Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT


> -----Original Message-----
> From: Knut.Hatlen@Sun.COM [mailto:Knut.Hatlen@Sun.COM]
> Sent: Tuesday, January 13, 2009 4:59 AM
> To: Derby Discussion
> Subject: Re: Bad performance with UPDATE on a nested SELECT
[SNIP]
> Yes, it looks like the same query plan. I see the following problems
> with this query plan:
> 
> 1) The IN clause is rewritten to a join with SUMMA_RECORDS as the outer
> table, and we have no information to limit the scan of the outer table,
> hence a full table scan is performed. It would probably have been better
> if SUMMA_RECORDS were the inner table. Then the C index could be used to
> enforce the restriction (childId='...') on the outer table,
> SUMMA_RELATIONS, and index I could be used to perform lookups in the
> inner table.
> 
> 2) The index scan on SUMMA_RELATIONS uses the index PC. Since that index
> is on the columns (parentId, childId), it is primarily sorted on
> parentId, so the scan needs to go through the entire index, for each row
> in the outer table, in order to match childId with the criterion in the
> WHERE clause of the nested select. It would probably be better to use
> the index on childId (index C) instead of PC (or reverse the order of
> the columns in the PC index).
> 
> Derby allows you to override some of the optimizer's decisions, but it's
> not as easy for UPDATE statements as it is for SELECT statements, so I
> don't know how to tell it not to pick the table scan. Using the C index
> instead of the PC index should be easy enough, though:
> 
> UPDATE summa_records SET base='my_base' WHERE id IN
>   (SELECT parentId FROM summa_relations --DERBY-PROPERTIES index=C
>       WHERE childId='horizon_2332668')
> 
> --
> Knut Anders

I think that you're missing something.

First without knowing the whole application, why is there an index PC
instead of an Index P and then an index C? The relationship match table has
just two columns. So why do you need a compound index? It doesn't make
sense.

Will a parent have over 10,000 children? Then maybe it might make sense.
(Well not really. If that was the case, you'd probably want to use a temp
table. Assuming that Derby support the dynamic creation of temp tables like
Informix, but not like Oracle or DB2... )

I suggested that you redesign your tables so that instead of working with
varchars and indexing varchars all over the place, you instead create a
table so you can do a lookup of the varchar, matching to an integer ID. 

Indexing varchars in a main table is a *bad* idea. You end up with a lot of
fat indexes. Fat indexs don't perform as well as skinny indexes, especially
when you're talking about tables with millions plus rows.

Also do not use query optimization statements except as a last resort. 

Again I *highly* recommend creating a table where you assign a unique id to
a given 'parentId'/'id'/'childId' and then use that id in these tables
instead of the varchar. 1.5 million rows of an index on a varchar column
means a lot of wasted space and overhead. (Just to re-iterate the point made
above.)

Look, no disrespect... But a lot of times, improving your table design will
improve your performance and at the same time keeping your solution database
agnostic.

HTH

-Mike

But hey! What do I know? 
I'm not the guy who spent a billion dollars on MySQL when they could have
bought Informix for the same price almost a decade ago. ;-)

RE: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT

Posted by Mikkel Kamstrup Erlandsen <mk...@statsbiblioteket.dk>.

On Wed, 2009-01-14 at 14:15 +0100, derby@segel.com wrote:
> 
> > -----Original Message-----
> > From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk]
> > Sent: Wednesday, January 14, 2009 12:12 AM
> > To: Derby Discussion
> > Cc: msegel@segel.com
> > Subject: RE: Bad table design leads to bad performance...RE: Bad
> > performance with UPDATE on a nested SELECT
> [SNIP]
> > >
> > > As I said above, I just tried out your strategy. Using only integer
> > > handles the query runs in about 4s. I still need a factor 100 better
> > > than that...
> > 
> > Sorry, forgot to add that this was on a base with ~700k rows only, not
> > the full 1,5M rows one...
> > 
> > Cheers,
> > Mikkel
> 
> Mikkel,
> 
> Sorry if I'm being a bit slow...
> 
> That 4 seconds was on 700K rows using integers instead of varchars?
> Does that also include the optimizer hint or was that without the optimizer
> hint?

Yes. And it was without the optimizer hint.

> The reason I ask is that if your query was the same and you just switched
> from varchar to integers, then there clearly is an issue with varchars and
> the optimizer. 

Yeah, the query was the exact same, just with INTEGER instead of
VARCHAR(255) for the id column. For reference, the query we are talking
about:

UPDATE records
SET base='my_base'
WHERE id IN (
  SELECT parentId
  FROM relations
  WHERE childId='id_1');

I am not sure that the issue is about varchars. As I stated in an
earlier mail I had a running time of 5-10s on the 700k row base when I
was using varchars. This means that an identity index on an integer
column gives a small speed up (maybe 30% if I am to pull a random number
out of my hat). What I was hoping for was more like 3000%...

> If I understood you correctly, it sounded like when you tried the optimizer
> hint, using varchars, you got the same result as the integers. Is that the
> case?

More or less... The integer case might be slightly faster, but I don't
have precise timings to back this up. In any case the difference here is
very small.

Generally I would say that the optimizer hint doesn't provide any real
difference.

> If you're going to loop through the result set, you may find it faster to
> delete the old row and then perform the insert. Updates are hard(er) on a
> database.

I am just doing a manual loop over the SELECT result set now (it usually
contains 0-5 rows) and doing an update for each row. It works at an
acceptable speed. < 1ms for each op.

> Derby isn't Postgres and there are some neat things you can do in C that you
> can't do in Java that could give a C based database an edge. (Pointers can
> be your friend. ;-) At the same time, there are design decisions that could
> have been made which are hampering performance now. 

And C doesn't have a JIT compiler :-) Java programs can be blazingly
fast. Lucene impresses me on a daily basis :-)

I don't believe this issue is about Java. At this point I am quite
certain that it is a problem with the query optimizer in Derby.

Cheers,
Mikkel

RE: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT

Posted by de...@segel.com.


> -----Original Message-----
> From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk]
> Sent: Wednesday, January 14, 2009 12:12 AM
> To: Derby Discussion
> Cc: msegel@segel.com
> Subject: RE: Bad table design leads to bad performance...RE: Bad
> performance with UPDATE on a nested SELECT
[SNIP]
> >
> > As I said above, I just tried out your strategy. Using only integer
> > handles the query runs in about 4s. I still need a factor 100 better
> > than that...
> 
> Sorry, forgot to add that this was on a base with ~700k rows only, not
> the full 1,5M rows one...
> 
> Cheers,
> Mikkel

Mikkel,

Sorry if I'm being a bit slow...

That 4 seconds was on 700K rows using integers instead of varchars?
Does that also include the optimizer hint or was that without the optimizer
hint?

The reason I ask is that if your query was the same and you just switched
from varchar to integers, then there clearly is an issue with varchars and
the optimizer. 

If I understood you correctly, it sounded like when you tried the optimizer
hint, using varchars, you got the same result as the integers. Is that the
case?
Again its evidence that supports the issue that there is something wrong
with the optimizer.

If you're going to loop through the result set, you may find it faster to
delete the old row and then perform the insert. Updates are hard(er) on a
database.

Derby isn't Postgres and there are some neat things you can do in C that you
can't do in Java that could give a C based database an edge. (Pointers can
be your friend. ;-) At the same time, there are design decisions that could
have been made which are hampering performance now. 

-Mike

RE: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT

Posted by Mikkel Kamstrup Erlandsen <mk...@statsbiblioteket.dk>.

On Tue, 2009-01-13 at 23:47 +0100, Mikkel Kamstrup Erlandsen wrote:
> On Tue, 2009-01-13 at 16:59 +0100, Michael Segel wrote:
> > Yes. My point is that Derby isn't PostgresSQL and there are a lot of
> > internal factors that could cause the problems you are seeing. How database
> > engines handle varchars internally could be an issue when concerning
> > performance. In terms of languages, C is very different from Java. There are
> > things that you can do in C that you can't do in Java.
> > 
> > With respect to your problem, I believe that there is an issue with the
> > query optimizer. By cleaning up your table design the optimizer may make a
> > better selection. As a last resort, you then apply the hint.
> > 
> > What I am suggesting is that for testing purposes only, you create the table
> > that maps a varchar to an identity integer. Then for each row in your base
> > table, you insert a row in to a new table that stores the integer id instead
> > of the varchar. You then do the same for the relationship table.
> > 
> > When you re-run the query you can either join the table in the inner select,
> > or you could just hardcode the integer value instead of the varchar value.
> > 
> > Yes there is minimal cost for the join, but you're talking milliseconds.
> > Weigh that against the time spent on a sequential scan of 1.5million +
> > records...
> 
> As I said above, I just tried out your strategy. Using only integer
> handles the query runs in about 4s. I still need a factor 100 better
> than that...

Sorry, forgot to add that this was on a base with ~700k rows only, not
the full 1,5M rows one...

Cheers,
Mikkel

RE: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT

Posted by Mikkel Kamstrup Erlandsen <mk...@statsbiblioteket.dk>.

On Tue, 2009-01-13 at 16:59 +0100, Michael Segel wrote:
> See my comments mixed in...
> 
> Just a caveat, you're right, I don't know your entire application, I'm just
> looking at it from what little information you have shared in your posts....

I apologize if I am cryptic, there are no secrets at all[1], but we do
have a lot of odd requirements from the data model that I wont bore you
with if I can avoid it :-)

[1]: I am working on an open source project called Summa:
http://wiki.statsbiblioteket.dk/summa

More comments inlined below...

> > -----Original Message-----
> > From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk]
> > Sent: Tuesday, January 13, 2009 8:51 AM
> > To: Derby Discussion; msegel@segel.com
> > Subject: Re: Bad table design leads to bad performance...RE: Bad
> > performance with UPDATE on a nested SELECT
> > 
> [SNIP]
> > 
> > It will not be trivial for us to "just" use integer ids. The problem is
> > that either end of a relation will not necessarily exist in the base.
> > 
> 
> In your update statement, as written, unless there is a parent to child
> relationship established, you will not update a base record. Your statement
> is matching the base record's id on a match to a parentId which has a
> specific child record.
> 
> This means that you *can* have a base record that will not get updated.

Yes, this is intentional. In the "real" query I actually update the
timestamps of all parent records of a given record. I just used 'base'
here instead of timestamps because it is simpler in writing (and has the
exact same execution pattern).

<SNIP>
> >   id: book_1
> >   children: book_2, book_3
> >   parents:
> > 
> > 7 days later we receive metadata for book_2:
> > 
> >   id: book_2
> >   children:
> >   parents:
<SNIP>
> Yes! That is exactly what I am suggesting. I wasn't sure but you're
> indicating that 'book_1', 'book_2' and 'book_3' are all the same data types
> where the role is going to be table dependent. So you can create a simple
> table that maps a book varchar to an identity value. (A unique integer value
> with a backing index) This implies that there can only be one id for
> 'book_1'. Also you would then want an index on the varchar only for this
> table. (One fat index instead of 3)
> 
> By doing this, you do not have any need for anything more than allowing
> nulls in the column. 

Ok, I've tried this. More on this below...

> Using your example above, you say that book_1 has children (book_2 and
> book_3) Does this not imply that for book_2 that book_1 is its parent?

Yes.

> So do you add a parent child relationship, or do you keep it null until
> there is a metadata record and if the metadata record indicates that there
> are no relationships, do you then remove any relationships that might have
> been set prior to the new record?

When we put book_1 in the db we simply add the tuples (book_1,book_2)
and (book_1,book_3) to the relations table.

> Yes. My point is that Derby isn't PostgresSQL and there are a lot of
> internal factors that could cause the problems you are seeing. How database
> engines handle varchars internally could be an issue when concerning
> performance. In terms of languages, C is very different from Java. There are
> things that you can do in C that you can't do in Java.
> 
> With respect to your problem, I believe that there is an issue with the
> query optimizer. By cleaning up your table design the optimizer may make a
> better selection. As a last resort, you then apply the hint.
> 
> What I am suggesting is that for testing purposes only, you create the table
> that maps a varchar to an identity integer. Then for each row in your base
> table, you insert a row in to a new table that stores the integer id instead
> of the varchar. You then do the same for the relationship table.
> 
> When you re-run the query you can either join the table in the inner select,
> or you could just hardcode the integer value instead of the varchar value.
> 
> Yes there is minimal cost for the join, but you're talking milliseconds.
> Weigh that against the time spent on a sequential scan of 1.5million +
> records...

As I said above, I just tried out your strategy. Using only integer
handles the query runs in about 4s. I still need a factor 100 better
than that...

Instead of using Derby properties we will probably resolve to hacking
around this in the code by doing the SELECT first and then iterating
manually over the result set doing one UPDATE per row... Before it gets
to that I'll give it a final stab tomorrow when I am rested.

Cheers,
Mikkel

RE: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT

Posted by Michael Segel <ms...@segel.com>.

See my comments mixed in...

Just a caveat, you're right, I don't know your entire application, I'm just
looking at it from what little information you have shared in your posts....

> -----Original Message-----
> From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk]
> Sent: Tuesday, January 13, 2009 8:51 AM
> To: Derby Discussion; msegel@segel.com
> Subject: Re: Bad table design leads to bad performance...RE: Bad
> performance with UPDATE on a nested SELECT
> 
[SNIP]
> 
> It will not be trivial for us to "just" use integer ids. The problem is
> that either end of a relation will not necessarily exist in the base.
> 

In your update statement, as written, unless there is a parent to child
relationship established, you will not update a base record. Your statement
is matching the base record's id on a match to a parentId which has a
specific child record.

This means that you *can* have a base record that will not get updated.

> This means that if we use proxy integer ids we'd have to insert a
> dummy-id in the relations table that me later must pair up to the real
> id if a record with linked id shows up.
> 
No.
This is not the case.
(See below)

> To be concrete; I am working at a library and the "records" are books
> linking each other. A case could be that we are instructed to put the
> following metadata in the base:
> 
>   id: book_1
>   children: book_2, book_3
>   parents:
> 
> 7 days later we receive metadata for book_2:
> 
>   id: book_2
>   children:
>   parents:
> 
> Note that book_2 does not mention its relation to book_1. This is
> (sadly) legal in our world.
> 
> I can see that your strategy can work out if we store
> integer_id->varchar_id in a separate table. Ids for non-existing records
> can just be added here as well. I am just concerned that this will bring
> even more JOIN overhead than we already have...
> 

Yes! That is exactly what I am suggesting. I wasn't sure but you're
indicating that 'book_1', 'book_2' and 'book_3' are all the same data types
where the role is going to be table dependent. So you can create a simple
table that maps a book varchar to an identity value. (A unique integer value
with a backing index) This implies that there can only be one id for
'book_1'. Also you would then want an index on the varchar only for this
table. (One fat index instead of 3)

By doing this, you do not have any need for anything more than allowing
nulls in the column. 

Using your example above, you say that book_1 has children (book_2 and
book_3) Does this not imply that for book_2 that book_1 is its parent?
So do you add a parent child relationship, or do you keep it null until
there is a metadata record and if the metadata record indicates that there
are no relationships, do you then remove any relationships that might have
been set prior to the new record?


> > Also do not use query optimization statements except as a last resort.
> >
> > Again I *highly* recommend creating a table where you assign a unique id
> to
> > a given 'parentId'/'id'/'childId' and then use that id in these tables
> > instead of the varchar. 1.5 million rows of an index on a varchar column
> > means a lot of wasted space and overhead. (Just to re-iterate the point
> made
> > above.)
> >
> > Look, no disrespect... But a lot of times, improving your table design
> will
> > improve your performance and at the same time keeping your solution
> database
> > agnostic.
> 
> Your input is much appreciated and provides good food for thought :-)
> The problem is just that we have a PostgresQL doing this exact query (on
> the same data amounts) in < 1ms, but we would really, _really_ like to
> use Derby instead.
> 
> Cheers,
> Mikkel

Yes. My point is that Derby isn't PostgresSQL and there are a lot of
internal factors that could cause the problems you are seeing. How database
engines handle varchars internally could be an issue when concerning
performance. In terms of languages, C is very different from Java. There are
things that you can do in C that you can't do in Java.

With respect to your problem, I believe that there is an issue with the
query optimizer. By cleaning up your table design the optimizer may make a
better selection. As a last resort, you then apply the hint.

What I am suggesting is that for testing purposes only, you create the table
that maps a varchar to an identity integer. Then for each row in your base
table, you insert a row in to a new table that stores the integer id instead
of the varchar. You then do the same for the relationship table.

When you re-run the query you can either join the table in the inner select,
or you could just hardcode the integer value instead of the varchar value.

Yes there is minimal cost for the join, but you're talking milliseconds.
Weigh that against the time spent on a sequential scan of 1.5million +
records...

-Mike

Re: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT

Posted by Mikkel Kamstrup Erlandsen <mk...@statsbiblioteket.dk>.

On Tue, 2009-01-13 at 15:19 +0100, Michael Segel wrote:
> 
> > -----Original Message-----
> > From: Knut.Hatlen@Sun.COM [mailto:Knut.Hatlen@Sun.COM]
> > Sent: Tuesday, January 13, 2009 4:59 AM
> > To: Derby Discussion
> > Subject: Re: Bad performance with UPDATE on a nested SELECT
> [SNIP]
> > Yes, it looks like the same query plan. I see the following problems
> > with this query plan:
> > 
> > 1) The IN clause is rewritten to a join with SUMMA_RECORDS as the outer
> > table, and we have no information to limit the scan of the outer table,
> > hence a full table scan is performed. It would probably have been better
> > if SUMMA_RECORDS were the inner table. Then the C index could be used to
> > enforce the restriction (childId='...') on the outer table,
> > SUMMA_RELATIONS, and index I could be used to perform lookups in the
> > inner table.
> > 
> > 2) The index scan on SUMMA_RELATIONS uses the index PC. Since that index
> > is on the columns (parentId, childId), it is primarily sorted on
> > parentId, so the scan needs to go through the entire index, for each row
> > in the outer table, in order to match childId with the criterion in the
> > WHERE clause of the nested select. It would probably be better to use
> > the index on childId (index C) instead of PC (or reverse the order of
> > the columns in the PC index).
> > 
> > Derby allows you to override some of the optimizer's decisions, but it's
> > not as easy for UPDATE statements as it is for SELECT statements, so I
> > don't know how to tell it not to pick the table scan. Using the C index
> > instead of the PC index should be easy enough, though:
> > 
> > UPDATE summa_records SET base='my_base' WHERE id IN
> >   (SELECT parentId FROM summa_relations --DERBY-PROPERTIES index=C
> >       WHERE childId='horizon_2332668')
> > 
> > --
> > Knut Anders
> 
> I think that you're missing something.
> 
> First without knowing the whole application, why is there an index PC
> instead of an Index P and then an index C? The relationship match table has
> just two columns. So why do you need a compound index? It doesn't make
> sense.

The idea was to have a unique index on the 'relations' table, but maybe
you are right...

> Will a parent have over 10,000 children? Then maybe it might make sense.
> (Well not really. If that was the case, you'd probably want to use a temp
> table. Assuming that Derby support the dynamic creation of temp tables like
> Informix, but not like Oracle or DB2... )

A typical case is that parents might have 0-10 children (and reverse -
children might have 0-10 parents).

> I suggested that you redesign your tables so that instead of working with
> varchars and indexing varchars all over the place, you instead create a
> table so you can do a lookup of the varchar, matching to an integer ID. 
> 
> Indexing varchars in a main table is a *bad* idea. You end up with a lot of
> fat indexes. Fat indexs don't perform as well as skinny indexes, especially
> when you're talking about tables with millions plus rows.

It will not be trivial for us to "just" use integer ids. The problem is
that either end of a relation will not necessarily exist in the base. 

This means that if we use proxy integer ids we'd have to insert a
dummy-id in the relations table that me later must pair up to the real
id if a record with linked id shows up.

To be concrete; I am working at a library and the "records" are books
linking each other. A case could be that we are instructed to put the
following metadata in the base:

  id: book_1
  children: book_2, book_3
  parents:

7 days later we receive metadata for book_2:

  id: book_2
  children:
  parents:

Note that book_2 does not mention its relation to book_1. This is
(sadly) legal in our world.

I can see that your strategy can work out if we store
integer_id->varchar_id in a separate table. Ids for non-existing records
can just be added here as well. I am just concerned that this will bring
even more JOIN overhead than we already have...

> Also do not use query optimization statements except as a last resort. 
> 
> Again I *highly* recommend creating a table where you assign a unique id to
> a given 'parentId'/'id'/'childId' and then use that id in these tables
> instead of the varchar. 1.5 million rows of an index on a varchar column
> means a lot of wasted space and overhead. (Just to re-iterate the point made
> above.)
> 
> Look, no disrespect... But a lot of times, improving your table design will
> improve your performance and at the same time keeping your solution database
> agnostic.

Your input is much appreciated and provides good food for thought :-)
The problem is just that we have a PostgresQL doing this exact query (on
the same data amounts) in < 1ms, but we would really, _really_ like to
use Derby instead.

Cheers,
Mikkel