You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Bryan Pendleton <bp...@amberpoint.com> on 2006/10/02 18:46:20 UTC

Re: Optimizer work/issues/discussions....

> good time to start some discussion around the various issues in the 
> current Derby optimizer

One place I struggle is that I find logged query plans to be rather
intimidating and hard to read. I've been thinking that it might be
useful to put some effort into either improving the output of the query
plan dumper, or perhaps coming up with some other tools for displaying
and analyzing query selection behaviors.

Is there prior art here that we can re-use? Are there good examples of
tools which analyze and visualize query execution strategies that we
can learn from and borrow?

thanks,

bryan

Re: logQueryPlan readability

Posted by Army <qo...@gmail.com>.

Bryan Pendleton wrote:
> 
> I was thinking that there might be some smaller, easier, things that we
> could do which would offer some smaller wins, but might still be worth it.

Thanks for taking an interest in this discussion, Bryan, and for writing-up some 
ideas!

I think that if these smaller wins also aid in the accomplishment of the bigger 
goal (i.e. emitting info as XML), then this is definitely a good approach.

> Ideas that I had include:
> 
>  1) logQueryPlan output intermixes timing information ("we processed this
>     many rows in this many milliseconds") with structural information
>     ("we did a distinct scalar aggregate of a sort of a union"). I was
>     thinking that we could refine the logQueryPlan behavior so that you
>     could say things like:
> 
>      logQueryPlan=[nodeTree | estTime | actTime | estRows | actRows]
> 
>     so that the user could choose a subset of the information if they were
>     only interested in the "shape" of the tree, say, and not the detailed
>     performance numbers; the idea is just that we give the user a way to
>     pick a smaller subset of the information to give them a more 
>     approachable amount of output.

If this same kind of "subsetting" will also be used for the XML formatting, then 
something like this could be useful.  In the particular breakdown shown in the 
example, though, it seems like "nodeTree" could get to be pretty large.  And if 
we're looking at time/row values for each node in the tree, would the values 
exist on top of the "nodeTree" structure?  I.e. "estTime" is "nodeTree" with 
additional info?

I think the big thing to figure out here is what "nodeTree" would be.  It should 
theoretically  be small enough to only display what's "relevant".  But what's 
relevant for one query (or user) may not be relevant for another.  So we'd have 
to either 1) dump anything which could potentially be "relevant", or 2) allow 
the person looking at the query plan to determine what s/he wants to look at. 
In the former case we end up back where we started with too large of a query 
plan (though use of the "brief | full" tag below could help); in the latter, 
we'd have to come up with a way to allow the user to retrieve bits and pieces of 
what is effectively an unstructured clob.  The latter seems like it could 
require a lot of effort that might be better spent working on an XML format, 
which then inherently has a means of allowing a user to retrieve the pieces s/he 
thinks are important (namley, via XPath).

Of course, as I write all of that I'm thinking of the several-thousand line 
query plans that we get from queries like those in DERBY-1205 and 
DERBY-1777--which is probably not what the average person has in mind ;)  For 
more manageable queries I agree with you that just being able to show a simple 
nodeTree with minimal info would be a big improvement over what we have now.

>  2) The indentation for the qualification information seems to get lost,
>     in my experience, making the display hard to read. Also, since "real"
>     query plans are often quite deep, I wonder whether we should display
>     things with a fixed 1-or-2 space indentation "step" rather than using
>     tabs. The advantage of emitting hard tabs is that the user can reset
>     their tab stops, but if they don't, the lines almost always wrap or
>     get truncated. Avoiding line wrap could also take the form of slightly
>     less wordy displays, so that instead of
> 
>       optimizer estimated row count:   1439201.17
> 
>     we could say
> 
>       est rows: 1439201
> 
>     so that as the indentation started to grow, we'd still not wrap lines.
>     Of course, this trades off line-wrapping for less-self-evident output.

This is an idea that I think would be useful if we were planning to stick with 
the current (non-XML) format in the long term.  But if the ultimate goal is to 
switch to XML with some related analysis/viewing tool, I'm not sure what the 
effect of the such a change would be?  I guess if we go with the simplest idea 
of using Derby's own SQL/XML operators to retrieve pieces of the query plan, 
then smaller indentation would indeed be useful (because Derby preserves 
whitespace).  And without knowing anything about where or how the query plans 
are actually written to logs, I would guess (hope) that this particular 
indentation change would be straightforward...

>  3) There's a bunch of kind of detailed internal information in the query
>     plan output, and in particular for many common queries we output a lot
>     of information about settings which the user probably didn't set, but
>     which are just set to their default value. I'm thinking about things
>     like "Ordered nulls: false" and "Unknown return value: false" and 
>     the like.  I wonder whether we could change the various output 
>     routines so that they only emitted information like this if it was
>     set to a non-default value, which seems like in the common case it 
>     might make the query plan display substantially shorter. If we didn't 
>     want to do this unconditionally, I thought that maybe we might give 
>     the user a knob like:
> 
>       logQueryPlanFormat=[brief | full]

Seems like a good suggestion, and one that could be useful for an XML format, as 
well.  The only potential drawback is the addition of yet another Derby 
property, but I don't know if that's really a concern or not?

> Do people think that things like this would be worth doing? Or would it be
> better just to bite the bullet and pursue the full-on "XML format and a
> separate analysis tool" proposals?

Insofar as these kinds of things can still be useful for the eventual (and 
currently just theoretical) XML format, I say Yes, they would be useful.

I do wonder if we have to worry about "backward compatibility" of the query 
plans?  Are there people who may have written tools/utilities that expect the 
log query plan to keep its current shape?  If so, then we'd probably have to add 
another property to toggle the new formatting on/off (with "off" being the 
default).  And that would mean that we'd have to keep logic for generating two 
different query plans around in the code.  I'm not sure how difficult that would 
be, nor whether use of a "full-on XML format" would make such logic better or 
worse...?

But all of that blabbering aside, if no one takes up the task of creating a full 
XML format, then anyone willing to implement changes like the ones mentioned 
above would still be doing query-plan-readers a favor.

So who's willing to fry what fish? :)

Army

Re: logQueryPlan readability

Posted by Bryan Pendleton <bp...@amberpoint.com>.

>> One place I struggle is that I find logged query plans to be rather
>> intimidating and hard to read. 

I liked the ideas about emitting the information in XML format and then
building a tool to analyze it.

It's kind of a big first step, though.

I was thinking that there might be some smaller, easier, things that we
could do which would offer some smaller wins, but might still be worth it.

Ideas that I had include:

  1) logQueryPlan output intermixes timing information ("we processed this
     many rows in this many milliseconds") with structural information
     ("we did a distinct scalar aggregate of a sort of a union"). I was
     thinking that we could refine the logQueryPlan behavior so that you
     could say things like:

      logQueryPlan=[nodeTree | estTime | actTime | estRows | actRows]

     so that the user could choose a subset of the information if they were
     only interested in the "shape" of the tree, say, and not the detailed
     performance numbers; the idea is just that we give the user a way to
     pick a smaller subset of the information to give them a more approachable
     amount of output.

  2) The indentation for the qualification information seems to get lost,
     in my experience, making the display hard to read. Also, since "real"
     query plans are often quite deep, I wonder whether we should display
     things with a fixed 1-or-2 space indentation "step" rather than using
     tabs. The advantage of emitting hard tabs is that the user can reset
     their tab stops, but if they don't, the lines almost always wrap or
     get truncated. Avoiding line wrap could also take the form of slightly
     less wordy displays, so that instead of

       optimizer estimated row count:   1439201.17

     we could say

       est rows: 1439201

     so that as the indentation started to grow, we'd still not wrap lines.
     Of course, this trades off line-wrapping for less-self-evident output.

  3) There's a bunch of kind of detailed internal information in the query
     plan output, and in particular for many common queries we output a lot
     of information about settings which the user probably didn't set, but
     which are just set to their default value. I'm thinking about things
     like "Ordered nulls: false" and "Unknown return value: false" and the like.
     I wonder whether we could change the various output routines so that they
     only emitted information like this if it was set to a non-default value,
     which seems like in the common case it might make the query plan display
     substantially shorter. If we didn't want to do this unconditionally, I
     thought that maybe we might give the user a knob like:

       logQueryPlanFormat=[brief | full]

Do people think that things like this would be worth doing? Or would it be
better just to bite the bullet and pursue the full-on "XML format and a
separate analysis tool" proposals?

thanks,

bryan

Re: Optimizer work/issues/discussions....

Posted by Army <qo...@gmail.com>.

Bryan Pendleton wrote:
> One place I struggle is that I find logged query plans to be rather
> intimidating and hard to read. I've been thinking that it might be
> useful to put some effort into either improving the output of the query
> plan dumper, or perhaps coming up with some other tools for displaying
> and analyzing query selection behaviors.

+1!

> Is there prior art here that we can re-use? Are there good examples of
> tools which analyze and visualize query execution strategies that we
> can learn from and borrow?

I don't know the answers to these questions, but they're great questions and 
hopefully someone else can chime in...

I searched the Derby archives and found an email from a student named Felix 
Beyer about something that sounds promising:

http://www.nabble.com/Feedback-for-extension-tf1967245.html#a5399097
http://www.nabble.com/Explain-Tables-and-Workload-Management-Extensions-tf2009196.html#a5519844

Of particular interest:

<begin quote>

  in a former project I managed to extend Derby to extract the
generated optimized query plans in form of XML files for visualizing
them in an external application. I used the GXL file format for export
and visualized the plans with the JGraph Framework. Internal changes
affected the current Derby structure in two ways: First of all a new
system function was added to toggle query extraction on or off and
second a visitor pattern was used to collect the required information
through a traverse of the query tree after the optimization step.

<end quote>

I don't know the specifics, but that certainly sounds like the right kind of 
thing.  Unfortunately the thread ends with a question from Felix:

"Are Derby Administrators or Application Developers interested in tuning tools 
like this, or could even the developers benefit during testing the optimizer or 
visualizing some internal state of Derby?"

and only one reply:

"IMHO, I'm not so sure how useful this tool would be. I view Derby as a 
lightweight RDBMS. As such, the queries will most likely be straight forward and 
I should be able to figure out the indexes that I'll need to get optimal 
performance."

So perhaps that was the end of that particular missed opportunity.  Unless Felix 
is by chance still following the derby-dev lists...?

Army

Re: Optimizer work/issues/discussions....

Posted by Yip Ng <yi...@gmail.com>.

Good idea, Bryan.  Perhaps dumping it in XML as a format is one option to
start with.
(I think this sort of facility can be applied to a more general case of dump
outputs.)  Then a explain tool can be used or implemented to read the XML
data and produce something more visually readable.  Of course, the plain
text format should be retained.

Yip

On 10/2/06, Bryan Pendleton <bp...@amberpoint.com> wrote:
>
> > good time to start some discussion around the various issues in the
> > current Derby optimizer
>
> One place I struggle is that I find logged query plans to be rather
> intimidating and hard to read. I've been thinking that it might be
> useful to put some effort into either improving the output of the query
> plan dumper, or perhaps coming up with some other tools for displaying
> and analyzing query selection behaviors.
>
> Is there prior art here that we can re-use? Are there good examples of
> tools which analyze and visualize query execution strategies that we
> can learn from and borrow?
>
> thanks,
>
> bryan
>
>

Re: Optimizer work/issues/discussions....

Posted by Army <qo...@gmail.com>.

Rick Hillegas wrote:

> Hi Bryan,
> 
> A modest first step might be to output the plans in xml. The ouptut 
> could then be post-processed into the format needed by your favorite 
> graph visualization tool. Even just pulling the xml into your browser 
> might be more readable than what we have today.

Agreed.

And as a bit of an "extra" from this, we could perhaps use Derby's own SQL/XML 
operators to parse the query plans and pull-out the "pieces of interest", which 
would in turn (in a dream world) allow us to write functional tests that can 
verify the query plan in a assertion-based way instead of just dumping thousands 
(and thousands) of lines of output to a master file and looking for the one or 
two lines that actually matter...

Army

Re: Optimizer work/issues/discussions....

Posted by Rick Hillegas <Ri...@Sun.COM>.

Hi Bryan,

A modest first step might be to output the plans in xml. The ouptut 
could then be post-processed into the format needed by your favorite 
graph visualization tool. Even just pulling the xml into your browser 
might be more readable than what we have today.

Regards,
-Rick

Bryan Pendleton wrote:

>> good time to start some discussion around the various issues in the 
>> current Derby optimizer
>
>
> One place I struggle is that I find logged query plans to be rather
> intimidating and hard to read. I've been thinking that it might be
> useful to put some effort into either improving the output of the query
> plan dumper, or perhaps coming up with some other tools for displaying
> and analyzing query selection behaviors.
>
> Is there prior art here that we can re-use? Are there good examples of
> tools which analyze and visualize query execution strategies that we
> can learn from and borrow?
>
> thanks,
>
> bryan
>