You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Kristian Waagan (JIRA)" <ji...@apache.org> on 2010/08/13 15:32:19 UTC
[jira] Updated: (DERBY-4771) Continue investigation of automatic creation/update of index statistics

     [ https://issues.apache.org/jira/browse/DERBY-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kristian Waagan updated DERBY-4771:
-----------------------------------

    Attachment: derby-4771-1a-prototype_code_dump.stat
                derby-4771-1a-prototype_code_dump.diff

Attached is a prototype of another attempt at implementing auto-update
of Derby index statistics. First I'll describe the patch briefly, then
I'll note some potential improvements and ideas.
I've omitted lots of details, feel free to ask questions and to comment
on the suggested improvements etc. They need a lot more work...

The code is nowhere near complete, its primary purpose is to spur
discussion and hopefully guide us in the right direction.


[Prototype description]

The prototype performs some checks for whether the index statistics are
stale during statement compilation, as Mamta did under DERBY-3788. If
the statistics are considered stale, an update job to update all indexes
for the base table is scheduled with a "daemon". The daemon keeps track
of scheduled update jobs, and will execute them in a separate thread.
Only one job will be taken care of at a time, and if there are too many
jobs, new jobs are discarded. When a slot frees up in the work queue,
these jobs will eventually be scheduled. If there are no statistics,
creating them will be scheduled (the daemon doesn't separate between
creating and updating stats). When a job is scheduled for a base table,
this is recorded in the associated index descriptors (transient state)
to avoid having to query the daemon too often.

As mentioned, the work is carried out in a separate thread, created as
required (there is no permanent background thread, it dies if the queue
is emptied). This seems appropriate as statistics update should be
rather infrequent compared to other operations in a database system.

When new statistics are computed for the indexes of a table, they are
stored in the daemon. They require little memory (table identifier, and
per index, the index identifer, two longs and one int).

As a statement is compiled, the optimizer will consider the available
indexes. At this point the index statistics are checked, and if we see
that they have been scheduled we make sure we check if they are
completed a little later in the compilation process. If we find new
statistics for the query being compiled, we also write any other
completed statistics to the data dictionary. Writing to the data
dictionary is currently done with a nested read-write user transaction
in the user transaction (during statement compilation) - mainly to avoid
keeping locks for an extended period of time.

For clarity, statement compilation/execution will not wait for new
statistics to be generated. In the case of large tables, it could take
hours to generate new stats.

Obvious weaknesses:
 o code organization (I don't know the code well) - choices made based on
   what worked and on reducing overhead (i.e., checking indexes when we
   have already obtained handles to them)
 o the async/decoupled data dictionary update - done to avoid having to
   create a LanguageConnectionContext (lcc).
 o logic/thresholds for determining when stats are stale
 o the row estimate logic also has weaknesses (for instance when mixing
   setting absolute values and updating the estimate based on deltas)

Other notes/characteristics of the prototype:
 o stats not generated/updated for system tables (caused locking problems)
 o lower limit on the row estimate (don't generate for tables with few rows)
 o I considered to expose the NO_WAIT option in the call to add new
   descriptors to the data dictionary. Don't know if this is needed if we
   update stats with a separate transaction from the daemon, then we can
   either use TransactionControl.setNoLockWait() or maybe even just wait?
 o current staleness code is dependent on reasonable row estimates
 o the "unit of work" is currently a base table - when scheduled all
   associated index statistics will be regenerated.
 o I suspect that most tests in suites.All run with the DBO as the user,
   and I haven't done anything specific to handle missing privileges.
 

[Prototype state]

Runs suites.All and derby.all with only four failures, all in
OrderByAndSortAvoidance. The tests fail on an assert for whether a table
scan is performed. To me it looks like the new stats makes the
compiler/optimizer choose a different plan (not necessarily better in
terms of pages visited though, but that's a DBA/optimizer issue).

Currently two flags control the prototype behavior:
 o derby.language.disableIndexStatsUpdate=*false*|true
 o derby.language.logIndexStatsUpdate=*false*|true

If you grep for 'istat' in derby.log, you should get all the lines
relevant to automatic index statistics update.


[Potential improvements]
 o update data dictionary from the daemon thread
   (must then be able to create an appropriate lcc)

 o drift in the number of unique values isn't handled.
   Some potential remedies (raw ideas):

   Mechanism                       Distinct value drift     Row count change
   =========================================================================
   (a) compilation check                    N                       Y
   (b) timed check                          N                       Y
   (c) timed unconditional update           Y                       Y
   (d) UPDATE table SET ...                 y                       N

   In short:
   (a) creates statistics when not existing and kicks off the update job
   as soon as stale we believe we should have had better stats. (b) helps
   systems which are in a steady state (all statements compiled and
   reused) - would typically check all user tables with indexes and
   perform the staleness check from (a). (c) would help against
   "anything" - but potentially with a large delay. Only useful for
   applications where the database is up for very long periods of time
   (days, weeks, months). Intervals for (b) and (c) would have to be
   configurable. Mechanism (d) would help for updates changing a large
   percentage of the rows, but would not catch many small updates
   changing the selectivity of an index.
   It may be possible to reuse BasicDaemon for the timed checks
   (scheduling only, work would still be performed in a separate thread).
   
 o do we need to throttle (a) the index scans, or (b) the processing
   rate of the scheduled jobs?
   (I started playing with a crude utilization rate)

 o almost as above, but we should take care to avoid "infinite-loops"

 o at which point may a change in either the number of rows or the field
   values be big enough to warrant a recalculation of the stats?
   What's more costly; a sub-optimal plan or reading all the data?


I'll be away for some weeks, but plan to return to this issue when I'm back.
My next steps depends on the feedback I get, but one way forwards may be
to try to do the data dictionary update from the daemon itself. Once we get
the core framework in place, we can start working on all the various issues
that have to be addressed.

> Continue investigation of automatic creation/update of index statistics
> -----------------------------------------------------------------------
>
>                 Key: DERBY-4771
>                 URL: https://issues.apache.org/jira/browse/DERBY-4771
>             Project: Derby
>          Issue Type: Task
>            Reporter: Kristian Waagan
>         Attachments: derby-4771-1a-prototype_code_dump.diff, derby-4771-1a-prototype_code_dump.stat
>
>
> Work was started to improve Derby's handling of index statistics. This issue tracks further discussion and work for this task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.