You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Kevin Hore (JIRA)" <de...@db.apache.org> on 2005/11/14 17:03:28 UTC
[jira] Commented: (DERBY-47) Some possible improvements to IN optimization

    [ http://issues.apache.org/jira/browse/DERBY-47?page=comments#action_12357601 ] 

Kevin Hore commented on DERBY-47:
---------------------------------

I have also found this issue to be a problem and would like to know whether there are any plans to fix it?

What follows is a copy of discussion from the derby-users mailing list form 2005/11/11 "Poor query optimizer choices is making Derby unusable for large tables", it describes how the same behaviour is causing problems for a query with a GROUP BY clause:

-----------------------------------------------------------------------
Hi Kevin,

Kevin Hore wrote:


>> i) Does anyone have any plans to fix this problem?


Can you file an enhancement request for this? I think Derby could
improve it's handling of OR/IN clauses. Many databases don't optimize OR
clauses as much as possible, though some do better than others. It would
be great if Derby could internally process this as two different scans
(one for 'CONTACT' and another for 'ADD') and then combine the results.
Some databases can do this. However, the workaround suggested by Jeff L.
does this, though you have to rewrite the query.

Satheesh


>> ii) In the meantime, are there any work-arounds? I'd appreciate any
>> suggestions that would decrease the execution time of my second query
>> below (the one with with two search terms). Likewise, any general
>> strategies for avoiding this problem with IN clauses would be
>> appreciated.
>>
>>
>> ----PROBLEM DESCRIPTION----
>>
>> Consider the table:
>>
>> CREATE TABLE tblSearchDictionary
>> (
>> ObjectId int NOT NULL,
>> ObjectType int NOT NULL,
>> Word VARCHAR(64) NOT NULL,
>> WordLocation int NOT NULL,
>> CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
>> );
>>
>> This table has an index on each of the four columns, it also has the
>> unique index across all four columns as defined above:
>>
>> CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary
>> (ObjectId);
>> CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
>> (ObjectType);
>> CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
>> CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
>> (WordLocation);
>>
>> The table contains about 260,000 rows.
>>
>> The following query selects all rows that match instances of string in
>> the Word column. It sums the WordLocation column having grouped by the
>> ObjectId column.
>>
>> SELECT ObjectId, SUM(WordLocation) AS Score
>> FROM tblSearchDictionary
>> WHERE Word = 'CONTACT'
>> GROUP BY ObjectId;
>>
>> On my machine this will usually complete in an acceptable time of
>> around 200ms.
>>
>> Now consider the following query which adds a second search term on
>> the same column.
>>
>> SELECT ObjectId, SUM(WordLocation) AS Score
>> FROM tblSearchDictionary
>> WHERE Word = 'CONTACT' OR Word = 'ADD'
>> GROUP BY ObjectId;
>>
>> This second query usually takes around 10000ms on my machine. My
>> understanding from the Derby optimizer docs and DERBY-47 is that this
>> is because Derby is re-writing the query along the following lines,
>> and then choosing to do a table scan:
>>
>> SELECT ObjectId, SUM(WordLocation) AS Score
>> FROM tblSearchDictionary
>> WHERE
>>   Word IN ('CONTACT', 'ADD')
>>   AND Word >= 'ADD'
>>   AND Word <= 'CONTACT'
>> GROUP BY ObjectId;
>>
>> The plan for the first query indicates that the tblSearchDictionaryWord
>> index is used to perform an index scan. However, the plan for the second
>> query indicates that the majority of the additional time is taken
>> performing a table scan over the entire table, instead of making use of
>> the indexes available. Our application uses IN quite frequently, so
>> this optimizer behaviour would seem to present a significant problem.
>>
>> ---QUERY PLAN FOR FIRST QUERY----
>>
>> Statement Name:
>>     null
>> Statement Text:
>>     SELECT
>>     ObjectId,
>>     SUM(WordLocation) AS Score
>> FROM tblSearchDictionary
>> WHERE
>>         Word = 'CONTACT'
>> GROUP BY
>>     ObjectId
>>
>> Parse Time: 0
>> Bind Time: 0
>> Optimize Time: 16
>> Generate Time: 0
>> Compile Time: 16
>> Execute Time: 0
>> Begin Compilation Timestamp : 2005-11-11 12:28:52.765
>> End Compilation Timestamp : 2005-11-11 12:28:52.781
>> Begin Execution Timestamp : 2005-11-11 13:08:15.406
>> End Execution Timestamp : 2005-11-11 13:08:15.406
>> Statement Execution Plan Text:
>> Project-Restrict ResultSet (5):
>> Number of opens = 1
>> Rows seen = 93
>> Rows filtered = 0
>> restriction = false
>> projection = true
>>     constructor time (milliseconds) = 0
>>     open time (milliseconds) = 0
>>     next time (milliseconds) = 0
>>     close time (milliseconds) = 0
>>     restriction time (milliseconds) = 0
>>     projection time (milliseconds) = 0
>>     optimizer estimated row count:            1.00
>>     optimizer estimated cost:          226.00
>>
>> Source result set:
>>     Grouped Aggregate ResultSet:
>>     Number of opens = 1
>>     Rows input = 113
>>     Has distinct aggregate = false
>>     In sorted order = false
>>     Sort information:
>>         Number of rows input=113
>>         Number of rows output=93
>>         Sort type=internal
>>         constructor time (milliseconds) = 0
>>         open time (milliseconds) = 0
>>         next time (milliseconds) = 0
>>         close time (milliseconds) = 0
>>         optimizer estimated row count:            1.00
>>         optimizer estimated cost:          226.00
>>
>>     Source result set:
>>         Project-Restrict ResultSet (4):
>>         Number of opens = 1
>>         Rows seen = 113
>>         Rows filtered = 0
>>         restriction = false
>>         projection = true
>>             constructor time (milliseconds) = 0
>>             open time (milliseconds) = 0
>>             next time (milliseconds) = 0
>>             close time (milliseconds) = 0
>>             restriction time (milliseconds) = 0
>>             projection time (milliseconds) = 0
>>             optimizer estimated row count:          118.00
>>             optimizer estimated cost:          226.00
>>
>>         Source result set:
>>             Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
>>             Number of opens = 1
>>             Rows seen = 113
>>             Columns accessed from heap = {0, 3}
>>                 constructor time (milliseconds) = 0
>>                 open time (milliseconds) = 0
>>                 next time (milliseconds) = 0
>>                 close time (milliseconds) = 0
>>                 optimizer estimated row count:          118.00
>>                 optimizer estimated cost:          226.00
>>
>>                 Index Scan ResultSet for TBLSEARCHDICTIONARY using index
>> TBLSEARCHDICTIONARYWORD at read committed isolation level using share
>> row locking chosen by the optimizer
>>                 Number of opens = 1
>>                 Rows seen = 113
>>                 Rows filtered = 0
>>                 Fetch Size = 1
>>                     constructor time (milliseconds) = 0
>>                     open time (milliseconds) = 0
>>                     next time (milliseconds) = 0
>>                     close time (milliseconds) = 0
>>                     next time in milliseconds/row = 0
>>
>>                 scan information:
>>                     Bit set of columns fetched=All
>>                     Number of columns fetched=2
>>                     Number of deleted rows visited=0
>>                     Number of pages visited=4
>>                     Number of rows qualified=113
>>                     Number of rows visited=114
>>                     Scan type=btree
>>                     Tree height=3
>>                     start position:
>>     >= on first 1 column(s).
>>     Ordered null semantics on the following columns:
>> 0
>>                     stop position:
>>     > on first 1 column(s).
>>     Ordered null semantics on the following columns:
>> 0
>>                     qualifiers:
>> None
>>                     optimizer estimated row count:          118.00
>>                     optimizer estimated cost:          226.00
>>
>>
>> ---QUERY PLAN FOR SECOND QUERY----
>>
>> Statement Name:
>>     null
>> Statement Text:
>>     SELECT
>>     ObjectId,
>>     SUM(WordLocation) AS Score
>> FROM tblSearchDictionary
>> WHERE
>>         Word = 'CONTACT' OR  Word = 'ADD'
>> GROUP BY
>>     ObjectId
>>
>> Parse Time: 0
>> Bind Time: 0
>> Optimize Time: 0
>> Generate Time: 15
>> Compile Time: 15
>> Execute Time: 4250
>> Begin Compilation Timestamp : 2005-11-11 13:16:17.578
>> End Compilation Timestamp : 2005-11-11 13:16:17.593
>> Begin Execution Timestamp : 2005-11-11 13:16:17.593
>> End Execution Timestamp : 2005-11-11 13:16:27.437
>> Statement Execution Plan Text:
>> Project-Restrict ResultSet (5):
>> Number of opens = 1
>> Rows seen = 100
>> Rows filtered = 0
>> restriction = false
>> projection = true
>>     constructor time (milliseconds) = 0
>>     open time (milliseconds) = 4250
>>     next time (milliseconds) = 0
>>     close time (milliseconds) = 0
>>     restriction time (milliseconds) = 0
>>     projection time (milliseconds) = 0
>>     optimizer estimated row count:            1.00
>>     optimizer estimated cost:        82959.49
>>
>> Source result set:
>>     Grouped Aggregate ResultSet:
>>     Number of opens = 1
>>     Rows input = 712
>>     Has distinct aggregate = false
>>     In sorted order = false
>>     Sort information:
>>         Number of rows input=712
>>         Number of rows output=593
>>         Sort type=internal
>>         constructor time (milliseconds) = 0
>>         open time (milliseconds) = 4250
>>         next time (milliseconds) = 0
>>         close time (milliseconds) = 0
>>         optimizer estimated row count:            1.00
>>         optimizer estimated cost:        82959.49
>>
>>     Source result set:
>>         Project-Restrict ResultSet (4):
>>         Number of opens = 1
>>         Rows seen = 712
>>         Rows filtered = 0
>>         restriction = false
>>         projection = true
>>             constructor time (milliseconds) = 0
>>             open time (milliseconds) = 0
>>             next time (milliseconds) = 4219
>>             close time (milliseconds) = 15
>>             restriction time (milliseconds) = 0
>>             projection time (milliseconds) = 0
>>             optimizer estimated row count:        19200.45
>>             optimizer estimated cost:        82959.49
>>
>>         Source result set:
>>             Project-Restrict ResultSet (3):
>>             Number of opens = 1
>>             Rows seen = 40806
>>             Rows filtered = 40094
>>             restriction = true
>>             projection = false
>>                 constructor time (milliseconds) = 0
>>                 open time (milliseconds) = 0
>>                 next time (milliseconds) = 4219
>>                 close time (milliseconds) = 15
>>                 restriction time (milliseconds) = 124
>>                 projection time (milliseconds) = 0
>>                 optimizer estimated row count:        19200.45
>>                 optimizer estimated cost:        82959.49
>>
>>             Source result set:
>>                 Table Scan ResultSet for TBLSEARCHDICTIONARY at read
>> committed
>> isolation level using share row locking chosen by the optimizer
>>                 Number of opens = 1
>>                 Rows seen = 40806
>>                 Rows filtered = 0
>>                 Fetch Size = 1
>>                     constructor time (milliseconds) = 0
>>                     open time (milliseconds) = 0
>>                     next time (milliseconds) = 4001
>>                     close time (milliseconds) = 15
>>                     next time in milliseconds/row = 0
>>
>>                 scan information:
>>                     Bit set of columns fetched={0, 2, 3}
>>                     Number of columns fetched=3
>>                     Number of pages visited=2978
>>                     Number of rows qualified=40806
>>                     Number of rows visited=256001
>>                     Scan type=heap
>>                     start position:
>> null                    stop position:
>> null                    qualifiers:
>> Column[0][0] Id: 2
>> Operator: <
>> Ordered nulls: false
>> Unknown return value: true
>> Negate comparison result: true
>> Column[0][1] Id: 2
>> Operator: <=
>> Ordered nulls: false
>> Unknown return value: false
>> Negate comparison result: false
>>
>>                     optimizer estimated row count:        19200.45
>>                     optimizer estimated cost:        82959.49
>>
>> ----------
>>
>> Thanks in advance for any help!
>>
>> Kind regards,
>>
>>
>> Kevin Hore
>>
>>
>>

> Some possible improvements to IN optimization
> ---------------------------------------------
>
>          Key: DERBY-47
>          URL: http://issues.apache.org/jira/browse/DERBY-47
>      Project: Derby
>         Type: Improvement
>   Components: SQL
>     Versions: 10.0.2.0
>  Environment: all
>     Reporter: Sunitha Kambhampati

>
> Consider a simple case of  - 
> A table tbl has 10000 rows, there is a primary key index on i1
> and the query in question is 
>  select * from tbl where i1 in (-1,100000)
> derby does a table scan of the entire table even though the "IN" list has only two values and the comparison is on a field that has an index.
> Briefly looking at the code, it seems like we insert a between and use the IN list to get the start and stop values for the scan. Thus the range of the values in the "IN" list here plays an important role. 
> Thus if the query was changed to select * from tbl where i1 in (-1, 1), an index scan would be chosen.
> It would be nice if we could do something clever in this case where there is clearly an index on the field and the number of values in the IN list is known. Maybe use the rowcount estimate and the IN list size to do some optimizations.  
> - consider the length of the "IN" list to do searches on the table.  ie use the IN list values to do index key searches on the table,
> -or try to convert it to a join. Use the "IN" list values to create a temporary table and do a join. It is most likely that the optimizer will choose the table with "IN" list here as the outer table in the join and thus will do key searches on the larger table. 
> -------------------------------------------------------------------
> some query plans that I logged using derby.language.logQueryPlan=true for some similar queries:
> Table has ascending values from 0 - 9999 for i1. primary key index on i1.
> GMT Thread[UT0,5,main] (XID = 19941), (SESSIONID = 0), select * from scanfixed where i1 in (-1,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict ResultSet (2):
> Number of opens = 1
> Rows seen = 10000
> Rows filtered = 9990
> restriction = true
> projection = false
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:          750.38
> 	optimizer estimated cost:         8579.46
> Source result set:
> 	Table Scan ResultSet for SCANFIXED at read committed isolation level using instantaneous share row locking chosen by the optimizer
> 	Number of opens = 1
> 	Rows seen = 10000
> 	Rows filtered = 0
> 	Fetch Size = 16
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		next time in milliseconds/row = 0
> 	scan information: 
> 		Bit set of columns fetched=All
> 		Number of columns fetched=9
> 		Number of pages visited=417
> 		Number of rows qualified=10000
> 		Number of rows visited=10000
> 		Scan type=heap
> 		start position: 
> null		stop position: 
> null		qualifiers:
> Column[0][0] Id: 0
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
> Column[0][1] Id: 0
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
> 		optimizer estimated row count:          750.38
> 		optimizer estimated cost:         8579.46
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> l
> 2004-10-14 18:59:47.577 GMT Thread[UT0,5,main] (XID = 19216), (SESSIONID = 0), select * from scanfixed where i1 in (9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict ResultSet (3):
> Number of opens = 1
> Rows seen = 10
> Rows filtered = 0
> restriction = true
> projection = true
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:            4.80
> 	optimizer estimated cost:           39.53
> Source result set:
> 	Index Row to Base Row ResultSet for SCANFIXED:
> 	Number of opens = 1
> 	Rows seen = 10
> 	Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6, 7, 8}
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		optimizer estimated row count:            4.80
> 		optimizer estimated cost:           39.53
> 		Index Scan ResultSet for SCANFIXED using index SCANFIXEDX at read committed isolation level using instantaneous share row locking chosen by the optimizer
> 		Number of opens = 1
> 		Rows seen = 10
> 		Rows filtered = 0
> 		Fetch Size = 16
> 			constructor time (milliseconds) = 0
> 			open time (milliseconds) = 0
> 			next time (milliseconds) = 0
> 			close time (milliseconds) = 0
> 			next time in milliseconds/row = 0
> 		scan information: 
> 			Bit set of columns fetched=All
> 			Number of columns fetched=2
> 			Number of deleted rows visited=0
> 			Number of pages visited=2
> 			Number of rows qualified=10
> 			Number of rows visited=10
> 			Scan type=btree
> 			Tree height=2
> 			start position: 
> 	>= on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			stop position: 
> 	> on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			qualifiers:
> None
> 			optimizer estimated row count:            4.80
> 			optimizer estimated cost:           39.53

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira