You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-user@db.apache.org by Kevin Hore <kh...@araxis.com> on 2005/11/11 15:59:51 UTC

Poor query optimizer choices is making Derby unusable for large tables

The Derby query optimizer (this is with 10.1.1.0) decides to perform an 
expensive table scan in certain circumstances, when it could make use of 
available indexes. The resulting poor performance is rendering Derby 
useless for our application.

I believe that this behaviour might be related to DERBY-47 
(http://issues.apache.org/jira/browse/DERBY-47), which was opened in 
October 2004. However, this seems such a severe and fundamental problem 
that I'm surprised firstly that a significant portion of the Derby user 
base isn't affected by this and, secondly, that DERBY-47 hasn't been 
considered a critical defect to be fixed with the utmost urgency.

I've described the problem in detail below, and I'd appreciate any 
assistance. Specifically:

i) Does anyone have any plans to fix this problem?

ii) In the meantime, are there any work-arounds? I’d appreciate any 
suggestions that would decrease the execution time of my second query 
below (the one with with two search terms). Likewise, any general 
strategies for avoiding this problem with IN clauses would be appreciated.


----PROBLEM DESCRIPTION----

Consider the table:

CREATE TABLE tblSearchDictionary
(
ObjectId int NOT NULL,
ObjectType int NOT NULL,
Word VARCHAR(64) NOT NULL,
WordLocation int NOT NULL,
CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
);

This table has an index on each of the four columns, it also has the
unique index across all four columns as defined above:

CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary (ObjectId);
CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
(ObjectType);
CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
(WordLocation);

The table contains about 260,000 rows.

The following query selects all rows that match instances of string in
the Word column. It sums the WordLocation column having grouped by the
ObjectId column.

SELECT ObjectId, SUM(WordLocation) AS Score
FROM tblSearchDictionary
WHERE Word = 'CONTACT'
GROUP BY ObjectId;

On my machine this will usually complete in an acceptable time of around 
200ms.

Now consider the following query which adds a second search term on the 
same column.

SELECT ObjectId, SUM(WordLocation) AS Score
FROM tblSearchDictionary
WHERE Word = 'CONTACT' OR Word = 'ADD'
GROUP BY ObjectId;

This second query usually takes around 10000ms on my machine. My 
understanding from the Derby optimizer docs and DERBY-47 is that this is 
because Derby is re-writing the query along the following lines, and 
then choosing to do a table scan:

SELECT ObjectId, SUM(WordLocation) AS Score
FROM tblSearchDictionary
WHERE
   Word IN ('CONTACT', 'ADD')
   AND Word >= 'ADD'
   AND Word <= 'CONTACT'
GROUP BY ObjectId;

The plan for the first query indicates that the tblSearchDictionaryWord
index is used to perform an index scan. However, the plan for the second
query indicates that the majority of the additional time is taken
performing a table scan over the entire table, instead of making use of
the indexes available. Our application uses IN quite frequently, so this 
optimizer behaviour would seem to present a significant problem.

---QUERY PLAN FOR FIRST QUERY----

Statement Name:
     null
Statement Text:
     SELECT
     ObjectId,
     SUM(WordLocation) AS Score
FROM tblSearchDictionary
WHERE
         Word = 'CONTACT'
GROUP BY
     ObjectId

Parse Time: 0
Bind Time: 0
Optimize Time: 16
Generate Time: 0
Compile Time: 16
Execute Time: 0
Begin Compilation Timestamp : 2005-11-11 12:28:52.765
End Compilation Timestamp : 2005-11-11 12:28:52.781
Begin Execution Timestamp : 2005-11-11 13:08:15.406
End Execution Timestamp : 2005-11-11 13:08:15.406
Statement Execution Plan Text:
Project-Restrict ResultSet (5):
Number of opens = 1
Rows seen = 93
Rows filtered = 0
restriction = false
projection = true
     constructor time (milliseconds) = 0
     open time (milliseconds) = 0
     next time (milliseconds) = 0
     close time (milliseconds) = 0
     restriction time (milliseconds) = 0
     projection time (milliseconds) = 0
     optimizer estimated row count:            1.00
     optimizer estimated cost:          226.00

Source result set:
     Grouped Aggregate ResultSet:
     Number of opens = 1
     Rows input = 113
     Has distinct aggregate = false
     In sorted order = false
     Sort information:
         Number of rows input=113
         Number of rows output=93
         Sort type=internal
         constructor time (milliseconds) = 0
         open time (milliseconds) = 0
         next time (milliseconds) = 0
         close time (milliseconds) = 0
         optimizer estimated row count:            1.00
         optimizer estimated cost:          226.00

     Source result set:
         Project-Restrict ResultSet (4):
         Number of opens = 1
         Rows seen = 113
         Rows filtered = 0
         restriction = false
         projection = true
             constructor time (milliseconds) = 0
             open time (milliseconds) = 0
             next time (milliseconds) = 0
             close time (milliseconds) = 0
             restriction time (milliseconds) = 0
             projection time (milliseconds) = 0
             optimizer estimated row count:          118.00
             optimizer estimated cost:          226.00

         Source result set:
             Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
             Number of opens = 1
             Rows seen = 113
             Columns accessed from heap = {0, 3}
                 constructor time (milliseconds) = 0
                 open time (milliseconds) = 0
                 next time (milliseconds) = 0
                 close time (milliseconds) = 0
                 optimizer estimated row count:          118.00
                 optimizer estimated cost:          226.00

                 Index Scan ResultSet for TBLSEARCHDICTIONARY using index
TBLSEARCHDICTIONARYWORD at read committed isolation level using share
row locking chosen by the optimizer
                 Number of opens = 1
                 Rows seen = 113
                 Rows filtered = 0
                 Fetch Size = 1
                     constructor time (milliseconds) = 0
                     open time (milliseconds) = 0
                     next time (milliseconds) = 0
                     close time (milliseconds) = 0
                     next time in milliseconds/row = 0

                 scan information:
                     Bit set of columns fetched=All
                     Number of columns fetched=2
                     Number of deleted rows visited=0
                     Number of pages visited=4
                     Number of rows qualified=113
                     Number of rows visited=114
                     Scan type=btree
                     Tree height=3
                     start position:
     >= on first 1 column(s).
     Ordered null semantics on the following columns:
0
                     stop position:
     > on first 1 column(s).
     Ordered null semantics on the following columns:
0
                     qualifiers:
None
                     optimizer estimated row count:          118.00
                     optimizer estimated cost:          226.00


---QUERY PLAN FOR SECOND QUERY----

Statement Name:
     null
Statement Text:
     SELECT
     ObjectId,
     SUM(WordLocation) AS Score
FROM tblSearchDictionary
WHERE
         Word = 'CONTACT' OR  Word = 'ADD'
GROUP BY
     ObjectId

Parse Time: 0
Bind Time: 0
Optimize Time: 0
Generate Time: 15
Compile Time: 15
Execute Time: 4250
Begin Compilation Timestamp : 2005-11-11 13:16:17.578
End Compilation Timestamp : 2005-11-11 13:16:17.593
Begin Execution Timestamp : 2005-11-11 13:16:17.593
End Execution Timestamp : 2005-11-11 13:16:27.437
Statement Execution Plan Text:
Project-Restrict ResultSet (5):
Number of opens = 1
Rows seen = 100
Rows filtered = 0
restriction = false
projection = true
     constructor time (milliseconds) = 0
     open time (milliseconds) = 4250
     next time (milliseconds) = 0
     close time (milliseconds) = 0
     restriction time (milliseconds) = 0
     projection time (milliseconds) = 0
     optimizer estimated row count:            1.00
     optimizer estimated cost:        82959.49

Source result set:
     Grouped Aggregate ResultSet:
     Number of opens = 1
     Rows input = 712
     Has distinct aggregate = false
     In sorted order = false
     Sort information:
         Number of rows input=712
         Number of rows output=593
         Sort type=internal
         constructor time (milliseconds) = 0
         open time (milliseconds) = 4250
         next time (milliseconds) = 0
         close time (milliseconds) = 0
         optimizer estimated row count:            1.00
         optimizer estimated cost:        82959.49

     Source result set:
         Project-Restrict ResultSet (4):
         Number of opens = 1
         Rows seen = 712
         Rows filtered = 0
         restriction = false
         projection = true
             constructor time (milliseconds) = 0
             open time (milliseconds) = 0
             next time (milliseconds) = 4219
             close time (milliseconds) = 15
             restriction time (milliseconds) = 0
             projection time (milliseconds) = 0
             optimizer estimated row count:        19200.45
             optimizer estimated cost:        82959.49

         Source result set:
             Project-Restrict ResultSet (3):
             Number of opens = 1
             Rows seen = 40806
             Rows filtered = 40094
             restriction = true
             projection = false
                 constructor time (milliseconds) = 0
                 open time (milliseconds) = 0
                 next time (milliseconds) = 4219
                 close time (milliseconds) = 15
                 restriction time (milliseconds) = 124
                 projection time (milliseconds) = 0
                 optimizer estimated row count:        19200.45
                 optimizer estimated cost:        82959.49

             Source result set:
                 Table Scan ResultSet for TBLSEARCHDICTIONARY at read 
committed
isolation level using share row locking chosen by the optimizer
                 Number of opens = 1
                 Rows seen = 40806
                 Rows filtered = 0
                 Fetch Size = 1
                     constructor time (milliseconds) = 0
                     open time (milliseconds) = 0
                     next time (milliseconds) = 4001
                     close time (milliseconds) = 15
                     next time in milliseconds/row = 0

                 scan information:
                     Bit set of columns fetched={0, 2, 3}
                     Number of columns fetched=3
                     Number of pages visited=2978
                     Number of rows qualified=40806
                     Number of rows visited=256001
                     Scan type=heap
                     start position:
null                    stop position:
null                    qualifiers:
Column[0][0] Id: 2
Operator: <
Ordered nulls: false
Unknown return value: true
Negate comparison result: true
Column[0][1] Id: 2
Operator: <=
Ordered nulls: false
Unknown return value: false
Negate comparison result: false

                     optimizer estimated row count:        19200.45
                     optimizer estimated cost:        82959.49

----------

Thanks in advance for any help!

Kind regards,


Kevin Hore

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Kevin Hore <kh...@araxis.com>.

Hi Satheesh

Satheesh Bandaram wrote:
> Hi Kevin,
> 
> Kevin Hore wrote:
> 
> 
>>i) Does anyone have any plans to fix this problem?
> 
> 
> Can you file an enhancement request for this? I think Derby could
> improve it's handling of OR/IN clauses. Many databases don't optimize OR
> clauses as much as possible, though some do better than others. It would
> be great if Derby could internally process this as two different scans
> (one for 'CONTACT' and another for 'ADD') and then combine the results.
> Some databases can do this. However, the workaround suggested by Jeff L.
> does this, though you have to rewrite the query.
I've commented on the re-write solution elsewhere. Regarding an
enhancement, I think DERBY-47 pretty much covers the problem, but I'll
add a comment to that describing my circumstances.

> 
> Satheesh
> 
> 
>>ii) In the meantime, are there any work-arounds? I’d appreciate any
>>suggestions that would decrease the execution time of my second query
>>below (the one with with two search terms). Likewise, any general
>>strategies for avoiding this problem with IN clauses would be
>>appreciated.
>>
>>
>>----PROBLEM DESCRIPTION----
>>
>>Consider the table:
>>
>>CREATE TABLE tblSearchDictionary
>>(
>>ObjectId int NOT NULL,
>>ObjectType int NOT NULL,
>>Word VARCHAR(64) NOT NULL,
>>WordLocation int NOT NULL,
>>CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
>>);
>>
>>This table has an index on each of the four columns, it also has the
>>unique index across all four columns as defined above:
>>
>>CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary
>>(ObjectId);
>>CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
>>(ObjectType);
>>CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
>>CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
>>(WordLocation);
>>
>>The table contains about 260,000 rows.
>>
>>The following query selects all rows that match instances of string in
>>the Word column. It sums the WordLocation column having grouped by the
>>ObjectId column.
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE Word = 'CONTACT'
>>GROUP BY ObjectId;
>>
>>On my machine this will usually complete in an acceptable time of
>>around 200ms.
>>
>>Now consider the following query which adds a second search term on
>>the same column.
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE Word = 'CONTACT' OR Word = 'ADD'
>>GROUP BY ObjectId;
>>
>>This second query usually takes around 10000ms on my machine. My
>>understanding from the Derby optimizer docs and DERBY-47 is that this
>>is because Derby is re-writing the query along the following lines,
>>and then choosing to do a table scan:
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE
>>  Word IN ('CONTACT', 'ADD')
>>  AND Word >= 'ADD'
>>  AND Word <= 'CONTACT'
>>GROUP BY ObjectId;
>>
>>The plan for the first query indicates that the tblSearchDictionaryWord
>>index is used to perform an index scan. However, the plan for the second
>>query indicates that the majority of the additional time is taken
>>performing a table scan over the entire table, instead of making use of
>>the indexes available. Our application uses IN quite frequently, so
>>this optimizer behaviour would seem to present a significant problem.
>>
>>---QUERY PLAN FOR FIRST QUERY----
>>
>>Statement Name:
>>    null
>>Statement Text:
>>    SELECT
>>    ObjectId,
>>    SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE
>>        Word = 'CONTACT'
>>GROUP BY
>>    ObjectId
>>
>>Parse Time: 0
>>Bind Time: 0
>>Optimize Time: 16
>>Generate Time: 0
>>Compile Time: 16
>>Execute Time: 0
>>Begin Compilation Timestamp : 2005-11-11 12:28:52.765
>>End Compilation Timestamp : 2005-11-11 12:28:52.781
>>Begin Execution Timestamp : 2005-11-11 13:08:15.406
>>End Execution Timestamp : 2005-11-11 13:08:15.406
>>Statement Execution Plan Text:
>>Project-Restrict ResultSet (5):
>>Number of opens = 1
>>Rows seen = 93
>>Rows filtered = 0
>>restriction = false
>>projection = true
>>    constructor time (milliseconds) = 0
>>    open time (milliseconds) = 0
>>    next time (milliseconds) = 0
>>    close time (milliseconds) = 0
>>    restriction time (milliseconds) = 0
>>    projection time (milliseconds) = 0
>>    optimizer estimated row count:            1.00
>>    optimizer estimated cost:          226.00
>>
>>Source result set:
>>    Grouped Aggregate ResultSet:
>>    Number of opens = 1
>>    Rows input = 113
>>    Has distinct aggregate = false
>>    In sorted order = false
>>    Sort information:
>>        Number of rows input=113
>>        Number of rows output=93
>>        Sort type=internal
>>        constructor time (milliseconds) = 0
>>        open time (milliseconds) = 0
>>        next time (milliseconds) = 0
>>        close time (milliseconds) = 0
>>        optimizer estimated row count:            1.00
>>        optimizer estimated cost:          226.00
>>
>>    Source result set:
>>        Project-Restrict ResultSet (4):
>>        Number of opens = 1
>>        Rows seen = 113
>>        Rows filtered = 0
>>        restriction = false
>>        projection = true
>>            constructor time (milliseconds) = 0
>>            open time (milliseconds) = 0
>>            next time (milliseconds) = 0
>>            close time (milliseconds) = 0
>>            restriction time (milliseconds) = 0
>>            projection time (milliseconds) = 0
>>            optimizer estimated row count:          118.00
>>            optimizer estimated cost:          226.00
>>
>>        Source result set:
>>            Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
>>            Number of opens = 1
>>            Rows seen = 113
>>            Columns accessed from heap = {0, 3}
>>                constructor time (milliseconds) = 0
>>                open time (milliseconds) = 0
>>                next time (milliseconds) = 0
>>                close time (milliseconds) = 0
>>                optimizer estimated row count:          118.00
>>                optimizer estimated cost:          226.00
>>
>>                Index Scan ResultSet for TBLSEARCHDICTIONARY using index
>>TBLSEARCHDICTIONARYWORD at read committed isolation level using share
>>row locking chosen by the optimizer
>>                Number of opens = 1
>>                Rows seen = 113
>>                Rows filtered = 0
>>                Fetch Size = 1
>>                    constructor time (milliseconds) = 0
>>                    open time (milliseconds) = 0
>>                    next time (milliseconds) = 0
>>                    close time (milliseconds) = 0
>>                    next time in milliseconds/row = 0
>>
>>                scan information:
>>                    Bit set of columns fetched=All
>>                    Number of columns fetched=2
>>                    Number of deleted rows visited=0
>>                    Number of pages visited=4
>>                    Number of rows qualified=113
>>                    Number of rows visited=114
>>                    Scan type=btree
>>                    Tree height=3
>>                    start position:
>>    >= on first 1 column(s).
>>    Ordered null semantics on the following columns:
>>0
>>                    stop position:
>>    > on first 1 column(s).
>>    Ordered null semantics on the following columns:
>>0
>>                    qualifiers:
>>None
>>                    optimizer estimated row count:          118.00
>>                    optimizer estimated cost:          226.00
>>
>>
>>---QUERY PLAN FOR SECOND QUERY----
>>
>>Statement Name:
>>    null
>>Statement Text:
>>    SELECT
>>    ObjectId,
>>    SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE
>>        Word = 'CONTACT' OR  Word = 'ADD'
>>GROUP BY
>>    ObjectId
>>
>>Parse Time: 0
>>Bind Time: 0
>>Optimize Time: 0
>>Generate Time: 15
>>Compile Time: 15
>>Execute Time: 4250
>>Begin Compilation Timestamp : 2005-11-11 13:16:17.578
>>End Compilation Timestamp : 2005-11-11 13:16:17.593
>>Begin Execution Timestamp : 2005-11-11 13:16:17.593
>>End Execution Timestamp : 2005-11-11 13:16:27.437
>>Statement Execution Plan Text:
>>Project-Restrict ResultSet (5):
>>Number of opens = 1
>>Rows seen = 100
>>Rows filtered = 0
>>restriction = false
>>projection = true
>>    constructor time (milliseconds) = 0
>>    open time (milliseconds) = 4250
>>    next time (milliseconds) = 0
>>    close time (milliseconds) = 0
>>    restriction time (milliseconds) = 0
>>    projection time (milliseconds) = 0
>>    optimizer estimated row count:            1.00
>>    optimizer estimated cost:        82959.49
>>
>>Source result set:
>>    Grouped Aggregate ResultSet:
>>    Number of opens = 1
>>    Rows input = 712
>>    Has distinct aggregate = false
>>    In sorted order = false
>>    Sort information:
>>        Number of rows input=712
>>        Number of rows output=593
>>        Sort type=internal
>>        constructor time (milliseconds) = 0
>>        open time (milliseconds) = 4250
>>        next time (milliseconds) = 0
>>        close time (milliseconds) = 0
>>        optimizer estimated row count:            1.00
>>        optimizer estimated cost:        82959.49
>>
>>    Source result set:
>>        Project-Restrict ResultSet (4):
>>        Number of opens = 1
>>        Rows seen = 712
>>        Rows filtered = 0
>>        restriction = false
>>        projection = true
>>            constructor time (milliseconds) = 0
>>            open time (milliseconds) = 0
>>            next time (milliseconds) = 4219
>>            close time (milliseconds) = 15
>>            restriction time (milliseconds) = 0
>>            projection time (milliseconds) = 0
>>            optimizer estimated row count:        19200.45
>>            optimizer estimated cost:        82959.49
>>
>>        Source result set:
>>            Project-Restrict ResultSet (3):
>>            Number of opens = 1
>>            Rows seen = 40806
>>            Rows filtered = 40094
>>            restriction = true
>>            projection = false
>>                constructor time (milliseconds) = 0
>>                open time (milliseconds) = 0
>>                next time (milliseconds) = 4219
>>                close time (milliseconds) = 15
>>                restriction time (milliseconds) = 124
>>                projection time (milliseconds) = 0
>>                optimizer estimated row count:        19200.45
>>                optimizer estimated cost:        82959.49
>>
>>            Source result set:
>>                Table Scan ResultSet for TBLSEARCHDICTIONARY at read
>>committed
>>isolation level using share row locking chosen by the optimizer
>>                Number of opens = 1
>>                Rows seen = 40806
>>                Rows filtered = 0
>>                Fetch Size = 1
>>                    constructor time (milliseconds) = 0
>>                    open time (milliseconds) = 0
>>                    next time (milliseconds) = 4001
>>                    close time (milliseconds) = 15
>>                    next time in milliseconds/row = 0
>>
>>                scan information:
>>                    Bit set of columns fetched={0, 2, 3}
>>                    Number of columns fetched=3
>>                    Number of pages visited=2978
>>                    Number of rows qualified=40806
>>                    Number of rows visited=256001
>>                    Scan type=heap
>>                    start position:
>>null                    stop position:
>>null                    qualifiers:
>>Column[0][0] Id: 2
>>Operator: <
>>Ordered nulls: false
>>Unknown return value: true
>>Negate comparison result: true
>>Column[0][1] Id: 2
>>Operator: <=
>>Ordered nulls: false
>>Unknown return value: false
>>Negate comparison result: false
>>
>>                    optimizer estimated row count:        19200.45
>>                    optimizer estimated cost:        82959.49
>>
>>----------
>>
>>Thanks in advance for any help!
>>
>>Kind regards,
>>
>>
>>Kevin Hore
>>
>>
>>
> 
>

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Satheesh Bandaram <sa...@Sourcery.Org>.

Hi Kevin,

Kevin Hore wrote:

> i) Does anyone have any plans to fix this problem?

Can you file an enhancement request for this? I think Derby could
improve it's handling of OR/IN clauses. Many databases don't optimize OR
clauses as much as possible, though some do better than others. It would
be great if Derby could internally process this as two different scans
(one for 'CONTACT' and another for 'ADD') and then combine the results.
Some databases can do this. However, the workaround suggested by Jeff L.
does this, though you have to rewrite the query.

Satheesh

> ii) In the meantime, are there any work-arounds? I’d appreciate any
> suggestions that would decrease the execution time of my second query
> below (the one with with two search terms). Likewise, any general
> strategies for avoiding this problem with IN clauses would be
> appreciated.
>
>
> ----PROBLEM DESCRIPTION----
>
> Consider the table:
>
> CREATE TABLE tblSearchDictionary
> (
> ObjectId int NOT NULL,
> ObjectType int NOT NULL,
> Word VARCHAR(64) NOT NULL,
> WordLocation int NOT NULL,
> CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
> );
>
> This table has an index on each of the four columns, it also has the
> unique index across all four columns as defined above:
>
> CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary
> (ObjectId);
> CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
> (ObjectType);
> CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
> CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
> (WordLocation);
>
> The table contains about 260,000 rows.
>
> The following query selects all rows that match instances of string in
> the Word column. It sums the WordLocation column having grouped by the
> ObjectId column.
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE Word = 'CONTACT'
> GROUP BY ObjectId;
>
> On my machine this will usually complete in an acceptable time of
> around 200ms.
>
> Now consider the following query which adds a second search term on
> the same column.
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE Word = 'CONTACT' OR Word = 'ADD'
> GROUP BY ObjectId;
>
> This second query usually takes around 10000ms on my machine. My
> understanding from the Derby optimizer docs and DERBY-47 is that this
> is because Derby is re-writing the query along the following lines,
> and then choosing to do a table scan:
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE
>   Word IN ('CONTACT', 'ADD')
>   AND Word >= 'ADD'
>   AND Word <= 'CONTACT'
> GROUP BY ObjectId;
>
> The plan for the first query indicates that the tblSearchDictionaryWord
> index is used to perform an index scan. However, the plan for the second
> query indicates that the majority of the additional time is taken
> performing a table scan over the entire table, instead of making use of
> the indexes available. Our application uses IN quite frequently, so
> this optimizer behaviour would seem to present a significant problem.
>
> ---QUERY PLAN FOR FIRST QUERY----
>
> Statement Name:
>     null
> Statement Text:
>     SELECT
>     ObjectId,
>     SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE
>         Word = 'CONTACT'
> GROUP BY
>     ObjectId
>
> Parse Time: 0
> Bind Time: 0
> Optimize Time: 16
> Generate Time: 0
> Compile Time: 16
> Execute Time: 0
> Begin Compilation Timestamp : 2005-11-11 12:28:52.765
> End Compilation Timestamp : 2005-11-11 12:28:52.781
> Begin Execution Timestamp : 2005-11-11 13:08:15.406
> End Execution Timestamp : 2005-11-11 13:08:15.406
> Statement Execution Plan Text:
> Project-Restrict ResultSet (5):
> Number of opens = 1
> Rows seen = 93
> Rows filtered = 0
> restriction = false
> projection = true
>     constructor time (milliseconds) = 0
>     open time (milliseconds) = 0
>     next time (milliseconds) = 0
>     close time (milliseconds) = 0
>     restriction time (milliseconds) = 0
>     projection time (milliseconds) = 0
>     optimizer estimated row count:            1.00
>     optimizer estimated cost:          226.00
>
> Source result set:
>     Grouped Aggregate ResultSet:
>     Number of opens = 1
>     Rows input = 113
>     Has distinct aggregate = false
>     In sorted order = false
>     Sort information:
>         Number of rows input=113
>         Number of rows output=93
>         Sort type=internal
>         constructor time (milliseconds) = 0
>         open time (milliseconds) = 0
>         next time (milliseconds) = 0
>         close time (milliseconds) = 0
>         optimizer estimated row count:            1.00
>         optimizer estimated cost:          226.00
>
>     Source result set:
>         Project-Restrict ResultSet (4):
>         Number of opens = 1
>         Rows seen = 113
>         Rows filtered = 0
>         restriction = false
>         projection = true
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 0
>             close time (milliseconds) = 0
>             restriction time (milliseconds) = 0
>             projection time (milliseconds) = 0
>             optimizer estimated row count:          118.00
>             optimizer estimated cost:          226.00
>
>         Source result set:
>             Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
>             Number of opens = 1
>             Rows seen = 113
>             Columns accessed from heap = {0, 3}
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 0
>                 next time (milliseconds) = 0
>                 close time (milliseconds) = 0
>                 optimizer estimated row count:          118.00
>                 optimizer estimated cost:          226.00
>
>                 Index Scan ResultSet for TBLSEARCHDICTIONARY using index
> TBLSEARCHDICTIONARYWORD at read committed isolation level using share
> row locking chosen by the optimizer
>                 Number of opens = 1
>                 Rows seen = 113
>                 Rows filtered = 0
>                 Fetch Size = 1
>                     constructor time (milliseconds) = 0
>                     open time (milliseconds) = 0
>                     next time (milliseconds) = 0
>                     close time (milliseconds) = 0
>                     next time in milliseconds/row = 0
>
>                 scan information:
>                     Bit set of columns fetched=All
>                     Number of columns fetched=2
>                     Number of deleted rows visited=0
>                     Number of pages visited=4
>                     Number of rows qualified=113
>                     Number of rows visited=114
>                     Scan type=btree
>                     Tree height=3
>                     start position:
>     >= on first 1 column(s).
>     Ordered null semantics on the following columns:
> 0
>                     stop position:
>     > on first 1 column(s).
>     Ordered null semantics on the following columns:
> 0
>                     qualifiers:
> None
>                     optimizer estimated row count:          118.00
>                     optimizer estimated cost:          226.00
>
>
> ---QUERY PLAN FOR SECOND QUERY----
>
> Statement Name:
>     null
> Statement Text:
>     SELECT
>     ObjectId,
>     SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE
>         Word = 'CONTACT' OR  Word = 'ADD'
> GROUP BY
>     ObjectId
>
> Parse Time: 0
> Bind Time: 0
> Optimize Time: 0
> Generate Time: 15
> Compile Time: 15
> Execute Time: 4250
> Begin Compilation Timestamp : 2005-11-11 13:16:17.578
> End Compilation Timestamp : 2005-11-11 13:16:17.593
> Begin Execution Timestamp : 2005-11-11 13:16:17.593
> End Execution Timestamp : 2005-11-11 13:16:27.437
> Statement Execution Plan Text:
> Project-Restrict ResultSet (5):
> Number of opens = 1
> Rows seen = 100
> Rows filtered = 0
> restriction = false
> projection = true
>     constructor time (milliseconds) = 0
>     open time (milliseconds) = 4250
>     next time (milliseconds) = 0
>     close time (milliseconds) = 0
>     restriction time (milliseconds) = 0
>     projection time (milliseconds) = 0
>     optimizer estimated row count:            1.00
>     optimizer estimated cost:        82959.49
>
> Source result set:
>     Grouped Aggregate ResultSet:
>     Number of opens = 1
>     Rows input = 712
>     Has distinct aggregate = false
>     In sorted order = false
>     Sort information:
>         Number of rows input=712
>         Number of rows output=593
>         Sort type=internal
>         constructor time (milliseconds) = 0
>         open time (milliseconds) = 4250
>         next time (milliseconds) = 0
>         close time (milliseconds) = 0
>         optimizer estimated row count:            1.00
>         optimizer estimated cost:        82959.49
>
>     Source result set:
>         Project-Restrict ResultSet (4):
>         Number of opens = 1
>         Rows seen = 712
>         Rows filtered = 0
>         restriction = false
>         projection = true
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 4219
>             close time (milliseconds) = 15
>             restriction time (milliseconds) = 0
>             projection time (milliseconds) = 0
>             optimizer estimated row count:        19200.45
>             optimizer estimated cost:        82959.49
>
>         Source result set:
>             Project-Restrict ResultSet (3):
>             Number of opens = 1
>             Rows seen = 40806
>             Rows filtered = 40094
>             restriction = true
>             projection = false
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 0
>                 next time (milliseconds) = 4219
>                 close time (milliseconds) = 15
>                 restriction time (milliseconds) = 124
>                 projection time (milliseconds) = 0
>                 optimizer estimated row count:        19200.45
>                 optimizer estimated cost:        82959.49
>
>             Source result set:
>                 Table Scan ResultSet for TBLSEARCHDICTIONARY at read
> committed
> isolation level using share row locking chosen by the optimizer
>                 Number of opens = 1
>                 Rows seen = 40806
>                 Rows filtered = 0
>                 Fetch Size = 1
>                     constructor time (milliseconds) = 0
>                     open time (milliseconds) = 0
>                     next time (milliseconds) = 4001
>                     close time (milliseconds) = 15
>                     next time in milliseconds/row = 0
>
>                 scan information:
>                     Bit set of columns fetched={0, 2, 3}
>                     Number of columns fetched=3
>                     Number of pages visited=2978
>                     Number of rows qualified=40806
>                     Number of rows visited=256001
>                     Scan type=heap
>                     start position:
> null                    stop position:
> null                    qualifiers:
> Column[0][0] Id: 2
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
> Column[0][1] Id: 2
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
>
>                     optimizer estimated row count:        19200.45
>                     optimizer estimated cost:        82959.49
>
> ----------
>
> Thanks in advance for any help!
>
> Kind regards,
>
>
> Kevin Hore
>
>
>

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Kevin Hore <kh...@araxis.com>.

Hi Rick,

Thanks for your suggested re-write, but I'm really looking for a single
query. Unfortunately, we have quite a number of queries affected by this
and it would be significant work to re-written them all in this way. In
addition, some are considerably more complex than the example I gave,
and not so amenable to a relatively straightforward rewrite along the
lines you suggest.

Thanks again though, I appreciate the suggestion.

Kind regards,

Kevin Hore

Rick Hillegas wrote:
> Hi Kevin,
> 
> You might also try using a temporary table to split your scan up into a 
> series of optimzable queries. Regards-Rick
> 
> declare global temporary table session.accumulator
> (
>   ObjectId int NOT NULL,
>   WordLocation int NOT NULL
> )
> not logged;
> 
> insert into session.accumulator
>    SELECT ObjectId, WordLocation
>    FROM tblSearchDictionary
>    WHERE Word = 'CONTACT'
> ;
> insert into session.accumulator
>    SELECT ObjectId, WordLocation
>    FROM tblSearchDictionary
>    WHERE Word = 'ADD'
> ;
> 
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM session.accumulator
> GROUP BY ObjectId;
> 
> 
> Mamta Satoor wrote:
> 
>> Hi Kevin,
>>  
>> I haven't investigated Derby-47 to know how to fix the problem but I 
>> do have an optimizer overrides patch waiting for review on the derby 
>> developer list which will let user specify their own optimizer 
>> properties to help the optimizer pick a specific plan. The JIRA entry 
>> for optimizer overrides is Derby-573. The patch is for Derby 10.2.
>>  
>> These optimizer properties can be handy in a case like yours where 
>> until the optimizer is fixed, a user can tell the optimizer what plan 
>> to use. Just an FYI as I know it doesn't help with your particular 
>> case since there is no optimizer overrides support for 10.1.
>>  
>> Mamta
>>  
>> On 11/11/05, *Kevin Hore* <kh@araxis.com <ma...@araxis.com>> wrote:
>>
>>     The Derby query optimizer (this is with 10.1.1.0
>>     <http://10.1.1.0>) decides to perform an
>>     expensive table scan in certain circumstances, when it could make
>>     use of
>>     available indexes. The resulting poor performance is rendering Derby
>>     useless for our application.
>>
>>     I believe that this behaviour might be related to DERBY-47
>>     ( http://issues.apache.org/jira/browse/DERBY-47), which was opened in
>>     October 2004. However, this seems such a severe and fundamental
>>     problem
>>     that I'm surprised firstly that a significant portion of the Derby
>>     user
>>     base isn't affected by this and, secondly, that DERBY-47 hasn't been
>>     considered a critical defect to be fixed with the utmost urgency.
>>
>>     I've described the problem in detail below, and I'd appreciate any
>>     assistance. Specifically:
>>
>>     i) Does anyone have any plans to fix this problem?
>>
>>     ii) In the meantime, are there any work-arounds? I'd appreciate any
>>     suggestions that would decrease the execution time of my second query
>>     below (the one with with two search terms). Likewise, any general
>>     strategies for avoiding this problem with IN clauses would be
>>     appreciated.
>>
>>
>>     ----PROBLEM DESCRIPTION----
>>
>>     Consider the table:
>>
>>     CREATE TABLE tblSearchDictionary
>>     (
>>     ObjectId int NOT NULL,
>>     ObjectType int NOT NULL,
>>     Word VARCHAR(64) NOT NULL,
>>     WordLocation int NOT NULL,
>>     CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
>>     );
>>
>>     This table has an index on each of the four columns, it also has the
>>     unique index across all four columns as defined above:
>>
>>     CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary
>>     (ObjectId);
>>     CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
>>     (ObjectType);
>>     CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
>>     CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
>>     (WordLocation);
>>
>>     The table contains about 260,000 rows.
>>
>>     The following query selects all rows that match instances of 
>> string in
>>     the Word column. It sums the WordLocation column having grouped by
>>     the
>>     ObjectId column.
>>
>>     SELECT ObjectId, SUM(WordLocation) AS Score
>>     FROM tblSearchDictionary
>>     WHERE Word = 'CONTACT'
>>     GROUP BY ObjectId;
>>
>>     On my machine this will usually complete in an acceptable time of
>>     around
>>     200ms.
>>
>>     Now consider the following query which adds a second search term
>>     on the
>>     same column.
>>
>>     SELECT ObjectId, SUM(WordLocation) AS Score
>>     FROM tblSearchDictionary
>>     WHERE Word = 'CONTACT' OR Word = 'ADD'
>>     GROUP BY ObjectId;
>>
>>     This second query usually takes around 10000ms on my machine. My
>>     understanding from the Derby optimizer docs and DERBY-47 is that
>>     this is
>>     because Derby is re-writing the query along the following lines, and
>>     then choosing to do a table scan:
>>
>>     SELECT ObjectId, SUM(WordLocation) AS Score
>>     FROM tblSearchDictionary
>>     WHERE
>>       Word IN ('CONTACT', 'ADD')
>>       AND Word >= 'ADD'
>>       AND Word <= 'CONTACT'
>>     GROUP BY ObjectId;
>>
>>     The plan for the first query indicates that the
>>     tblSearchDictionaryWord
>>     index is used to perform an index scan. However, the plan for the
>>     second
>>     query indicates that the majority of the additional time is taken
>>     performing a table scan over the entire table, instead of making
>>     use of
>>     the indexes available. Our application uses IN quite frequently,
>>     so this
>>     optimizer behaviour would seem to present a significant problem.
>>
>>     ---QUERY PLAN FOR FIRST QUERY----
>>
>>     Statement Name:
>>         null
>>     Statement Text:
>>         SELECT
>>         ObjectId,
>>         SUM(WordLocation) AS Score
>>     FROM tblSearchDictionary
>>     WHERE
>>             Word = 'CONTACT'
>>     GROUP BY
>>         ObjectId
>>
>>     Parse Time: 0
>>     Bind Time: 0
>>     Optimize Time: 16
>>     Generate Time: 0
>>     Compile Time: 16
>>     Execute Time: 0
>>     Begin Compilation Timestamp : 2005-11-11 12:28:52.765
>>     End Compilation Timestamp : 2005-11-11 12:28: 52.781
>>     Begin Execution Timestamp : 2005-11-11 13:08:15.406
>>     End Execution Timestamp : 2005-11-11 13:08:15.406
>>     Statement Execution Plan Text:
>>     Project-Restrict ResultSet (5):
>>     Number of opens = 1
>>     Rows seen = 93
>>     Rows filtered = 0
>>     restriction = false
>>     projection = true
>>         constructor time (milliseconds) = 0
>>         open time (milliseconds) = 0
>>         next time (milliseconds) = 0
>>         close time (milliseconds) = 0
>>         restriction time (milliseconds) = 0
>>         projection time (milliseconds) = 0
>>         optimizer estimated row count:            1.00
>>         optimizer estimated cost:          226.00
>>
>>     Source result set:
>>         Grouped Aggregate ResultSet:
>>         Number of opens = 1
>>         Rows input = 113
>>         Has distinct aggregate = false
>>         In sorted order = false
>>         Sort information:
>>             Number of rows input=113
>>             Number of rows output=93
>>             Sort type=internal
>>             constructor time (milliseconds) = 0
>>             open time (milliseconds) = 0
>>             next time (milliseconds) = 0
>>             close time (milliseconds) = 0
>>             optimizer estimated row count:             1.00
>>             optimizer estimated cost:          226.00
>>
>>         Source result set:
>>             Project-Restrict ResultSet (4):
>>             Number of opens = 1
>>             Rows seen = 113
>>             Rows filtered = 0
>>             restriction = false
>>             projection = true
>>                 constructor time (milliseconds) = 0
>>                 open time (milliseconds) = 0
>>                 next time (milliseconds) = 0
>>                 close time (milliseconds) = 0
>>                 restriction time (milliseconds) = 0
>>                 projection time (milliseconds) = 0
>>                 optimizer estimated row count:          118.00
>>                 optimizer estimated cost:          226.00
>>
>>             Source result set:
>>                 Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
>>                 Number of opens = 1
>>                 Rows seen = 113
>>                 Columns accessed from heap = {0, 3}
>>                     constructor time (milliseconds) = 0
>>                     open time (milliseconds) = 0
>>                     next time (milliseconds) = 0
>>                     close time (milliseconds) = 0
>>                     optimizer estimated row count:           118.00
>>                     optimizer estimated cost:          226.00
>>
>>                     Index Scan ResultSet for TBLSEARCHDICTIONARY using
>>     index
>>     TBLSEARCHDICTIONARYWORD at read committed isolation level using share
>>     row locking chosen by the optimizer
>>                     Number of opens = 1
>>                     Rows seen = 113
>>                     Rows filtered = 0
>>                     Fetch Size = 1
>>                         constructor time (milliseconds) = 0
>>                         open time (milliseconds) = 0
>>                         next time (milliseconds) = 0
>>                         close time (milliseconds) = 0
>>                         next time in milliseconds/row = 0
>>
>>                     scan information:
>>                         Bit set of columns fetched=All
>>                         Number of columns fetched=2
>>                         Number of deleted rows visited=0
>>                         Number of pages visited=4
>>                         Number of rows qualified=113
>>                         Number of rows visited=114
>>                         Scan type=btree
>>                         Tree height=3
>>                         start position:
>>         >= on first 1 column(s).
>>         Ordered null semantics on the following columns:
>>     0
>>                         stop position:
>>         > on first 1 column(s).
>>         Ordered null semantics on the following columns:
>>     0
>>                         qualifiers:
>>     None
>>                         optimizer estimated row count:          118.00
>>                         optimizer estimated cost:          226.00
>>
>>
>>     ---QUERY PLAN FOR SECOND QUERY----
>>
>>     Statement Name:
>>         null
>>     Statement Text:
>>         SELECT
>>         ObjectId,
>>         SUM(WordLocation) AS Score
>>     FROM tblSearchDictionary
>>     WHERE
>>             Word = 'CONTACT' OR  Word = 'ADD'
>>     GROUP BY
>>         ObjectId
>>
>>     Parse Time: 0
>>     Bind Time: 0
>>     Optimize Time: 0
>>     Generate Time: 15
>>     Compile Time: 15
>>     Execute Time: 4250
>>     Begin Compilation Timestamp : 2005-11-11 13:16:17.578
>>     End Compilation Timestamp : 2005-11-11 13:16: 17.593
>>     Begin Execution Timestamp : 2005-11-11 13:16:17.593
>>     End Execution Timestamp : 2005-11-11 13:16:27.437
>>     Statement Execution Plan Text:
>>     Project-Restrict ResultSet (5):
>>     Number of opens = 1
>>     Rows seen = 100
>>     Rows filtered = 0
>>     restriction = false
>>     projection = true
>>         constructor time (milliseconds) = 0
>>         open time (milliseconds) = 4250
>>         next time (milliseconds) = 0
>>         close time (milliseconds) = 0
>>         restriction time (milliseconds) = 0
>>         projection time (milliseconds) = 0
>>         optimizer estimated row count:            1.00
>>         optimizer estimated cost:        82959.49
>>
>>     Source result set:
>>         Grouped Aggregate ResultSet:
>>         Number of opens = 1
>>         Rows input = 712
>>         Has distinct aggregate = false
>>         In sorted order = false
>>         Sort information:
>>             Number of rows input=712
>>             Number of rows output=593
>>             Sort type=internal
>>             constructor time (milliseconds) = 0
>>             open time (milliseconds) = 4250
>>             next time (milliseconds) = 0
>>             close time (milliseconds) = 0
>>             optimizer estimated row count:             1.00
>>             optimizer estimated cost:        82959.49
>>
>>         Source result set:
>>             Project-Restrict ResultSet (4):
>>             Number of opens = 1
>>             Rows seen = 712
>>             Rows filtered = 0
>>             restriction = false
>>             projection = true
>>                 constructor time (milliseconds) = 0
>>                 open time (milliseconds) = 0
>>                 next time (milliseconds) = 4219
>>                 close time (milliseconds) = 15
>>                 restriction time (milliseconds) = 0
>>                 projection time (milliseconds) = 0
>>                 optimizer estimated row count:        19200.45
>>                 optimizer estimated cost:        82959.49
>>
>>             Source result set:
>>                 Project-Restrict ResultSet (3):
>>                 Number of opens = 1
>>                 Rows seen = 40806
>>                 Rows filtered = 40094
>>                 restriction = true
>>                 projection = false
>>                     constructor time (milliseconds) = 0
>>                     open time (milliseconds) = 0
>>                     next time (milliseconds) = 4219
>>                     close time (milliseconds) = 15
>>                     restriction time (milliseconds) = 124
>>                     projection time (milliseconds) = 0
>>                     optimizer estimated row count:        19200.45
>>                     optimizer estimated cost:        82959.49
>>
>>                 Source result set:
>>                     Table Scan ResultSet for TBLSEARCHDICTIONARY at read
>>     committed
>>     isolation level using share row locking chosen by the optimizer
>>                     Number of opens = 1
>>                     Rows seen = 40806
>>                     Rows filtered = 0
>>                     Fetch Size = 1
>>                         constructor time (milliseconds) = 0
>>                         open time (milliseconds) = 0
>>                         next time (milliseconds) = 4001
>>                         close time (milliseconds) = 15
>>                         next time in milliseconds/row = 0
>>
>>                     scan information:
>>                         Bit set of columns fetched={0, 2, 3}
>>                         Number of columns fetched=3
>>                         Number of pages visited=2978
>>                         Number of rows qualified=40806
>>                         Number of rows visited=256001
>>                         Scan type=heap
>>                         start position:
>>     null                    stop position:
>>     null                    qualifiers:
>>     Column[0][0] Id: 2
>>     Operator: <
>>     Ordered nulls: false
>>     Unknown return value: true
>>     Negate comparison result: true
>>     Column[0][1] Id: 2
>>     Operator: <=
>>     Ordered nulls: false
>>     Unknown return value: false
>>     Negate comparison result: false
>>
>>                         optimizer estimated row count:        19200.45
>>                         optimizer estimated cost:         82959.49
>>
>>     ----------
>>
>>     Thanks in advance for any help!
>>
>>     Kind regards,
>>
>>
>>     Kevin Hore
>>
>>
>

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Rick Hillegas <Ri...@Sun.COM>.

Hi Kevin,

You might also try using a temporary table to split your scan up into a 
series of optimzable queries. Regards-Rick

declare global temporary table session.accumulator
(
   ObjectId int NOT NULL,
   WordLocation int NOT NULL
)
not logged;

insert into session.accumulator
    SELECT ObjectId, WordLocation
    FROM tblSearchDictionary
    WHERE Word = 'CONTACT'
;
insert into session.accumulator
    SELECT ObjectId, WordLocation
    FROM tblSearchDictionary
    WHERE Word = 'ADD'
;

SELECT ObjectId, SUM(WordLocation) AS Score
FROM session.accumulator
GROUP BY ObjectId;


Mamta Satoor wrote:

> Hi Kevin,
>  
> I haven't investigated Derby-47 to know how to fix the problem but I 
> do have an optimizer overrides patch waiting for review on the derby 
> developer list which will let user specify their own optimizer 
> properties to help the optimizer pick a specific plan. The JIRA entry 
> for optimizer overrides is Derby-573. The patch is for Derby 10.2.
>  
> These optimizer properties can be handy in a case like yours where 
> until the optimizer is fixed, a user can tell the optimizer what plan 
> to use. Just an FYI as I know it doesn't help with your particular 
> case since there is no optimizer overrides support for 10.1.
>  
> Mamta
>  
> On 11/11/05, *Kevin Hore* <kh@araxis.com <ma...@araxis.com>> wrote:
>
>     The Derby query optimizer (this is with 10.1.1.0
>     <http://10.1.1.0>) decides to perform an
>     expensive table scan in certain circumstances, when it could make
>     use of
>     available indexes. The resulting poor performance is rendering Derby
>     useless for our application.
>
>     I believe that this behaviour might be related to DERBY-47
>     ( http://issues.apache.org/jira/browse/DERBY-47), which was opened in
>     October 2004. However, this seems such a severe and fundamental
>     problem
>     that I'm surprised firstly that a significant portion of the Derby
>     user
>     base isn't affected by this and, secondly, that DERBY-47 hasn't been
>     considered a critical defect to be fixed with the utmost urgency.
>
>     I've described the problem in detail below, and I'd appreciate any
>     assistance. Specifically:
>
>     i) Does anyone have any plans to fix this problem?
>
>     ii) In the meantime, are there any work-arounds? I'd appreciate any
>     suggestions that would decrease the execution time of my second query
>     below (the one with with two search terms). Likewise, any general
>     strategies for avoiding this problem with IN clauses would be
>     appreciated.
>
>
>     ----PROBLEM DESCRIPTION----
>
>     Consider the table:
>
>     CREATE TABLE tblSearchDictionary
>     (
>     ObjectId int NOT NULL,
>     ObjectType int NOT NULL,
>     Word VARCHAR(64) NOT NULL,
>     WordLocation int NOT NULL,
>     CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
>     );
>
>     This table has an index on each of the four columns, it also has the
>     unique index across all four columns as defined above:
>
>     CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary
>     (ObjectId);
>     CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
>     (ObjectType);
>     CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
>     CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
>     (WordLocation);
>
>     The table contains about 260,000 rows.
>
>     The following query selects all rows that match instances of string in
>     the Word column. It sums the WordLocation column having grouped by
>     the
>     ObjectId column.
>
>     SELECT ObjectId, SUM(WordLocation) AS Score
>     FROM tblSearchDictionary
>     WHERE Word = 'CONTACT'
>     GROUP BY ObjectId;
>
>     On my machine this will usually complete in an acceptable time of
>     around
>     200ms.
>
>     Now consider the following query which adds a second search term
>     on the
>     same column.
>
>     SELECT ObjectId, SUM(WordLocation) AS Score
>     FROM tblSearchDictionary
>     WHERE Word = 'CONTACT' OR Word = 'ADD'
>     GROUP BY ObjectId;
>
>     This second query usually takes around 10000ms on my machine. My
>     understanding from the Derby optimizer docs and DERBY-47 is that
>     this is
>     because Derby is re-writing the query along the following lines, and
>     then choosing to do a table scan:
>
>     SELECT ObjectId, SUM(WordLocation) AS Score
>     FROM tblSearchDictionary
>     WHERE
>       Word IN ('CONTACT', 'ADD')
>       AND Word >= 'ADD'
>       AND Word <= 'CONTACT'
>     GROUP BY ObjectId;
>
>     The plan for the first query indicates that the
>     tblSearchDictionaryWord
>     index is used to perform an index scan. However, the plan for the
>     second
>     query indicates that the majority of the additional time is taken
>     performing a table scan over the entire table, instead of making
>     use of
>     the indexes available. Our application uses IN quite frequently,
>     so this
>     optimizer behaviour would seem to present a significant problem.
>
>     ---QUERY PLAN FOR FIRST QUERY----
>
>     Statement Name:
>         null
>     Statement Text:
>         SELECT
>         ObjectId,
>         SUM(WordLocation) AS Score
>     FROM tblSearchDictionary
>     WHERE
>             Word = 'CONTACT'
>     GROUP BY
>         ObjectId
>
>     Parse Time: 0
>     Bind Time: 0
>     Optimize Time: 16
>     Generate Time: 0
>     Compile Time: 16
>     Execute Time: 0
>     Begin Compilation Timestamp : 2005-11-11 12:28:52.765
>     End Compilation Timestamp : 2005-11-11 12:28: 52.781
>     Begin Execution Timestamp : 2005-11-11 13:08:15.406
>     End Execution Timestamp : 2005-11-11 13:08:15.406
>     Statement Execution Plan Text:
>     Project-Restrict ResultSet (5):
>     Number of opens = 1
>     Rows seen = 93
>     Rows filtered = 0
>     restriction = false
>     projection = true
>         constructor time (milliseconds) = 0
>         open time (milliseconds) = 0
>         next time (milliseconds) = 0
>         close time (milliseconds) = 0
>         restriction time (milliseconds) = 0
>         projection time (milliseconds) = 0
>         optimizer estimated row count:            1.00
>         optimizer estimated cost:          226.00
>
>     Source result set:
>         Grouped Aggregate ResultSet:
>         Number of opens = 1
>         Rows input = 113
>         Has distinct aggregate = false
>         In sorted order = false
>         Sort information:
>             Number of rows input=113
>             Number of rows output=93
>             Sort type=internal
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 0
>             close time (milliseconds) = 0
>             optimizer estimated row count:             1.00
>             optimizer estimated cost:          226.00
>
>         Source result set:
>             Project-Restrict ResultSet (4):
>             Number of opens = 1
>             Rows seen = 113
>             Rows filtered = 0
>             restriction = false
>             projection = true
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 0
>                 next time (milliseconds) = 0
>                 close time (milliseconds) = 0
>                 restriction time (milliseconds) = 0
>                 projection time (milliseconds) = 0
>                 optimizer estimated row count:          118.00
>                 optimizer estimated cost:          226.00
>
>             Source result set:
>                 Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
>                 Number of opens = 1
>                 Rows seen = 113
>                 Columns accessed from heap = {0, 3}
>                     constructor time (milliseconds) = 0
>                     open time (milliseconds) = 0
>                     next time (milliseconds) = 0
>                     close time (milliseconds) = 0
>                     optimizer estimated row count:           118.00
>                     optimizer estimated cost:          226.00
>
>                     Index Scan ResultSet for TBLSEARCHDICTIONARY using
>     index
>     TBLSEARCHDICTIONARYWORD at read committed isolation level using share
>     row locking chosen by the optimizer
>                     Number of opens = 1
>                     Rows seen = 113
>                     Rows filtered = 0
>                     Fetch Size = 1
>                         constructor time (milliseconds) = 0
>                         open time (milliseconds) = 0
>                         next time (milliseconds) = 0
>                         close time (milliseconds) = 0
>                         next time in milliseconds/row = 0
>
>                     scan information:
>                         Bit set of columns fetched=All
>                         Number of columns fetched=2
>                         Number of deleted rows visited=0
>                         Number of pages visited=4
>                         Number of rows qualified=113
>                         Number of rows visited=114
>                         Scan type=btree
>                         Tree height=3
>                         start position:
>         >= on first 1 column(s).
>         Ordered null semantics on the following columns:
>     0
>                         stop position:
>         > on first 1 column(s).
>         Ordered null semantics on the following columns:
>     0
>                         qualifiers:
>     None
>                         optimizer estimated row count:          118.00
>                         optimizer estimated cost:          226.00
>
>
>     ---QUERY PLAN FOR SECOND QUERY----
>
>     Statement Name:
>         null
>     Statement Text:
>         SELECT
>         ObjectId,
>         SUM(WordLocation) AS Score
>     FROM tblSearchDictionary
>     WHERE
>             Word = 'CONTACT' OR  Word = 'ADD'
>     GROUP BY
>         ObjectId
>
>     Parse Time: 0
>     Bind Time: 0
>     Optimize Time: 0
>     Generate Time: 15
>     Compile Time: 15
>     Execute Time: 4250
>     Begin Compilation Timestamp : 2005-11-11 13:16:17.578
>     End Compilation Timestamp : 2005-11-11 13:16: 17.593
>     Begin Execution Timestamp : 2005-11-11 13:16:17.593
>     End Execution Timestamp : 2005-11-11 13:16:27.437
>     Statement Execution Plan Text:
>     Project-Restrict ResultSet (5):
>     Number of opens = 1
>     Rows seen = 100
>     Rows filtered = 0
>     restriction = false
>     projection = true
>         constructor time (milliseconds) = 0
>         open time (milliseconds) = 4250
>         next time (milliseconds) = 0
>         close time (milliseconds) = 0
>         restriction time (milliseconds) = 0
>         projection time (milliseconds) = 0
>         optimizer estimated row count:            1.00
>         optimizer estimated cost:        82959.49
>
>     Source result set:
>         Grouped Aggregate ResultSet:
>         Number of opens = 1
>         Rows input = 712
>         Has distinct aggregate = false
>         In sorted order = false
>         Sort information:
>             Number of rows input=712
>             Number of rows output=593
>             Sort type=internal
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 4250
>             next time (milliseconds) = 0
>             close time (milliseconds) = 0
>             optimizer estimated row count:             1.00
>             optimizer estimated cost:        82959.49
>
>         Source result set:
>             Project-Restrict ResultSet (4):
>             Number of opens = 1
>             Rows seen = 712
>             Rows filtered = 0
>             restriction = false
>             projection = true
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 0
>                 next time (milliseconds) = 4219
>                 close time (milliseconds) = 15
>                 restriction time (milliseconds) = 0
>                 projection time (milliseconds) = 0
>                 optimizer estimated row count:        19200.45
>                 optimizer estimated cost:        82959.49
>
>             Source result set:
>                 Project-Restrict ResultSet (3):
>                 Number of opens = 1
>                 Rows seen = 40806
>                 Rows filtered = 40094
>                 restriction = true
>                 projection = false
>                     constructor time (milliseconds) = 0
>                     open time (milliseconds) = 0
>                     next time (milliseconds) = 4219
>                     close time (milliseconds) = 15
>                     restriction time (milliseconds) = 124
>                     projection time (milliseconds) = 0
>                     optimizer estimated row count:        19200.45
>                     optimizer estimated cost:        82959.49
>
>                 Source result set:
>                     Table Scan ResultSet for TBLSEARCHDICTIONARY at read
>     committed
>     isolation level using share row locking chosen by the optimizer
>                     Number of opens = 1
>                     Rows seen = 40806
>                     Rows filtered = 0
>                     Fetch Size = 1
>                         constructor time (milliseconds) = 0
>                         open time (milliseconds) = 0
>                         next time (milliseconds) = 4001
>                         close time (milliseconds) = 15
>                         next time in milliseconds/row = 0
>
>                     scan information:
>                         Bit set of columns fetched={0, 2, 3}
>                         Number of columns fetched=3
>                         Number of pages visited=2978
>                         Number of rows qualified=40806
>                         Number of rows visited=256001
>                         Scan type=heap
>                         start position:
>     null                    stop position:
>     null                    qualifiers:
>     Column[0][0] Id: 2
>     Operator: <
>     Ordered nulls: false
>     Unknown return value: true
>     Negate comparison result: true
>     Column[0][1] Id: 2
>     Operator: <=
>     Ordered nulls: false
>     Unknown return value: false
>     Negate comparison result: false
>
>                         optimizer estimated row count:        19200.45
>                         optimizer estimated cost:         82959.49
>
>     ----------
>
>     Thanks in advance for any help!
>
>     Kind regards,
>
>
>     Kevin Hore
>
>

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Kevin Hore <kh...@araxis.com>.

Hi Mamta,

Thank you for your suggestion, but this is for a production system and
so, as you suspected, we don't really want to use an alpha version of Derby.

Kind regards,

Kevin Hore

Mamta Satoor wrote:
> Hi Kevin,
>  I haven't investigated Derby-47 to know how to fix the problem but I do
> have an optimizer overrides patch waiting for review on the derby developer
> list which will let user specify their own optimizer properties to help the
> optimizer pick a specific plan. The JIRA entry for optimizer overrides is
> Derby-573. The patch is for Derby 10.2.
>  These optimizer properties can be handy in a case like yours where until
> the optimizer is fixed, a user can tell the optimizer what plan to use. Just
> an FYI as I know it doesn't help with your particular case since there is no
> optimizer overrides support for 10.1.
>  Mamta
>  On 11/11/05, Kevin Hore <kh...@araxis.com> wrote:
> 
>>The Derby query optimizer (this is with 10.1.1.0 <http://10.1.1.0>)
>>decides to perform an
>>expensive table scan in certain circumstances, when it could make use of
>>available indexes. The resulting poor performance is rendering Derby
>>useless for our application.
>>
>>I believe that this behaviour might be related to DERBY-47
>>(http://issues.apache.org/jira/browse/DERBY-47), which was opened in
>>October 2004. However, this seems such a severe and fundamental problem
>>that I'm surprised firstly that a significant portion of the Derby user
>>base isn't affected by this and, secondly, that DERBY-47 hasn't been
>>considered a critical defect to be fixed with the utmost urgency.
>>
>>I've described the problem in detail below, and I'd appreciate any
>>assistance. Specifically:
>>
>>i) Does anyone have any plans to fix this problem?
>>
>>ii) In the meantime, are there any work-arounds? I'd appreciate any
>>suggestions that would decrease the execution time of my second query
>>below (the one with with two search terms). Likewise, any general
>>strategies for avoiding this problem with IN clauses would be appreciated.
>>
>>
>>----PROBLEM DESCRIPTION----
>>
>>Consider the table:
>>
>>CREATE TABLE tblSearchDictionary
>>(
>>ObjectId int NOT NULL,
>>ObjectType int NOT NULL,
>>Word VARCHAR(64) NOT NULL,
>>WordLocation int NOT NULL,
>>CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
>>);
>>
>>This table has an index on each of the four columns, it also has the
>>unique index across all four columns as defined above:
>>
>>CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary
>>(ObjectId);
>>CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
>>(ObjectType);
>>CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
>>CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
>>(WordLocation);
>>
>>The table contains about 260,000 rows.
>>
>>The following query selects all rows that match instances of string in
>>the Word column. It sums the WordLocation column having grouped by the
>>ObjectId column.
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE Word = 'CONTACT'
>>GROUP BY ObjectId;
>>
>>On my machine this will usually complete in an acceptable time of around
>>200ms.
>>
>>Now consider the following query which adds a second search term on the
>>same column.
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE Word = 'CONTACT' OR Word = 'ADD'
>>GROUP BY ObjectId;
>>
>>This second query usually takes around 10000ms on my machine. My
>>understanding from the Derby optimizer docs and DERBY-47 is that this is
>>because Derby is re-writing the query along the following lines, and
>>then choosing to do a table scan:
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE
>>Word IN ('CONTACT', 'ADD')
>>AND Word >= 'ADD'
>>AND Word <= 'CONTACT'
>>GROUP BY ObjectId;
>>
>>The plan for the first query indicates that the tblSearchDictionaryWord
>>index is used to perform an index scan. However, the plan for the second
>>query indicates that the majority of the additional time is taken
>>performing a table scan over the entire table, instead of making use of
>>the indexes available. Our application uses IN quite frequently, so this
>>optimizer behaviour would seem to present a significant problem.
>>
>>---QUERY PLAN FOR FIRST QUERY----
>>
>>Statement Name:
>>null
>>Statement Text:
>>SELECT
>>ObjectId,
>>SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE
>>Word = 'CONTACT'
>>GROUP BY
>>ObjectId
>>
>>Parse Time: 0
>>Bind Time: 0
>>Optimize Time: 16
>>Generate Time: 0
>>Compile Time: 16
>>Execute Time: 0
>>Begin Compilation Timestamp : 2005-11-11 12:28:52.765
>>End Compilation Timestamp : 2005-11-11 12:28:52.781
>>Begin Execution Timestamp : 2005-11-11 13:08:15.406
>>End Execution Timestamp : 2005-11-11 13:08:15.406
>>Statement Execution Plan Text:
>>Project-Restrict ResultSet (5):
>>Number of opens = 1
>>Rows seen = 93
>>Rows filtered = 0
>>restriction = false
>>projection = true
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 0
>>close time (milliseconds) = 0
>>restriction time (milliseconds) = 0
>>projection time (milliseconds) = 0
>>optimizer estimated row count: 1.00
>>optimizer estimated cost: 226.00
>>
>>Source result set:
>>Grouped Aggregate ResultSet:
>>Number of opens = 1
>>Rows input = 113
>>Has distinct aggregate = false
>>In sorted order = false
>>Sort information:
>>Number of rows input=113
>>Number of rows output=93
>>Sort type=internal
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 0
>>close time (milliseconds) = 0
>>optimizer estimated row count: 1.00
>>optimizer estimated cost: 226.00
>>
>>Source result set:
>>Project-Restrict ResultSet (4):
>>Number of opens = 1
>>Rows seen = 113
>>Rows filtered = 0
>>restriction = false
>>projection = true
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 0
>>close time (milliseconds) = 0
>>restriction time (milliseconds) = 0
>>projection time (milliseconds) = 0
>>optimizer estimated row count: 118.00
>>optimizer estimated cost: 226.00
>>
>>Source result set:
>>Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
>>Number of opens = 1
>>Rows seen = 113
>>Columns accessed from heap = {0, 3}
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 0
>>close time (milliseconds) = 0
>>optimizer estimated row count: 118.00
>>optimizer estimated cost: 226.00
>>
>>Index Scan ResultSet for TBLSEARCHDICTIONARY using index
>>TBLSEARCHDICTIONARYWORD at read committed isolation level using share
>>row locking chosen by the optimizer
>>Number of opens = 1
>>Rows seen = 113
>>Rows filtered = 0
>>Fetch Size = 1
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 0
>>close time (milliseconds) = 0
>>next time in milliseconds/row = 0
>>
>>scan information:
>>Bit set of columns fetched=All
>>Number of columns fetched=2
>>Number of deleted rows visited=0
>>Number of pages visited=4
>>Number of rows qualified=113
>>Number of rows visited=114
>>Scan type=btree
>>Tree height=3
>>start position:
>>
>>>= on first 1 column(s).
>>
>>Ordered null semantics on the following columns:
>>0
>>stop position:
>>
>>>on first 1 column(s).
>>
>>Ordered null semantics on the following columns:
>>0
>>qualifiers:
>>None
>>optimizer estimated row count: 118.00
>>optimizer estimated cost: 226.00
>>
>>
>>---QUERY PLAN FOR SECOND QUERY----
>>
>>Statement Name:
>>null
>>Statement Text:
>>SELECT
>>ObjectId,
>>SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE
>>Word = 'CONTACT' OR Word = 'ADD'
>>GROUP BY
>>ObjectId
>>
>>Parse Time: 0
>>Bind Time: 0
>>Optimize Time: 0
>>Generate Time: 15
>>Compile Time: 15
>>Execute Time: 4250
>>Begin Compilation Timestamp : 2005-11-11 13:16:17.578
>>End Compilation Timestamp : 2005-11-11 13:16:17.593
>>Begin Execution Timestamp : 2005-11-11 13:16:17.593
>>End Execution Timestamp : 2005-11-11 13:16:27.437
>>Statement Execution Plan Text:
>>Project-Restrict ResultSet (5):
>>Number of opens = 1
>>Rows seen = 100
>>Rows filtered = 0
>>restriction = false
>>projection = true
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 4250
>>next time (milliseconds) = 0
>>close time (milliseconds) = 0
>>restriction time (milliseconds) = 0
>>projection time (milliseconds) = 0
>>optimizer estimated row count: 1.00
>>optimizer estimated cost: 82959.49
>>
>>Source result set:
>>Grouped Aggregate ResultSet:
>>Number of opens = 1
>>Rows input = 712
>>Has distinct aggregate = false
>>In sorted order = false
>>Sort information:
>>Number of rows input=712
>>Number of rows output=593
>>Sort type=internal
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 4250
>>next time (milliseconds) = 0
>>close time (milliseconds) = 0
>>optimizer estimated row count: 1.00
>>optimizer estimated cost: 82959.49
>>
>>Source result set:
>>Project-Restrict ResultSet (4):
>>Number of opens = 1
>>Rows seen = 712
>>Rows filtered = 0
>>restriction = false
>>projection = true
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 4219
>>close time (milliseconds) = 15
>>restriction time (milliseconds) = 0
>>projection time (milliseconds) = 0
>>optimizer estimated row count: 19200.45
>>optimizer estimated cost: 82959.49
>>
>>Source result set:
>>Project-Restrict ResultSet (3):
>>Number of opens = 1
>>Rows seen = 40806
>>Rows filtered = 40094
>>restriction = true
>>projection = false
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 4219
>>close time (milliseconds) = 15
>>restriction time (milliseconds) = 124
>>projection time (milliseconds) = 0
>>optimizer estimated row count: 19200.45
>>optimizer estimated cost: 82959.49
>>
>>Source result set:
>>Table Scan ResultSet for TBLSEARCHDICTIONARY at read
>>committed
>>isolation level using share row locking chosen by the optimizer
>>Number of opens = 1
>>Rows seen = 40806
>>Rows filtered = 0
>>Fetch Size = 1
>>constructor time (milliseconds) = 0
>>open time (milliseconds) = 0
>>next time (milliseconds) = 4001
>>close time (milliseconds) = 15
>>next time in milliseconds/row = 0
>>
>>scan information:
>>Bit set of columns fetched={0, 2, 3}
>>Number of columns fetched=3
>>Number of pages visited=2978
>>Number of rows qualified=40806
>>Number of rows visited=256001
>>Scan type=heap
>>start position:
>>null stop position:
>>null qualifiers:
>>Column[0][0] Id: 2
>>Operator: <
>>Ordered nulls: false
>>Unknown return value: true
>>Negate comparison result: true
>>Column[0][1] Id: 2
>>Operator: <=
>>Ordered nulls: false
>>Unknown return value: false
>>Negate comparison result: false
>>
>>optimizer estimated row count: 19200.45
>>optimizer estimated cost: 82959.49
>>
>>----------
>>
>>Thanks in advance for any help!
>>
>>Kind regards,
>>
>>
>>Kevin Hore
>>
>>
> 
>

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Mamta Satoor <ms...@gmail.com>.

Hi Kevin,
 I haven't investigated Derby-47 to know how to fix the problem but I do
have an optimizer overrides patch waiting for review on the derby developer
list which will let user specify their own optimizer properties to help the
optimizer pick a specific plan. The JIRA entry for optimizer overrides is
Derby-573. The patch is for Derby 10.2.
 These optimizer properties can be handy in a case like yours where until
the optimizer is fixed, a user can tell the optimizer what plan to use. Just
an FYI as I know it doesn't help with your particular case since there is no
optimizer overrides support for 10.1.
 Mamta
 On 11/11/05, Kevin Hore <kh...@araxis.com> wrote:
>
> The Derby query optimizer (this is with 10.1.1.0 <http://10.1.1.0>)
> decides to perform an
> expensive table scan in certain circumstances, when it could make use of
> available indexes. The resulting poor performance is rendering Derby
> useless for our application.
>
> I believe that this behaviour might be related to DERBY-47
> (http://issues.apache.org/jira/browse/DERBY-47), which was opened in
> October 2004. However, this seems such a severe and fundamental problem
> that I'm surprised firstly that a significant portion of the Derby user
> base isn't affected by this and, secondly, that DERBY-47 hasn't been
> considered a critical defect to be fixed with the utmost urgency.
>
> I've described the problem in detail below, and I'd appreciate any
> assistance. Specifically:
>
> i) Does anyone have any plans to fix this problem?
>
> ii) In the meantime, are there any work-arounds? I'd appreciate any
> suggestions that would decrease the execution time of my second query
> below (the one with with two search terms). Likewise, any general
> strategies for avoiding this problem with IN clauses would be appreciated.
>
>
> ----PROBLEM DESCRIPTION----
>
> Consider the table:
>
> CREATE TABLE tblSearchDictionary
> (
> ObjectId int NOT NULL,
> ObjectType int NOT NULL,
> Word VARCHAR(64) NOT NULL,
> WordLocation int NOT NULL,
> CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
> );
>
> This table has an index on each of the four columns, it also has the
> unique index across all four columns as defined above:
>
> CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary
> (ObjectId);
> CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
> (ObjectType);
> CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
> CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
> (WordLocation);
>
> The table contains about 260,000 rows.
>
> The following query selects all rows that match instances of string in
> the Word column. It sums the WordLocation column having grouped by the
> ObjectId column.
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE Word = 'CONTACT'
> GROUP BY ObjectId;
>
> On my machine this will usually complete in an acceptable time of around
> 200ms.
>
> Now consider the following query which adds a second search term on the
> same column.
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE Word = 'CONTACT' OR Word = 'ADD'
> GROUP BY ObjectId;
>
> This second query usually takes around 10000ms on my machine. My
> understanding from the Derby optimizer docs and DERBY-47 is that this is
> because Derby is re-writing the query along the following lines, and
> then choosing to do a table scan:
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE
> Word IN ('CONTACT', 'ADD')
> AND Word >= 'ADD'
> AND Word <= 'CONTACT'
> GROUP BY ObjectId;
>
> The plan for the first query indicates that the tblSearchDictionaryWord
> index is used to perform an index scan. However, the plan for the second
> query indicates that the majority of the additional time is taken
> performing a table scan over the entire table, instead of making use of
> the indexes available. Our application uses IN quite frequently, so this
> optimizer behaviour would seem to present a significant problem.
>
> ---QUERY PLAN FOR FIRST QUERY----
>
> Statement Name:
> null
> Statement Text:
> SELECT
> ObjectId,
> SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE
> Word = 'CONTACT'
> GROUP BY
> ObjectId
>
> Parse Time: 0
> Bind Time: 0
> Optimize Time: 16
> Generate Time: 0
> Compile Time: 16
> Execute Time: 0
> Begin Compilation Timestamp : 2005-11-11 12:28:52.765
> End Compilation Timestamp : 2005-11-11 12:28:52.781
> Begin Execution Timestamp : 2005-11-11 13:08:15.406
> End Execution Timestamp : 2005-11-11 13:08:15.406
> Statement Execution Plan Text:
> Project-Restrict ResultSet (5):
> Number of opens = 1
> Rows seen = 93
> Rows filtered = 0
> restriction = false
> projection = true
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 0
> close time (milliseconds) = 0
> restriction time (milliseconds) = 0
> projection time (milliseconds) = 0
> optimizer estimated row count: 1.00
> optimizer estimated cost: 226.00
>
> Source result set:
> Grouped Aggregate ResultSet:
> Number of opens = 1
> Rows input = 113
> Has distinct aggregate = false
> In sorted order = false
> Sort information:
> Number of rows input=113
> Number of rows output=93
> Sort type=internal
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 0
> close time (milliseconds) = 0
> optimizer estimated row count: 1.00
> optimizer estimated cost: 226.00
>
> Source result set:
> Project-Restrict ResultSet (4):
> Number of opens = 1
> Rows seen = 113
> Rows filtered = 0
> restriction = false
> projection = true
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 0
> close time (milliseconds) = 0
> restriction time (milliseconds) = 0
> projection time (milliseconds) = 0
> optimizer estimated row count: 118.00
> optimizer estimated cost: 226.00
>
> Source result set:
> Index Row to Base Row ResultSet for TBLSEARCHDICTIONARY:
> Number of opens = 1
> Rows seen = 113
> Columns accessed from heap = {0, 3}
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 0
> close time (milliseconds) = 0
> optimizer estimated row count: 118.00
> optimizer estimated cost: 226.00
>
> Index Scan ResultSet for TBLSEARCHDICTIONARY using index
> TBLSEARCHDICTIONARYWORD at read committed isolation level using share
> row locking chosen by the optimizer
> Number of opens = 1
> Rows seen = 113
> Rows filtered = 0
> Fetch Size = 1
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 0
> close time (milliseconds) = 0
> next time in milliseconds/row = 0
>
> scan information:
> Bit set of columns fetched=All
> Number of columns fetched=2
> Number of deleted rows visited=0
> Number of pages visited=4
> Number of rows qualified=113
> Number of rows visited=114
> Scan type=btree
> Tree height=3
> start position:
> >= on first 1 column(s).
> Ordered null semantics on the following columns:
> 0
> stop position:
> > on first 1 column(s).
> Ordered null semantics on the following columns:
> 0
> qualifiers:
> None
> optimizer estimated row count: 118.00
> optimizer estimated cost: 226.00
>
>
> ---QUERY PLAN FOR SECOND QUERY----
>
> Statement Name:
> null
> Statement Text:
> SELECT
> ObjectId,
> SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE
> Word = 'CONTACT' OR Word = 'ADD'
> GROUP BY
> ObjectId
>
> Parse Time: 0
> Bind Time: 0
> Optimize Time: 0
> Generate Time: 15
> Compile Time: 15
> Execute Time: 4250
> Begin Compilation Timestamp : 2005-11-11 13:16:17.578
> End Compilation Timestamp : 2005-11-11 13:16:17.593
> Begin Execution Timestamp : 2005-11-11 13:16:17.593
> End Execution Timestamp : 2005-11-11 13:16:27.437
> Statement Execution Plan Text:
> Project-Restrict ResultSet (5):
> Number of opens = 1
> Rows seen = 100
> Rows filtered = 0
> restriction = false
> projection = true
> constructor time (milliseconds) = 0
> open time (milliseconds) = 4250
> next time (milliseconds) = 0
> close time (milliseconds) = 0
> restriction time (milliseconds) = 0
> projection time (milliseconds) = 0
> optimizer estimated row count: 1.00
> optimizer estimated cost: 82959.49
>
> Source result set:
> Grouped Aggregate ResultSet:
> Number of opens = 1
> Rows input = 712
> Has distinct aggregate = false
> In sorted order = false
> Sort information:
> Number of rows input=712
> Number of rows output=593
> Sort type=internal
> constructor time (milliseconds) = 0
> open time (milliseconds) = 4250
> next time (milliseconds) = 0
> close time (milliseconds) = 0
> optimizer estimated row count: 1.00
> optimizer estimated cost: 82959.49
>
> Source result set:
> Project-Restrict ResultSet (4):
> Number of opens = 1
> Rows seen = 712
> Rows filtered = 0
> restriction = false
> projection = true
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 4219
> close time (milliseconds) = 15
> restriction time (milliseconds) = 0
> projection time (milliseconds) = 0
> optimizer estimated row count: 19200.45
> optimizer estimated cost: 82959.49
>
> Source result set:
> Project-Restrict ResultSet (3):
> Number of opens = 1
> Rows seen = 40806
> Rows filtered = 40094
> restriction = true
> projection = false
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 4219
> close time (milliseconds) = 15
> restriction time (milliseconds) = 124
> projection time (milliseconds) = 0
> optimizer estimated row count: 19200.45
> optimizer estimated cost: 82959.49
>
> Source result set:
> Table Scan ResultSet for TBLSEARCHDICTIONARY at read
> committed
> isolation level using share row locking chosen by the optimizer
> Number of opens = 1
> Rows seen = 40806
> Rows filtered = 0
> Fetch Size = 1
> constructor time (milliseconds) = 0
> open time (milliseconds) = 0
> next time (milliseconds) = 4001
> close time (milliseconds) = 15
> next time in milliseconds/row = 0
>
> scan information:
> Bit set of columns fetched={0, 2, 3}
> Number of columns fetched=3
> Number of pages visited=2978
> Number of rows qualified=40806
> Number of rows visited=256001
> Scan type=heap
> start position:
> null stop position:
> null qualifiers:
> Column[0][0] Id: 2
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
> Column[0][1] Id: 2
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
>
> optimizer estimated row count: 19200.45
> optimizer estimated cost: 82959.49
>
> ----------
>
> Thanks in advance for any help!
>
> Kind regards,
>
>
> Kevin Hore
>
>

SV: SV: Is Hibernate inappropriate for embedded databases?

Posted by Henrik Johansson <he...@vocab.se>.

I am not sure how the Store layer cache is used in standard JDBC
usage but it seems like using Hibernate would not consume more memory
than executing an SQL statement and from the resultset create objects
that you would use in your application.

On the contrary, having caching in the DAO layer will probably make it
more efficient since you ask the cache rather than the DB for data.

I am not at all an expert in this field but the simplicity that comes with
using Hibernate should not be discarded without careful consideration.

Maybe we have some gurus out there that have more experience with the
Derby/Hibernate combination...?

/ Henrik

-----Ursprungligt meddelande-----
Från: Michael McCutcheon [mailto:michaelm001@frontiernet.net]
Skickat: den 18 november 2005 15:47
Till: Derby Discussion
Ämne: Re: SV: Is Hibernate inappropriate for embedded databases?

But what about the 'double caching' problems for an embedded scenario?

Henrik Johansson wrote:

>I tried JBoss Embedded which uses Hibernate for a small test
>and it worked like a charm with Derby.
>
>Using Hibernate alone (or with JBoss Embedded) seems to be a
>very good idea in my opinion for all sorts of reasons.
>
>Too bad its seemingly not production ready...
>
>Hibernate alone would be clever since it is very well tested and
>there is no idea to reinvent the wheel.
>
>/ Henrik
>
>
>-----Ursprungligt meddelande-----
>Från: Michael McCutcheon [mailto:michaelm001@frontiernet.net]
>Skickat: den 18 november 2005 05:33
>Till: Derby Discussion
>Ämne: Is Hibernate inappropriate for embedded databases?
>
>
>I'm still struggling to figure out if I should go with strait JDBC,
>stored procedures or something like Hibernate for my data access/update
>in my web app running on Tomcat.
>
>I've been looking at Hibernate, and it seems that it's claim to fame is
>all of the fancy caching it can do to speed performance.
>
>However, doesn't derby itself cache data in memory (once the data is
>accessed)?
>
>If the caching functionality in Hibernate is used for an embedded
>database, doesn't that just mean that the data will be cached twice
>(once in derby, once in hibernate), resulting in memory bloat and
>potentially worse performance?
>
>Any thoughts on this issue would be appreciated.
>
>Mike
>
>
>
>
>
>

Re: SV: Is Hibernate inappropriate for embedded databases?

Posted by Michael McCutcheon <mi...@frontiernet.net>.

But what about the 'double caching' problems for an embedded scenario?



Henrik Johansson wrote:

>I tried JBoss Embedded which uses Hibernate for a small test
>and it worked like a charm with Derby.
>
>Using Hibernate alone (or with JBoss Embedded) seems to be a
>very good idea in my opinion for all sorts of reasons.
>
>Too bad its seemingly not production ready...
>
>Hibernate alone would be clever since it is very well tested and
>there is no idea to reinvent the wheel.
>
>/ Henrik
>
>
>-----Ursprungligt meddelande-----
>Från: Michael McCutcheon [mailto:michaelm001@frontiernet.net]
>Skickat: den 18 november 2005 05:33
>Till: Derby Discussion
>Ämne: Is Hibernate inappropriate for embedded databases?
>
>
>I'm still struggling to figure out if I should go with strait JDBC,
>stored procedures or something like Hibernate for my data access/update
>in my web app running on Tomcat.
>
>I've been looking at Hibernate, and it seems that it's claim to fame is
>all of the fancy caching it can do to speed performance.
>
>However, doesn't derby itself cache data in memory (once the data is
>accessed)?
>
>If the caching functionality in Hibernate is used for an embedded
>database, doesn't that just mean that the data will be cached twice
>(once in derby, once in hibernate), resulting in memory bloat and
>potentially worse performance?
>
>Any thoughts on this issue would be appreciated.
>
>Mike
>
>
>
>
>  
>

SV: Is Hibernate inappropriate for embedded databases?

Posted by Henrik Johansson <he...@vocab.se>.

I tried JBoss Embedded which uses Hibernate for a small test
and it worked like a charm with Derby.

Using Hibernate alone (or with JBoss Embedded) seems to be a
very good idea in my opinion for all sorts of reasons.

Too bad its seemingly not production ready...

Hibernate alone would be clever since it is very well tested and
there is no idea to reinvent the wheel.

/ Henrik


-----Ursprungligt meddelande-----
Från: Michael McCutcheon [mailto:michaelm001@frontiernet.net]
Skickat: den 18 november 2005 05:33
Till: Derby Discussion
Ämne: Is Hibernate inappropriate for embedded databases?


I'm still struggling to figure out if I should go with strait JDBC,
stored procedures or something like Hibernate for my data access/update
in my web app running on Tomcat.

I've been looking at Hibernate, and it seems that it's claim to fame is
all of the fancy caching it can do to speed performance.

However, doesn't derby itself cache data in memory (once the data is
accessed)?

If the caching functionality in Hibernate is used for an embedded
database, doesn't that just mean that the data will be cached twice
(once in derby, once in hibernate), resulting in memory bloat and
potentially worse performance?

Any thoughts on this issue would be appreciated.

Mike

Is Hibernate inappropriate for embedded databases?

Posted by Michael McCutcheon <mi...@frontiernet.net>.

I'm still struggling to figure out if I should go with strait JDBC, 
stored procedures or something like Hibernate for my data access/update 
in my web app running on Tomcat.

I've been looking at Hibernate, and it seems that it's claim to fame is 
all of the fancy caching it can do to speed performance. 

However, doesn't derby itself cache data in memory (once the data is 
accessed)?

If the caching functionality in Hibernate is used for an embedded 
database, doesn't that just mean that the data will be cached twice 
(once in derby, once in hibernate), resulting in memory bloat and 
potentially worse performance?

Any thoughts on this issue would be appreciated.

Mike

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Kevin Hore <kh...@araxis.com>.

Hi Michael,

Thank you for yours suggestions. You've obviously very kindly given it
quite a bit a thought.

I'd already tried a bunch of the things that you have suggested.
Defining a composite key and removing the existing indexes did nothing
to improve performance. Re-writing using IN clause produced no
improvement (indeed, I'm pretty sure that this is what Derby is doing
anyway).

I ought to point out here that my comments on how the optimizer would
re-write my query is based on the Derby documentation found at
http://db.apache.org/derby/docs/10.1/tuning/rtuntransform582.html

Unfortunately (for me) this is merely one simple example of a whole
class of problems with different queries involving different tables in
the system I'm working on, and the problem is fundamentally that
multiple IN/WHERE terms aren't being handled sensibly, as summarized by
Satheesh Bandaram in his post. Derby is able to choose a sensible
strategy with just one term in the WHERE/IN clause -- it's just the case
of multiple terms where it chooses inappropriate strategies.

I will add my thoughts to DERBY-47 regarding this.

Thanks again,

Kevin Hore


Michael Segel wrote:
> On Friday 11 November 2005 08:59, Kevin Hore wrote:
> Hi,
> 
> Before complaining about the query optimizer, lets look at your design first.
> Yeah, I know that Derby, like every other database is not perfect and could 
> always use improvements. But the reality is that the first place to look for 
> improvements is in your own code.
> 
> Since you've got a small table and only 260K rows, lets try these simple 
> experiments to see if it will help.
> 
> 
>>----PROBLEM DESCRIPTION----
>>
>>Consider the table:
>>
>>CREATE TABLE tblSearchDictionary
>>(
>>ObjectId int NOT NULL,
>>ObjectType int NOT NULL,
>>Word VARCHAR(64) NOT NULL,
>>WordLocation int NOT NULL,
>>CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
>>);
>>
> 
> Just a first suggestion, and its really a nit, could you alter the table to 
> have Word as the last column? 
> 
> 	CREATE TABLE tblSearchDictionary
> 	(
> 		 objectId		INT NOT NULL,
> 		 objectType 	INT NOT NULL,
> 		 wordLocation	INT NOT NULL,
> 		 word 		VARCHAR(64) NOT NULL,
> 		CONSTRAINT CONSd0e222 UNIQUE (objectId,objectType,word,wordLocation)
> 	);
> 
>>This table has an index on each of the four columns, it also has the
>>unique index across all four columns as defined above:
>>
> 
> Ok first suggestion. Drop all of your Indexes.
> Based on your examples, these indexes are part of the problem.
> 
> For example an index on objectType doesn't make any sense.
> First, in your example, you don't even use objectType in any of your queries.
> Second. How many objectTypes are there? 10,100, 1000? The point is that 
> objectType is not an unique enough identifier to justify being its own index.
> The same could be said for wordLocation. So why have an index on this column?
> 
> 
> 
>>CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary (ObjectId);
>>CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
>>(ObjectType);
>>CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
>>CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
>>(WordLocation);
>>
> 
> 
> Ok, since you believe that the Optimizer is having problems selecting the 
> correct Index to use, lets create your new indexes one by one to see how they 
> can improve performance.
> 
> Lets review your query statements:
> 
> 
>>The following query selects all rows that match instances of string in
>>the Word column. It sums the WordLocation column having grouped by the
>>ObjectId column.
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE Word = 'CONTACT'
>>GROUP BY ObjectId;
>>
>>On my machine this will usually complete in an acceptable time of around
>>200ms.
>>
> 
> Ok, this is pretty straight forward enough.
> 
> A good index would be a composite index on objectID and word.
> 
> 
>>Now consider the following query which adds a second search term on the
>>same column.
>>
>>SELECT ObjectId, SUM(WordLocation) AS Score
>>FROM tblSearchDictionary
>>WHERE Word = 'CONTACT' OR Word = 'ADD'
>>GROUP BY ObjectId;
>>
> 
> 
> Ok, again if you create a composite index on objectID and Word, you will get 
> better performance.
> 
> As you point out below, the query may get rewritten.  If you wanted to, you 
> could just rewrite your query to say "WHERE word IN ('CONTACT', 'ADD') but 
> I'd save that for an option to test as a further tweak.
> 
> 
> So, as a test, do the following:
> 
> 1) DROP ALL CURRENT INDEXes.
> 2) CREATE a composite index on objectID and Word.
> 
> Run your query tests to see what happens to performance.
> 
> 3) To test your theory about the optimizer, rewrite your second query to use 
> the IN (xxx,xxx) option rather than OR.
> 
> In addition:
> 
> 1) Recreate your table with your VARCHAR column as the last column.
> 2) Alter your table and replace your VARCHAR column with a fixed length 
> CHAR(64) column.
> 
> If you can, please post your results.
> Do each change and see what happens to your query times.
> 
> HTH
> 
> -Gumby

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Michael Segel <ms...@segel.com>.

On Saturday 12 November 2005 16:04, Michael Segel wrote:
> On Saturday 12 November 2005 10:28, Michael Segel wrote:
> > 	CREATE TABLE tblSearchDictionary
> > 	(
> > 		 objectId		INT NOT NULL,
> > 		 objectType 	INT NOT NULL,
> > 		 wordLocation	INT NOT NULL,
> > 		 word 		VARCHAR(64) NOT NULL,
> > 		CONSTRAINT CONSd0e222 UNIQUE (objectId,objectType,word,wordLocation)
> > 	);
> >
> > > This table has an index on each of the four columns, it also has the
> > > unique index across all four columns as defined above:
>
> As a follow up...
> Change the constraint order to UNIQUE(objectId, word, wordLocation,
> objectType);
>
> You probably can drop the index I asked you to create if you make this mod.
> Then your query should use the unique constraint's "backing index"...
>
> HTH.
>
> -G

If you want to make this a little bit more efficient, you could try the 
following:

1) Create two tables. 

CREATE TABLE wordKey
(
	wordID	INT  NOT NULL GENERATED ALWAYS AS IDENTITY,
	word		VARCHAR(128) NOT NULL,
	CONSTRAINT	wordKeyCONS01 UNIQUE (wordID, word)
);
// Note this should create the backing INDEX so you don't need to create an 
INDEX

	CREATE TABLE tblSearchDictionary
  	(
  		 objectId		INT NOT NULL,
 		 objectType 	INT NOT NULL,
 		 wordLocation	INT NOT NULL,
  		 wordID 		INT NOT NULL,
		CONSTRAINT CONSd0e222 UNIQUE (objectId,wordID, wordLocation, objectType)
 	);
// Note: I didn't see the SQL Syntax for creating a foreign key on wordID but 
you'll want to do that and to also for cascading deletes, even though you may 
not ever delete a word from wordKey...

It looks like you're trying to build your own word search index of documents.
If you you'll want to automate it rather than use static queries.

So lets say you're getting a text input of "CONSTRAINT  OR BAR AND FOO".
I would interpret this to mean select documents that either contain the word 
"CONSTRAINT" or the words "FOO" AND "BAR".
Note that you're going to create a parse tree that says node one contains 1 
item, "CONSTRAINT", node two contains 2 items, "FOO", "BAR"...

So you would want to create a SELECT statement that is equivlent to:

SELECT objectID, SUM(wordLocation) SCORE 
FROM	tblSearchDictionary
WHERE  wordID = xxx
UNION (
	SELECT objectID, SUM(wordLocation) SCORE
	FROM tblSearchDictionary
	WHERE wordID IN (yyy,zzz)
)...
GROUP BY objectID

Where xxx, yyy, zzz are integer values that you got from your look up.

While this isn't the complete solution, it should point you in the correct 
direction and should work independent of some kludges in the optimizer.

Your wordKey look up statement would be a prepared statement, while your 
actual scoring routine would be a dynamic statement.  

This should give you what you want.

I believe that there may be more to this for example what happens if you have  
a document that contains all three words ....

I'd also recommend the lookup table as a way to decrease the size of your 
overall solution and it may change how your optimizer looks at things. 
(You're doing your main query all on ints...)

But hey, what do I know? I am just having my first cup of joe and you get what 
you paid for... ;-)

-G

-- 
Michael Segel
Principal 
MSCC

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Michael Segel <ms...@segel.com>.

On Saturday 12 November 2005 10:28, Michael Segel wrote:

> 	CREATE TABLE tblSearchDictionary
> 	(
> 		 objectId		INT NOT NULL,
> 		 objectType 	INT NOT NULL,
> 		 wordLocation	INT NOT NULL,
> 		 word 		VARCHAR(64) NOT NULL,
> 		CONSTRAINT CONSd0e222 UNIQUE (objectId,objectType,word,wordLocation)
> 	);
>
> > This table has an index on each of the four columns, it also has the
> > unique index across all four columns as defined above:
>
As a follow up...
Change the constraint order to UNIQUE(objectId, word, wordLocation, 
objectType);

You probably can drop the index I asked you to create if you make this mod. 
Then your query should use the unique constraint's "backing index"...

HTH.

-G


-- 
Michael Segel
Principal 
MSCC

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Michael Segel <ms...@segel.com>.

On Friday 11 November 2005 08:59, Kevin Hore wrote:
Hi,

Before complaining about the query optimizer, lets look at your design first.
Yeah, I know that Derby, like every other database is not perfect and could 
always use improvements. But the reality is that the first place to look for 
improvements is in your own code.

Since you've got a small table and only 260K rows, lets try these simple 
experiments to see if it will help.

> ----PROBLEM DESCRIPTION----
>
> Consider the table:
>
> CREATE TABLE tblSearchDictionary
> (
> ObjectId int NOT NULL,
> ObjectType int NOT NULL,
> Word VARCHAR(64) NOT NULL,
> WordLocation int NOT NULL,
> CONSTRAINT CONSd0e222 UNIQUE (ObjectId,ObjectType,Word,WordLocation)
> );
>
Just a first suggestion, and its really a nit, could you alter the table to 
have Word as the last column? 

	CREATE TABLE tblSearchDictionary
	(
		 objectId		INT NOT NULL,
		 objectType 	INT NOT NULL,
		 wordLocation	INT NOT NULL,
		 word 		VARCHAR(64) NOT NULL,
		CONSTRAINT CONSd0e222 UNIQUE (objectId,objectType,word,wordLocation)
	);
> This table has an index on each of the four columns, it also has the
> unique index across all four columns as defined above:
>
Ok first suggestion. Drop all of your Indexes.
Based on your examples, these indexes are part of the problem.

For example an index on objectType doesn't make any sense.
First, in your example, you don't even use objectType in any of your queries.
Second. How many objectTypes are there? 10,100, 1000? The point is that 
objectType is not an unique enough identifier to justify being its own index.
The same could be said for wordLocation. So why have an index on this column?

> CREATE INDEX tblSearchDictionaryObjectId ON tblSearchDictionary (ObjectId);
> CREATE INDEX tblSearchDictionaryObjectType ON tblSearchDictionary
> (ObjectType);
> CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word);
> CREATE INDEX tblSearchDictionaryWordLocation ON tblSearchDictionary
> (WordLocation);
>

Ok, since you believe that the Optimizer is having problems selecting the 
correct Index to use, lets create your new indexes one by one to see how they 
can improve performance.

Lets review your query statements:

> The following query selects all rows that match instances of string in
> the Word column. It sums the WordLocation column having grouped by the
> ObjectId column.
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE Word = 'CONTACT'
> GROUP BY ObjectId;
>
> On my machine this will usually complete in an acceptable time of around
> 200ms.
>
Ok, this is pretty straight forward enough.

A good index would be a composite index on objectID and word.

> Now consider the following query which adds a second search term on the
> same column.
>
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE Word = 'CONTACT' OR Word = 'ADD'
> GROUP BY ObjectId;
>

Ok, again if you create a composite index on objectID and Word, you will get 
better performance.

As you point out below, the query may get rewritten.  If you wanted to, you 
could just rewrite your query to say "WHERE word IN ('CONTACT', 'ADD') but 
I'd save that for an option to test as a further tweak.

So, as a test, do the following:

1) DROP ALL CURRENT INDEXes.
2) CREATE a composite index on objectID and Word.

Run your query tests to see what happens to performance.

3) To test your theory about the optimizer, rewrite your second query to use 
the IN (xxx,xxx) option rather than OR.

In addition:

1) Recreate your table with your VARCHAR column as the last column.
2) Alter your table and replace your VARCHAR column with a fixed length 
CHAR(64) column.

If you can, please post your results.
Do each change and see what happens to your query times.

HTH

-Gumby
-- 
Michael Segel
Principal 
MSCC

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Kevin Hore <kh...@araxis.com>.

Hi Jeffrey,

Thank you for your response. I think you are probably right about Derby
not having a strategy for doing multiple scans. I think DERBY-47
probably covers this, but I'll perhaps add a note to that covering my
circumstances.

I had thought of using UNION but, unfortunately the IN clause must cope
with a varying number of values, and I don't want the system to be
constructing complex statements on-the-fly to avoid this particular
Derby problem. We have quite a number of queries affected by this, not
just the (simple) example under consideration, so using UNION would
require major surgery to our application.

Thanks again.

Kind regards

Kevin Hore

Jeffrey Lichtman wrote:
> 
>> I've described the problem in detail below, and I'd appreciate any 
>> assistance. Specifically:
>>
>> i) Does anyone have any plans to fix this problem?
> 
> 
> I believe the real problem is that Derby doesn't have any strategy for 
> doing multiple scans for OR/IN clauses. This is a useful feature, but I 
> don't know if anyone has entered an enhancement request.
> 
>> ii) In the meantime, are there any work-arounds? I'd appreciate any 
>> suggestions that would decrease the execution time of my second query 
>> below (the one with with two search terms). Likewise, any general 
>> strategies for avoiding this problem with IN clauses would be 
>> appreciated.
> 
> 
> You could re-write the query to use UNION:
> 
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM tblSearchDictionary
> WHERE Word = 'CONTACT' OR Word = 'ADD'
> GROUP BY ObjectId;
> 
> would become:
> 
> SELECT ObjectId, SUM(WordLocation) AS Score
> FROM
>     (SELECT ObjectId, WordLocation
>         FROM tblSearchDictionary
>         WHERE Word = 'CONTACT'
>      UNION ALL
>      SELECT ObjectId, WordLocation
>         FROM tblSearchDictionary
>         WHERE Word = 'ADD') t
> GROUP BY ObjectId;
> 
> 
>                        -        Jeff Lichtman
>                                 swazoo@rcn.com
>                                 Check out Swazoo Koolak's Web Jukebox at
>                                 http://swazoo.com/

Re: Poor query optimizer choices is making Derby unusable for large tables

Posted by Jeffrey Lichtman <sw...@rcn.com>.

>I've described the problem in detail below, and I'd appreciate any 
>assistance. Specifically:
>
>i) Does anyone have any plans to fix this problem?

I believe the real problem is that Derby doesn't have any strategy 
for doing multiple scans for OR/IN clauses. This is a useful feature, 
but I don't know if anyone has entered an enhancement request.

>ii) In the meantime, are there any work-arounds? I'd appreciate any 
>suggestions that would decrease the execution time of my second 
>query below (the one with with two search terms). Likewise, any 
>general strategies for avoiding this problem with IN clauses would 
>be appreciated.

You could re-write the query to use UNION:

SELECT ObjectId, SUM(WordLocation) AS Score
FROM tblSearchDictionary
WHERE Word = 'CONTACT' OR Word = 'ADD'
GROUP BY ObjectId;

would become:

SELECT ObjectId, SUM(WordLocation) AS Score
FROM
     (SELECT ObjectId, WordLocation
         FROM tblSearchDictionary
         WHERE Word = 'CONTACT'
      UNION ALL
      SELECT ObjectId, WordLocation
         FROM tblSearchDictionary
         WHERE Word = 'ADD') t
GROUP BY ObjectId;


                        -        Jeff Lichtman
                                 swazoo@rcn.com
                                 Check out Swazoo Koolak's Web Jukebox at
                                 http://swazoo.com/