You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-user@db.apache.org by John English <jo...@gmail.com> on 2021/09/25 14:39:30 UTC

Using indexes

I asked something like this some months ago, but am still not getting 
anywhere...

I have a table used for logging events, which current contains just 
under 400,000 rows:

CREATE TABLE system_log (
   id          INTEGER       GENERATED ALWAYS AS IDENTITY,
   time        TIMESTAMP     DEFAULT NULL,
   username    VARCHAR(15),
   name        VARCHAR(520),
   facility    VARCHAR(15)   NOT NULL,
   event       VARCHAR(31)   NOT NULL,
   details     VARCHAR(32000),
   CONSTRAINT systemlog_pk   PRIMARY KEY (id)
);

I generally want to display most recent events first, so have defined an 
index for the purpose:

CREATE INDEX log_index ON system_log (id DESC);

I execute the following query:

SELECT id,DateTimeFormat(time,null) AS 
t_time,name,username,facility,event,details
FROM system_log
ORDER BY id DESC
NULLS LAST
FETCH FIRST 20 ROWS ONLY;

(DateTimeFormat is a simple function that converts a timestamp into 
something human-readable like "25-Sep-2021 at 12:30".)

A timing test showed that it takes 6.348 seconds to execute this. If I 
order by id ASC, it takes 0.124 seconds.

I used SYSCS_SET_RUNTIMESTATISTICS(1) and 
SYSCS_SET_STATISTICS_TIMING(1), and got a dump showing that it's doing 
an external sort and not using the index using ORDER BY id DESC, but 
using the PK index for ORDER BY id ASC (see below)

Why isn't the first query using the index I created, and what can I do 
to persuade Derby to use it?

Rick Hillegas mentioned "covering indexes", but thanks to the large 
"details" column I was unable to create such a beast. Would a CLOB be 
any better?

TIA,
-- 
John English

-------------------------------------------------------------------------
Using ORDER BY id DESC:

Statement Text:
	SELECT id,DateTimeFormat(time,null) AS 
t_time,name,username,facility,event,details
FROM system_log
ORDER BY id DESC
NULLS LAST
FETCH FIRST 20 ROWS ONLY
Parse Time: 20
Bind Time: 1
Optimize Time: 1
Generate Time: 1
Compile Time: 23
Execute Time: 6305
Begin Compilation Timestamp : 2021-09-25 17:15:19.947
End Compilation Timestamp : 2021-09-25 17:15:19.97
Begin Execution Timestamp : 2021-09-25 17:15:19.971
End Execution Timestamp : 2021-09-25 17:15:26.295
Statement Execution Plan Text:
Scroll Insensitive ResultSet:
Number of opens = 1
Rows seen = 20
Number of reads from hash table = 20
Number of writes to hash table = 20
	constructor time (milliseconds) = 0
	open time (milliseconds) = 6302
	next time (milliseconds) = 3
	close time (milliseconds) = 0
	optimizer estimated row count: 388223.00
	optimizer estimated cost: 398112.27
Source result set:
	Row Count (1):
	Number of opens = 1
	Rows seen = 20
	Rows filtered = 0
		constructor time (milliseconds) = 0
		open time (milliseconds) = 6302
		next time (milliseconds) = 3
		close time (milliseconds) = 18
		optimizer estimated row count: 388223.00
		optimizer estimated cost: 398112.27
	Source result set:
		Sort ResultSet:
		Number of opens = 1
		Rows input = 388218
		Rows returned = 20
		Eliminate duplicates = false
		In sorted order = false
		Sort information:
			Number of merge runs=11
			Number of rows input=388218
			Number of rows output=388218
			Size of merge runs=[32767, 32767, 32767, 32767, 32767, 32767, 32767, 
32767, 32767, 32767, 32767]
			Sort type=external
			constructor time (milliseconds) = 0
			open time (milliseconds) = 6302
			next time (milliseconds) = 3
			close time (milliseconds) = 18
			optimizer estimated row count: 388223.00
			optimizer estimated cost: 398112.27
		Source result set:
			Project-Restrict ResultSet (3):
			Number of opens = 1
			Rows seen = 388218
			Rows filtered = 0
			restriction = false
			projection = true
				constructor time (milliseconds) = 0
				open time (milliseconds) = 0
				next time (milliseconds) = 3881
				close time (milliseconds) = 0
				restriction time (milliseconds) = 0
				projection time (milliseconds) = 1434
				optimizer estimated row count: 388223.00
				optimizer estimated cost: 398112.27
			Source result set:
				Table Scan ResultSet for SYSTEM_LOG at read uncommitted isolation 
level using share row locking chosen by the optimizer
				Number of opens = 1
				Rows seen = 388218
				Rows filtered = 0
				Fetch Size = 16
					constructor time (milliseconds) = 0
					open time (milliseconds) = 0
					next time (milliseconds) = 2423
					close time (milliseconds) = 0
					next time in milliseconds/row = 0

				scan information:
					Bit set of columns fetched={0, 1, 2, 3, 4, 7, 8}
					Number of columns fetched=7
					Number of pages visited=2661
					Number of rows qualified=388218
					Number of rows visited=388404
					Scan type=heap
					start position:
						null
					stop position:
						null
					qualifiers:
						None
					optimizer estimated row count: 388223.00
					optimizer estimated cost: 398112.27

-------------------------------------------------------------------------
Using ORDER BY id ASC:

Statement Name:
	null
Statement Text:
	SELECT id,DateTimeFormat(time,null) AS 
t_time,name,username,facility,event,details
FROM system_log
ORDER BY id ASC
NULLS LAST
FETCH FIRST 20 ROWS ONLY
Parse Time: 24
Bind Time: 1
Optimize Time: 1
Generate Time: 1
Compile Time: 27
Execute Time: 123
Begin Compilation Timestamp : 2021-09-25 17:14:57.128
End Compilation Timestamp : 2021-09-25 17:14:57.155
Begin Execution Timestamp : 2021-09-25 17:20:31.615
End Execution Timestamp : 2021-09-25 17:20:31.738
Statement Execution Plan Text:
Scroll Insensitive ResultSet:
Number of opens = 1
Rows seen = 20
Number of reads from hash table = 20
Number of writes to hash table = 20
	constructor time (milliseconds) = 0
	open time (milliseconds) = 0
	next time (milliseconds) = 123
	close time (milliseconds) = 0
	optimizer estimated row count: 388223.00
	optimizer estimated cost: 1035759.04
Source result set:
	Row Count (1):
	Number of opens = 1
	Rows seen = 20
	Rows filtered = 0
		constructor time (milliseconds) = 0
		open time (milliseconds) = 0
		next time (milliseconds) = 123
		close time (milliseconds) = 0
		optimizer estimated row count: 388223.00
		optimizer estimated cost: 1035759.04
	Source result set:
		Project-Restrict ResultSet (4):
		Number of opens = 1
		Rows seen = 20
		Rows filtered = 0
		restriction = false
		projection = true
			constructor time (milliseconds) = 0
			open time (milliseconds) = 0
			next time (milliseconds) = 123
			close time (milliseconds) = 0
			restriction time (milliseconds) = 0
			projection time (milliseconds) = 1
			optimizer estimated row count: 388223.00
			optimizer estimated cost: 1035759.04
		Source result set:
			Index Row to Base Row ResultSet for SYSTEM_LOG:
			Number of opens = 1
			Rows seen = 20
			Columns accessed from heap = {0, 1, 2, 3, 4, 7, 8}
				constructor time (milliseconds) = 0
				open time (milliseconds) = 0
				next time (milliseconds) = 122
				close time (milliseconds) = 0
				optimizer estimated row count: 388223.00
				optimizer estimated cost: 1035759.04
				Index Scan ResultSet for SYSTEM_LOG using constraint SYSTEMLOG_PK at 
read uncommitted isolation level using share row locking chosen by the 
optimizer
				Number of opens = 1
				Rows seen = 20
				Rows filtered = 0
				Fetch Size = 1
					constructor time (milliseconds) = 0
					open time (milliseconds) = 0
					next time (milliseconds) = 121
					close time (milliseconds) = 0
					next time in milliseconds/row = 6

				scan information:
					Bit set of columns fetched={1}
					Number of columns fetched=1
					Number of deleted rows visited=94
					Number of pages visited=1187
					Number of rows qualified=20
					Number of rows visited=114
					Scan type=btree
					Tree height=3
					start position:
						None
					stop position:
						None
					qualifiers:
						None
					optimizer estimated row count: 388223.00
					optimizer estimated cost: 1035759.04

-- 
This email has been checked for viruses by AVG.
https://www.avg.com


Re: Using indexes

Posted by John English <jo...@gmail.com>.
On Tue, 28 Sep 2021, 17:52 Rick Hillegas, <ri...@gmail.com> wrote:

> Glad that you have made progress on this puzzle. One comment inline...
>
> On 9/28/21 4:10 AM, John English wrote:
> > The WHERE clause seems to prevent the index from being used.
> The index is not usable in this query because username is not the
> leading column in the index.
>

Ah, right.

>

Re: Using indexes

Posted by Rick Hillegas <ri...@gmail.com>.
Glad that you have made progress on this puzzle. One comment inline...

On 9/28/21 4:10 AM, John English wrote:
> A couple more data points, from testing different variants of the 
> inner select:
>
> 1) SELECT time FROM system_log
>      ORDER BY time DESC NULLS LAST FETCH NEXT 20 ROWS ONLY;
>
> 419ms, "Index Scan ResultSet for SYSTEM_LOG using index LOG_INDEX at 
> read uncommitted isolation level using share row locking chosen by the 
> optimizer"
>
> 2) SELECT time FROM system_log
>      WHERE username='foo'
>      ORDER BY time DESC NULLS LAST FETCH NEXT 20 ROWS ONLY;
>
> 2245ms, "Table Scan ResultSet for SYSTEM_LOG at read uncommitted 
> isolation level using share row locking chosen by the optimizer"
>
> Removing "FETCH NEXT 20 ROWS ONLY":
>
> 1) 388,237 rows in 3859ms, otherwise same as above (uses index).
> 2) 23,211 rows in 2199ms, otherwise same as above (doesn't use index).
>
> The WHERE clause seems to prevent the index from being used.
The index is not usable in this query because username is not the 
leading column in the index.



Re: Using indexes

Posted by John English <jo...@gmail.com>.
A couple more data points, from testing different variants of the inner 
select:

1) SELECT time FROM system_log
      ORDER BY time DESC NULLS LAST FETCH NEXT 20 ROWS ONLY;

419ms, "Index Scan ResultSet for SYSTEM_LOG using index LOG_INDEX at 
read uncommitted isolation level using share row locking chosen by the 
optimizer"

2) SELECT time FROM system_log
      WHERE username='foo'
      ORDER BY time DESC NULLS LAST FETCH NEXT 20 ROWS ONLY;

2245ms, "Table Scan ResultSet for SYSTEM_LOG at read uncommitted 
isolation level using share row locking chosen by the optimizer"

Removing "FETCH NEXT 20 ROWS ONLY":

1) 388,237 rows in 3859ms, otherwise same as above (uses index).
2) 23,211 rows in 2199ms, otherwise same as above (doesn't use index).

The WHERE clause seems to prevent the index from being used.

-- 
John English

-- 
This email has been checked for viruses by AVG.
https://www.avg.com


Re: Using indexes

Posted by John English <jo...@gmail.com>.
On 27/09/2021 18:23, John English wrote:
> I'll try the temporary table approach as soon as I get a minute, and 
> will let you know what happens.

I'm finally getting somewhere. Having (a) moved the row counting out to 
a separate query, (b) renamed columns in the table to match what my code 
requires so I can access the table directly instead of using a view, and 
(c) using your (Rick's) original idea of a nested select (which I 
decided to try again before going for a temporary table), I get an 
execution time of 455ms (woo-hoo!) for this query:

SELECT DateTimeFormat(t_time,null) AS t_time,facility,event,details
FROM system_log,
     (SELECT time FROM system_log
      ORDER BY time DESC NULLS LAST FETCH NEXT 20 ROWS ONLY) AS x
WHERE system_log.time=x.time;

However, if I add in a WHERE clause:

SELECT DateTimeFormat(t_time,null) AS t_time,facility,event,details
FROM system_log,
     (SELECT time FROM system_log
      WHERE username='foo'
      ORDER BY time DESC NULLS LAST FETCH NEXT 20 ROWS ONLY) AS x
WHERE system_log.time=x.time;

the execution time goes up to 2 seconds (I'm still happy!). The left 
result set (the nested select, by the look of it) still doesn't use the 
index:

Left result set:
   Row Count (2):
   Number of opens = 1
   Rows seen = 20
   Rows filtered = 0
   ...
   Source result set:
     Sort ResultSet:
     Number of opens = 1
     Rows input = 23211
     Rows returned = 20
     Eliminate duplicates = false
     In sorted order = false
     Sort information:
       Number of rows input=23211
       Number of rows output=23211
       Sort type=internal
       ...
       optimizer estimated row count: 38824.20
       optimizer estimated cost: 399695.58
     Source result set:
       Project-Restrict ResultSet (4):
       Number of opens = 1
       Rows seen = 23211
       Rows filtered = 0
       restriction = false
       projection = true
       ...
       Source result set:
         Table Scan ResultSet for SYSTEM_LOG at read uncommitted 
isolation level using share row locking chosen by the optimizer
         Number of opens = 1
         Rows seen = 23211

but the right result set uses the PK index.

So, I'm in a much better position as the result of Rick's advice, but I 
still have no idea why the descending PK index isn't ever used!!!

-- 
John English

-- 
This email has been checked for viruses by AVG.
https://www.avg.com


Re: Using indexes

Posted by John English <jo...@gmail.com>.
On 26/09/2021 22:57, Rick Hillegas wrote:
> The support for FETCH/OFFSET is pretty minimal. Probably, the optimizer 
> isn't smart enough to know that the subquery returns only 20 small rows.
> 
> What happens if you dump the results of the subquery into a temporary 
> table and then join that with system_log?

Thanks, I'll give it a whirl. Meanwhile,I've got it down to about 5 or 6 
seconds by (a) eliminating the view in favour of a direct table access, 
and (b) using a separate SELECT COUNT(*) to get the row count, again 
using the table directly and without using an ORDER BY clause. (I 
renamed the id and time columns to time and t_time respectively, so the 
names match what the code expects to get back from the view, and then 
recreated the index.)

It still doesn't seem to be using the index (see trace below) but it's 
an improvement. Sorting by time ASC use the PK index and takes about 
200ms, but time DESC doesn't use an index and takes about 3s. And I 
still don't understand why it's ignoring the index.

I'll try the temporary table approach as soon as I get a minute, and 
will let you know what happens.

Thanks for the help,
-- 
John English

----------------------------------------------------
Statement Text:
	SELECT DateTimeFormat(t_time,null) AS 
t_time,facility,event,details,name,username FROM system_log
     ORDER BY time DESC NULLS LAST FETCH NEXT 20 ROWS ONLY
Parse Time: 19
Bind Time: 1
Optimize Time: 1
Generate Time: 0
Compile Time: 21
Execute Time: 4474
Begin Compilation Timestamp : 2021-09-27 17:52:23.802
End Compilation Timestamp : 2021-09-27 17:52:23.823
Begin Execution Timestamp : 2021-09-27 17:52:23.824
End Execution Timestamp : 2021-09-27 17:52:28.298
Statement Execution Plan Text:
Scroll Insensitive ResultSet:
Number of opens = 1
Rows seen = 20
Number of reads from hash table = 20
Number of writes to hash table = 20
	constructor time (milliseconds) = 0
	open time (milliseconds) = 4474
	next time (milliseconds) = 0
	close time (milliseconds) = 0
	optimizer estimated row count: 388236.00
	optimizer estimated cost: 399685.32
Source result set:
	Row Count (1):
	Number of opens = 1
	Rows seen = 20
	Rows filtered = 0
		constructor time (milliseconds) = 0
		open time (milliseconds) = 4474
		next time (milliseconds) = 0
		close time (milliseconds) = 0
		optimizer estimated row count: 388236.00
		optimizer estimated cost: 399685.32
	Source result set:
		Project-Restrict ResultSet (5):
		Number of opens = 1
		Rows seen = 20
		Rows filtered = 0
		restriction = false
		projection = true
			constructor time (milliseconds) = 0
			open time (milliseconds) = 4474
			next time (milliseconds) = 0
			close time (milliseconds) = 0
			restriction time (milliseconds) = 0
			projection time (milliseconds) = 0
			optimizer estimated row count: 388236.00
			optimizer estimated cost: 399685.32
		Source result set:
			Sort ResultSet:
			Number of opens = 1
			Rows input = 388231
			Rows returned = 20
			Eliminate duplicates = false
			In sorted order = false
			Sort information:
				Number of rows input=388231
				Number of rows output=388231
				Sort type=internal
				constructor time (milliseconds) = 0
				open time (milliseconds) = 4474
				next time (milliseconds) = 0
				close time (milliseconds) = 0
				optimizer estimated row count: 388236.00
				optimizer estimated cost: 399685.32
			Source result set:
				Project-Restrict ResultSet (3):
				Number of opens = 1
				Rows seen = 388231
				Rows filtered = 0
				restriction = false
				projection = true
					constructor time (milliseconds) = 0
					open time (milliseconds) = 0
					next time (milliseconds) = 4232
					close time (milliseconds) = 3
					restriction time (milliseconds) = 0
					projection time (milliseconds) = 1807
					optimizer estimated row count: 388236.00
					optimizer estimated cost: 399685.32
				Source result set:
					Table Scan ResultSet for SYSTEM_LOG at read uncommitted isolation 
level using share row locking chosen by the optimizer
					Number of opens = 1
					Rows seen = 388231
					Rows filtered = 0
					Fetch Size = 16
						constructor time (milliseconds) = 0
						open time (milliseconds) = 0
						next time (milliseconds) = 2390
						close time (milliseconds) = 3
						next time in milliseconds/row = 0

					scan information:
						Bit set of columns fetched={0, 1, 2, 3, 4, 7, 8}
						Number of columns fetched=7
						Number of pages visited=2666
						Number of rows qualified=388231
						Number of rows visited=388417
						Scan type=heap
						start position:
							null
						stop position:
							null
						qualifiers:
							None
						optimizer estimated row count: 388236.00
						optimizer estimated cost: 399685.32

-- 
This email has been checked for viruses by AVG.
https://www.avg.com


Re: Using indexes

Posted by Rick Hillegas <ri...@gmail.com>.
On 9/25/21 11:59 AM, John English wrote:
> On 25/09/2021 21:14, Rick Hillegas wrote:
>> On 9/25/21 7:39 AM, John English wrote:
>>> SELECT id,DateTimeFormat(time,null) AS 
>>> t_time,name,username,facility,event,details
>>> FROM system_log
>>> ORDER BY id DESC
>>> NULLS LAST
>>> FETCH FIRST 20 ROWS ONLY;
>>
>> I can remember whether you tried to rewrite the query to use a 
>> subquery. Something like this:
>>
>> SELECT id, time AS t_time,name,username,facility,event,details
>> FROM
>>    system_log s,
>>    (
>>      SELECT id AS log_id
>>      FROM system_log
>>      ORDER BY id DESC
>>      NULLS LAST
>>      FETCH FIRST 20 ROWS ONLY
>>    ) t
>> WHERE s.id = t.log_id
>> ;
>>
>> Does that help?
>>
>> -Rick
>
> No, when I tried it, it made it even worse if anything. The subquery 
> still fetches all 400,000 rows, sorts them without using the index, 
> and then uses the results to select again in the outer query. I really 
> don't understand why the index is ignored.
>
The support for FETCH/OFFSET is pretty minimal. Probably, the optimizer 
isn't smart enough to know that the subquery returns only 20 small rows.

What happens if you dump the results of the subquery into a temporary 
table and then join that with system_log?



Re: Using indexes

Posted by John English <jo...@gmail.com>.
On 25/09/2021 21:14, Rick Hillegas wrote:
> On 9/25/21 7:39 AM, John English wrote:
>> SELECT id,DateTimeFormat(time,null) AS 
>> t_time,name,username,facility,event,details
>> FROM system_log
>> ORDER BY id DESC
>> NULLS LAST
>> FETCH FIRST 20 ROWS ONLY;
> 
> I can remember whether you tried to rewrite the query to use a subquery. 
> Something like this:
> 
> SELECT id, time AS t_time,name,username,facility,event,details
> FROM
>    system_log s,
>    (
>      SELECT id AS log_id
>      FROM system_log
>      ORDER BY id DESC
>      NULLS LAST
>      FETCH FIRST 20 ROWS ONLY
>    ) t
> WHERE s.id = t.log_id
> ;
> 
> Does that help?
> 
> -Rick

No, when I tried it, it made it even worse if anything. The subquery 
still fetches all 400,000 rows, sorts them without using the index, and 
then uses the results to select again in the outer query. I really don't 
understand why the index is ignored.

-- 
John English

-- 
This email has been checked for viruses by AVG.
https://www.avg.com


Re: Using indexes

Posted by John English <jo...@gmail.com>.
On 25/09/2021 21:14, Rick Hillegas wrote:
> SELECT id, time AS t_time,name,username,facility,event,details
> FROM
>    system_log s,
>    (
>      SELECT id AS log_id
>      FROM system_log
>      ORDER BY id DESC
>      NULLS LAST
>      FETCH FIRST 20 ROWS ONLY
>    ) t
> WHERE s.id = t.log_id
> ;

Ideally I want to access the table through a view, defined like this:

CREATE VIEW system_log_view AS
   SELECT  time AS t_time,
           facility,
           event,
           details,
           name,
           username,
           id AS time
   FROM    system_log

(The renamings may seem strange but they serve a purpose!)

Selecting the ids directly from the table takes 0.4s:

   SELECT id FROM system_log
   ORDER BY id DESC NULLS LAST
   FETCH FIRST 20 ROWS ONLY;

Going through the view takes much longer (2.9s):

   SELECT time FROM system_log_view
   ORDER BY time DESC NULLS LAST
   FETCH FIRST 20 ROWS ONLY;

It seems like the view is materializing all the rows, and then 20 rows 
are being selected from the 400,000:

	Source result set:
		Sort ResultSet:
		Number of opens = 1
		Rows input = 388226
		Rows returned = 20

In any case I want to know the total number of rows, even though I 
display them in pages of 20 at a time. I thought I might get away with 
using a separate query for this, but no:

   SELECT COUNT(*) FROM system_log;		-- 0.271s
   SELECT COUNT(*) FROM system_log_view;		-- 2.184s

I don't understand why the view is so much slower than accessing the 
table directly. The direct table access says this:

Index Scan ResultSet for SYSTEM_LOG using index LOG_INDEX at read 
uncommitted isolation level using share row locking chosen by the optimizer

The corresponding line from executing the view doesn't mention the index:

Table Scan ResultSet for SYSTEM_LOG at read uncommitted isolation level 
using share row locking chosen by the optimizer

What am I not understanding about views here?
-- 
John English

-- 
This email has been checked for viruses by AVG.
https://www.avg.com


Re: Using indexes

Posted by Rick Hillegas <ri...@gmail.com>.
On 9/25/21 7:39 AM, John English wrote:
> SELECT id,DateTimeFormat(time,null) AS 
> t_time,name,username,facility,event,details
> FROM system_log
> ORDER BY id DESC
> NULLS LAST
> FETCH FIRST 20 ROWS ONLY;

I can remember whether you tried to rewrite the query to use a subquery. 
Something like this:

SELECT id, time AS t_time,name,username,facility,event,details
FROM
   system_log s,
   (
     SELECT id AS log_id
     FROM system_log
     ORDER BY id DESC
     NULLS LAST
     FETCH FIRST 20 ROWS ONLY
   ) t
WHERE s.id = t.log_id
;

Does that help?

-Rick