You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by Jeffrey Lichtman <sw...@rcn.com> on 2005/11/03 20:09:43 UTC

Re: Derby, SQL 2003, UNIQUE constraints, unique indices and NULL (long)

> > CREATE TABLE t2 (i INTEGER, j INTEGER NOT NULL);
> > CREATE INDEX idx3 on t2(i);
> > CREATE UNIQUE INDEX idx4 on t2(j);
> >
> > SELECT i FROM t2;  -- use index idx3 (covers query)
> > SELECT j FROM t2;  -- Do a table scan
>
>Shouldn't the last SELECT use idx4 since the index is UNIQUE *and* j
>is NOT NULL?
>--
>Bernt Marius Johnsen, Database Technology Group,

Yes, you're right. I should have said:

CREATE TABLE t2 (i INTEGER, j INTEGER);
CREATE INDEX idx3 on t2(i);
CREATE UNIQUE INDEX idx4 on t2(j);

SELECT i FROM t2;  -- use index idx3 (covers query)
SELECT j FROM t2;  -- Do a table scan

I've done a little more thinking about this problem. I suspect that 
the store code uses the results of the same compare() method 
invocation to determine where to insert a row into an index and also 
to determine whether the row is a duplicate in a unique index. If 
Derby is to allow multiple nulls in a unique index, the store must 
use the two different null semantics for the two cases. For the 
question of where to insert the row, nulls must be considered to be 
equal to each other, and for the question of whether they are 
duplicates, they must be considered not equal to each other.

If I'm right, it should be easy to change Derby to allow multiple 
nulls in unique indexes. We still haven't heard from any of the store 
experts, though. Mike?

                        -        Jeff Lichtman
                                 swazoo@rcn.com
                                 Check out Swazoo Koolak's Web Jukebox at
                                 http://swazoo.com/ 


Re: Derby, SQL 2003, UNIQUE constraints, unique indices and NULL (long)

Posted by Mike Matrigali <mi...@sbcglobal.net>.

Jeffrey Lichtman wrote:
> 
>> > CREATE TABLE t2 (i INTEGER, j INTEGER NOT NULL);
>> > CREATE INDEX idx3 on t2(i);
>> > CREATE UNIQUE INDEX idx4 on t2(j);
>> >
>> > SELECT i FROM t2;  -- use index idx3 (covers query)
>> > SELECT j FROM t2;  -- Do a table scan
>>
>> Shouldn't the last SELECT use idx4 since the index is UNIQUE *and* j
>> is NOT NULL?
>> -- 
>> Bernt Marius Johnsen, Database Technology Group,
> 
> 
> Yes, you're right. I should have said:
> 
> CREATE TABLE t2 (i INTEGER, j INTEGER);
> CREATE INDEX idx3 on t2(i);
> CREATE UNIQUE INDEX idx4 on t2(j);
> 
> SELECT i FROM t2;  -- use index idx3 (covers query)
> SELECT j FROM t2;  -- Do a table scan
> 
> I've done a little more thinking about this problem. I suspect that the 
> store code uses the results of the same compare() method invocation to 
> determine where to insert a row into an index and also to determine 
> whether the row is a duplicate in a unique index. If Derby is to allow 
> multiple nulls in a unique index, the store must use the two different 
> null semantics for the two cases. For the question of where to insert 
> the row, nulls must be considered to be equal to each other, and for the 
> question of whether they are duplicates, they must be considered not 
> equal to each other.
> 
> If I'm right, it should be easy to change Derby to allow multiple nulls 
> in unique indexes. We still haven't heard from any of the store experts, 
> though. Mike?

I haven't thought this one all the way through, but thought I would
respond with at least what I know - I have been reticent to weigh
in on this uniqueness issue given the flack being thrown about.  So
I asked for some patience if I say something stupid below.

When I last looked at this issue
I looked at special casing columns with null's in the unique index and
that turned into a rat hole where all unique key index lookups would
go slower to support the multiple null case.

The current derby index implementation assumes all rows in the index are
unique and have a deterministic ordering, where a row in the index is 
all the key columns plus the the row location column which points to the
row in the base table.  So for "non-unique" indexes searching is always
done using all columns plus the row location column.  This basic 
assumption is used for searching, branch key specification, logical
undo recovery, and serializable locking considerations.  So from store
point of view during searching null's must compare equal.

For unique indexes searching is done using only the keys, and not the
the row location column.  In this way when a search is done to insert
a row into the unique index, it searches just for the keys (not 
including the unique row location) and if it
finds an existing non-deleted match it throws a duplicate key exception.

As jeff points out there may be a way to do the search using one compare
and the duplicate check using another, but no matter what from the store
point of view the null's are the same (yes I know SQL does not agree).
Recovery on a unique key index expects that searching for a key will
find ONE entry for the key and it will be the right one.  There are also
locking, searching and cost assumptions/optimizations that assume if you 
are looking for a key in the tree you will only get 1 row back.

So I think if it were decided to allow multiple null's in a constraint
or an index, then the actual underlying btree implementation to use 
would be the non-unique index implementation with some extra code
stuck in to do the datatype specific comparison necessary at insert 
time.  I think that code would look something like:
     o position beginning of scan using insert key, using current 
compare semantics.
     o scan all rows which match the key using current semantics
     o for each row use alternate compare semantics and if alternate
       compare says equal disallow insert.

This could be
done today with no store changes by having the execution engine first
do a serializable lookup on the row to be inserted and then inserting
if the row met SQL defined uniqueness definitions - but implementing
in store would be more efficient as at least it would only do one
search vs. 2.

I would worry about an implementation like the one Bernt suggests, as it
just seems to invite weird bugs when using the wrong index; but this
may just reflect my inexperience with the optimizer and execution engine.

The other day I heard a good use for the SQL constraint null semantics.
Imagine an application that tracked people by SS# but also included
people with no SS#.  In that case the SQL constraint would insure
uniqueness of the SS# while allowing null's for those with no SS#.

> 
>                        -        Jeff Lichtman
>                                 swazoo@rcn.com
>                                 Check out Swazoo Koolak's Web Jukebox at
>                                 http://swazoo.com/
> 
>