You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Daniel John Debrunner <dj...@debrunners.com> on 2005/02/18 19:00:16 UTC

Type system notes

Inspired by Dibyendu I thought I would add some notes about the Derby
type system, since that's where I have been working for the last few
weeks for JSR 169. Though I wasn't motivated enough to write in XML for
Forrest, I added the information as a package.html in ipai.types
directory, so it would be picked up by the engine javadoc.

As Dibyendu pointed out, there is a lot of information in the existing
javadoc, close to the code where I think design docs should be. The java
doc can be created by exexcuting the top level ant target javadoc.

cd ${derby.source}
ant javadoc

This then creates javadoc in

${derby.source}/javadoc

The complete engine javadoc can be accessed through

${derby.source}/javadoc/engine/index.html

Hopefully at some point this javadoc can ben generated automatically and
added to the site.

Adding more package files would be a great goal for Derby, and I think
that Dibyendu's documents should be linked to from various store
package.html's.

Here is a direct link to the types' package.html, from svn.

https://svn.apache.org/viewcvs.cgi/*checkout*/incubator/derby/code/trunk/java/engine/org/apache/derby/iapi/types/package.html

Dan.

Re: Type system notes

Posted by Daniel John Debrunner <dj...@debrunners.com>.

RPost wrote:

> 
> So it appears that the 'result' parameter that SumAggregator's accumulate
> method passes on to the SQLInteger.plus method may 'morph' from a tiny or
> small int to an integer to a long to a big decimal. Thus for this use of
> 'plus' it doesn't appear that the result is necessarily the same class as
> the addend parameters. I may have missed something in the way that
> polymorphism may take care of this.

You are correct, that slipped my mind, even though I was just working on
it. The reason for this is that avg(x) is not a simple as executing
sum(x)/count(x). This is due to overflow, sum(x) on an INT column
returns an INT and thus can throw out of range errors, e.g. imagine just
two rows, each with the value Integer.MAX_VALUE. The sum is out of
range, but the average needs to Integer.MAX_VALUE, thus using SQLInteger
as the summing mechanism doesn't work. Thus the average code uses the
natural type initially for the sum part of average and then moves up to
a type with a greater range if an out of range exception is seen. This
gradual change is for performance, SQLInteger will be faster than
SQLLong or SQLDecimal for addition, especially SQLDecimal will be much
slower.

> 
> There is also a comment by Dan in the AvgAggregator accumulate method:
> 
>   // this code creates data type objects directly, it is anticipating
>   // the time they move into the defined api of the type system. (djd).
> 
> Maybe Dan can comment on what this means.

This is the DataValueFactory issue in those notes, basically it's
assuming that it's ok to new SQLLong (for example) directly without
going through the type factory. Assuming one day that will be the
standard practice for most type implementations. The types package also
used to be split across the impl/iapi groupings.

Dan.

Re: Type system notes

Posted by RPost <rp...@pacbell.net>.

>"Daniel John Debrunner" wrote:
>Operators and self
>It seems the operator methods should almost always be acting on thier own
value, e.g. the plus() method should only >take one input and the result is
the value of the receiver (self) added to the input. Currently the plus
takes two inputs >and probably in most if not all cases the left input is
the receiver. The result would be smaller code and possible >faster, as the
method calls on self would not be through an interface.

The 'accumulate' method in SumAggregator calls the NumberDataValue 'plus'
method. Naturally, since this call was the first one I looked at it passes
'self' as the 2nd addend but it does act on its own value and replaces its
'value' with the result.

Accumulate checks itself for 'null' on every call and clones the 'addend'
parameter if needed. Thus for an accumulation of 1 million rows it will
check itself for 'null' 1 million times. If it were possible to always
initialize the aggregator at creation this null check could be removed.

AvgAggregator (via it's own accumulate method) is one user of the
'accumulate' method in SumAggregator. The accumulate method used in this
class catches the 'value out of range' exception and promotes its 'value' to
a class that can handle larger values; it then calls 'super.accumulate'
(SumAggregator).

So it appears that the 'result' parameter that SumAggregator's accumulate
method passes on to the SQLInteger.plus method may 'morph' from a tiny or
small int to an integer to a long to a big decimal. Thus for this use of
'plus' it doesn't appear that the result is necessarily the same class as
the addend parameters. I may have missed something in the way that
polymorphism may take care of this.

There is also a comment by Dan in the AvgAggregator accumulate method:

  // this code creates data type objects directly, it is anticipating
  // the time they move into the defined api of the type system. (djd).

Maybe Dan can comment on what this means.

Re: Type system notes

Posted by RPost <rp...@pacbell.net>.

Still researching type system issues with a view to testing some altered
'plus' operator code.
Found a code construct I haven't seen before.

The DataValueDescriptor interface has several methods that return a
BooleanDataValue object. Curiously, though, the BooleanDataValue interface
extends the DataValueDescriptor interface.

I've seen a lot of low-level code before but I haven't seen code where
lower-level interfaces have a built-in dependency on an extending interface.
Everything I've seen only goes in one direction.

Is this common?

Would anyone care to provide some expert insight into the whys and
wherefores (and pitfalls if any) of using such a construct?

The more I look into this code the more impressed I am with the level of
development that went into it.

Re: Type system notes

Posted by Daniel John Debrunner <dj...@debrunners.com>.

RPost wrote:

> The SQLInteger.java has a 'plus' method whose signature is:
>    public NumberDataValue plus(NumberDataValue addend1,
>                                                   NumberDataValue addend2,
>                                                   NumberDataValue result)
> 
> All of the parameters and the return value are interfaces.
> 
> Would you shed some light on the actual technical requirements for the
> 'plus' method?
> 
> In other words, can the type of each parameter and result value literally be
> any and all present and future implementations of the NumberDataValue
> interface? That is, is it a derby requirement that the 'plus' method be
> capable of being called with 'addend1', 'addend2' and 'result' parameters
> being three distinct classes? Or is this just a convenience to make the code
> generation more efficient?

I think the current use and requirements are that all the four types are
the same, and typically, if not always the addend1 is the same instance
as the receiver.

The technical requirement could be more flexible, requiring that the
result is suitable to hold the result of operation and that addend1 and
addend2 can be correctly represented by the receiver.

Thus, given the implementations today, this would work

SQLInteger.plus(SQLInteger, SQLShort, SQLLong)

This would give incorrect results

SQLInteger.plus(SQLInteger, SQLLong, SQLInteger)

Inccorrect results would occur because the plus methods 'pull' arguments
from addend1 and addend2 using the getXXX methods, in this case with
addend2 SQLLong.getInt(). This SQLLong.getInt() matches the Java
operation (int) longValue, thus silently truncating the top 32 bits.

I was saying that the methods should be simplified to say the method
executes on the receiver (self), thus plus (I think) should be defined as

// Perform result = this + val
void plus(NumberDataValue val, NumberDataValue result)

and thinking about interfaces, I think it might be possible to define
plus in NumberDataValue as

// set result to this + val
void plus(DataValueDescriptor val, DataValueDescriptor result)

The advantage of going to the primary type interface DVD, is that it
removes the need to add a cast in before calling plus, as type values
are handled generically in most of the code as DataValueDescriptors.

Dan.

Re: Type system notes

Posted by RPost <rp...@pacbell.net>.

>"Daniel John Debrunner" wrote:

>Language Compilation
>Much of the generate code for language involves the type system. E.g. SQL
operators are converted to method calls >on interfaces within the type

Method calls on interfaces certainly represent polymorphism at its finest
and I am not questioning their use in Derby. But they also make it more
difficult, for me at least, to abstract the actual technical requirements
from the code itself.

The SQLInteger.java has a 'plus' method whose signature is:
   public NumberDataValue plus(NumberDataValue addend1,
                                                  NumberDataValue addend2,
                                                  NumberDataValue result)

All of the parameters and the return value are interfaces.

Would you shed some light on the actual technical requirements for the
'plus' method?

In other words, can the type of each parameter and result value literally be
any and all present and future implementations of the NumberDataValue
interface? That is, is it a derby requirement that the 'plus' method be
capable of being called with 'addend1', 'addend2' and 'result' parameters
being three distinct classes? Or is this just a convenience to make the code
generation more efficient?

Re: Type system notes

Posted by RPost <rp...@pacbell.net>.

>"Daniel John Debrunner" wrote:

>I think it's actually fairly easy to test this. Write really simple
>tests using direct creation of SQLIntegers, completely outside of the
>engine. Work on plus, write a simple additon using plus that simulates
>what the engine does, execute one million additions using

>1) references through NumberDataValue
>2) referenecs through NumberDataType
>3+) various modified plus() methods as descibed in the type's package.html.

Initial tests (all tests pass non-null arguments) give these typical
results:

1 Million iterations

Total 'native int' time: [0]
Total NumberDataValue time: [140]
Total SQLInteger time: [47]
Total DataValueDescriptor time: [141]
Total int time: [47]

10 Million iterations

Total 'native int' time: [47]
Total NumberDataValue time: [1281]
Total SQLInteger time: [547]
Total DataValueDescriptor time: [1234]
Total int time: [344]

'native int' is just adding two 'int' values together in java - no method
call involved.
'NumberDataValue' is the existing 'plus' function.
'SQLInteger' calls 'plus1' which defines all parameters and the return as
SQLInteger.
'DataValueDescriptor' calls 'plus2' which defines all parameters and the
return as DataValueDescriptor.
         (The NumberDataValue interface extends the DataValueDescriptor
interface)
'int'  calls 'plus3' which defines all parameters and the return as int.
Since ints can't be null much of the method was removed.

Not much difference between NumberDataValue and DataValueDescriptor which
are 2-3 times slower than SQLInteger.

SQLInteger has pretty low overhead given that it is an Object and performs
all of the 'null' and overflow checking.

Still to do: tests using methods defined on 'self' per Dan's suggestion.

Re: Type system notes

Posted by RPost <rp...@pacbell.net>.

My question 'Are these same DVDs also then used by the log module to create
log records?' was related to the interaction with the log functionality.

Is the array of DVDs created for a row, perhaps as part of an insert/update,
used by the log code to create the log records? Or does the log code create
it's own DVDs for log purposes?

Re: Type system notes

Posted by Daniel John Debrunner <dj...@debrunners.com>.

RPost wrote:
> Thanks for the extended notes. These are very helpful.
> 
> 
>>"Daniel John Debrunner" wrote:
> 
> 
>>https://svn.apache.org/viewcvs.cgi/*checkout*/incubator/derby/code/trunk/ja
> 
> va/engine/org/apache/derby/iapi/types/package.html
> 
> 
>>Generally the Derby engine works upon an array of DVD's that represent a
> 
> row
> 
> Are these same DVDs also then used by the log module to create log records?

Not sure what this question is really asking, but there is only one
DataValueDescriptor interface. Some store classes are implementations of it.

>>Interaction with Store
> 
> What interaction is there with other modules (or functionality) such as
> log/restore/recovery or the catalog? Are external versions of type
> descriptors used to create the catalog descriptions of the columns? Used in
> metadata queries?

That would indeed be a good write up, deeper insight into the TypeId
side. Since I wasn't working in that area it wasn't fresh in my mind.

>>DataTypeDescriptor
>>Note that a DataValueDescriptor is not tied to any DataTypeDescriptor
> 
> 
> Is this for the same performance reasons given in the DataValueDescriptor
> section? There you said: 'For example in  reading rows from the store a
> single DVD is used to read a column's value for all the rows processed'. I
> assume that not tying the value and type descriptors together means that the
> value descriptors don't need to validate the type when being reused during
> reads from the store.

One reason is memory overhead, tieing a DVD to a DTD would mean each DVD
has am extra instance field that is the reference to the DTD. Another is
that these objects were previously written in network protocols and thus
needed to be created context free.
I think you are also correct in the performance assumption, as Derby can
avoid the normalization step if input and output types are compatible,
e.g. CHAR(5) to CHAR(10) does not need a length check.

>>Issues
>>Interfaces or Classes
>>Code would be smaller and faster if the interfaces were removed
> 
> 
> Do you have any sense or 'guesstimate' as to what the maximum potential size
> or speed savings could be?
> 
> Do you think this may be necessary (as opposed to desireable) for certain
> environments such as mobile or wireless?
> 
> Is it conceptually possible to design a 'proof of concept' that might
> provide at least an estimate of the savings that might be achieved? That is,
> is there any specific test case that might be useful to see if it is worth
> exploring further or would the changes be extensive even to perform a
> limited test?. Obviously the simpler the case the better.

I don't think it's necessary but it offends me :-). Looking at XP
practices, it would fall into the 'refactor' bucket.

I think it's actually fairly easy to test this. Write really simple
tests using direct creation of SQLIntegers, completely outside of the
engine. Work on plus, write a simple additon using plus that simulates
what the engine does, execute one million additions using

1) references through NumberDataValue
2) referenecs through NumberDataType
3+) various modified plus() methods as descibed in the type's package.html.

>>Result Holder Generation
>>The dynamic creation of result holders (see language section) means that
> 
> all operators have to check for the result reference being passed in being
> null, and if so create a new instance of the desired type
> 
> Could a result holder cache/factory be used effectively for this? Perhaps a
> separate thread that maintains a cache of new instances of various types.
> The size of the cache could be configurable by introducing a new property.
> This would allow the null checks to be removed from the operator code and
> the operator code would not have to wait synchronously for instance
> creation. Obviously there would be asynchronous waits since the cache would
> never be big enough for large numbers of rows.

I don't think a cache is a good idea, it's too much complexity. All I
was trying to say that the generated code could ensure the field was
initialized at statement initialization time, thus ensuring the field
was never null when the operator is executed. That removes the need in
the operators to check to see if the result holder is null, ie. their
api is defined as the result holder passed in must never be null, and
removes the need in the generated code to set the field with the return
(result) of the method call. A possible extra step is to define the
methods as void, as the caller already has the reference to the result.

Thanks for the questions!
Dan.

Re: Type system notes

Posted by RPost <rp...@pacbell.net>.

Thanks for the extended notes. These are very helpful.

> "Daniel John Debrunner" wrote:

>https://svn.apache.org/viewcvs.cgi/*checkout*/incubator/derby/code/trunk/ja
va/engine/org/apache/derby/iapi/types/package.html
>

>Generally the Derby engine works upon an array of DVD's that represent a
row

Are these same DVDs also then used by the log module to create log records?
>Interaction with Store
What interaction is there with other modules (or functionality) such as
log/restore/recovery or the catalog? Are external versions of type
descriptors used to create the catalog descriptions of the columns? Used in
metadata queries?

>DataTypeDescriptor
>Note that a DataValueDescriptor is not tied to any DataTypeDescriptor

Is this for the same performance reasons given in the DataValueDescriptor
section? There you said: 'For example in  reading rows from the store a
single DVD is used to read a column's value for all the rows processed'. I
assume that not tying the value and type descriptors together means that the
value descriptors don't need to validate the type when being reused during
reads from the store.

>Issues
>Interfaces or Classes
>Code would be smaller and faster if the interfaces were removed

Do you have any sense or 'guesstimate' as to what the maximum potential size
or speed savings could be?

Do you think this may be necessary (as opposed to desireable) for certain
environments such as mobile or wireless?

Is it conceptually possible to design a 'proof of concept' that might
provide at least an estimate of the savings that might be achieved? That is,
is there any specific test case that might be useful to see if it is worth
exploring further or would the changes be extensive even to perform a
limited test?. Obviously the simpler the case the better.

>Result Holder Generation
>The dynamic creation of result holders (see language section) means that
all operators have to check for the result reference being passed in being
null, and if so create a new instance of the desired type

Could a result holder cache/factory be used effectively for this? Perhaps a
separate thread that maintains a cache of new instances of various types.
The size of the cache could be configurable by introducing a new property.

This would allow the null checks to be removed from the operator code and
the operator code would not have to wait synchronously for instance
creation. Obviously there would be asynchronous waits since the cache would
never be big enough for large numbers of rows.

Re: Type system notes

Posted by "Jean T. Anderson" <jt...@bristowhill.com>.

thanks! I'll add a link to it.

  -jean

Dibyendu Majumdar wrote:
> Hi Jean,
> 
> This is the one I was referring to.
> ----- Original Message ----- 
> From: "Daniel John Debrunner" <dj...@debrunners.com>
> To: <de...@db.apache.org>
> Sent: Friday, February 18, 2005 6:00 PM
> Subject: Type system notes
> 
> 
> 
>>Here is a direct link to the types' package.html, from svn.
>>
>>
> 
> https://svn.apache.org/viewcvs.cgi/*checkout*/incubator/derby/code/trunk/java/engine/org/apache/derby/iapi/types/package.html
> 
>>Dan.
>>
> 
> 
>

Re: Type system notes

Posted by je...@videotron.ca.

Hi, 
could you tell me if there is a formal specification somewhere for the SQL 
type system?
Thank
-Jean

Le 2 Mars 2005 18:36, Dibyendu Majumdar a écrit :
> > Here is a direct link to the types' package.html, from svn.
>
> https://svn.apache.org/viewcvs.cgi/*checkout*/incubator/derby/code/trunk/ja
>va/engine/org/apache/derby/iapi/types/package.html

Re: Type system notes

Posted by Dibyendu Majumdar <di...@mazumdar.demon.co.uk>.

Hi Jean,

This is the one I was referring to.
----- Original Message ----- 
From: "Daniel John Debrunner" <dj...@debrunners.com>
To: <de...@db.apache.org>
Sent: Friday, February 18, 2005 6:00 PM
Subject: Type system notes


>
> Here is a direct link to the types' package.html, from svn.
>
>
https://svn.apache.org/viewcvs.cgi/*checkout*/incubator/derby/code/trunk/java/engine/org/apache/derby/iapi/types/package.html
>
> Dan.
>