You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@quickstep.apache.org by "Jianqiao Zhu (JIRA)" <ji...@apache.org> on 2017/10/12 18:35:00 UTC

[jira] [Created] (QUICKSTEP-109) Refactor type system to provide better extensibility of types and functions

Jianqiao Zhu created QUICKSTEP-109:
--------------------------------------

             Summary: Refactor type system to provide better extensibility of types and functions
                 Key: QUICKSTEP-109
                 URL: https://issues.apache.org/jira/browse/QUICKSTEP-109
             Project: Apache Quickstep
          Issue Type: Improvement
          Components: Expressions, Parser, Query Optimizer, Storage, Types
            Reporter: Jianqiao Zhu


This is an initial PR that provides an overall view of the type system refactoring work. Many constructs are at their initial designs and maybe further improved.

The PR aims at reviewing the refactoring designs at the "architecture" level. Detailed code style and unit test issues may be addressed later in subsequent concrete PRs.

The overall purpose of the refactoring is to improve the extensibility of the existing type/function system (i.e. support more kinds of types/functions and make it easier to add new types and functions), while retaining the performance of the current system.

### Major Changes
#### Part I. Type System
---
##### 1. Categorize all types into four [_memory layouts_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/TypeID.hpp#L64).

The four memory layouts are:
* __CxxInlinePod__ <sub>(C++ plain old data)</sub>
* __ParInlinePod__ <sub>(Parameterized inline plain old data)</sub>
* __ParOutOfLinePod__ <sub>(Parameterized out-of-line plain old data)</sub>
* __CxxGeneric__ <sub>(C++ generic types)</sub>

Memory layout decides how the corresponding type's values are stored and represented.

Briefly speaking,
* _CxxInlinePod_ corresponds to C++ primitive types or POD structs.
  * E.g. _int_, _double_, _struct { double x, double y }_.
  * The size of a CxxInlinePod value is known at C++ compile time (e.g _double_ has size 8, _struct { double x, double y }_ has size 16).
* _ParInlinePod_ corresponds to database defined "fixed length" types.
  * E.g. _Char(8)_, _Char(20)_.
  * The size of such types' values are not known at C++ compile time. Instead, the type is parameterized by an unsigned integer, where the parameter's value is known at SQL query compile time (which is C++ run-time).
* _ParOutOfLinePod_ corresponds to database defined "variable length" types.
  * E.g. _Varchar(20)_.
  * The size of such types' values are not known until SQL query run-time.
* _CxxGeneric_ correponds to C++ general types (i.e. any C++ type).
  * E.g. _std::set&lt;int&gt;_, _std::vector&lt;const Type*&gt;_.
  * Such types have to implement serialization/deserialization methods to have storage support.
---
##### 2. Use [_TypeIDTrait_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/TypeRegistrar.hpp#L59) to allow many information to be known at compile time.

With this per-type trait information, we can avoid many boilerplate code for each subclass of _Type_ by using template techniques and specialize on the memory layout. See [_TypeSynthesizer_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/TypeSynthesizer.hpp) and [_TypeFactory_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/TypeFactory.cpp#L69).

_TypeIDTrait_ is also extensively used in many other places as it provides all the required compile-time information about a type.

---

##### 3. Support more types.
Details will be written later about how to add a new type into the Quickstep system.

The current PR has some example types added:
* The [_Bool_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/BoolType.hpp) type. It will be used later for connecting scalar functions and predicates.
* The [_Text_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/TextType.hpp) type. A general non-parameterized string type.
  * __TODO:__ We need some updates in the storage block module (potentially also other places) to handle the "infinite maximum byte size" types.
* The [_MetaType_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/MetaType-decl.hpp) type. It is "type of type". I.e. a value of _MetaType_ has C++ type _const Type*_.
* The [_Array_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/ArrayType.hpp) type. A generic type that represents an array. This type takes a MetaType value as parameter, where the parameter specifies the array's element type.
  * __TODO__: We need specialized array types such as _IntArray_ and _TextArray_ for performance consideration.

---
##### 4. Improve the type casting mechanism.

Type casting (coersion) is an important feature that is needed in practice from time to time.

This PR's design defined an overall [template](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/unary_operations/CastFunctorOverloads.hpp#L41)
```
template <typename SourceType, typename TargetType, typename Enable = void>
struct CastFunctor;
```
which is then specialized by different source/target types.

The coercibility between two types is then [inferred](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/utility/CastUtil.cpp#L58) according to whether the corresponding specialization exists. Thus it suffices to just specialize _CastFunctor_ when adding a new casting operation, and all the dependent places (e.g. _Type::isCoercibleFrom()_) will mostly be auto-generated by the system (unless the target type is a parameterized type and you want to do some further checks).

Note that _safe-coercibility_ is a separate issue and needs to be taken care of mostly manually, by overriding _Type::isSafelyCoercibleFrom()_.

Explicit casting is supported with a PostgreSQL-like syntax. E.g.

(1)
```
SELECT (i::text + (i+1)::text)::int AS result FROM generate_series(1, 3) AS g(i);

--
+-----------+
|result     |
+-----------+
|         12|
|         23|
|         34|
+-----------+
```
(2)
```
CREATE TABLE r(x varchar(16));

INSERT INTO r SELECT pow(10, i)::varchar(10) FROM generate_series(1, 3) AS g(i);

SELECT 'There are ' + length(x)::varchar(10) + ' characters in ' + x AS result FROM r;

--
+---------------------------------------------------+
|result                                             |
+---------------------------------------------------+
|                       There are 2 characters in 10|
|                      There are 3 characters in 100|
|                     There are 4 characters in 1000|
+---------------------------------------------------+
```

(3)
```
SELECT {1,2,3}::array(double) AS result from generate_series(1, 1);

--
+--------------------------------+
|result                          |
+--------------------------------+
|                         {1,2,3}|
+--------------------------------+
```

__NOTE__: The work is not yet fully completed so there may be `LOG(FATAL)` aborts for some combinations of queries.


Implicit coersion is supported when resolving scalar functions, see [here](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/OperationFactory.cpp#L292). For example, we have support for the _sqrt_ function where the parameter can be a _Float_ or _Double_ value. Consider the query
```
SELECT sqrt(x) FROM r;
```
where `x` has _Int_ type, then an implicit coercion from _Int_ to _Float_ will be added.

---
##### 5. Add [_GenericValue_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/GenericValue.hpp) to represent typed-values of all four memory layouts.

The original _TypedValue_ is not sufficient to represent _CxxGeneric_ values, as we need to embed the overall _Type_ information in order to handle value allocation/copy/destruction. However, due to performance consideration, we may not just replace _TypedValue_ with a more generic but slower implementation. Thus, a separate _GenericValue_ is added and we still use _TypedValue_ when handling storage-related operations.

---
##### 6. Move type resolving from parser to resolver.

This avoids the need of modifying _SqlParser.ypp_ for adding a new type.

See [_ParseDataType_](https://github.com/apache/incubator-quickstep/blob/refactor-type/parser/ParseDataType.hpp) and [_Resolver::resolveDataType()_](https://github.com/apache/incubator-quickstep/blob/refactor-type/query_optimizer/resolver/Resolver.cpp#L1196).

~

#### Part II. Scalar Function
---
##### 1. Implement [_UnaryOperationSynthesizer_/_UncheckedUnaryOperatorSynthesizer_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/unary_operations/UnaryOperationSynthesizer.hpp#L58) to make it easier to add unary functions.

Example unary functions:
* [Arithmetic](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/unary_operations/ArithmeticUnaryFunctors.hpp#L60)
* [String](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/unary_operations/AsciiStringUnaryFunctors.hpp#L106)
* [Math](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/unary_operations/CMathUnaryFunctors.hpp#L70)

##### 2. Implement [_BinaryOperationSynthesizer_/_UncheckedBinaryOperatorSynthesizer_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/binary_operations/BinaryOperationSynthesizer.hpp#L62) to make it easier to add binary functions.

Example binary functions:
* [Arithmetic](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/binary_operations/ArithmeticBinaryFunctors.hpp#L94)
* [String](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/binary_operations/AsciiStringBinaryFunctors.hpp#L127)
* [Math](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/binary_operations/CMathBinaryFunctors.hpp#L66)

##### 3. Use [_OperationSignature_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/OperationSignature.hpp#L45) and [_OperationFactory_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/OperationFactory.hpp#L48) to support general operation resolution.

* See [_OperationFactory::OperationFactory()_](https://github.com/apache/incubator-quickstep/blob/refactor-type/types/operations/OperationFactory.cpp#L85) about how operations are registered.
* See [_Resolver::resolveScalarFunction()_](https://github.com/apache/incubator-quickstep/blob/refactor-type/query_optimizer/resolver/Resolver.cpp#L2889) about how a function from SQL query gets resolved.


~

#### Part III. TODOs
* A lot of _TODO(refactor-type)_ in the code to be fixed.
* Refactor the predicate system (we will have something like _ComparisonSynthesizer_).
* A lot unit tests are broken (due to API change) and need to be fixed.
* Comments and style of template metaprogramming code.
* More to be added ...




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)