You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Xiaoqing Gao <28...@qq.com> on 2022/06/27 12:14:23 UTC

Add check constraint syntax

Hi, are you considering adding the feature of CONSTRAINT syntax?

when we add syntax CONSTRAINT c1 CHECK (col1_day = DAY(col1_time)) on ddl, impala identify CONSTRAINT, then convert "col_time between $time_lower and $time_upper" to "col_day between $day_lower and $day_upper and col_time between $time_lower and $time_upper".

E.g.
CREATE TABLE rawdata.event_p1 ( day INT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION DEFAULT_COMPRESSION,  id BIGINT NOT NULL ENCODING BIT_SHUFFLE COMPRESSION DEFAULT_COMPRESSION, time TIMESTAMP NOT NULL ENCODING BIT_SHUFFLE COMPRESSION DEFAULT_COMPRESSION, p__ip STRING NULL ENCODING DICT_ENCODING COMPRESSION DEFAULT_COMPRESSION, CONSTRAINT c1 CHECK (day = DAY(time))  ) PARTITIONED BY ( &nbsp; day INT )  STORED AS PARQUET

explain select * from event_p1 where time &gt; '2018-07-18 19:36:42'

01:EXCHANGE [UNPARTITIONED] | 00:SCAN HDFS [rawdata.event_ros_p7]    partition predicates: `day` &gt; 17730 and time &gt; '2018-07-18 19:36:42' &nbsp; &nbsp;partitions=425/439 files=3420 size=1.71TB 


Thanks

gaoxiaoqing

Re: Add check constraint syntax

Posted by David Rorke <dr...@cloudera.com>.
There has been some initial work on PK/FK constraints described in
https://issues.apache.org/jira/browse/IMPALA-3531.   AFAICT the status of
this is roughly that we support the syntax to define PK/FK constraints as
part of "create table" but don't support adding them via "alter table".
Also as far as I know we aren't enforcing/validating these yet (so you can
only define them with "novalidate") and the Impala planner isn't doing any
meaningful optimizations yet based on the defined constraints.

On Mon, Jun 27, 2022 at 7:26 AM Zoltán Borók-Nagy <bo...@cloudera.com>
wrote:

> Currently we are not working on CONSTRAINTs AFAICT. But feel free to open a
> Jira for it, someone might pick it up. You can also work on implementing it
> if you want.
>
> During writes, Impala would just verify the constraints, i.e. "col_day =
> DAY(col_time)".
>
> Your example also includes query optimization based on constraints, i.e.
> adding extra predicates on columns with constraints. This seems to be the
> trickier part, and in some cases might require that the rewrite rule knows
> the functions in the CONSTRAINT.
>
> But I'd also like to point out that the above use case works already with
> Iceberg tables.
>
> E.g.:
>
> CREATE TABLE ice_t (ts TIMESTAMP, d INT)
> PARTITIONED BY SPEC (DAY(ts))
> STORED AS ICEBERG;
>
> INSERT INTO ice_t VALUES (now(), 42);
>
> SELECT * FROM ice_t where ts > '2018-07-18 19:36:42';
>
> During insertion the partition value is automatically calculated, and
> during the planning of the SELECT statement the Iceberg table is smart
> enough to apply the ts > '2018-07-18 19:36:42' predicate (rewritten to
> "day(ts) >= '2018-07-18'") on the partition column.
>
> Cheers,
>      Zoltan
>
>
> On Mon, Jun 27, 2022 at 2:14 PM Xiaoqing Gao <28...@qq.com> wrote:
>
> > Hi, are you considering adding the feature of CONSTRAINT syntax?
> >
> > when we add syntax CONSTRAINT c1 CHECK (col1_day = DAY(col1_time)) on
> ddl,
> > impala identify CONSTRAINT, then convert "col_time between $time_lower
> and
> > $time_upper" to "col_day between $day_lower and $day_upper and col_time
> > between $time_lower and $time_upper".
> >
> > E.g.
> > CREATE TABLE rawdata.event_p1 ( day INT NOT NULL ENCODING BIT_SHUFFLE
> > COMPRESSION DEFAULT_COMPRESSION,  id BIGINT NOT NULL ENCODING BIT_SHUFFLE
> > COMPRESSION DEFAULT_COMPRESSION, time TIMESTAMP NOT NULL ENCODING
> > BIT_SHUFFLE COMPRESSION DEFAULT_COMPRESSION, p__ip STRING NULL ENCODING
> > DICT_ENCODING COMPRESSION DEFAULT_COMPRESSION, CONSTRAINT c1 CHECK (day =
> > DAY(time))  ) PARTITIONED BY ( &nbsp; day INT )  STORED AS PARQUET
> >
> > explain select * from event_p1 where time &gt; '2018-07-18 19:36:42'
> >
> > 01:EXCHANGE [UNPARTITIONED] | 00:SCAN HDFS [rawdata.event_ros_p7]
> > partition predicates: `day` &gt; 17730 and time &gt; '2018-07-18
> 19:36:42'
> > &nbsp; &nbsp;partitions=425/439 files=3420 size=1.71TB
> >
> >
> > Thanks
> >
> > gaoxiaoqing
>

Re: Add check constraint syntax

Posted by Zoltán Borók-Nagy <bo...@cloudera.com>.
Currently we are not working on CONSTRAINTs AFAICT. But feel free to open a
Jira for it, someone might pick it up. You can also work on implementing it
if you want.

During writes, Impala would just verify the constraints, i.e. "col_day =
DAY(col_time)".

Your example also includes query optimization based on constraints, i.e.
adding extra predicates on columns with constraints. This seems to be the
trickier part, and in some cases might require that the rewrite rule knows
the functions in the CONSTRAINT.

But I'd also like to point out that the above use case works already with
Iceberg tables.

E.g.:

CREATE TABLE ice_t (ts TIMESTAMP, d INT)
PARTITIONED BY SPEC (DAY(ts))
STORED AS ICEBERG;

INSERT INTO ice_t VALUES (now(), 42);

SELECT * FROM ice_t where ts > '2018-07-18 19:36:42';

During insertion the partition value is automatically calculated, and
during the planning of the SELECT statement the Iceberg table is smart
enough to apply the ts > '2018-07-18 19:36:42' predicate (rewritten to
"day(ts) >= '2018-07-18'") on the partition column.

Cheers,
     Zoltan


On Mon, Jun 27, 2022 at 2:14 PM Xiaoqing Gao <28...@qq.com> wrote:

> Hi, are you considering adding the feature of CONSTRAINT syntax?
>
> when we add syntax CONSTRAINT c1 CHECK (col1_day = DAY(col1_time)) on ddl,
> impala identify CONSTRAINT, then convert "col_time between $time_lower and
> $time_upper" to "col_day between $day_lower and $day_upper and col_time
> between $time_lower and $time_upper".
>
> E.g.
> CREATE TABLE rawdata.event_p1 ( day INT NOT NULL ENCODING BIT_SHUFFLE
> COMPRESSION DEFAULT_COMPRESSION,  id BIGINT NOT NULL ENCODING BIT_SHUFFLE
> COMPRESSION DEFAULT_COMPRESSION, time TIMESTAMP NOT NULL ENCODING
> BIT_SHUFFLE COMPRESSION DEFAULT_COMPRESSION, p__ip STRING NULL ENCODING
> DICT_ENCODING COMPRESSION DEFAULT_COMPRESSION, CONSTRAINT c1 CHECK (day =
> DAY(time))  ) PARTITIONED BY ( &nbsp; day INT )  STORED AS PARQUET
>
> explain select * from event_p1 where time &gt; '2018-07-18 19:36:42'
>
> 01:EXCHANGE [UNPARTITIONED] | 00:SCAN HDFS [rawdata.event_ros_p7]
> partition predicates: `day` &gt; 17730 and time &gt; '2018-07-18 19:36:42'
> &nbsp; &nbsp;partitions=425/439 files=3420 size=1.71TB
>
>
> Thanks
>
> gaoxiaoqing