You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Song Jun (JIRA)" <ji...@apache.org> on 2019/04/17 08:22:00 UTC
[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

    [ https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819844#comment-16819844 ] 

Song Jun commented on SPARK-19842:
----------------------------------

I think Constraint should be designed with DataSource v2 and can do more than this jira.

Constraint can be used to:
1. data integrity(not include in this jira)
2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, unique/not null is also useful)

For data integrity, we have two scenarios:
1.1 DataSource native support data integrity, such as mysql/oracle and so on
    Spark should only call read/write API of this DataSource, and do nothing about data integrity.
1.2 DataSource do not support data integrity, such as csv/json/parquet and so on
    Spark can provide data integrity for this DataSource like Hive does(maybe a switch can be used to turn it off), and we can discuss to support which kind of Constraint.
    For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT NULL ENFORCE check is implement by add an extra UDF GenericUDFEnforceNotNullConstraint to the Plan(https://issues.apache.org/jira/browse/HIVE-16605).

For Optimizer rewrite query:
2.1 We can add Constraint Information into CatalogTable which is returned by catalog.getTable API. Then Optimizer can use it to do query rewrite.
2.2 if we can not get Constraint information, we can use hint to the SQL

Above all, we can bring Constraint feature to DataSource v2 design:
a) to support 2.1 feature, we can add constraint information to createTable/alterTable/getTable API in this SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#)
b) to support data integrity, we can add ConstaintSupport mix-in for DataSource v2:
  if one DataSource support Constraint, then Spark do nothing when insert data;
  if one DataSource do not support Constraint but still want to do constraint check, then Spark should do the constraint check like Hive(such as not null in Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan).
  if one DataSource do not support Constraint and do not want to do constraint check, then Spark do nothing.


Hive catalog support constraint, we can implement this logic in createTable/alterTable API . Then we can use SparkSQL DDL to create Table with constraint which stored to HiveMetaStore by Hive catalog API.
for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 PRIMARY KEY (a) DISABLE) USING parquet;

As for how to store constraint, because Hive 2.1 has provide constraint API in Hive.java, we can call it directly in createTable/alterTable API of Hive catalog. There is no need to use table properties to store these
constraint information by Spark. There are some concern for using Hive 2.1 catalog API directly in the docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9), such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is inprogress(https://issues.apache.org/jira/browse/SPARK-23710).

[~cloud_fan] [~ioana-delaney]
If this proposal is reasonable, please give me some feedback. Thanks!

> Informational Referential Integrity Constraints Support in Spark
> ----------------------------------------------------------------
>
>                 Key: SPARK-19842
>                 URL: https://issues.apache.org/jira/browse/SPARK-19842
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Ioana Delaney
>            Priority: Major
>         Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key (referential integrity) constraints_ in Spark. The main purpose is to open up an area of query optimization techniques that rely on referential integrity constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a _unique_, _primary key_, _foreign key_, or _check constraint_, that can be used by Spark to improve query performance. Informational constraints are not enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize the query processing. They provide semantics information that allows Catalyst to rewrite queries to eliminate joins, push down aggregates, remove unnecessary Distinct operations, and perform a number of other optimizations. Informational constraints are primarily targeted to applications that load and analyze data that originated from a data warehouse. For such applications, the conditions for a given constraint are known to be true, so the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, constraint validation, and maintenance. The document shows many examples of query performance improvements that utilize referential integrity constraints and can be implemented in Spark.
> Link to the google doc: [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org