You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/17 08:28:57 UTC
[GitHub] [spark] windpiger commented on issue #18994: [SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL.

windpiger commented on issue #18994: [SPARK-21784][SQL] Adds support for defining informational primary key and foreign key constraints using ALTER TABLE DDL.
URL: https://github.com/apache/spark/pull/18994#issuecomment-483989274
 
 
   I think Constraint should be designed with DataSource v2 and can do more than [SPARK-19842](https://issues.apache.org/jira/browse/SPARK-19842).
   
   Constraint can be used to:
   1. data integrity(not include in [SPARK-19842](https://issues.apache.org/jira/browse/SPARK-19842))
   2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, unique/not null is also useful)
   
   For data integrity, we have two scenarios:
   1.1 DataSource native support data integrity, such as mysql/oracle and so on
   Spark should only call read/write API of this DataSource, and do nothing about data integrity.
   1.2 DataSource do not support data integrity, such as csv/json/parquet and so on
   Spark can provide data integrity for this DataSource like Hive does(maybe a switch can be used to turn it off), and we can discuss to support which kind of Constraint.
   For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT NULL ENFORCE check is implement by add an extra UDF GenericUDFEnforceNotNullConstraint to the Plan([HIVE-16605](https://issues.apache.org/jira/browse/HIVE-16605)).
   
   For Optimizer rewrite query:
   2.1 We can add Constraint Information into CatalogTable which is returned by catalog.getTable API. Then Optimizer can use it to do query rewrite.
   2.2 if we can not get Constraint information, we can use hint to the SQL
   
   Above all, we can bring Constraint feature to DataSource v2 design:
   a) to support 2.1 feature, we can add constraint information to createTable/alterTable/getTable API in this SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#)
   b) to support data integrity, we can add ConstaintSupport mix-in for DataSource v2:
   if one DataSource support Constraint, then Spark do nothing when insert data;
   if one DataSource do not support Constraint but still want to do constraint check, then Spark should do the constraint check like Hive(such as not null in Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan).
   if one DataSource do not support Constraint and do not want to do constraint check, then Spark do nothing.
   
   Hive catalog support constraint, we can implement this logic in createTable/alterTable API . Then we can use SparkSQL DDL to create Table with constraint which stored to HiveMetaStore by Hive catalog API.
   for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 PRIMARY KEY (a) DISABLE) USING parquet;
   
   **_As for how to store constraint_**, because Hive 2.1 has provide constraint API in Hive.java, we can call it directly in createTable/alterTable API of Hive catalog. There is no need to use table properties to store these
   constraint information by Spark. There are some concern for using Hive 2.1 catalog API directly in the docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9), such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is inprogress([SPARK-23710](https://issues.apache.org/jira/browse/SPARK-23710)).
   
   @cloud-fan @gatorsmile @sureshthalamati @ioana-delaney
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org