You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/07 02:14:21 UTC

[GitHub] [iceberg] jackye1995 opened a new pull request #2560: Spark: add spark extension for identifier fields

jackye1995 opened a new pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560


   continuation of #2556 , will need to rebase after that is merged.
   
   Here I propose the following syntax:
   
   ```sql
   ALTER TABLE table REPLACE IDENTIFIER FIELDS (field1, field2, ...)
   
   -- drop all
   ALTER TABLE table DROP IDENTIFIER FIELDS
   ```
   
   `IDENTIFIER` and `FIELDS` are new keywords. Because we already have `FIELD` as a keyword, I think adding `FIELDS` can be beneficial for other batch operations in the future. It also makes the statement more clear instead of saying `ALTER TABLE table REPLACE IDENTIFIER`.
   
   The reason for having another drop statement to drop all is because typically in SQL statements a `( ... )` clause always has at least one element. We can do `ALTER TABLE table REPLACE IDENTIFIER FIELDS ()` instead, but it just feels awkward to me, not sure how other people think about this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-854311244


   @rdblue updated based on the suggestions to use `DROP`. Also fixed a bug along the way regarding nested field in `Schema`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-866283555


   Okay let me create an issue to discuss that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r628676396



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier REPLACE ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #replaceIdentifierFields

Review comment:
       +1 for @yyanyy 's comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r627877475



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -22,10 +22,12 @@
 import java.util.List;
 import java.util.Locale;

Review comment:
       contents in this file will be rebased out.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r629804140



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier REPLACE ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #replaceIdentifierFields

Review comment:
       yeah `SET` makes sense to me. Was trying to reuse `REPLACE` keyword, but it does feel awkward.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r645239265



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -159,7 +161,7 @@ quotedIdentifier
 
 nonReserved
     : ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
-    | DISTRIBUTED | LOCALLY | UNORDERED
+    | DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS | SET | UNSET

Review comment:
       we forgot to add REPLACE in an older PR, so I added it here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r637272965



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier SET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #setIdentifierFields
+    | ALTER TABLE multipartIdentifier UNSET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #unsetIdentifierFields

Review comment:
       What about `DROP` instead of `UNSET` since this is dropping identifier fields from the set?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r637268342



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier SET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #setIdentifierFields

Review comment:
       Should `fields` be a list of `multipartIdentifier` since we support nested fields?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-837702516


   @yyanyy I changed to use `SET` and `UNSET` instead of `REPLACE` and `DROP`, please see it that feels better, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r637269199



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -174,15 +176,19 @@ DESC: 'DESC';
 DISTRIBUTED: 'DISTRIBUTED';
 DROP: 'DROP';
 FIELD: 'FIELD';
+FIELDS: 'FIELDS';
 FIRST: 'FIRST';
 LAST: 'LAST';
 LOCALLY: 'LOCALLY';
 NULLS: 'NULLS';
 ORDERED: 'ORDERED';
 PARTITION: 'PARTITION';
 REPLACE: 'REPLACE';
+ROW_IDENTIFIER: 'IDENTIFIER';

Review comment:
       What about naming this `IDENTIFIER_KW` or something so it is more clear that this is `identifier` as a keyword?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r637271860



##########
File path: spark3-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/UnsetIdentifierFieldsExec.scala
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.iceberg.relocated.com.google.common.collect.Sets
+import org.apache.iceberg.spark.source.SparkTable
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.connector.catalog.Identifier
+import org.apache.spark.sql.connector.catalog.TableCatalog
+
+case class UnsetIdentifierFieldsExec(
+    catalog: TableCatalog,
+    ident: Identifier,
+    fields: Seq[String]) extends V2CommandExec {
+  import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+
+  override lazy val output: Seq[Attribute] = Nil
+
+  override protected def run(): Seq[InternalRow] = {
+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        val identifierFieldNames = iceberg.table.schema().identifierFieldNames()

Review comment:
       Nit: it's better to omit parens with Java APIs when they would be omitted in a Scala API. Since this is a getter, I would omit the parens on `schema` and `identifierFieldNames`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r629804738



##########
File path: spark3-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ReplaceIdentifierFieldsExec.scala
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.iceberg.spark.source.SparkTable
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.connector.catalog.Identifier
+import org.apache.spark.sql.connector.catalog.TableCatalog
+
+case class ReplaceIdentifierFieldsExec(
+    catalog: TableCatalog,
+    ident: Identifier,
+    fields: Seq[String]) extends V2CommandExec {
+  import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+
+  override lazy val output: Seq[Attribute] = Nil
+
+  override protected def run(): Seq[InternalRow] = {
+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        iceberg.table.updateSchema()
+          .setIdentifierFields(scala.collection.JavaConverters.seqAsJavaList(fields))

Review comment:
       Yeah that was what I was debating about with myself. Technically we can do it, but I have never seen a `(...)` clause with no element in SQL, so I am hesitated to support that.
   
   If we go with the `SET` route, I can add a `UNSET` for removing fields. What do you feel about that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r656520832



##########
File path: spark3-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DropIdentifierFieldsExec.scala
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions
+import org.apache.iceberg.relocated.com.google.common.collect.Sets
+import org.apache.iceberg.spark.source.SparkTable
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.connector.catalog.Identifier
+import org.apache.spark.sql.connector.catalog.TableCatalog
+
+case class DropIdentifierFieldsExec(
+    catalog: TableCatalog,
+    ident: Identifier,
+    fields: Seq[String]) extends V2CommandExec {
+  import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+
+  override lazy val output: Seq[Attribute] = Nil
+
+  override protected def run(): Seq[InternalRow] = {
+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        val schema = iceberg.table.schema
+        val identifierFieldNames = Sets.newHashSet(schema.identifierFieldNames)
+
+        for (name <- fields) {
+          Preconditions.checkArgument(schema.findField(name) != null,
+            "Cannot complete drop identifier fields operation: field %s not found", name)

Review comment:
       Nit: you can omit "complete" and "operation" to be more concise. "Cannot drop identifier field: %s (not found)" says basically the same thing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-854311244


   @rdblue updated based on the suggestions to use `DROP`. Also fixed a bug along the way regarding nested field in `Schema`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r653859623



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE table=multipartIdentifier SET IDENTIFIER_KW FIELDS '(' fields+=multipartIdentifier (',' fields+=multipartIdentifier)* ')' #setIdentifierFields

Review comment:
       > I would make the parens optional
   
   Do you mean to also not have parens optional when there are multiple fields, like:
   
   ```
   SET IDENTIFIER FIELDS first_name, last_name
   ```
   
   Or is it only for the case when there is 1 field?

##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE table=multipartIdentifier SET IDENTIFIER_KW FIELDS '(' fields+=multipartIdentifier (',' fields+=multipartIdentifier)* ')' #setIdentifierFields

Review comment:
       Okay that makes sense. I will update with all the other comments, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r637268694



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -159,7 +161,7 @@ quotedIdentifier
 
 nonReserved
     : ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
-    | DISTRIBUTED | LOCALLY | UNORDERED
+    | DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS | SET | UNSET

Review comment:
       Looks like `REPLACE` is here, but should not be?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

openinx commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-834981100


   I would like to invite @aokolnychyi to review this patch !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] yyanyy commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

yyanyy commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r628533052



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier REPLACE ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #replaceIdentifierFields

Review comment:
       do we now have `SET ROW_IDENTIFIER`? if not, wondering if it would seem odd if user can only "replace" even when the table doesn't have row identifiers yet

##########
File path: spark3-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ReplaceIdentifierFieldsExec.scala
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.iceberg.spark.source.SparkTable
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.connector.catalog.Identifier
+import org.apache.spark.sql.connector.catalog.TableCatalog
+
+case class ReplaceIdentifierFieldsExec(
+    catalog: TableCatalog,
+    ident: Identifier,
+    fields: Seq[String]) extends V2CommandExec {
+  import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+
+  override lazy val output: Seq[Attribute] = Nil
+
+  override protected def run(): Seq[InternalRow] = {
+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        iceberg.table.updateSchema()
+          .setIdentifierFields(scala.collection.JavaConverters.seqAsJavaList(fields))

Review comment:
       not familiar with this code base, curious - is it possible that input `fields` is null here and we encounter NPE? If it's nullable and NPE safe, do we want `replace` with empty input to be the same as `drop`?

##########
File path: spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAlterTableSchema.java
##########
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.extensions;
+
+import java.util.Map;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestAlterTableSchema extends SparkExtensionsTestBase {

Review comment:
       great to see this! 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r637269714



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE multipartIdentifier SET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #setIdentifierFields
+    | ALTER TABLE multipartIdentifier UNSET ROW_IDENTIFIER FIELDS '(' fields+=identifier (',' fields+=identifier)* ')' #unsetIdentifierFields

Review comment:
       What is the behavior of `UNSET IDENTIFIER FIELDS`? That isn't obvious to me so I'm wondering if we should change it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r645239265



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -159,7 +161,7 @@ quotedIdentifier
 
 nonReserved
     : ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
-    | DISTRIBUTED | LOCALLY | UNORDERED
+    | DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS | SET | UNSET

Review comment:
       we forgot to add REPLACE in an older PR, so I added it here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-834008832


   @openinx @rdblue @yyanyy could you take a look? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r627877793



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -159,7 +161,7 @@ quotedIdentifier
 
 nonReserved
     : ADD | ALTER | AS | ASC | BY | CALL | DESC | DROP | FIELD | FIRST | LAST | NULLS | ORDERED | PARTITION | TABLE | WRITE
-    | DISTRIBUTED | LOCALLY | UNORDERED
+    | DISTRIBUTED | LOCALLY | UNORDERED | REPLACE | WITH | ROW_IDENTIFIER | FIELDS

Review comment:
       also added missing keywords that I forgot to add in #2365 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-865699352


   @rdblue sorry I was stuck with some internal work and did not get time to update this. the latest code addressed most of the comments.
   
   I have been thinking about the syntax over the last few days, I think I was very biased with the Hive DDL syntax, but as you pointed out, if we prefer the other way, it might be better to instead use a syntax like:
   
   ```sql
   ALTER TABLE table WRITE IDENTIFIED BY first_name,last_name;
   ```
   
   And we don't support any other syntax like add and drop.
   
   This avoids the confusion and it should be clear that we are replacing all the identifier fields. It is closer to the existing syntax of sort order, plus it does not need parentheses. What do you think?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#issuecomment-866275467


   @jackye1995, sorry, I didn't read your last comment before looking at the updates. I approved and merged this.
   
   I like your idea to change this to `WRITE IDENTIFIED BY`. I'd like to hear what @aokolnychyi thinks about it. I'm happy with the current syntax as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r653856249



##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE table=multipartIdentifier SET IDENTIFIER_KW FIELDS '(' fields+=multipartIdentifier (',' fields+=multipartIdentifier)* ')' #setIdentifierFields

Review comment:
       My comment was based on similar reasoning for other DDL. For example, it is better to support both `ADD COLUMNS` and `ADD COLUMN` interchangeably so that users don't get needless syntax errors when they use a plural or singular term with the "wrong" number of columns.
   
   I think that the problem extending that to this is that it isn't clear that `SET IDENTIFIER FIELD x` actually replaces all existing identifier fields, which I think is why you're mentioning not wanting to use the `ADD IDENTIFIER FIELD` syntax. I think that if there is confusion now, then there would be more confusion for users, so maybe it is better to only support `SET IDENTIFIER FIELDS (...)`. Though I would make the parens optional.
   
   > . . . given that all Hive DDLs are in this batch format
   
   I don't think that we want to take inspiration from Hive very much. Better to use Postgres or Trino as reference points.
   
   > If we want to have a way to only add new fields without the need to know current fields, I can also add a ADD IDENTIFIER FIELDS.
   
   This wasn't my intent and I would not want to have confusion, so let's not allow using `FIELD` interchangeably.

##########
File path: spark3-extensions/src/main/antlr/org.apache.spark.sql.catalyst.parser.extensions/IcebergSqlExtensions.g4
##########
@@ -71,6 +71,8 @@ statement
     | ALTER TABLE multipartIdentifier DROP PARTITION FIELD transform                        #dropPartitionField
     | ALTER TABLE multipartIdentifier REPLACE PARTITION FIELD transform WITH transform (AS name=identifier)? #replacePartitionField
     | ALTER TABLE multipartIdentifier WRITE writeSpec                                       #setWriteDistributionAndOrdering
+    | ALTER TABLE table=multipartIdentifier SET IDENTIFIER_KW FIELDS '(' fields+=multipartIdentifier (',' fields+=multipartIdentifier)* ')' #setIdentifierFields

Review comment:
       Yeah, I would make it possible to use that syntax. No need for users to remember that they need parens when we don't actually need them to parse correctly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2560: Spark: add spark extension for identifier fields

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2560:
URL: https://github.com/apache/iceberg/pull/2560#discussion_r637272508



##########
File path: spark3-extensions/src/main/scala/org/apache/spark/sql/execution/datasources/v2/UnsetIdentifierFieldsExec.scala
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import org.apache.iceberg.relocated.com.google.common.collect.Sets
+import org.apache.iceberg.spark.source.SparkTable
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.connector.catalog.Identifier
+import org.apache.spark.sql.connector.catalog.TableCatalog
+
+case class UnsetIdentifierFieldsExec(
+    catalog: TableCatalog,
+    ident: Identifier,
+    fields: Seq[String]) extends V2CommandExec {
+  import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+
+  override lazy val output: Seq[Attribute] = Nil
+
+  override protected def run(): Seq[InternalRow] = {
+    catalog.loadTable(ident) match {
+      case iceberg: SparkTable =>
+        val identifierFieldNames = iceberg.table.schema().identifierFieldNames()
+        fields.map(f => identifierFieldNames.remove(f))

Review comment:
       This should not assume that it is safe to modify `identifierFieldNames`. The class is free to return an immutable collection, and arguably should return one. We don't want to rely on the existing behavior.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org