You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/30 12:10:26 UTC

[GitHub] [iceberg] openinx opened a new pull request #2010: Core: Add primary key spec.

openinx opened a new pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-806579536


   Close this now because we've had a PR #2354 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ismailsimsek edited a comment on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
ismailsimsek edited a comment on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-760880495


   > One thing in the discussion was whether we should call this `primary key` or something like a `default equality key`. Saying primary key would make people feel the uniqueness is enforced, but it actually cannot enforce it. I don't have a strong opinion on this, but I constantly hear customers using Redshift complaining about this. Also, once we introduce primary key, there will definitely be an ask to have a foreign key concept. Just something to think about if we decide to go with this approach.
   
   +1 for having `primary key` without enforcement. Most of the Analytic databases are supporting informational `Primary Key` and `Foreign Key` Definitions without enforcing it, to use it as metadata. ex [snowflake](https://docs.snowflake.com/en/sql-reference/constraints-overview.html)
   
   Vertica has good balance between both it enforces/checks [constraints during the read,join](http://vertica-forums.com/viewtopic.php?t=101#p143) and not checks it during insert.
   
   it could be good feature to have [ANALYZE_CONSTRAINTS](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_CONSTRAINTS.htm) command with it user can validate constraint manually on a table. for-example after insert.
   
   in-case of enforcement it could be good idea to have flag on table level constraints ENABLED/DISABLED
   not sure if this would be possible (for small and less frequently changing tables)
   1. append data to a table
   2. if `constraints_enabled` check constraint before commit!
      - commit or rollback
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ismailsimsek edited a comment on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
ismailsimsek edited a comment on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-760880495


   > One thing in the discussion was whether we should call this `primary key` or something like a `default equality key`. Saying primary key would make people feel the uniqueness is enforced, but it actually cannot enforce it. I don't have a strong opinion on this, but I constantly hear customers using Redshift complaining about this. Also, once we introduce primary key, there will definitely be an ask to have a foreign key concept. Just something to think about if we decide to go with this approach.
   
   +1 for having `primary key` without enforcement. Most of the Analytic databases are supporting informational `Primary Key` and `Foreign Key` Definitions without enforcing it, to use it as metadata. ex [snowflake](https://docs.snowflake.com/en/sql-reference/constraints-overview.html)
   
   Vertica has good balance between both it enforces/checks [constraints during the read,join](http://vertica-forums.com/viewtopic.php?t=101#p143) and not checks it during insert.
   
   it could be good feature to have [ANALYZE_CONSTRAINTS](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_CONSTRAINTS.htm) command which the user can validate constraint manually on a table. for-example after insert.
   
   in-case of enforcement it could be good idea to have flag on table level constraints ENABLED/DISABLED
   not sure if this would be possible (for small and less frequently changing tables)
   1. append data to a table
   2. if `constraints_enabled` check constraint before commit!
      - commit or rollback
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx closed pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
openinx closed pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#discussion_r551184526



##########
File path: api/src/main/java/org/apache/iceberg/PrimaryKey.java
##########
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A primary key that defines which columns will be unique in this table.
+ */
+public class PrimaryKey implements Serializable {
+
+  private static final PrimaryKey NON_PRIMARY_KEY = new PrimaryKey(null, 0, false, ImmutableList.of());
+
+  private final Schema schema;
+  private final int keyId;
+  private final boolean enforceUniqueness;
+  private final Integer[] sourceIds;

Review comment:
       I saw there'is customized `Transform` for each fields in `PartitionSpec` and `SortOrder`.  The `PartitionSpec` needs it because of  hidden partition feature, the `SortOder` need it because of `NullOrder` to determinate whether null should be sort first or last. 
   
   For primary key,  we don't have null columns and I did not think of other reasons that we will need an extra `Transform`,  so I just add the `sourceId` here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#discussion_r551179592



##########
File path: api/src/main/java/org/apache/iceberg/PrimaryKey.java
##########
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A primary key that defines which columns will be unique in this table.
+ */
+public class PrimaryKey implements Serializable {
+
+  private static final PrimaryKey NON_PRIMARY_KEY = new PrimaryKey(null, 0, false, ImmutableList.of());

Review comment:
       `0` is reserved for `NON_PRIMARY_KEY`,  the newly created primary key's key id will start from 1.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-800824166


   @jackye1995   Looks great !   Then I will close this PR if you publish it,  will be glad to be a reviewer for your patch, thanks ! 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#discussion_r551175643



##########
File path: api/src/main/java/org/apache/iceberg/PrimaryKey.java
##########
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A primary key that defines which columns will be unique in this table.
+ */
+public class PrimaryKey implements Serializable {
+
+  private static final PrimaryKey NON_PRIMARY_KEY = new PrimaryKey(null, 0, false, ImmutableList.of());
+
+  private final Schema schema;
+  private final int keyId;
+  private final boolean enforceUniqueness;
+  private final Integer[] sourceIds;
+
+  private transient volatile List<Integer> sourceIdList;
+
+  private PrimaryKey(Schema schema, int keyId, boolean enforceUniqueness, List<Integer> sourceIds) {
+    this.schema = schema;
+    this.keyId = keyId;
+    this.enforceUniqueness = enforceUniqueness;
+    this.sourceIds = sourceIds.toArray(new Integer[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} for this primary key.
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Returns this ID of this primary key.
+   */
+  public int keyId() {
+    return keyId;
+  }
+
+  /**
+   * Returns true if the uniqueness should be guaranteed when writing iceberg table.
+   */
+  public boolean enforceUniqueness() {
+    return enforceUniqueness;
+  }
+
+  /**
+   * Returns the list of source field ids for this primary key.
+   */
+  public List<Integer> sourceIds() {
+    if (sourceIdList == null) {
+      synchronized (this) {
+        if (sourceIdList == null) {
+          this.sourceIdList = ImmutableList.copyOf(sourceIds);
+        }
+      }
+    }
+    return sourceIdList;
+  }
+
+  /**
+   * Returns true if the primary key has no column.
+   */
+  public boolean isNonPrimaryKey() {
+    return sourceIds.length == 0;
+  }
+
+  /**
+   * Returns a dummy primary key that has no column.
+   */
+  public static PrimaryKey nonPrimaryKey() {
+    return NON_PRIMARY_KEY;
+  }
+
+  /**
+   * Checks whether this primary key is equivalent to another primary key while ignoring the primary key id.
+   *
+   * @param other a different primary key.
+   * @return true if this key is equivalent to the given key.
+   */
+  public boolean samePrimaryKey(PrimaryKey other) {
+    return Arrays.equals(sourceIds, other.sourceIds) && enforceUniqueness == other.enforceUniqueness;
+  }
+
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (!(other instanceof PrimaryKey)) {
+      return false;
+    }
+
+    PrimaryKey that = (PrimaryKey) other;
+    if (this.keyId != that.keyId) {
+      return false;
+    }
+
+    if (this.enforceUniqueness != that.enforceUniqueness) {
+      return false;
+    }
+
+    return Arrays.equals(sourceIds, that.sourceIds);
+  }
+
+  @Override
+  public int hashCode() {
+    int hash = 31 * keyId;
+    hash = hash + (enforceUniqueness ? 1 : 0);
+    hash += Arrays.hashCode(sourceIds);
+    return hash;
+  }
+
+  @Override
+  public String toString() {
+    return MoreObjects.toStringHelper(this)
+        .add("keyId", keyId)
+        .add("enforceUniqueness", enforceUniqueness)
+        .add("sourceIds", sourceIds())
+        .toString();
+  }
+
+  /**
+   * Creates a new {@link Builder primary key builder} for the given {@link Schema}.
+   *
+   * @param schema a schema
+   * @return a primary key builder for the given schema.
+   */
+  public static Builder builderFor(Schema schema) {
+    return new Builder(schema);
+  }
+
+  /**
+   * A builder to create valid {@link PrimaryKey primary keys}. Call {@link #builderFor(Schema)} to create a new
+   * builder.
+   */
+  public static class Builder {
+    private final Schema schema;
+    private final List<Integer> sourceIds = Lists.newArrayList();
+    // Default ID to 1 as 0 is reserved for non primary key.
+    private int keyId = 1;
+    private boolean enforceUniqueness = false;
+
+    private Builder(Schema schema) {
+      this.schema = schema;
+    }
+
+    public Builder withKeyId(int newKeyId) {
+      this.keyId = newKeyId;
+      return this;
+    }
+
+    public Builder withEnforceUniqueness(boolean enable) {
+      this.enforceUniqueness = enable;
+      return this;
+    }
+
+    public Builder addField(String name) {
+      Types.NestedField column = schema.findField(name);
+
+      Preconditions.checkNotNull(column, "Cannot find source column: %s", name);
+      Preconditions.checkArgument(column.isRequired(), "Cannot add optional source field to primary key: %s", name);
+
+      Type sourceType = column.type();
+      ValidationException.check(sourceType.isPrimitiveType(), "Cannot add non-primitive field: %s", sourceType);
+
+      sourceIds.add(column.fieldId());
+      return this;
+    }
+
+    public Builder addField(int sourceId) {
+      Types.NestedField column = schema.findField(sourceId);
+      Preconditions.checkNotNull(column, "Cannot find source column: %s", sourceId);
+      Preconditions.checkArgument(column.isRequired(), "Cannot add optional source field to primary key: %s", sourceId);

Review comment:
       Other databases (such as MySQL) also do not allow to add `nullable` columns in the primary keys.  see https://dev.mysql.com/doc/refman/8.0/en/create-table.html: 
   
   > A unique index where all key columns must be defined as NOT NULL. If they are not explicitly declared as NOT NULL, MySQL declares them so implicitly (and silently). A table can have only one PRIMARY KEY. The name of a PRIMARY KEY is always PRIMARY, which thus cannot be used as the name for any other kind of index. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ismailsimsek edited a comment on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
ismailsimsek edited a comment on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-760880495


   > One thing in the discussion was whether we should call this `primary key` or something like a `default equality key`. Saying primary key would make people feel the uniqueness is enforced, but it actually cannot enforce it. I don't have a strong opinion on this, but I constantly hear customers using Redshift complaining about this. Also, once we introduce primary key, there will definitely be an ask to have a foreign key concept. Just something to think about if we decide to go with this approach.
   
   +1 for having `primary key` without enforcement. Most of the Analytic databases are supporting informational `Primary Key` and `Foreign Key` Definitions without enforcing it, to use it as metadata. ex [snowflake](https://docs.snowflake.com/en/sql-reference/constraints-overview.html)
   
   Vertica has good balance between both it enforces/checks [constraints during the read,join](http://vertica-forums.com/viewtopic.php?t=101#p143) and not checks it during insert.
   
   it could be good feature to have [ANALYZE_CONSTRAINTS](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_CONSTRAINTS.htm) command with it user can validate constraint manually on a table. for-example after insert.
   
   in-case of enforcement it could be good idea to have flag on table level constraints ENABLED/DISABLED
   if its possible something like (for small and less frequently changing tables)
   1. append data to a table
   2. if `constraints_enabled` check constraint before commit!
      - commit or rollback
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-781105164


   @openinx happy new year, any update for this? I think we can derive the row id feature for format v2 based on this PR. I can raise another one if you don't have time to do it here. And @rdblue any thoughts on this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#discussion_r556878173



##########
File path: api/src/main/java/org/apache/iceberg/PrimaryKey.java
##########
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Objects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A primary key that defines which columns will be unique in this table.
+ */
+public class PrimaryKey implements Serializable {
+
+  private static final PrimaryKey NON_PRIMARY_KEY = new PrimaryKey(null, 0, false, ImmutableList.of());
+
+  private final Schema schema;
+  private final int keyId;
+  private final boolean enforceUniqueness;
+  private final Integer[] sourceIds;
+
+  private transient volatile List<Integer> sourceIdList;
+
+  private PrimaryKey(Schema schema, int keyId, boolean enforceUniqueness, List<Integer> sourceIds) {
+    this.schema = schema;
+    this.keyId = keyId;
+    this.enforceUniqueness = enforceUniqueness;

Review comment:
       I don't think we should add have `enforceUniqueness` in Iceberg API, because even if a user specifies primary key must be enforced,  it is hard to enforce it at engine level. Primary key should always be enforced at "best effort". If Flink can enforce it through upsert, I feel it should be a flag in Flink to do so.

##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopTables.java
##########
@@ -125,17 +126,21 @@ private Table loadMetadataTable(String location, String metadataTableName, Metad
    * Create a table using the FileSystem implementation resolve from
    * location.
    *
-   * @param schema iceberg schema used to create the table
-   * @param spec partitioning spec, if null the table will be unpartitioned
+   * @param schema     iceberg schema used to create the table

Review comment:
       nit: unnecessary change in spacing

##########
File path: core/src/main/java/org/apache/iceberg/TableMetadataParser.java
##########
@@ -242,6 +253,7 @@ public static TableMetadata read(FileIO io, InputFile file) {
     }
   }
 
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")

Review comment:
       should probably refactor the method into methods like `parsePartitionSpecs`, `parseSortOrders` and `parsePrimaryKeys`, but it is not urgent, we can do it in another PR.

##########
File path: api/src/main/java/org/apache/iceberg/PrimaryKey.java
##########
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A primary key that defines which columns will be unique in this table.
+ */
+public class PrimaryKey implements Serializable {
+
+  private static final PrimaryKey NON_PRIMARY_KEY = new PrimaryKey(null, 0, false, ImmutableList.of());
+
+  private final Schema schema;
+  private final int keyId;
+  private final boolean enforceUniqueness;
+  private final Integer[] sourceIds;
+
+  private transient volatile List<Integer> sourceIdList;
+
+  private PrimaryKey(Schema schema, int keyId, boolean enforceUniqueness, List<Integer> sourceIds) {
+    this.schema = schema;
+    this.keyId = keyId;
+    this.enforceUniqueness = enforceUniqueness;
+    this.sourceIds = sourceIds.toArray(new Integer[0]);
+  }
+
+  /**
+   * Returns the {@link Schema} for this primary key.
+   */
+  public Schema schema() {
+    return schema;
+  }
+
+  /**
+   * Returns this ID of this primary key.
+   */
+  public int keyId() {
+    return keyId;
+  }
+
+  /**
+   * Returns true if the uniqueness should be guaranteed when writing iceberg table.
+   */
+  public boolean enforceUniqueness() {
+    return enforceUniqueness;
+  }
+
+  /**
+   * Returns the list of source field ids for this primary key.
+   */
+  public List<Integer> sourceIds() {
+    if (sourceIdList == null) {
+      synchronized (this) {
+        if (sourceIdList == null) {
+          this.sourceIdList = ImmutableList.copyOf(sourceIds);
+        }
+      }
+    }
+    return sourceIdList;
+  }
+
+  /**
+   * Returns true if the primary key has no column.
+   */
+  public boolean isNonPrimaryKey() {
+    return sourceIds.length == 0;
+  }
+
+  /**
+   * Returns a dummy primary key that has no column.
+   */
+  public static PrimaryKey nonPrimaryKey() {
+    return NON_PRIMARY_KEY;
+  }
+
+  /**
+   * Checks whether this primary key is equivalent to another primary key while ignoring the primary key id.
+   *
+   * @param other a different primary key.
+   * @return true if this key is equivalent to the given key.
+   */
+  public boolean samePrimaryKey(PrimaryKey other) {
+    return Arrays.equals(sourceIds, other.sourceIds) && enforceUniqueness == other.enforceUniqueness;
+  }
+
+  @Override
+  public boolean equals(Object other) {
+    if (this == other) {
+      return true;
+    } else if (!(other instanceof PrimaryKey)) {
+      return false;
+    }
+
+    PrimaryKey that = (PrimaryKey) other;
+    if (this.keyId != that.keyId) {
+      return false;
+    }
+
+    if (this.enforceUniqueness != that.enforceUniqueness) {
+      return false;
+    }
+
+    return Arrays.equals(sourceIds, that.sourceIds);
+  }
+
+  @Override
+  public int hashCode() {
+    int hash = 31 * keyId;
+    hash = hash + (enforceUniqueness ? 1 : 0);
+    hash += Arrays.hashCode(sourceIds);
+    return hash;
+  }
+
+  @Override
+  public String toString() {
+    return MoreObjects.toStringHelper(this)
+        .add("keyId", keyId)
+        .add("enforceUniqueness", enforceUniqueness)
+        .add("sourceIds", sourceIds())
+        .toString();
+  }
+
+  /**
+   * Creates a new {@link Builder primary key builder} for the given {@link Schema}.
+   *
+   * @param schema a schema
+   * @return a primary key builder for the given schema.
+   */
+  public static Builder builderFor(Schema schema) {
+    return new Builder(schema);
+  }
+
+  /**
+   * A builder to create valid {@link PrimaryKey primary keys}. Call {@link #builderFor(Schema)} to create a new
+   * builder.
+   */
+  public static class Builder {
+    private final Schema schema;
+    private final List<Integer> sourceIds = Lists.newArrayList();
+    // Default ID to 1 as 0 is reserved for non primary key.
+    private int keyId = 1;
+    private boolean enforceUniqueness = false;
+
+    private Builder(Schema schema) {
+      this.schema = schema;
+    }
+
+    public Builder withKeyId(int newKeyId) {
+      this.keyId = newKeyId;
+      return this;
+    }
+
+    public Builder withEnforceUniqueness(boolean enable) {
+      this.enforceUniqueness = enable;
+      return this;
+    }
+
+    public Builder addField(String name) {
+      Types.NestedField column = schema.findField(name);
+
+      Preconditions.checkNotNull(column, "Cannot find source column: %s", name);
+      Preconditions.checkArgument(column.isRequired(), "Cannot add optional source field to primary key: %s", name);
+
+      Type sourceType = column.type();
+      ValidationException.check(sourceType.isPrimitiveType(), "Cannot add non-primitive field: %s", sourceType);
+
+      sourceIds.add(column.fieldId());
+      return this;
+    }
+
+    public Builder addField(int sourceId) {
+      Types.NestedField column = schema.findField(sourceId);
+      Preconditions.checkNotNull(column, "Cannot find source column: %s", sourceId);
+      Preconditions.checkArgument(column.isRequired(), "Cannot add optional source field to primary key: %s", sourceId);

Review comment:
       agree, primary key should be non-null.
   nit: duplicated logic with L183, can you refactor the checks?

##########
File path: api/src/main/java/org/apache/iceberg/PrimaryKey.java
##########
@@ -0,0 +1,221 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+/**
+ * A primary key that defines which columns will be unique in this table.
+ */
+public class PrimaryKey implements Serializable {
+
+  private static final PrimaryKey NON_PRIMARY_KEY = new PrimaryKey(null, 0, false, ImmutableList.of());

Review comment:
       To be consistent with `PartitionSpec` and `SortOrder`, better to use `new PrimaryKey(new Schema(), ...)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] XuQianJin-Stars commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
XuQianJin-Stars commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-805465979


   hi @jackye1995 What is the status of this primary key spec?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
openinx commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-800769586


   Okay,  after reconsidered the primary key uniqueness issues,  it's hard to guarantee the uniqueness in an embedded table format lib,  If both the spark job and flink streaming job are writing the same iceberg table I couldn't think of a good and efficient way to guarantee the uniqueness of primary key.  If we have an online server in front of those data files, then it's will be easy to guarantee the uniqueness because all of the write requests will be send to the same online server and the server could decide how to reject those duplicated write request, while for an iceberg table format it's hard to synchronize between different computation job. 
   
   So I'm fine to introduce the primary key without enforced uniqueness.  @jackye1995   Did you start this work in your repo ? Should we update this PR based the above discussion ?  ( Sorry about the delay).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ismailsimsek commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
ismailsimsek commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-760880495


   > One thing in the discussion was whether we should call this `primary key` or something like a `default equality key`. Saying primary key would make people feel the uniqueness is enforced, but it actually cannot enforce it. I don't have a strong opinion on this, but I constantly hear customers using Redshift complaining about this. Also, once we introduce primary key, there will definitely be an ask to have a foreign key concept. Just something to think about if we decide to go with this approach.
   
   +1 for having `primary key` without enforcement. Most of the Analytic databases are supporting informational `Primary Key` and `Foreign Key` Definitions without enforcing it, to use it as metadata. ex [snowflake](https://docs.snowflake.com/en/sql-reference/constraints-overview.html)
   
   Vertica has good balance between both it enforces/checks [constraints during the read,join](http://vertica-forums.com/viewtopic.php?t=101#p143) and not checks it during insert.
   
   it could be good feature to have [ANALYZE_CONSTRAINTS](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_CONSTRAINTS.htm) command which the user can validate constraint manually on a table. for-example after insert.
   
   in-case of enforcement it could be good idea to have flag on table level constraints ENABLED/DISABLED


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-800783133


   thanks for the reply! yup I have a version locally that I am working on, which includes changes similar to this PR without the uniqueness constraint. I will publish it tomorrow or the day after tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2010: Core: Add primary key spec.

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2010:
URL: https://github.com/apache/iceberg/pull/2010#issuecomment-805470676


   @XuQianJin-Stars #2354 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org