You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/15 01:34:32 UTC

[GitHub] [iceberg] karuppayya opened a new pull request #1933: Refactor Spark DF read and Write options

karuppayya opened a new pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933


   Refactor Spark DF read and Write options


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
karuppayya commented on pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#issuecomment-748298442


   @holdenk Thanks for feedback. I can add a column to site/docs/configuration.md's spark configs. Would that work?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#discussion_r546858248



##########
File path: site/docs/spark.md
##########
@@ -513,6 +535,27 @@ data.write
 
 The behavior of DataFrameWriter overwrite mode was undefined in Spark 2.4, but is required to overwrite the entire table in Spark 3. Because of this new requirement, the Iceberg source's behavior changed in Spark 3. In Spark 2.4, the behavior was to dynamically overwrite partitions. To use the Spark 2.4 behavior, add option `overwrite-mode=dynamic`.
 
+**Note**: Dataframe write options are available as static constants in [SparkWriteOptions](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/types/SparkWriteOptions.java) class.
+
+| Spark write option           | Constant                    |
+| ---------------------- | -------------------------- |
+| write-format           | WRITE_FORMAT |
+| target-file-size-bytes | TARGET_FILE_SIZE_BYTES      |
+| check-nullability      | CHECK_NULLABILITY                       |
+| snapshot-property._custom-key_    |SNAPSHOT_PROPERTY_PREFIX._custom-key_)            |
+| fanout-enabled       | FANOUT_ENABLED        |
+| check-ordering       | CHECK_ORDERING        |
+
+Usage:
+```
+ex: spark

Review comment:
       Same comment here as above.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#discussion_r546858189



##########
File path: site/docs/spark.md
##########
@@ -347,6 +347,28 @@ df.createOrReplaceTempView("table")
 spark.sql("""select count(1) from table""").show()
 ```
 
+**Note**: Dataframe read options are available as static constants in [SparkReadOptions](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/types/SparkReadOptions.java) class.
+
+| Spark Read option    | Constant               |
+| --------------- | --------------------- |
+| snapshot-id     | SNAPSHOT_ID              |
+| as-of-timestamp | AS_OF_TIMESTAMP              |
+| split-size      | SPLIT_SIZE |
+| lookback        | LOOKBACK |
+| file-open-cost  | FILE_OPEN_COST |
+| vectorization-enabled  | VECTORIZATION_ENABLED |
+| batch-size  | VECTORIZATION_BATCH_SIZE |

Review comment:
       I'm not sure that we want to encourage users to refer to constants in Iceberg. What do you think, @aokolnychyi, @holdenk, @RussellSpitzer?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#issuecomment-748302457


   > @holdenk Thanks for feedback. I can add a column to site/docs/configuration.md's spark configs. Would that work?
   
   I think you may want to add these to the Spark Section directly since they don't apply to configuration generally


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#discussion_r543263516



##########
File path: spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
##########
@@ -107,7 +108,7 @@
       boolean caseSensitive, DataSourceOptions options) {
     this.table = table;
     this.snapshotId = options.get("snapshot-id").map(Long::parseLong).orElse(null);

Review comment:
       Shall we use `SparkReadOptions.SNAPSHOT_ID` here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#issuecomment-749122909


   Looks fine to me. Just a question for other people on the docs and a couple of nits in the docs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on a change in pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
karuppayya commented on a change in pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#discussion_r546922331



##########
File path: site/docs/spark.md
##########
@@ -513,6 +535,27 @@ data.write
 
 The behavior of DataFrameWriter overwrite mode was undefined in Spark 2.4, but is required to overwrite the entire table in Spark 3. Because of this new requirement, the Iceberg source's behavior changed in Spark 3. In Spark 2.4, the behavior was to dynamically overwrite partitions. To use the Spark 2.4 behavior, add option `overwrite-mode=dynamic`.
 
+**Note**: Dataframe write options are available as static constants in [SparkWriteOptions](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/types/SparkWriteOptions.java) class.
+
+| Spark write option           | Constant                    |
+| ---------------------- | -------------------------- |
+| write-format           | WRITE_FORMAT |
+| target-file-size-bytes | TARGET_FILE_SIZE_BYTES      |
+| check-nullability      | CHECK_NULLABILITY                       |
+| snapshot-property._custom-key_    |SNAPSHOT_PROPERTY_PREFIX._custom-key_)            |
+| fanout-enabled       | FANOUT_ENABLED        |
+| check-ordering       | CHECK_ORDERING        |
+
+Usage:
+```
+ex: spark

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#discussion_r543262780



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java
##########
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark;
+
+/**
+ * Spark DF read options
+ */
+public class SparkReadOptions {
+
+  private SparkReadOptions() {
+  }
+
+  // Snapshot ID of the table snapshot to read
+  public static final String SNAPSHOT_ID = "snapshot-id";

Review comment:
       There are also more read options. Why not include all of them?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] holdenk commented on pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#issuecomment-748277392


   I like the idea of this change, would it make sense to update the documentation as well so folks see how to use it? Less prone to typos in the data read/write config path is <3 <3


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#discussion_r546857821



##########
File path: site/docs/spark.md
##########
@@ -347,6 +347,28 @@ df.createOrReplaceTempView("table")
 spark.sql("""select count(1) from table""").show()
 ```
 
+**Note**: Dataframe read options are available as static constants in [SparkReadOptions](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/types/SparkReadOptions.java) class.
+
+| Spark Read option    | Constant               |
+| --------------- | --------------------- |
+| snapshot-id     | SNAPSHOT_ID              |
+| as-of-timestamp | AS_OF_TIMESTAMP              |
+| split-size      | SPLIT_SIZE |
+| lookback        | LOOKBACK |
+| file-open-cost  | FILE_OPEN_COST |
+| vectorization-enabled  | VECTORIZATION_ENABLED |
+| batch-size  | VECTORIZATION_BATCH_SIZE |
+
+Usage:
+```
+ex: spark

Review comment:
       Can you remove "ex:" here and add the a language to the block so that it is highlighted correctly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#issuecomment-751876351


   Thanks, @karuppayya! Because there is still open discussion on whether we will recommend using the Iceberg constants, I've removed those docs from this PR and committed the rest.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#discussion_r543262570



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java
##########
@@ -29,4 +29,7 @@ private SparkWriteOptions() {
 
   // Overrides table property write.spark.fanout.enabled(default: false)
   public static final String FANOUT_ENABLED = "fanout-enabled";
+
+  // Fileformat for write operations
+  public static final String WRITE_FORMAT = "write-format";

Review comment:
       I feel like there are more write properties. Is there a reason we don't include others?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on pull request #1933: Refactor Spark DF read and Write options

Posted by GitBox <gi...@apache.org>.
karuppayya commented on pull request #1933:
URL: https://github.com/apache/iceberg/pull/1933#issuecomment-748389578


   > > @holdenk Thanks for feedback. I can add a column to site/docs/configuration.md's spark configs. Would that work?
   > 
   > I think you may want to add these to the Spark Section directly since they don't apply to configuration generally
   
   I have added it as a "Note", open for any other suggestions for introducing into the doc.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org