You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/04 17:38:50 UTC

[GitHub] [iceberg] karuppayya opened a new pull request #1877: Rename fanout configs

karuppayya opened a new pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877


   Renaming the fanout configs.
   1. Renamed DF write option from `partitioned.fanout.enabled` to `fanout-enabled`(similar to other Spark options)
   2. Renamed TableProperty `write.partitioned.fanout.enabled` to `write.spark.partitioned.fanout.enabled`, as it applies to only Spark.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-744206842


   Raised a PR #1929 for documenting fanout.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739848182


   @HeartSaVioR, do you happen to have a link to the open comment from the previous PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739850438


   I am not against having a table property but I think we should clearly document the concern of enabling this and discourage using fanout writers in batch jobs. The cost of a local sort is minimal and Spark does this implicitly in batch jobs for built-in V1 tables.
   
   I think the documentation should be clear about the potential consequences and give a good example.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#discussion_r536472489



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java
##########
@@ -67,7 +68,7 @@
 
   @Parameterized.Parameters(name = "format = {0}")
   public static Object[] parameters() {
-    return new Object[] { "parquet", "avro", "orc" };
+    return new Object[] { "parquet", "avro", "orc"};

Review comment:
       Nit: can you revert the whitespace change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739883152


   @aokolnychyi 
   Here it is. https://github.com/apache/iceberg/pull/1774#issuecomment-728736862
   
   I agree we could probably discourage using the option in doc. Probably they would only want to set the table property when their table is only written by streaming query.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739122730


   If you don't mind, this might be a good chance to update the guide on partitioned write in Spark: https://github.com/apache/iceberg/blob/master/site/docs/spark.md#writing-against-partitioned-table
   
   Fanout writer can be an alternative to explicit sort, hence would be better to add it there. If you'd just like to finalize your work with the code change, I can take a step for that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739426240


   Thanks @karuppayya! Looks good so I merged it.
   
   @HeartSaVioR, let's update the docs separately. I agree that it would probably be helpful, but I think we want to be careful about how we document this. This is primarily for streaming and I would not recommend people avoid adding a sort for batch work. If you're interested in working on the docs, that would be great! Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739437218


   Yeah I'm happy to document it. Thanks!
   
   One thing I wonder is how much it is helpful or even it hurts to have table property for this, as we are in consensus that this might be dangerous on batch query if they don't notice the behavior (depending on cardinality of partitions). I also commented on previous PR and it got merged without answering it. For now we don't explain what is fanout writer in the doc, so they have no idea and fear to enable this, but once we document the behavior without proper warn, they may be misunderstanding the behavior as good for all cases and update the table property. (I guess you're concerning about documenting this due to this, do I understand correctly?)
   
   Personally I'm not 100% sure we'd like to add table property which is only good for specific workload, but at least we could document this with warning that this opens files for cardinality of partitions in the data in each task, so only recommended to use it in streaming write (not mentioning table property here). WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on a change in pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
karuppayya commented on a change in pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#discussion_r536484938



##########
File path: site/docs/configuration.md
##########
@@ -156,5 +156,5 @@ df.write
 | target-file-size-bytes | As per table property      | Overrides this table's write.target-file-size-bytes          |
 | check-nullability      | true                       | Sets the nullable check on fields                            |
 | snapshot-property._custom-key_    | null            | Adds an entry with custom-key and corresponding value in the snapshot summary  |
-| partitioned.fanout.enabled       | false        | Overrides this table's write.partitioned.fanout.enabled  |
+| fanout-enabled       | false        | Overrides this table's write.spark.partitioned.fanout.enabled  |

Review comment:
       @rdblue, I had a chat with @aokolnychyi earlier today and  I got to know the rationale behind adding `spark`.
   I will rename it to `write.spark.fanout.enabled`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#discussion_r537418909



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java
##########
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark;
+
+/**
+ * Spark DF write options
+ */
+public class SparkWriteOptions {

Review comment:
       @rdblue @karuppayya, shall we add `SparkReadOptions` and move all read and write options to those classes? I think we need a place for them. It would be great to build a utility class that would take into account table properties, SQL conf, write/read options and provide the actual value to be used in the future. That being said, I think we should start simple and `SparkReadOptions` and `SparkWriteOptions` sounds reasonable to me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on a change in pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
karuppayya commented on a change in pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#discussion_r536266275



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java
##########
@@ -0,0 +1,19 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg.spark;
+
+public class SparkWriteOptions {
+    public static final String FANOUT_ENABLED = "fanout-enabled";

Review comment:
       Do we want to add the other write options in this PR or keep the refactor separate?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739850845


   Thanks for working on this, @karuppayya!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on a change in pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
karuppayya commented on a change in pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#discussion_r536485305



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java
##########
@@ -67,7 +68,7 @@
 
   @Parameterized.Parameters(name = "format = {0}")
   public static Object[] parameters() {
-    return new Object[] { "parquet", "avro", "orc" };
+    return new Object[] { "parquet", "avro", "orc"};

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#discussion_r536472434



##########
File path: spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java
##########
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark;
+
+/**
+ * Spark DF write options
+ */
+public class SparkWriteOptions {

Review comment:
       Thanks, I like adding a class for these.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#discussion_r536472286



##########
File path: site/docs/configuration.md
##########
@@ -156,5 +156,5 @@ df.write
 | target-file-size-bytes | As per table property      | Overrides this table's write.target-file-size-bytes          |
 | check-nullability      | true                       | Sets the nullable check on fields                            |
 | snapshot-property._custom-key_    | null            | Adds an entry with custom-key and corresponding value in the snapshot summary  |
-| partitioned.fanout.enabled       | false        | Overrides this table's write.partitioned.fanout.enabled  |
+| fanout-enabled       | false        | Overrides this table's write.spark.partitioned.fanout.enabled  |

Review comment:
       Thanks! @aokolnychyi, @RussellSpitzer, and I were just talking about making this change yesterday.
   
   We talked about using `fanout-enabled` like you have here to match the other write options, but we came to consensus on using `write.spark.fanout.enabled` for the table property. You're probably familiar with the rationale for using `spark`. We also didn't see the need for `partitioned` in the property since it only makes sense for partitioned writes. We could think of it as a global option that only has a behavior change when the table is partitioned.
   
   What do you think about using `write.spark.fanout.enabled` instead of `write.spark.partition.fanout.enabled`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] karuppayya commented on pull request #1877: Rename fanout configs

Posted by GitBox <gi...@apache.org>.
karuppayya commented on pull request #1877:
URL: https://github.com/apache/iceberg/pull/1877#issuecomment-739089541


   @rdblue @aokolnychyi Can you please help review the change? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org