You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/21 06:13:30 UTC

[GitHub] [iceberg] jackye1995 opened a new pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

jackye1995 opened a new pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845


   when table property `write.object-storage.path` is not set, use the default table data path which is `tableLocation/data`
   
   @yyanyy 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r687290562



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       Correct. Table location default is defined by the catalog implementation, which is likely `s3://bucket/warehouse/db/tablename`, or a custom override during create table. So that path slash data will always be there when writing data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r687295370



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       My only concern is with some potential confusion around the existence of `write.object-storage.*` and then falling back to `write.folder-storage.*`. I admittedly have enough trouble as it is convincing people that object-storage keys are not folders  🙂, but I don't think that's necessarily a large enough concern to introduce new table properties.
   
   Given that it's an existing property though, I do think it makes sense.
   
   In this example, with object storage location provider, where would the hash / entropy characters be placed in the path?
   
   Am I correct in assuming that the transformation would be as follows?
   `s3://bucket/warehouse/db/tablename` -> `s3://bucket/warehouse/db/tablename/<hash>/data`
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r683141762



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       @openinx any additional thoughts on this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r683141762



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       @openinx any additional thoughts on this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r677796039



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       Yes I totally agree `TableProperties.WRITE_NEW_DATA_LOCATION` is probably not a good name, but it is what it is, we might want to deprecate and rename this if we have a major version update...
   
   The resolution strategy I am trying to create here is that:
   - if `write.object-storage.path` is set, use it
   - if not found, fallback to `write.folder-storage.path`
   - if not found, use `<tableLocation>/data`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r675464207



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       To be honestly,  I did not get what's the real meaning for `TableProperties.WRITE_NEW_DATA_LOCATION`.   If it's the configured default location, then why did it name as `WRITE_NEW_DATA_LOCATION` with its real name as `write.folder-storage.path`.  
   
   Now in this patch,  we are trying to fallback to the `write.folder-storage.path`  when people did not set the `write.object-storage.path`.  It even make this more confuse to understand :-) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r687290562



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       Correct. Table location default is defined by the catalog implementation, which is likely `s3://bucket/warehouse/db/tablename`, or a custom override during create table. So that path slash data will always be there when writing data, and this PR tries to use that as the default.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r687295370



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       My only concern is with some potential confusion around the existence of `write.object-storage.*` and then falling back to `write.folder-storage.*`. I admittedly have enough trouble as it is convincing people that object-storage keys are not folders 
   🙂.
   
   Given that it's an existing property though, I do think it makes sense.
   
   In this example, with object storage location provider, where would the hash / entropy characters be placed in the path?
   
   Am I correct in assuming that the transformation would be as follows?
   `s3://bucket/warehouse/db/tablename` -> `s3://bucket/warehouse/db/tablename/<hash>/data`
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#issuecomment-897257645


   @jackye1995 has there been any further discussion on this?
   
   I was also going to open a PR to allow for a default location, as many users create many tables and don't want to configure a location per table (e.g. they want a warehouse-like location, but with object storage randomness in the path).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r687285432



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       I agree on the fallback strategy, though I'd like to see a warehouse like set up for the whole catalog.
   
   E.g. if I don't set `path`, per table, I'd like for it to work as s3://bucket/warehouse/path/<tableLocation and/or just table name>/<objectstorage_hash>/data
   
   Is that more or less what this config is achieving?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r687300481



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       Or would we wind up getting `s3://bucket/warehouse/<hash>/db/tablename/data`?
   
   My preference would be for the hash to be just after the warehouse for data files if `write.object-storage.path` is not specified, though I can understand that we don't want to complicate the resolution to much. However, keeping the entropy high enough in the path for S3 partitioning seems like a reasonable goal for allowing it to be somewhat convoluted.
   
   I have not made much usage of  `WRITE_NEW_DATA_LOCATION` to have a strong opinion there. I agree the naming is a bit strange (`write.folder-storage.path`), but given that this config already exists (with the given name), I do agree that it would be a bit strange to ignore it entirely if users have set it.
   
   Is there no convention around `write.folder-storage.path`, and `write.folder-storage.*` to only be applied to non-object storage layouts? Given that the property name is `TableProperties.WRITE_NEW_DATA_LOCATION`, I'm guessing that the name is just an arguably unfortunate side effect, but that we should respect the intention of this property.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 merged pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 merged pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r687291449



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       The main discussion point I think is if we should have `write.folder-storage.path` as a part of the fallback, which I think is worth having, but I do agree it increases the complexity for path resolution which is already quite complicated in Iceberg. If no one else objects the current approach, I will merge this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#issuecomment-897900861


   @kbendick thanks for the approval. We discussed offline, and I think the use cases that we are trying to achieve require this fallback strategy.
   
   Use case 1: user creates table and want to use `ObjectStorageLocationProvider` with the default table path. Currently this will throw a NPE because it requires `write.object-storage.path` to be set. But the user might not really know what is the root path of the table, and we should not break this abstraction. So a fallback to the default table root path should be desirable.
   
   Use case 2: user has an existing table with `write.folder-storage.path` set, and would like to enable object storage mode by updating the table property. With change to use case 1, this will directly fallback to the default table location instead of the folder storage path, but the correct behavior is for the Iceberg table to respect the `write.folder-storage.path` and use it as the root.
   
   So the 2 levels of fallback is required. I will go merge the PR, thanks for all the discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cobookman commented on a change in pull request #2845: Core: allow default data location for ObjectStorageLocationProvider

Posted by GitBox <gi...@apache.org>.
cobookman commented on a change in pull request #2845:
URL: https://github.com/apache/iceberg/pull/2845#discussion_r688201887



##########
File path: core/src/main/java/org/apache/iceberg/LocationProviders.java
##########
@@ -68,13 +68,15 @@ public static LocationProvider locationsFor(String location, Map<String, String>
     }
   }
 
+  private static String defaultDataLocation(String tableLocation, Map<String, String> properties) {
+    return properties.getOrDefault(TableProperties.WRITE_NEW_DATA_LOCATION, String.format("%s/data", tableLocation));

Review comment:
       > To be honestly, I did not get what's the real meaning for `TableProperties.WRITE_NEW_DATA_LOCATION`. If it's the configured default location, then why did it name as `WRITE_NEW_DATA_LOCATION` with its real name as `write.folder-storage.path`.
   
   @openinx opened a Github issue. As a new user to Iceberg the weird naming here is making the codebase hard to understand.
   https://github.com/apache/iceberg/issues/2964
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org