You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/15 21:22:56 UTC

[GitHub] [iceberg] pavibhai opened a new issue #1617: Support relative paths in Table Metadata

pavibhai opened a new issue #1617:
URL: https://github.com/apache/iceberg/issues/1617


   ## Background <a id="Background"></a>
   [Iceberg specification][spec] captures file references for the following:
   * Table metadata references `location` that determines the base location of the table.
   * Snapshot references `manifests` and `manifest-list` to determining the manifests that make up the snapshot.
   * Manifest list references `manifest_path` that identifies the location of a manifest file.
   * Manifest references `file_path` that identifies a data file.
   * Position based delete file references `file_path` identifies a data file on which position based delete is to be
   applied.
   
   All the file references are absolute paths right now.
   
   ## Challenge <a id="Challenge"></a>
   Table copy to another location could arise in the following use cases:
   * **Replication**: Copy the table state and history of state changes to another data center or availability zone.
   * **Backup**: Copy the table state and history of state changes to an archive storage for backup and recovery purposes.
   
   Absolute file references require the file references to reflect the new target location of the table before the table
   can be consumed.
   
   ## Solution Option <a id="SolutionOption"></a>
   We could support relative paths as follows:
   * Table metadata `location` shall always reference the absolute location of the table
   * All other path references shall support both relative and absolute references.
       * In case of a relative reference this shall be relative to the `location` from the table metadata.
       * In case of an absolute reference then it is used directly
   
   We might further consider splitting the Table metadata into two pieces:
   * Definition: This identifies elements of the metadata that do not usually change:
       * `format-version`
       * `table-uuid`
       * `location`
   * Transactional: This identifies elements of the metadata that are expected to change with every transaction
       * `sequence-number`
       * `current-snapshot-id`
       * `schema`
       * etc
   
   This will ensure the following:
   * Initial replication requires the revision to the `location` attribute in the table metadata.
   * All subsequent replications do not require any manipulation of any file reference and can happen incrementally.
   
   [spec]: https://iceberg.apache.org/spec/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omalley edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
omalley edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-730527801


   Relative paths would help us too.
   
   We certainly need to have support for a federated namespace with a virtual file system. We're moving to HDFS federation and moving our tables to be located in a virtual filesystem (eg: gridfs://cluster1/table ) so that we can move the data around without changing all of the metadata. The challenge here is that we need to make sure that the manifests don't contain the physical hdfs path, which is likely to break.
   
   For data access control, it seems like a really good invariant that all data is under the table's path.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pavibhai commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
pavibhai commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-730523094


   @jackye1995 and @rdblue 
   Thanks for your feedback.
   
   Our current thought process is as follows:
   1. Only the Table Path be an absolute path, with all other paths in the metadata files being relative paths
   2. The relative path should not be visible outside of the metadata component i.e. the relative path should be translated to an absolute path using the table path for other areas of Iceberg to consume, so only the metadata files shall store the relative path.
   3. The relative path references shall be confined to be within the table path i.e `../` shall not be allowed.
   
   I don't think the scenario called out about a wrong HDFS location being cleaned will be worsened or improved as a result of this change. If the absolute path of the table is incorrectly interpreted to be something else then problems shall still happen.
   
   We will explore the option of FileIO in parallel while we discuss further the complications of using relative paths.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-925266292


   The [design doc](https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit#heading=h.hxmtkjthp8hm) has been updated to add initial support for relative paths and multiple root locations for a table. Please review the design.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-767157268


   Thanks for the initial feedback @rdblue. I rephrased the proposal and approach sections. Could you please take another look at the doc?
   
   Thanks,
   Anurag


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-878800783


   Thanks for the detailed answer @flyrain! This really helps to have a clear understanding of the tasks at hand. 
   
   I would like to share my thoughts about 2 points:
   
   > The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.
   
   I think that if we chose the relative path approach then replication becomes quite straightforward, since we do not have to handle the mix of different paths (source absolute path for the new metadata files, destination absolute path for the old metadata files). We just need to copy the metadata files for one directional replication. The bidirectional replication is a different kettle of fish because of the commit resolution complexity, but I think it is also easier since we do not have to care about manifests and manifest-lists
   
   > However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion.
   
   What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table? 
   
   Thanks, Peter 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] vamen edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
vamen edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-924888360


   Has anybody working on this  ?
   Would love to pick this up otherwise


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-764855499






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
flyrain commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-852553776


   Thanks @anuragmantri for filing the PR and updating the design doc. It is great that there is no spec change in the latest design.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-880068307


   Catching up on the comments from past couple of days. Thanks everyone for providing inputs.
   
   @flyrain - Although it will increase the scope of this design to include bi-directional replication, we should consider covering that here. 
   
   Overall, I agree with @jackye1995's proposal. Thanks for the detailed explanation. 
   
   > Then these 4 properties are used to get 2 absolute root paths:
   > 
   > root path to write data: the LocationProvider would read some fields in this list to determine the base root path, and generate absolute paths to write data files.
   > root path to write metadata: similarly, the TableOperations.metadataFileLocation reads some fields above to determine the base root path, and generate paths for metadata files.
   
   In the initial design doc, my thinking was the writes mentioned above need to change at all. However, reading the comments from @pvary above, we may indeed  have to store a `root` location somewhere for cases where we want to read inaccessible paths. Whether we want to have explicit method to set this root before accessing a table or have a mechanism for replicated tables to register themselves as replicas and then use a retry logic can be discussed.
   
   My takeaways from this are:
   
   1. Metadata writers need to know
      a) Relative path boolean 
      b) True root of the table
      c) Metadata path if set
      Additionally, they will also convert paths to relative in metadata files.
      
   2. Data writers need to know
      a) Data path if set
      b) True root of the table 
      
   An update transaction will change all the above except relative path boolean. Please correct me if my understanding is incorrect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-833688123


   @flyrain, in my latest comment to Ryan's comment in the [design doc](https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit#), I proposed starting with "supporting relative paths in metadata" which does not need changes to the spec. We can then look at refactoring the fields of metadata.json file. This will make it easier to implement and review the change. What are your thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-733917263


   I'll add this as a discussion topic for the next Iceberg sync.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-765637049


   Thanks for your comments @jackye1995. We discussed your proposal and here are our thoughts.
   
   - Making no changes to the file paths and relying on `FileIO` implementations to internally interpret them differently can cause confusion since the same path could mean totally different locations based on the underlying implementation. 
   
   With our approach, 
   - The path we see in metadata files is the path that gets acted upon. We feel this is cleaner and causes less confusion.
   - Further, when we use relative paths, they will always be relative to the table. This invariant makes it easier to interpret the paths.
   
   For the reasons above, we prefer the approach laid out in the design doc. We understand the concerns around Iceberg spec with v2 format already changing a lot. We are open to ideas on shipping this with the least interference (even if it means we wait till v2 gets out). 
   
   Please let us know your thoughts. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-880080917


   > Why not expose these methods instead:
   > 
   > // initialize the provider with the root paths and the `useRelativePath` config,
   > // or have different implementations like:
   > // - AbsoluteLocationProvider,
   > // - RelativeLocationProvider?
   > LocationProvider(Properties tableProperties);
   > 
   > // generate the absolute path for the metadata files
   > Path metaDataLocation(String file);
   > // generate the absolute path for the metadata files
   > Path dataLocation(String file);
   > 
   > // generate the value we put into the avro files - absolute or relative path
   > String metaDataKey(Path metadataFile);
   > // generate the value we put into the avro files - absolute or relative path
   > String dataKey(Path metadataFile);
   
   @pvary - This may make the code change larger. Other than that, I don't see why this cannot be done unless I'm missing something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-764936126


   Thanks for the design doc! I wonder if this can be done through changes in Catalog and FileIO without touching the Iceberg specification, which can be much simpler to ship without conflicting with already quite a few changes for the format v2 specifications.
   
   The key difficulty I see is that table property is a part of the table metadata file. Therefore we cannot know if a table should use any special file path override without actually reading that file. If we have a way to know it without using FileIO, then we can decide for a specific table, all files should be read through a region-specific client for replication, or a different URI scheme for backup and HDFS federation.
   
   So here is my alternative proposal:
   1. add interface `TablePropertyIO` that is dedicated to read and write table properties.
   2. update catalog implementations (mostly in `BaseMetastoreCatalog`) to use `TablePropertyIO` when necessary, by adding it as a part of the `TableOperations` interface similar to `FileIO io()`.
   3. add new table property `table-io-impl`, that can be customized to do things like replicaiton, backup, etc.
   4. implement those specific table `FileIO` classes.
   
   Could this proposal satisfy your requirements?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-764855499


   Here is a design doc for this proposal: https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit?usp=sharing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] omalley commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
omalley commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-730527801


   Relative paths would help us too. We certainly need to have support for a federated namespace with a virtual file system. We're moving to HDFS federation and moving our tables to be located in a virtual filesystem (eg: gridfs://cluster1/table ) so that we can move the data around without changing all of the metadata. The challenge here is that we need to make sure that the manifests don't contain the physical hdfs path, which is likely to break.
   
   For data access control, it seems like a really good invariant that all data is under the table's path.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-764855499


   We've put together a [design doc](https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit?usp=sharing) for this proposal. We thought through a few scenarios and added an implementation. We will be glad to consider other ideas as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-764936126


   Thanks for the design doc! I wonder if this can be done through changes in Catalog and FileIO without touching the Iceberg specification, which can be much simpler to ship without conflicting with already quite a few changes for the format v2 specifications.
   
   The key difficulty I see is that table property is a part of the table metadata file. Therefore we cannot know if a table should use any special file path override without actually reading that file. If we have a way to know it without using FileIO, then we can decide for a specific table, all files should be read through a region-specific client for replication, or a different URI scheme for backup and HDFS federation.
   
   So here is my alternative proposal:
   1. add interface `TablePropertyIO` that is dedicated to read and write table properties.
   2. update catalog implementations (mostly in `BaseMetastoreCatalog`) to use `TablePropertyIO` when necessary, by adding it as a part of the `TableOperations` interface similar to `FileIO io()`.
   3. add new table property `table-io-impl`, that can be customized to do things like replicaiton, backup, etc.
   4. implement those specific table `FileIO` classes.
   
   Could this proposal satisfy your requirements?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-767157268


   Thanks for the initial feedback @rdblue. I rephrased the proposal and approach sections. Could you please take another look at the doc?
   
   Thanks,
   Anurag


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-869880358


   @pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files. My main point is that we need to have a well-documented plan for tracking the table location.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-729315067


   I don't think it is a good idea in general to use relative paths. We recently had an issue where using a `hdfs` location without authority caused a user's data to be deleted by the `RemoveOrphanFiles` action because the resolution of the table root changed. The main problem is that places in Iceberg would need to have some idea of "equivalent" paths and path resolution. Full URIs are much easier to work with and more reliable.
   
   But there is still a way to do both. Catalogs and tables can inject their own `FileIO` implementation, which is what is used to open files. That can do any resolution that you want based on environment. So you could use an implementation that allows you to override a portion of the file URI and read it from a different underlying location. I think that works better overall because there are no mistakes about equivalent URIs, but you can still read a table copy without rewriting the metadata.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-765667338


   Hi Anurag,
   
   I read the doc through the Proposal and Approach sections and I am quite
   confused about what your proposal is. Could you clarify it a bit? It may be
   obvious from the following example sections what the changes are and why
   they are needed, but I think that information should be in the
   proposal/approach.
   
   rb
   
   On Fri, Jan 22, 2021 at 11:31 AM Anurag Mantripragada <
   notifications@github.com> wrote:
   
   > Thanks for your comments @jackye1995 <https://github.com/jackye1995>. We
   > discussed your proposal and here are our thoughts.
   >
   >    - Making no changes to the file paths and relying on FileIO
   >    implementations to internally interpret them differently can cause
   >    confusion since the same path could mean totally different locations based
   >    on the underlying implementation.
   >
   > With our approach,
   >
   >    - The path we see in metadata files is the path that gets acted upon.
   >    We feel this is cleaner and causes less confusion.
   >    - Further, when we use relative paths, they will always be relative to
   >    the table. This invariant makes it easier to interpret the paths.
   >
   > For the reasons above, we prefer the approach laid out in the design doc.
   > We understand the concerns around Iceberg spec with v2 format already
   > changing a lot. We are open to ideas on shipping this with the least
   > interference (even if it means we wait till v2 gets out).
   >
   > Please let us know your thoughts.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/1617#issuecomment-765637049>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAAVO2YIGCLMOI5X7UQ7E2TS3HHBTANCNFSM4SSQSL7A>
   > .
   >
   
   
   -- 
   Ryan Blue
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-765667338


   I read the doc through the Proposal and Approach sections and I am quite confused about what your proposal is. Could you clarify it a bit? It may be obvious from the following example sections what the changes are and why they are needed, but I think that information should be in the proposal/approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-878961343


   Thanks @jackye1995 for highlighting the properties which we need to handle!
   
   I have not know about the `write.object-storage.path` before. Do I understand correctly that this is a `LocationProvider` specific parameter for `ObjectStoreLocationProvider` which is more or less the same than the `write.object-storage.path` for the `DefaultLocationProvider`? If so it highlights how tricky could it be if we want to make sure that every `LocationProvider` specific parameter is handled correctly when we are updating the locations in the metadata json. Maybe we should delegate this path generation to the `LocationProvider` altogether, or something similar.
   
   > All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction
   
   The process you have described above sounds useful to me if we have a place where both of the replicas are accessible, like a central node where we can initiate the copy of the data and then call UpdateProperties + UpdateLocation, so the replica could have it's own metadata json which holds the correct info.
   
   In our cases for the on-prem to cloud migration it is possible that the file copy is done by a different method and then we need to read and update the metadata json file, in which the locations are not accessible. So it would be good if we can run UpdateProperties + UpdateLocation on an "invalid" table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-767157268


   Thanks for the initial feedback @rdblue. I rephrased the proposal and approach sections. Could you please take another look at the [document](https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit?usp=sharing)?
   
   Thanks,
   Anurag


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-868901836


   @flyrain and @anuragmantri, see my comment above. Thanks for working on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-725596052


   I remember we discussed this use case some time ago in a sync up meeting, and the related issue was  #1531. The general feedback was that it is better to expose this as a Spark procedure, and rewrite the manifests with new file path URIs to reflect new location after replication or backup. This avoids the trouble of redefining the Iceberg spec. Any thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-879193232


   > If so it highlights how tricky could it be if we want to make sure that every LocationProvider specific parameter is handled correctly when we are updating the locations in the metadata json
   
   Yes, that is why I was proposing if we decide to go with relative path, `LocationProvider` should expose a method that describes the true file path root it is using, maybe something like `LocationProvider.root()`, and any plugin implementation can override this method if necessary. But `LocationProvider` only governs the path for data files, and metadata file paths are governed by `TableOperations.metadataFileLocation(String fileName), so we should also have a new method like `TableOperations.metadataRoot()` to describe the true root. Those 2 roots should be sufficient for any reader and writer to use if they stick with the Iceberg core IO library.
   
   On Trino side it's quite more complicated because some paths are hard coded and directly generated based on Hive conventions, but I think that is an issue to fix on Trino side: https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java#L303-L308
   
   But just keep in mind that there is always a risk for a compute engine to not stick with those things, and it will be hard to enforce.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
flyrain commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-875123252


   > For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make metadata_location and previous_metadata_location relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.
   
   Hi @rdblue, we might leave metadata_location and previous_metadata_location as is, they are table properties in Metastore like HMS, are not affected when we just move files from the source table to the target. They are absolute path, but they still point to the right place without any change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-764855499


   We've put together a design doc [1] for this proposal
   https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit?usp=sharing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-880068307


   Catching up on the comments from past couple of days. Thanks everyone for providing inputs.
   
   @flyrain - Although it will increase the scope of this design to include bi-directional replication, we should consider covering that here. 
   
   Overall, I agree with @jackye1995's proposal. Thanks for the detailed explanation. 
   
   > Then these 4 properties are used to get 2 absolute root paths:
   > 
   > root path to write data: the LocationProvider would read some fields in this list to determine the base root path, and generate absolute paths to write data files.
   > root path to write metadata: similarly, the TableOperations.metadataFileLocation reads some fields above to determine the base root path, and generate paths for metadata files.
   
   In the initial design doc, my thinking was the writes mentioned above need to change at all. However, reading the comments from @pvary above, we may indeed  have to store a `root` location somewhere for cases where we want to read inaccessible paths. Whether we want to have explicit method to set this root before accessing a table or have a mechanism for replicated tables to register themselves as replicas and then use a retry logic can be discussed.
   
   > Why not expose these methods instead:
   > 
   > // initialize the provider with the root paths and the `useRelativePath` config,
   > // or have different implementations like:
   > // - AbsoluteLocationProvider,
   > // - RelativeLocationProvider?
   > LocationProvider(Properties tableProperties);
   > 
   > // generate the absolute path for the metadata files
   > Path metaDataLocation(String file);
   > // generate the absolute path for the metadata files
   > Path dataLocation(String file);
   > 
   > // generate the value we put into the avro files - absolute or relative path
   > String metaDataKey(Path metadataFile);
   > // generate the value we put into the avro files - absolute or relative path
   > String dataKey(Path metadataFile);
   
   @pvary - This may make the code change larger. Other than that, I don't see why this cannot be done unless I'm missing something.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-868901758


   I looked through the design doc. It seems to be much better than the last version, so thank you for the update.
   
   Overall, I think it is still incorrect on a few points and is quite a bit longer than it will need to be in the end. The main confusion seems to be where the table location comes from. Iceberg [table metadata tracks a location](https://iceberg.apache.org/spec/#table-metadata-fields) string that is written into all metadata.json files. This is the table's location. Iceberg does not require any location other than this one and there isn't a way to pass a different location in the `TableOperations` API; the only way to pass a table location is through `TableMetadata`.
   
   There are also a few of table properties that can affect locations:
   * Metadata can be redirected by setting `write.metadata.path`, which is defaulted to the `metadata` folder under the table location
   * The default location provider writes data underneath `write.folder-storage.path`, which is defaulted to the `data` folder under the table location
   * The object storage location provider shards data underneath `write.object-storage.path`
   
   There are two broad classes of tables:
   1. Hadoop tables identify the table itself by location and enforce that the location in `TableMetadata` is identical to the location used to identify the table when it is created. For Hadoop tables, it makes sense that the table location used for relative paths is the location that gets passed to create [`HadoopTableOperations`](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/hadoop/HadoopTableOperations.java#L71). But then there could be a problem that table metadata actually points to a different location unless it is modified.
   2. Metastore tables do not have a location that is external to table metadata. However, when a metastore table is tracked by Hive, there is a location that Hive tracks and we set that to the table's location.
   
   For table metadata files, there is only an external location for Hadoop tables. Otherwise, the table location must come from the table's metadata.json file. That makes it difficult to make `metadata_location` and `previous_metadata_location` relative paths. I would probably not attempt to make these paths relative, but I'm open to ideas.
   
   We could update the spec to allow relative metadata.json paths for Hadoop tables, but will need to state how to handle a different `location` in the metadata file. (Ignore?)
   
   Metadata locations aside, I think that it is a good plan to make the relative paths transparent. Most of the library should continue to operate on absolute paths and relative paths should be made while writing and made absolute while reading. That makes it so the changes are fairly easily tested. If you do this, then I think there is a lot less to cover in the design doc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-879737391


   Why just get the root for the data/metadata from the LocationProvider?
   
   Why not expose these methods instead:
   ```
   // initialize the provider with the root paths and the `useRelativePath` config,
   // or have different implementations like:
   // - AbsoluteLocationProvider,
   // - RelativeLocationProvider?
   LocationProvider(Properties tableProperties);
   
   // generate the absolute path for the metadata files
   Path metaDataLocation(String file);
   // generate the absolute path for the metadata files
   Path dataLocation(String file);
   
   // generate the value we put into the avro files - absolute or relative path
   String metaDataKey(Path metadataFile);
   // generate the value we put into the avro files - absolute or relative path
   String dataKey(Path metadataFile);
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] vamen removed a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
vamen removed a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-924888360


   Has anybody working on this  ?
   Would love to pick this up otherwise


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] vamen commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
vamen commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-924888360


   Has anybody working on this  ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-878426784


   @pvary, I'm talking only about the metadata.json files. I agree that using relative paths for manifest lists, manifests, and data files is a good idea. But we keep the table location in the root metadata file and so it is difficult to make its location relative.
   
   I think we could do this for Hadoop tables by ignoring the location in metadata.json because it must match the table location. But Hadoop tables are mostly for testing and I don't recommend using them in practice. The bigger issue is how to track the table location for Hive or other metastore tables. In that case, I don't think it is unreasonable to use `updateLocation` to change the location of the table, but it is definitely inconvenient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-878856794


   Just trying to catch up with the thread, and reread the design doc. 
   
   For the design doc, I think we should not make `HiveTable` a separated case to discuss, all tables that are retrieved through a `Catalog` implementation should operate the same if we introduce relative path, we should not try to break the abstraction layer and treat Hive tables separately.
   
   To have a brief summary (and also potentially answer Peter's question), from my understanding we are leading towards a solution that following items to keep track of the true table roots, and should be absolute, while all other paths can be relative:
   1. `location`
   2. `write.metadata.path`
   3. `write.folder-storage.path`
   4. `write.object-storage.path`
   
   Then these 4 properties are used to get 2 absolute root paths:
   1. root path to write data: the `LocationProvider` would read some fields in this list to determine the base root path, and generate absolute paths to write data files. 
   2. root path to write metadata: similarly, the `TableOperations.metadataFileLocation` reads some fields above to determine the base root path, and generate paths for metadata files.
   
   These probably should be exposed as new methods to make sure the same location derivation logic is used everywhere.
   
   Once the content of a file is written, the paths are written to one metadata layer above with/without absolute path based on a flag proposed in the design doc, and the writer of this path needs to know what are the true root paths used in the 2 places above. On the read side, the absolute path is created based on the 2 absolute root paths. Those operations are done inside each specific IO method by passing the relative/absolute path, the correct absolute root path, and the boolean flag, to make sure operations are transparent above the IO level.
   
   All updates to the 4 locations on the top can be done in a UpdateProperties + UpdateLocation transaction, and when the metadata path is updated, the `metadata_location` in catalog will be automatically updated with the new file root path to write and then read the latest metadata json file that is now in anther root path. Therefore, we need to make sure that when this value is updated, the metadata json file writer uses the value in the update instead of the existing value. This should be able to switch a reader of Iceberg table from one replica to another explicitly, or implement try-and-error logic to switch to the replica after certain retries. I am not sure if there is a use case such as choosing a random replica to read file, but I think that should be solved by directly registering all replicas as different tables and round-robin on the reader side.
   
   I think if we go with this route then it sounds good to me, and we can try to hash out some more details in PRs. Please let me know if this summary is accurate, or if I have any misunderstanding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-870115508


   Thanks @rdblue for explaining how locations are tracked in Iceberg tables. Based on your and @pvary's comments, I updated the design doc to include
   
   - For HadoopTables
      - The location passed to `HadoopTableOperations` will be used as base table location.
      - If metadata.json file contains a different `location`, we will ignore it. (Do you see any issues?)
   
   - For metastore tables
      -  For `HiveTable`
          - We can use location property tracked in metastore as the base table location.
          OR
          - We could just use `updateLocation()` API to create a new metadata.json file and use the `location` attribute in that as base table location.
      - For other metastore tables
          - For metastore tables that don't track table location,  we can use the `location` attribute of the metadata.json file as the    base table location
   
   - Path conversion is transparent to APIs. I removed sections of the doc that refer to APIs that do not change.
   
   What are your thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
flyrain commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-879620722


   @pvary, yes, the relative path approach looks more elegant than the metadata rebuild one. As I mentioned, it is close to the ideal case, which doesn't rewrite the metadata files.
   > What do we want to change in the json? Is it only the path of the table, or we have to rewrite something else as well? Could we use something like the LocationProvider (which generates new datafile locations) to make the path resolution pluggable and store only the config in the table?
   
   Need to change the location in metadata.json, or a combination of three table properties mentioned by @rdblue and @jackye1995. 
   
   Thanks @jackye1995 's proposal. It makes sense to have a centralized place to get the root for data and metadata since there are multiple ways to specify their locations. We can also hide the logic of how to prioritize them. cc @anuragmantri. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
flyrain commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-878457209


   > Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica? 
   
   @pvary , ideally, table replication doesn't involve data file rewrite and metadata(manifest-list, manifest, metadata.json) rewrite. The process would be as simple as that user copys all files needed, then changes the target table properties to get the new status. It isn't the case in reality though.
   
   In this issue thread, we were talking about two ways to replicate a table. 1. relative path 2. rebuild the metadata files. Neither of them require data file rewrite. However, the relative-path approach requires the minimal metadata file rewrite, probably only metadata.json per our discussion. But metadata-rebuild approach involves rewrite of all three type of metadata files. They are metadata.json, manifest-list, and manifest. Every type of file stores table information cannot be recreated only by looking at the data files. For example, the partition spec in metadata.json and its id in manifest file, and the snapshot relative metadata. 
   
   To your question, both source and target tables should be able to live independently after the replication. That's relative easy to archive. The hard part is to enable incremental sync-up between them and bidirectional replication, which are quite common DR(Disaster recovery) use cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-870115508


   Thanks @rdblue for explaining how locations are tracked in Iceberg tables. Based on your and @pvary's comments, I updated the design doc to include
   
   - For HadoopTables
      - The location passed to `HadoopTableOperations` will be used as base table location.
      - If metadata.json file contains a different `location`, we will ignore it. (Do you see any issues?)
   
   - For metastore tables
      -  For `HiveTable`
          - We can use location property tracked in metastore as the base table location.
          OR
          - We could just use `updateLocation()` API to create a new metadata.json file and use the `location` attribute in that as base table location.
      - For other metastore tables
          - For metastore tables that don't track table location,  we can use the `location` attribute of the metadata.json file as the    base table location
   
   - Path conversion is transparent to APIs. I removed sections of the doc that refer to APIs that do not change.
   
   What are your thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-764855499


   Here is a design doc for this proposal: https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit?usp=sharing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] anuragmantri edited a comment on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
anuragmantri edited a comment on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-767157268


   Thanks for the initial feedback @rdblue. I rephrased the proposal and approach sections. Could you please take another look at the [document](https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit?usp=sharing)?
   
   Thanks,
   Anurag


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-869880358


   @pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files. My main point is that we need to have a well-documented plan for tracking the table location.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-878245863


   > @pvary, we can explore that, but it is fairly easy to write a new metadata file and use that for a table. At least that's a small operation, compared to rewriting all metadata and position delete files.
   
   @rdblue: I am not sure that I understand what are you proposing here. I might be mistaken, but AFAIK we store the locations of the files in the `snapshot` file, in the `manifest` files and also in v2  the `position delete` files as well. If we want to create a usable full copy of the table then currently we need to do the following:
   - Copy the data files
   - Recreate or parse and replace the path in all of the files mentioned above (`snapshot`, `manifest` files and `position delete` files)
   
   Is there an easier way to create a full working replica of an Iceberg table where we do not use any files/data from the original table and the 2 tables (original and the new) can live independently after the creation of the replica?
   
   > My main point is that we need to have a well-documented plan for tracking the table location.
   
   Totally agree with this 👍 
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-885497730


   One more use-case which also means that we have to manipulate path in the snapshot/metadata files is when we would like to enable temporary access for Iceberg tables stored on blobstores. In this case we have to create presigned URLs to access the specific files and we have to use this URL in the metadata of the table


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-869020125


   @rdblue: We would like to use relative path for replication of the Iceberg tables between on-prem Hive instances, or moving on-prem Iceberg tables to the cloud and sometimes even moving data from the cloud to on-prem.
   
   While we can do this by rewriting the manifest and snapshot files, it would be much easier to synchronize only the files themselves and update the metadata_location / previous_metadata_location for the target table and be done with it.
   
   Because of the use-cases above it would be useful to have relative paths in the Iceberg metadata for Metastore tables as well, and the absolute path could be generated using the Metastore table location. Maybe even the `metadata_location` / `previous_metadata_location` could be generated this way, so the tables on both end of the replication could have the same data and metadata too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1617: Support relative paths in Table Metadata

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1617:
URL: https://github.com/apache/iceberg/issues/1617#issuecomment-833009883


   @flyrain, could you take a look at what can be done here?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org