You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/31 18:35:32 UTC

[GitHub] [iceberg] GrigorievNick opened a new issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

GrigorievNick opened a new issue #2903:
URL: https://github.com/apache/iceberg/issues/2903


   Iceberg manages file location in metadata, so there is no reason to keep hive table file structure.
   But iceberg still writes data in partition per folder. 
   In my case partitions are organized as ranges and my storage is s3.
   One of the main issues, that sometimes I need to split ranges into two or coalesce them.
   So because it's ranged, I actually need only split one-two files on the partition border.
   But because S3 does not support rename, if the partition is part of the prefix, I will need to copy all data in the partition.
   
   Iceberg is a great tool to manage files and looks like its architecture does not require a strict file folder hierarchy.
   So I wonder do there is a way to say iceberg always writes all files to the same folder?
   
   ```
   /tmp/iceberg_cdc_test/iceberg_catalog/hdl/enrichment_table/data/idRange=0-50/ts_day=2021-07-30/00000-8-8b701a28-8a19-4b57-a84e-f2ff5b12bbb6-00001.orc 
   ```
   Example of files written by an iceberg in the partition.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick closed issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

Posted by GitBox <gi...@apache.org>.
GrigorievNick closed issue #2903:
URL: https://github.com/apache/iceberg/issues/2903


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick edited a comment on issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

Posted by GitBox <gi...@apache.org>.
GrigorievNick edited a comment on issue #2903:
URL: https://github.com/apache/iceberg/issues/2903#issuecomment-891647675


   >You can set 'write.storage-object.enabled'=true in the table properties, which will append some randomness to the beginning of the path for the data. This would still keep the folder hierarchy but would allow you to spread data files across multiple partitions.
   
   if every file will have a random prefix, it will satisfy my requirements. (As I understand that because of s3 policy, they try to create partition based on the first 3 symbols.)
   but unfortunately, I am not able to see any changes when specify `'write.storage-object.enabled'=true` .
   @kbendick what iceberg version have it? I even can't find this string in the[ code](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java).
   
   >To be able to have full control over the data layout, you can implement your own LocationProvider and then set write.location-provider.impl. Here is a test that uses a custom location provider: https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/test/java/org/apache/iceberg/TestLocationProvider.java
   
   Thank you for your answer, I will try to do it.
   
   >But coalescing the files after the fact is something that you'd need to do with some sort of Iceberg API (e.g. maybe read in the files to coalesce, create a temporary view of that data, and then use MERGE INTO or INSERT to handle the write), so that appropriate metadata is written.
   
   I think I can write my custom rewrite based on current rewrite actions.
   The actions list for split partitions:
   1. Scan statistics and find which partition is too big.
   2. Calculate new ranges.
   3. Taking to attention that files are sorted by column which used to build ranges. I can find the specific files which must be split into new partitions.
   4. Split a file into 2.
   5. Take all files from previous partitions, and create two new manifests with new data.
   6. mark all files in the previous manifest deleted.
   
   But this is possible only if data is physically stored in one folder, and only metadata control partition layout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick commented on issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

Posted by GitBox <gi...@apache.org>.
GrigorievNick commented on issue #2903:
URL: https://github.com/apache/iceberg/issues/2903#issuecomment-891647675


   >To be able to have full control over the data layout, you can implement your own LocationProvider and then set write.location-provider.impl. Here is a test that uses a custom location provider: https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/test/java/org/apache/iceberg/TestLocationProvider.java
   
   Thank you for your answer, I will try to do it.
   
   >But coalescing the files after the fact is something that you'd need to do with some sort of Iceberg API (e.g. maybe read in the files to coalesce, create a temporary view of that data, and then use MERGE INTO or INSERT to handle the write), so that appropriate metadata is written.
   
   I think I can write my custom rewrite based on current rewrite actions.
   The actions list for split partitions:
   1. Scan statistics and find which partition is too big.
   2. Calculate new ranges.
   3. Taking to attention that files are sorted by column which used to build ranges. I can find the specific files which must be split into new partitions.
   4.Split a file into 2.
   5. Take all files from previous partitions, and create two new manifests with new data.
   6, mark all files in the previous manifest deleted.
   
   But this is possible only if data is physically stored in one folder, and only metadata control partition layout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick edited a comment on issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

Posted by GitBox <gi...@apache.org>.
GrigorievNick edited a comment on issue #2903:
URL: https://github.com/apache/iceberg/issues/2903#issuecomment-891647675


   >You can set 'write.storage-object.enabled'=true in the table properties, which will append some randomness to the beginning of the path for the data. This would still keep the folder hierarchy but would allow you to spread data files across multiple partitions.
   
   if every file will have a random prefix, it will satisfy my requirements. (As I understand that because of s3 policy, they try to create partition based on the first 3 symbols.)
   but unfortunately, I am not able to see any changes when specify `'write.storage-object.enabled'=true` .
   @kbendick what iceberg version have it? I even can't find this string in the code.
   
   >To be able to have full control over the data layout, you can implement your own LocationProvider and then set write.location-provider.impl. Here is a test that uses a custom location provider: https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/test/java/org/apache/iceberg/TestLocationProvider.java
   
   Thank you for your answer, I will try to do it.
   
   >But coalescing the files after the fact is something that you'd need to do with some sort of Iceberg API (e.g. maybe read in the files to coalesce, create a temporary view of that data, and then use MERGE INTO or INSERT to handle the write), so that appropriate metadata is written.
   
   I think I can write my custom rewrite based on current rewrite actions.
   The actions list for split partitions:
   1. Scan statistics and find which partition is too big.
   2. Calculate new ranges.
   3. Taking to attention that files are sorted by column which used to build ranges. I can find the specific files which must be split into new partitions.
   4. Split a file into 2.
   5. Take all files from previous partitions, and create two new manifests with new data.
   6. mark all files in the previous manifest deleted.
   
   But this is possible only if data is physically stored in one folder, and only metadata control partition layout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick closed issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

Posted by GitBox <gi...@apache.org>.
GrigorievNick closed issue #2903:
URL: https://github.com/apache/iceberg/issues/2903


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #2903:
URL: https://github.com/apache/iceberg/issues/2903#issuecomment-891459560


   @GrigorievNick You have a few options.
   
   You can set `'write.storage-object.enabled'=true` in the table properties, which will append some randomness to the beginning of the path for the data. This would still keep the folder hierarchy, but would allow you to spread data files across multiple partitions.
   
   To be able to have full control over the data layout, you can implement your own LocationProvider and then set `write.location-provider.impl`. Here is a test that uses a custom location provider: https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/test/java/org/apache/iceberg/TestLocationProvider.java
   
   See also the documentation on writing your own file io implementation, location provider, or catalog: https://iceberg.apache.org/custom-catalog/#custom-file-io-implementation
   
   However, altering the files outside of Iceberg is going to cause issues as the data files are stored in the metadata lists. So I don't know how accomplishable your ultimate goal is. Perhaps others can weigh in, but the custom location provider (or more likely custom FileIO and LocationProvider - which are relatively small interfaces) could be used to at least get all of the data into a flat folder structure. But coalescing the files after the fact is something that you'd need to do with some sort of Iceberg API (e.g. maybe read in the files to coalesce, create a temporary view of that data, and then use `MERGE INTO` or `INSERT` to handle the write), so that appropriate metadata is written.
   
   Hope that helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick edited a comment on issue #2903: Does it possible to write all data files in one folder and do not create folder per partition?

Posted by GitBox <gi...@apache.org>.
GrigorievNick edited a comment on issue #2903:
URL: https://github.com/apache/iceberg/issues/2903#issuecomment-891647675


   >To be able to have full control over the data layout, you can implement your own LocationProvider and then set write.location-provider.impl. Here is a test that uses a custom location provider: https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/test/java/org/apache/iceberg/TestLocationProvider.java
   
   Thank you for your answer, I will try to do it.
   
   >But coalescing the files after the fact is something that you'd need to do with some sort of Iceberg API (e.g. maybe read in the files to coalesce, create a temporary view of that data, and then use MERGE INTO or INSERT to handle the write), so that appropriate metadata is written.
   
   I think I can write my custom rewrite based on current rewrite actions.
   The actions list for split partitions:
   1. Scan statistics and find which partition is too big.
   2. Calculate new ranges.
   3. Taking to attention that files are sorted by column which used to build ranges. I can find the specific files which must be split into new partitions.
   4. Split a file into 2.
   5. Take all files from previous partitions, and create two new manifests with new data.
   6. mark all files in the previous manifest deleted.
   
   But this is possible only if data is physically stored in one folder, and only metadata control partition layout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org