You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/07 11:06:01 UTC

[GitHub] [iceberg] boroknagyz opened a new issue #2043: Partition data to path, and path to data conversions are not consistent with each other

boroknagyz opened a new issue #2043:
URL: https://github.com/apache/iceberg/issues/2043


   PartitionSpec.partitionToPath() method creates a human-readable partition path:
   https://github.com/apache/iceberg/blob/083e70bd0e9be39c25170aee2ddb7e084527fed6/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L173
   
   However, DataFiles.fillFromPath() invokes Conversions.fromPartitionString()
   https://github.com/apache/iceberg/blob/083e70bd0e9be39c25170aee2ddb7e084527fed6/core/src/main/java/org/apache/iceberg/DataFiles.java#L84
   
   Which interprets the values via a simple string parsing:
   https://github.com/apache/iceberg/blob/083e70bd0e9be39c25170aee2ddb7e084527fed6/api/src/main/java/org/apache/iceberg/types/Conversions.java#L44-L78
   
   This causes problems with partition transforms. E.g., partition transform YEAR stores the passed years since 1970: https://iceberg.apache.org/spec/#partition-transforms
   
   Therefore if we store the timestamp '2021-01-07 11:58:19.523065', and our table is using the YEAR partition transform, then the partition path will be ts_year=2021 (human-readable). Now, if we set the partition data using Datafiles.fillFromPath() (or, with Builder.withPartitionPath()), partition data will have the value of 2021, and not the passed years since 1970.
   
   I think besides toHumanString(), partition transforms should have a fromHumanString() method as well, and this method should be used when parsing the partition path.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-758281167


   Conversion to and from string for partition data has caused a lot of bugs in Hive and other projects that use Hive tables. I do not think that it is a good idea to do it in general. Avoiding those bugs is the reason why Iceberg always keeps track of the partition values as a struct or row and serializes it as data. I think it is a bad idea to add this to the Iceberg library because it would encourage a bad practice.
   
   If you need to do that for Impala in the short term, then let's keep the implementation there; hopefully with a note to fix it later.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] boroknagyz commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other

Posted by GitBox <gi...@apache.org>.
boroknagyz commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-757885743


   Thanks for your reply, @rdblue.
   In Impala we also use Builder.withPartitionPath, because we generate the partition path already in our C++ backend, so this was the easiest way to transfer partition data from C++ to Java.
   
   I see that Iceberg  unit test code also uses withPartitionPath to conveniently create partition data, so for Iceberg this conversion could be useful for unit tests at least.
   
   With that said, if you think two-way conversion could be useful for Iceberg, then I'll happily contribute code. If not, that's also OK for me, I can add partition path -> partition data conversion to Impala's code base.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-757058872


   `DataFiles.fillFromPath` is only available as a convenience for importing Hive data files. It does not parse values from Iceberg paths on purpose because paths should not be parsed in Iceberg. Iceberg metadata contains the actual values that should be used instead of parsing from a String. Converting a partition data tuple to a path component is considered one-way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-757058872


   `DataFiles.fillFromPath` is only available as a convenience for importing Hive data files. It does not parse values from Iceberg paths on purpose because paths should not be parsed in Iceberg. Iceberg metadata contains the actual values that should be used instead of parsing from a String. Converting a partition data tuple to a path component is considered one-way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org