You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/07 11:06:01 UTC
[GitHub] [iceberg] boroknagyz opened a new issue #2043: Partition data to path, and path to data conversions are not consistent with each other
boroknagyz opened a new issue #2043:
URL: https://github.com/apache/iceberg/issues/2043
PartitionSpec.partitionToPath() method creates a human-readable partition path:
https://github.com/apache/iceberg/blob/083e70bd0e9be39c25170aee2ddb7e084527fed6/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L173
However, DataFiles.fillFromPath() invokes Conversions.fromPartitionString()
https://github.com/apache/iceberg/blob/083e70bd0e9be39c25170aee2ddb7e084527fed6/core/src/main/java/org/apache/iceberg/DataFiles.java#L84
Which interprets the values via a simple string parsing:
https://github.com/apache/iceberg/blob/083e70bd0e9be39c25170aee2ddb7e084527fed6/api/src/main/java/org/apache/iceberg/types/Conversions.java#L44-L78
This causes problems with partition transforms. E.g., partition transform YEAR stores the passed years since 1970: https://iceberg.apache.org/spec/#partition-transforms
Therefore if we store the timestamp '2021-01-07 11:58:19.523065', and our table is using the YEAR partition transform, then the partition path will be ts_year=2021 (human-readable). Now, if we set the partition data using Datafiles.fillFromPath() (or, with Builder.withPartitionPath()), partition data will have the value of 2021, and not the passed years since 1970.
I think besides toHumanString(), partition transforms should have a fromHumanString() method as well, and this method should be used when parsing the partition path.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other
Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-758281167
Conversion to and from string for partition data has caused a lot of bugs in Hive and other projects that use Hive tables. I do not think that it is a good idea to do it in general. Avoiding those bugs is the reason why Iceberg always keeps track of the partition values as a struct or row and serializes it as data. I think it is a bad idea to add this to the Iceberg library because it would encourage a bad practice.
If you need to do that for Impala in the short term, then let's keep the implementation there; hopefully with a note to fix it later.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] boroknagyz commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other
Posted by GitBox <gi...@apache.org>.
boroknagyz commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-757885743
Thanks for your reply, @rdblue.
In Impala we also use Builder.withPartitionPath, because we generate the partition path already in our C++ backend, so this was the easiest way to transfer partition data from C++ to Java.
I see that Iceberg unit test code also uses withPartitionPath to conveniently create partition data, so for Iceberg this conversion could be useful for unit tests at least.
With that said, if you think two-way conversion could be useful for Iceberg, then I'll happily contribute code. If not, that's also OK for me, I can add partition path -> partition data conversion to Impala's code base.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other
Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-757058872
`DataFiles.fillFromPath` is only available as a convenience for importing Hive data files. It does not parse values from Iceberg paths on purpose because paths should not be parsed in Iceberg. Iceberg metadata contains the actual values that should be used instead of parsing from a String. Converting a partition data tuple to a path component is considered one-way.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on issue #2043: Partition data to path, and path to data conversions are not consistent with each other
Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #2043:
URL: https://github.com/apache/iceberg/issues/2043#issuecomment-757058872
`DataFiles.fillFromPath` is only available as a convenience for importing Hive data files. It does not parse values from Iceberg paths on purpose because paths should not be parsed in Iceberg. Iceberg metadata contains the actual values that should be used instead of parsing from a String. Converting a partition data tuple to a path component is considered one-way.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org