You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "gaborkaszab (via GitHub)" <gi...@apache.org> on 2023/05/15 13:52:54 UTC

[GitHub] [iceberg] gaborkaszab opened a new issue, #7612: DataFile creation by file path seems wrong

gaborkaszab opened a new issue, #7612:
URL: https://github.com/apache/iceberg/issues/7612

   ### Apache Iceberg version
   
   1.2.1 (latest release)
   
   ### Query engine
   
   Impala
   
   ### Please describe the bug 🐞
   
   Using [DataFiles.Builder.withPartitionPath()](https://github.com/apache/iceberg/blob/81af3f76abe98419495bd4d18440a14bfaae2e0e/core/src/main/java/org/apache/iceberg/DataFiles.java#L241) may lead to issues when you have string partition columns that contain special characters.
   
   One example: our table is partitioned by a string column 's' and this column has a value of "11/12/13". As a file path this would look like the following:
   `warehouse_name/tbl_name/s=11%2F12%2F13`
   This is because the partition value is URL-encoded for paths.
   
   So when we create a DataFile using the withPartitionPath() approach thi URL encoded '11%2F12%2F13' value is going to be written into PartitionData even though the original partition value was '11/12/13'
   
   I think this is incorrect because for instance if we use this DataFile to append to the table by AppendFiles then the URL-encoded value of the partition is going to be saved for this particular partition of the table that can confuse readers. One example: Impala reads the value of the partition columns from metadata and not from the file and in this scenario gives a wrong result after adding a file (with no URL-encoded values in the partition col) using the Datafile.Builder.withPartitionPath() function to the table and then queries the table then the table shows URL-encoded values because the metadata is incorrect.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on issue #7612: DataFile creation by file path seems wrong

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1549679202

   cc @deniskuzZ for Hive


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] DataFile creation by file path seems wrong [iceberg]

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab closed issue #7612: DataFile creation by file path seems wrong
URL: https://github.com/apache/iceberg/issues/7612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on issue #7612: DataFile creation by file path seems wrong

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1569486257

   @wypoon Thanks for the analysis, I got to the same conclusion. I also published a fix for this yesterday. I don't think it's necessary to make changes in DataFiles.Builder, though.
   https://github.com/apache/iceberg/pull/7738


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on issue #7612: DataFile creation by file path seems wrong

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1548033388

   I did a search and apparently almost all the usages for this withPartitionPath(path) is for tests, that is fine. However, TableMigrationUtil also uses this, and I think this causes a bug on the clients side. I checked and for instance Hive uses this code path and after table migration the resulting table will be incorrect if the string partition cols have these special chars.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on issue #7612: DataFile creation by file path seems wrong

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1549577560

   Hey @JonasJ-ap, @ericlgoodman ,
   I saw you implemented the Delta Lake table to Iceberg table migration [here](https://github.com/apache/iceberg/commit/eeb055a37705b8c8ff1e13acefe0ba24382c66fe). I also see that you used DataFiles.builder(spec).withPartitionPath(path) when you create data files. I believe that this approach when you use the partition path to determine the partition values is incorrect. See the description above.
   
   Hive also has an issue with this approach, hence I reach out to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] gaborkaszab commented on issue #7612: DataFile creation by file path seems wrong

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1548063277

   I think we should restrict the use of DataFiles.Builder.withPartitionPath(part) for testing and remove the usage from 'prod' code such as TableMigrationUtil.
   
   $ find ../iceberg -name *.java -exec grep -Hn "withPartitionPath(" {} \; | grep -v Test
   ../iceberg/data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java:190:        .withPartitionPath(partitionKey)
   ../iceberg/core/src/main/java/org/apache/iceberg/FileMetadata.java:171:    public Builder withPartitionPath(String newPartitionPath) {
   ../iceberg/core/src/main/java/org/apache/iceberg/DataFiles.java:236:    public Builder withPartitionPath(String newPartitionPath) {
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on issue #7612: DataFile creation by file path seems wrong

Posted by "wypoon (via GitHub)" <gi...@apache.org>.
wypoon commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1569311184

   @gaborkaszab there is indeed a problem with `TableMigrationUtil.listPartition` because it uses `DataFiles.Builder#withPartitionPath`. However, the problem is not with the partition URI being URL encoded. That is fine. The problem is with `TableMigrationUtil.listPartition` building a partition key of the form `a=1/b=2` and then `DataFiles.fillFromPath` taking this String and trying to recover the partition values by splitting it along the character '/'.
   In `DataFiles.Builder` we have the partition URI, which is correct; we just need to get all the partition values so we can set them in the `DataFile` we're building. `withPartitionPath` is not able to do it.
   I have a fix for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] DataFile creation by file path seems wrong [iceberg]

Posted by "gaborkaszab (via GitHub)" <gi...@apache.org>.
gaborkaszab commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1831734715

   https://github.com/apache/iceberg/pull/7744 solves this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] DataFile creation by file path seems wrong [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #7612:
URL: https://github.com/apache/iceberg/issues/7612#issuecomment-1828847207

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org