You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/07 05:56:56 UTC

[GitHub] [iceberg] the-other-tim-brown opened a new pull request, #5932: Parquet: support nested fields when assigning fallback ids

the-other-tim-brown opened a new pull request, #5932:
URL: https://github.com/apache/iceberg/pull/5932

   The current implementation for addFallbackIds will not set IDs on fields outside of the top level. If you have a parquet file without IDs and nested fields, you will fail to read that file correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] the-other-tim-brown commented on pull request #5932: Parquet: support nested fields when assigning fallback ids

Posted by GitBox <gi...@apache.org>.

the-other-tim-brown commented on PR #5932:
URL: https://github.com/apache/iceberg/pull/5932#issuecomment-1272101180

   @rdblue to add some more context, I'm trying to bootstrap an Iceberg table from existing Parquet files and when reading these files from spark I hit this [line](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L91) in the ReadConf. The only way I found to make the spark reader work was to add in this code. I see the line above references the name mapping but I'm not sure how to pass that in. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #5932: Parquet: support nested fields when assigning fallback ids

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #5932:
URL: https://github.com/apache/iceberg/pull/5932#issuecomment-1272077852

Thanks for taking a look at this, @the-other-tim-brown, but I'm not sure that this is the right way to do what you want to accomplish.

The fallback ID assignment is really old and makes assumptions about how the data has evolved -- specifically that position-based column resolution is valid (just like CSV with no header). This works for top-level columns, but it won't work for correctness with nested fields. With position-based column resolution, you can add columns to the end of the schema safely. So you can have some files with columns `1: a, 2: b` and some with columns `1: a, 2: b, 3: c` (note that ID assignment is consistent). The problem with nested columns is that the top-level field assignment can change the assignment for nested fields. For example: `1: a struct<3: x, 4: y>, 2: b` and `1: a struct<4: x, 5: y>, 2: b, 3: c`.

I think what you probably want is to use a name mapping, which is more flexible and can probably handle what you want to do.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] the-other-tim-brown commented on pull request #5932: Parquet: support nested fields when assigning fallback ids

Posted by GitBox <gi...@apache.org>.

the-other-tim-brown commented on PR #5932:
URL: https://github.com/apache/iceberg/pull/5932#issuecomment-1273885025

   I see that I just need to set the `TableProperties.DEFAULT_NAME_MAPPING` to `NameMappingParser.toJson(MappingUtil.create(transaction.table().schema())` in the properties when creating the table and that fixed my reader issues. This is definitely what I should have been doing in the first place. Thanks for pointing me in the right direction!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue closed pull request #5932: Parquet: support nested fields when assigning fallback ids

Posted by GitBox <gi...@apache.org>.

rdblue closed pull request #5932: Parquet: support nested fields when assigning fallback ids
URL: https://github.com/apache/iceberg/pull/5932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org