You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/03/19 06:46:43 UTC

[GitHub] [iceberg] rzhang10 opened a new pull request #2352: make GetProjectedIds able to handle empty struct

rzhang10 opened a new pull request #2352:
URL: https://github.com/apache/iceberg/pull/2352


   In our production, we encounter a table with an empty struct in its schema (the table is ORC format but I think both iceberg and ORC somehow doesn't forbid the creation of table/schema with empty struct), and when we try to read this table, iceberg internally throws away the empty struct in the expected schema, thus the read builder doesn't build the correct number of readers (basically it lacks the reader for the empty struct field).
   
   This fix is to let iceberg not throw away the empty struct field, making it able to read tables with such corner case schemas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed pull request #2352: make GetProjectedIds able to handle empty struct

Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #2352:
URL: https://github.com/apache/iceberg/pull/2352


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rzhang10 edited a comment on pull request #2352: make GetProjectedIds able to handle empty struct

Posted by GitBox <gi...@apache.org>.
rzhang10 edited a comment on pull request #2352:
URL: https://github.com/apache/iceberg/pull/2352#issuecomment-823634268


   @rdblue Hey Ryan, I feel what causes the confusion here is the class `GetProjectedIds` 's name, if dig into the code of this class you'll find it actually doesn't do field projection, what it does is it tries to get extract all the internal ids from complex types (struct, map, list) out, such that the returned set of ids only maps to primitive type fields. However, the corner case it is missing is exactly the empty struct case, where the struct field itself should be kept. If you take a look at the code, the `GetProjectedIds` only gets used by `TypeUtil.selectNot` function which calls `select()` to really gets the projection, but via another visitor called `PruneColumns`.
   
   So I believe my code is handling the corner case correctly. cc @shardulm94 @wmoustafa


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rzhang10 edited a comment on pull request #2352: make GetProjectedIds able to handle empty struct

Posted by GitBox <gi...@apache.org>.
rzhang10 edited a comment on pull request #2352:
URL: https://github.com/apache/iceberg/pull/2352#issuecomment-823634268


   @rdblue Hey Ryan, I feel what causes the confusion here is the class `GetProjectedIds` 's name, if you go to this class actually doesn't do field projection, what it does is it tries to get extract all the internal ids from complex types (struct, map, list) out, such that the returned set of ids only maps to primitive type fields. However, the corner case it is missing is exactly the empty struct case, where the struct field itself should be kept. If you take a look at the code, the `GetProjectedIds` only gets used by `TypeUtil.selectNot` function which calls `select()` to really gets the projection, but via another visitor called `PruneColumns`.
   
   So I believe my code is handling the corner case correctly. cc @shardulm94 @wmoustafa


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2352: make GetProjectedIds able to handle empty struct

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2352:
URL: https://github.com/apache/iceberg/pull/2352#issuecomment-806113814


   @rzhang10, I think the problem with this is that it will cause more columns than needed to be projected. `GetProjectedIds` is used to get the IDs from an expected schema. If that expected schema contains an empty struct, then the struct itself will be selected. But when we apply that projection to a file schema using `select`, the entire struct from the file will be selected. That's because you might select a struct using syntax like `SELECT location FROM table`, where `location` is `struct<latitude: double, longitude: double>`.
   
   I think there might be a different way to solve the problem you're hitting. What method is used to produce the read schema from the expected schema and a file schema? I suspect that's what we need to update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2352: make GetProjectedIds able to handle empty struct

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2352:
URL: https://github.com/apache/iceberg/pull/2352#issuecomment-830870448


   @rzhang10, this isn't safe to commit because it will cause more columns than intended to be projected when reading files. I'm -1 for committing this. I think we can find a solution to the underlying problem, but for the reason I stated above I don't think that this is the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rzhang10 commented on pull request #2352: make GetProjectedIds able to handle empty struct

Posted by GitBox <gi...@apache.org>.
rzhang10 commented on pull request #2352:
URL: https://github.com/apache/iceberg/pull/2352#issuecomment-823634268


   @rdblue Hey Ryan, I feel what causes the confusion here is the class `GetProjectedIds` 's name, if you go to this class actually doesn't do field projection, what it does is it tries to get extract all the internal ids from complex types (struct, map, list) out, such that the returned set of ids only maps to primitive type fields. However, the corner case it is missing is exactly the empty struct case, where the struct field itself should be kept. If you take a look at the code, the `GetProjectedIds` only gets used by `TypeUtil.selectNot` function which calls `select()` to really gets the projection, but via another visitor called `PruneColumns`.
   
   So I believe my code is handling the corner case correctly. cc @shardulm94 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org