You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/22 17:10:45 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #4196: Parquet: Enable vectorized reads by default

aokolnychyi opened a new pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196


   This PR enables vectorized Parquet reads by default. This feature has been available for quite some time and being used by multiple companies in prod. I do anticipate more bugs to be found when we enable this by default but I think there is sufficient confidence it will perform reasonably well in most cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048226126


   @gustavoatt, looks like you implemented the initial support for INT96. Would you be interested in adding that support to the vectorized path? We consider enabling vectorized reads by default and it is going to cause failures for INT96 timestamps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048229617


   I've adapted the failing test for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048349269


   Thanks, @rdblue! Created #4200 to discuss adding support for INT96 to the vectorized path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048057115


   I think we know there is an INT96 column only after we opened the file, which is already too late.
   That means we should probably wait until INT96 columns are supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048338998


   Thanks, @aokolnychyi!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048053512


   It looks like we should either support INT96 vectorized reads, or turn off vectorization when we see there is an INT96 column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048067179


   Vectorization is a big deal and helps not only queries but also row-level operations. It would be a bit unfortunate to be blocked by this. I hope someone can work on supporting INT96. Let me see who did the original implementation. Maybe, they can work on the vectorized path too.
   
   That being said, I am also inclined to still enable vectorization. At least, this is what we did internally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048067179


   Vectorization is a big deal and helps not only queries but also row-level operations. It would be a bit unfortunate to be blocked by this. I hope someone can work on supporting INT96. Let me see who did the original implementation. Maybe, they can work on the vectorized path too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048050639


   I think the test fail because we don't support vectorized reads with INT96 timestamps (for legacy imported files).
   
   ```
   java.lang.UnsupportedOperationException: Unsupported type: required int96 tmp_col = 2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048052262


   cc @rdblue @RussellSpitzer @jackye1995 @szehon-ho @flyrain @karuppayya
   
   What do you think? Do we have to support INT96 in the vectorized path before enabling it by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048060777


   > I think we know there is an INT96 column only after we opened the file, which is already too late. That means we should probably wait until INT96 columns are supported.
   
   Ah, you're right. I'd opt to ignore this, then. INT96 timestamps are not in the Iceberg spec, for exactly this reason. Iceberg progress shouldn't be held up by them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048060777


   > I think we know there is an INT96 column only after we opened the file, which is already too late. That means we should probably wait until INT96 columns are supported.
   
   Ah, you're right. I'd opt to ignore this, then. INT96 timestamps are not in the Iceberg spec...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #4196: Parquet: Enable vectorized reads by default

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #4196:
URL: https://github.com/apache/iceberg/pull/4196#issuecomment-1048057115


   I think we know there is an INT96 column only after we opened the file, which is already too late.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org