You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/19 22:16:28 UTC

[GitHub] [iceberg] RussellSpitzer opened a new issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

RussellSpitzer opened a new issue #1635:
URL: https://github.com/apache/iceberg/issues/1635


   When older files are read with Iceberg the metrics are improperly evaluated. We incorrectly assumes that there are 0 nonNull values in every row group which leads to the reader skipping all row groups. 
   
   https://github.com/RussellSpitzer/iceberg/blob/4d1fc91d6528db88548c506a0b110b50121d67ff/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java#L299-L302
   
   In a Parquet 1.5.0 file we were working with the stats are reported as 
   
   [num_nulls: 0, min/max not defined]
   
   This causes us to pass `colStats.isEmpty` because these are valid settings but `hasNonNullValues` is still false.
   
   This leads us to returned `ROWS_CANNOT_MATCH` and skip the row group.
   
   
   These failures make any filter pushdown to the files return 0 records regardless of whether or not the filter actually matches anything.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712918616


   Ok just fixing the nonnull check is not sufficient because the min and max values cannot be serialized correctly either. I wrote a quick test to just manual check for nonnull s and when you do that it crashes attempting to check the min and max bounds


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712551444


   All I know for sure is we had a user with 1.5.0-Cloudera build of Parquet
   Files. I haven't actually explored past the point where we figured out this
   was the issue. I know in 1.11 there is no problem since when I initially
   tried to repo this issue by just faking their data it was fine. The issue
   only repo'd when I took an actual file they had and imported them into a
   new Iceberg table.
   
   On Mon, Oct 19, 2020 at 8:43 PM Kyle Bendickson <no...@github.com>
   wrote:
   
   > Do you know if there's a version number at which this is fixed? Are we
   > able to tell which version the files were written with and possibly check
   > that? I imagine not, as they usually say the writer type and I don't think
   > parquet files usually have the parquet version in them, more the spark
   > version etc. But a thought.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/1635#issuecomment-712535246>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YMXFAVJ374XGLM2IXTSLTTN3ANCNFSM4SW2V5ZQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712492648


   Are there non-null values in the row group?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713011906


   @kbendick I think it's probably best I take this one since I have the active repo, Should be ready soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712497137


   Yes, there are only non null values in the row group. The correct action is to not skip any row groups


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712535246


   Do you know if there's a version number at which this is fixed? Are we able to tell which version the files were written with and possibly check that? I imagine not, as they usually say the writer type and I don't think parquet files usually have the parquet version in them, more the spark version etc. But a thought.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712497137


   Yes, there are only non null values in the row group


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712525898


   I can take a stab at this if you dont have time @RussellSpitzer 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712918616


   Ok just fixing the nonnull check is not sufficient because the min and max values cannot be serialized correctly either. I wrote a quick test to just manual check for nonnull s and when you do that it crashes attempting to check the min and max bounds.
   
   This makes me think we need to make a backup isEmpty check. This would see whether nullcount is 0 and min/max are undefined. In this case we should also treat the stats as empty


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712898607


   The issue is "hasNonNull" was only added in 1.7 
   https://github.com/apache/parquet-mr/commit/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74
   
   
   Prior to this "isEmpty" had a different definition as well
   
   https://github.com/apache/parquet-mr/blame/d70fdbc40195077057a1edb14ccd16a26435d007/parquet-column/src/main/java/parquet/column/statistics/Statistics.java#L222-L224
   
   So I think out best thing to do is again to not rely only on "hasNonNull"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713145173


   So this is a bit more complicated. This bug only effects files made by parquet versions not effected by
   
   https://github.com/apache/parquet-mr/pull/197/files
   
   And also made before 1.7
   
   This means it should only effect Parquet Files made by Cloudera Parquet builds since they are the only pre 1.7.0 builds which also have valid statistics which can be read


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick removed a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
kbendick removed a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712525898


   I can take a stab at this if you dont have time @RussellSpitzer 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712475344


   I *think* the right thing to do here is to not rely on hasNonNullValues and also check if rowCount > numNull. I'll check other readers tomorrow to see how they deal with the issue


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713145173


   So this is a bit more complicated. This bug only effects files made by parquet versions not effected by
   
   https://github.com/apache/parquet-mr/pull/197/files
   
   And also made after 1.7
   
   This means it should only effect Parquet Files made by Cloudera Parquet builds since they are the only pre 1.7.0 builds which also have valid statistics which can be read


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713063712


   Ok I can get repos on Hive 1.Latest and Hadoop 2.Latest. The default config there produces Parquet-mr 1.6 which has the same issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713247329


   PR up https://github.com/apache/iceberg/pull/1638


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer closed issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files

Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed issue #1635:
URL: https://github.com/apache/iceberg/issues/1635


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org