You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/19 22:16:28 UTC
[GitHub] [iceberg] RussellSpitzer opened a new issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
RussellSpitzer opened a new issue #1635:
URL: https://github.com/apache/iceberg/issues/1635
When older files are read with Iceberg the metrics are improperly evaluated. We incorrectly assumes that there are 0 nonNull values in every row group which leads to the reader skipping all row groups.
https://github.com/RussellSpitzer/iceberg/blob/4d1fc91d6528db88548c506a0b110b50121d67ff/parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetricsRowGroupFilter.java#L299-L302
In a Parquet 1.5.0 file we were working with the stats are reported as
[num_nulls: 0, min/max not defined]
This causes us to pass `colStats.isEmpty` because these are valid settings but `hasNonNullValues` is still false.
This leads us to returned `ROWS_CANNOT_MATCH` and skip the row group.
These failures make any filter pushdown to the files return 0 records regardless of whether or not the filter actually matches anything.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712918616
Ok just fixing the nonnull check is not sufficient because the min and max values cannot be serialized correctly either. I wrote a quick test to just manual check for nonnull s and when you do that it crashes attempting to check the min and max bounds
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712551444
All I know for sure is we had a user with 1.5.0-Cloudera build of Parquet
Files. I haven't actually explored past the point where we figured out this
was the issue. I know in 1.11 there is no problem since when I initially
tried to repo this issue by just faking their data it was fine. The issue
only repo'd when I took an actual file they had and imported them into a
new Iceberg table.
On Mon, Oct 19, 2020 at 8:43 PM Kyle Bendickson <no...@github.com>
wrote:
> Do you know if there's a version number at which this is fixed? Are we
> able to tell which version the files were written with and possibly check
> that? I imagine not, as they usually say the writer type and I don't think
> parquet files usually have the parquet version in them, more the spark
> version etc. But a thought.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/iceberg/issues/1635#issuecomment-712535246>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AADE2YMXFAVJ374XGLM2IXTSLTTN3ANCNFSM4SW2V5ZQ>
> .
>
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712492648
Are there non-null values in the row group?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713011906
@kbendick I think it's probably best I take this one since I have the active repo, Should be ready soon.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712497137
Yes, there are only non null values in the row group. The correct action is to not skip any row groups
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712535246
Do you know if there's a version number at which this is fixed? Are we able to tell which version the files were written with and possibly check that? I imagine not, as they usually say the writer type and I don't think parquet files usually have the parquet version in them, more the spark version etc. But a thought.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712497137
Yes, there are only non null values in the row group
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712525898
I can take a stab at this if you dont have time @RussellSpitzer
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712918616
Ok just fixing the nonnull check is not sufficient because the min and max values cannot be serialized correctly either. I wrote a quick test to just manual check for nonnull s and when you do that it crashes attempting to check the min and max bounds.
This makes me think we need to make a backup isEmpty check. This would see whether nullcount is 0 and min/max are undefined. In this case we should also treat the stats as empty
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712898607
The issue is "hasNonNull" was only added in 1.7
https://github.com/apache/parquet-mr/commit/4bf9be34a87b51d07e0b0c9e74831bbcdbce0f74
Prior to this "isEmpty" had a different definition as well
https://github.com/apache/parquet-mr/blame/d70fdbc40195077057a1edb14ccd16a26435d007/parquet-column/src/main/java/parquet/column/statistics/Statistics.java#L222-L224
So I think out best thing to do is again to not rely only on "hasNonNull"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer edited a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713145173
So this is a bit more complicated. This bug only effects files made by parquet versions not effected by
https://github.com/apache/parquet-mr/pull/197/files
And also made before 1.7
This means it should only effect Parquet Files made by Cloudera Parquet builds since they are the only pre 1.7.0 builds which also have valid statistics which can be read
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] kbendick removed a comment on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
kbendick removed a comment on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712525898
I can take a stab at this if you dont have time @RussellSpitzer
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-712475344
I *think* the right thing to do here is to not rely on hasNonNullValues and also check if rowCount > numNull. I'll check other readers tomorrow to see how they deal with the issue
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713145173
So this is a bit more complicated. This bug only effects files made by parquet versions not effected by
https://github.com/apache/parquet-mr/pull/197/files
And also made after 1.7
This means it should only effect Parquet Files made by Cloudera Parquet builds since they are the only pre 1.7.0 builds which also have valid statistics which can be read
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713063712
Ok I can get repos on Hive 1.Latest and Hadoop 2.Latest. The default config there produces Parquet-mr 1.6 which has the same issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer commented on issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #1635:
URL: https://github.com/apache/iceberg/issues/1635#issuecomment-713247329
PR up https://github.com/apache/iceberg/pull/1638
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] RussellSpitzer closed issue #1635: [Bug] Metric filtering skips row groups incorrectly on older Parquet files
Posted by GitBox <gi...@apache.org>.
RussellSpitzer closed issue #1635:
URL: https://github.com/apache/iceberg/issues/1635
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org