You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/08/27 18:02:04 UTC

[jira] [Updated] (IMPALA-10112) Consider skipping FpRateTooHigh() check for bloom filters

     [ https://issues.apache.org/jira/browse/IMPALA-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong updated IMPALA-10112:
-----------------------------------
    Description: 
This check disables bloom filters on the sender side.

It is inaccurate in cases where there are duplicate values of the filter key on the build side. E.g. many-to-many join or a join with multiple keys. This could be fixed with some effort, but is probably not worth it, because:
* Partition filters are probably still worth evaluating even if there are false positives, because it's cheap and eliminating a partition is still beneficial.
* Runtime filters are dynamically disabled on the scan side if they are ineffective. I think we still also "evaluate" the always true filter, which is cheaper than doing the hashing and bloom evaluation, but still not entirely free.
* The disabling is fairly unlikely to kick in for partitioned joins because it's only applied to a small subset of the filter, before the Or() operation.

So it's potentially harmful and only likely beneficial for broadcast join filters, in which case it saves a small amount of scan CPU and, for global filters, coordinator RPCs and broadcasting. It's unclear that the complexity is worth it for this relatively small and uncertain benefit.



  was:
This check disables bloom filters on the sender side.

It is inaccurate in cases where there are duplicate values of the filter key on the build side. E.g. many-to-many join or a join with multiple keys. This could be fixed with some effort, but is probably not worth it, because:
* Partition filters are probably still worth evaluating even if there are false positives, because it's cheap and eliminating a partition is still beneficial.
* Runtime filters are dynamically disabled on the scan side if they are ineffective.
* The disabling is fairly unlikely to kick in for partitioned joins because it's only applied to a small subset of the filter, before the Or() operation.

So it's potentially harmful and only likely beneficial for broadcast join filters, in which case it saves a small amount of scan CPU and, for global filters, coordinator RPCs and broadcasting. It's unclear that the complexity is worth it for this relatively small and uncertain benefit.




> Consider skipping FpRateTooHigh() check for bloom filters
> ---------------------------------------------------------
>
>                 Key: IMPALA-10112
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10112
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>              Labels: performance
>
> This check disables bloom filters on the sender side.
> It is inaccurate in cases where there are duplicate values of the filter key on the build side. E.g. many-to-many join or a join with multiple keys. This could be fixed with some effort, but is probably not worth it, because:
> * Partition filters are probably still worth evaluating even if there are false positives, because it's cheap and eliminating a partition is still beneficial.
> * Runtime filters are dynamically disabled on the scan side if they are ineffective. I think we still also "evaluate" the always true filter, which is cheaper than doing the hashing and bloom evaluation, but still not entirely free.
> * The disabling is fairly unlikely to kick in for partitioned joins because it's only applied to a small subset of the filter, before the Or() operation.
> So it's potentially harmful and only likely beneficial for broadcast join filters, in which case it saves a small amount of scan CPU and, for global filters, coordinator RPCs and broadcasting. It's unclear that the complexity is worth it for this relatively small and uncertain benefit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org