You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/06 19:23:01 UTC

[GitHub] [arrow-datafusion] pjmore opened a new pull request #278: In list performance improvements

pjmore opened a new pull request #278:
URL: https://github.com/apache/arrow-datafusion/pull/278


   Added short circuiting extreme values check on list.
   
   # Which issue does this PR close?
   
   fixes #145 
   
   # What changes are included in this PR?
   Various changes to the InList physical expression.
   
   Macro based solution was replaced with generic functions, which ended up being a little symbol soupy. If the macro based solution is preferred I'd be happy to switch it to use a macro instead. 
   
   List of optimizations:
   - Using a HashSet to check for values in the list for large lists
   - For numeric lists sort the list of values:
       - Check that value falls within extreme values of list to enable 
       - Short circuits the linear search
   - Explicitly lift check for negated and contains_null to outside of the loop. It probably doesn't matter and if it did the compiler probably already does this,  but I thought it made sense to be explicit. 
   
   The thresholds for selecting whether to do a scan or use a HashSet could  be tuned more.
   
   # Are there any user-facing changes?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #278: In list performance improvements

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #278:
URL: https://github.com/apache/arrow-datafusion/pull/278#issuecomment-833808961


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/278?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#278](https://codecov.io/gh/apache/arrow-datafusion/pull/278?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c052c14) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/e271e4d480b17ca36f39c165661c1dfc8022c63f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e271e4d) will **decrease** coverage by `0.74%`.
   > The diff coverage is `59.31%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/278/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/278?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #278      +/-   ##
   ==========================================
   - Coverage   76.80%   76.06%   -0.75%     
   ==========================================
     Files         133      140       +7     
     Lines       23284    23736     +452     
   ==========================================
   + Hits        17884    18055     +171     
   - Misses       5400     5681     +281     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/278?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...atafusion/src/physical\_plan/expressions/in\_list.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9pbl9saXN0LnJz) | `68.52% <59.31%> (-9.87%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jb2FsZXNjZV9iYXRjaGVzLnJz) | `83.18% <0.00%> (-1.77%)` | :arrow_down: |
   | [...tafusion/src/physical\_plan/distinct\_expressions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9kaXN0aW5jdF9leHByZXNzaW9ucy5ycw==) | `90.34% <0.00%> (-0.57%)` | :arrow_down: |
   | [datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvZGF0YXNvdXJjZS9jc3YucnM=) | `73.07% <0.00%> (-0.26%)` | :arrow_down: |
   | [...lista/rust/core/src/serde/logical\_plan/to\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vdG9fcHJvdG8ucnM=) | `68.05% <0.00%> (-0.10%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/csv.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9jc3YucnM=) | `79.01% <0.00%> (-0.09%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9mdW5jdGlvbnMucnM=) | `89.42% <0.00%> (-0.06%)` | :arrow_down: |
   | [...sta/rust/core/src/serde/logical\_plan/from\_proto.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9sb2dpY2FsX3BsYW4vZnJvbV9wcm90by5ycw==) | `40.51% <0.00%> (-0.06%)` | :arrow_down: |
   | [datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvc3FsL3BsYW5uZXIucnM=) | `83.62% <0.00%> (-0.03%)` | :arrow_down: |
   | [datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9wYXJxdWV0LnJz) | `87.55% <0.00%> (-0.02%)` | :arrow_down: |
   | ... and [14 more](https://codecov.io/gh/apache/arrow-datafusion/pull/278/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/278?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/278?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [e271e4d...c052c14](https://codecov.io/gh/apache/arrow-datafusion/pull/278?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore edited a comment on pull request #278: In list performance improvements

Posted by GitBox <gi...@apache.org>.
pjmore edited a comment on pull request #278:
URL: https://github.com/apache/arrow-datafusion/pull/278#issuecomment-834634744


   After doing more benchmarking there's more performance degradation for small lists than performance gains on large ones so I'm closing the pr for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore commented on pull request #278: In list performance improvements

Posted by GitBox <gi...@apache.org>.
pjmore commented on pull request #278:
URL: https://github.com/apache/arrow-datafusion/pull/278#issuecomment-834634744


   After doing more benchmarking there's more performance degradation for small small lists than performance gains on large ones so I'm closing the pr for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #278: In list performance improvements

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #278:
URL: https://github.com/apache/arrow-datafusion/pull/278#issuecomment-835264780


   @pjmore I think for this optimization to make sense, it needs a certain threshold for where it moves to the different implementation (e.g. more than 500 values (or whenever it is faster) -> use hash set ).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #278: In list performance improvements

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #278:
URL: https://github.com/apache/arrow-datafusion/pull/278#issuecomment-833869330


   Cool @pjmore !
   I think these kind of improvements could really benefit from some (micro) benchmarks - do you feel like adding that in this PR so we can see the change in performance? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore closed pull request #278: In list performance improvements

Posted by GitBox <gi...@apache.org>.
pjmore closed pull request #278:
URL: https://github.com/apache/arrow-datafusion/pull/278


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org