You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/06/13 14:33:57 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #4407: Remove Binary Dictionary Arithmetic Support

tustvold opened a new pull request, #4407:
URL: https://github.com/apache/arrow-rs/pull/4407

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Relates to #3999 
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   As part of #3999 I'm trying to improve the consistency and correctness of the arithmetic kernels, however, I am repeatedly bashing my head against the dictionary support and therefore wanted to float this idea to see what people think.
   
   My major rationale is:
   
   * Currently calling a kernel with a DictionaryArray and a scalar returns a DictionaryArray, however, calling a kernel with two DictionaryArray returns a PrimitiveArray, the latter feels strange to me
   * Huge amount of code complexity, and code generation to support this use-case
   * Difficult to keep the arithmetic logic for PrimitiveArray values and DictionaryArray values consistent
   * We currently don't support operations between PrimitiveArray and DictionaryArray, should we??
   * The performance of operating directly on the dictionaries, vs casting first, is broadly in the same ballpark of ~10s of ns per row
   * I honestly don't really understand the use-case for a DictionaryArray of primitives, they will be significantly slower to process than the corresponding PrimitiveArray, orders of magnitude in some case, and will likely take up more memory (especially given #3837 and similar)
   
   I think what would help me be less frustrated bashing my head against this would be some motivating use-case for this functionality, as far as I can tell I can't see a compelling reason to ever use a DictionaryArray of primitives for query computation, they're almost always just worse
   
   Performance of arithmetic using this feature, vs just casting first, run using (#4405)
   
   ```
   dict_add(0)             time:   [354.31 µs 354.55 µs 354.84 µs]
                           change: [-1.1077% -0.7157% -0.2919%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   
   dict_add_checked(0)     time:   [31.384 µs 31.392 µs 31.401 µs]
                           change: [-1.3918% -0.7952% -0.4529%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   dict_add_cast(0)        time:   [44.593 µs 44.622 µs 44.657 µs]
                           change: [-3.3883% -3.3001% -3.2035%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   dict_add_cast_checked(0)
                           time:   [44.130 µs 44.160 µs 44.192 µs]
                           change: [-1.6532% -1.2188% -0.8736%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 4 outliers among 100 measurements (4.00%)
     4 (4.00%) high mild
   
   dict_add(0.1)           time:   [411.69 µs 411.94 µs 412.28 µs]
                           change: [-0.8818% -0.7008% -0.5335%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
     5 (5.00%) high mild
     1 (1.00%) high severe
   
   dict_add_checked(0.1)   time:   [19.859 µs 19.872 µs 19.885 µs]
                           change: [+3.1645% +3.8146% +4.4437%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 13 outliers among 100 measurements (13.00%)
     9 (9.00%) high mild
     4 (4.00%) high severe
   
   dict_add_cast(0.1)      time:   [67.510 µs 67.682 µs 67.866 µs]
                           change: [-32.706% -32.439% -32.166%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 16 outliers among 100 measurements (16.00%)
     3 (3.00%) low severe
     9 (9.00%) low mild
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   dict_add_cast_checked(0.1)
                           time:   [78.234 µs 78.265 µs 78.299 µs]
                           change: [-1.1505% -1.0897% -1.0254%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   dict_add(0.5)           time:   [687.92 µs 688.56 µs 689.45 µs]
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   dict_add_checked(0.5)   time:   [72.906 µs 72.921 µs 72.939 µs]
   Found 8 outliers among 100 measurements (8.00%)
     7 (7.00%) high mild
     1 (1.00%) high severe
   
   dict_add_cast(0.5)      time:   [68.336 µs 68.367 µs 68.399 µs]
   Found 6 outliers among 100 measurements (6.00%)
     5 (5.00%) high mild
     1 (1.00%) high severe
   
   dict_add_cast_checked(0.5)
                           time:   [126.81 µs 126.89 µs 126.97 µs]
   Found 7 outliers among 100 measurements (7.00%)
     6 (6.00%) high mild
     1 (1.00%) high severe
   
   dict_add(0.9)           time:   [498.11 µs 498.35 µs 498.60 µs]
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   dict_add_checked(0.9)   time:   [92.705 µs 95.673 µs 99.419 µs]
   
   dict_add_cast(0.9)      time:   [69.080 µs 69.248 µs 69.383 µs]
   
   dict_add_cast_checked(0.9)
                           time:   [171.86 µs 171.97 µs 172.08 µs]
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   dict_add(1)             time:   [370.66 µs 370.83 µs 371.02 µs]
   Found 21 outliers among 100 measurements (21.00%)
     12 (12.00%) low severe
     9 (9.00%) high mild
   
   dict_add_checked(1)     time:   [31.390 µs 31.402 µs 31.414 µs]
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   dict_add_cast(1)        time:   [43.996 µs 44.022 µs 44.048 µs]
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   dict_add_cast_checked(1)
                           time:   [45.406 µs 45.439 µs 45.476 µs]
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   ```
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   Yes, I suspect this will have downstream implications. Tagging @alamb @viirya @wjones127 @jhorstmann 
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on pull request #4407: Remove Binary Dictionary Arithmetic Support

Posted by "jhorstmann (via GitHub)" <gi...@apache.org>.
jhorstmann commented on PR #4407:
URL: https://github.com/apache/arrow-rs/pull/4407#issuecomment-1589549258

   Agree that the performance benefit of specialized kernels is probably not worth the complexity and added code.
   
   > Currently calling a kernel with a DictionaryArray and a scalar returns a DictionaryArray, however, calling a kernel with two DictionaryArray returns a PrimitiveArray, the latter feels strange to me
   
   This kind of makes sense to me, for many operations involving scalars, the dictionary would still be unique afterwards, while an operation with two dictionaries would lead to combinatoric explosion and no longer is beneficial to dictionary encode the results. Operations like `array * 0` would of course lead to all duplicated values in the dictionary, so always returning a `PrimitiveArray` could be more consistent.
   
   In our engine we had a similar issue with string replace or concat operations, where we decided that such operations on two dictionary arrays would always result in a string array, but with dictionary array and literal string it would be beneficial to build a new dictionary.
   
   I did not review the code in detail, maybe this is already happening, but could the dyn kernels automatically downcast/materialize dictionary arrays so that dictionary arrays are still supported as inputs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #4407: Remove Binary Dictionary Arithmetic Support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #4407:
URL: https://github.com/apache/arrow-rs/pull/4407#issuecomment-1613532324

   I think @viirya  should also have a chance to review / comment on this prior to merge.
   
   The justification as i understand it is that the primitive kernels are much faster anyways for this kind of operation and so including native dictionary creation is both slower as well as larger code and harder to work with


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #4407: Remove Binary Dictionary Arithmetic Support

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #4407:
URL: https://github.com/apache/arrow-rs/pull/4407#issuecomment-1611054056

   > Could the dyn kernels automatically downcast/materialize dictionary arrays so that dictionary arrays are still supported as inputs 
   
   I think I would prefer that this was delegated to the query engines type coercion machinery, to ensure this is visible to the planner and avoid unnecessary casts back and forth. A PR to do this in DataFusion can be found - https://github.com/apache/arrow-datafusion/pull/6785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on pull request #4407: Remove Binary Dictionary Arithmetic Support

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #4407:
URL: https://github.com/apache/arrow-rs/pull/4407#issuecomment-1613662342

   I will find some time looking at this soon if you are not hurry to merge this in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #4407: Remove Binary Dictionary Arithmetic Support

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #4407:
URL: https://github.com/apache/arrow-rs/pull/4407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org