You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/06/28 09:03:53 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request, #6785: Coerce dictionaries for arithmetic

tustvold opened a new pull request, #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   The support for arithmetic on dictionaries is inconsistent, results in a huge amount of code gen, and doesn't yield significant performance savings. There is an upstream PR proposing removing this functionality - https://github.com/apache/arrow-rs/pull/4407
   
   As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #6785: Coerce dictionaries for arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1612957624

   > but we might need to look at performance impact if any.
   
   The benchmarks on https://github.com/apache/arrow-rs/pull/4407 show that for the wrapping addition used by DataFusion casting first is faster than the logic for operating on the dictionary. It should be noted this is in part because of the inefficiency of the current way this is implemented in arrow-rs, with the checked kernel variant performing better, however, as DataFusion does not use this, **the change in this PR should result in a performance improvement for DF**.
   
   However, consistency is imo the biggest justification for this change, aside from the issues with arithmetic on dictionaries not supporting the full set of types (e.g. temporal), the coercion behaviour is rather inconsistent. **In particular if the left argument is a dictionary and the right argument is not both will be coerced to a dictionary and a dictionary kernel used, in all other cases both arguments will be coerced to primitives**. This makes me think it unlikely this code is actually being used in practice.
   
   ```
   ❯ create table foo as select (arrow_cast('1', 'Dictionary(Int32, Int32)')) as foo;
   0 rows in set. Query took 0.003 seconds.
   ❯ select arrow_typeof(foo) from foo;
   +--------------------------+
   | arrow_typeof(foo.foo)    |
   +--------------------------+
   | Dictionary(Int32, Int32) |
   +--------------------------+
   1 row in set. Query took 0.002 seconds.
   ❯ explain select foo / 23 from foo;
   +---------------+-------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                        |
   +---------------+-------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: CAST(foo.foo AS Dictionary(Int32, Int64)) / Dictionary(Int32, Int64(23)) AS foo.foo / Int64(23) |
   |               |   TableScan: foo projection=[foo]                                                                           |
   | physical_plan | ProjectionExec: expr=[CAST(foo@0 AS Dictionary(Int32, Int64)) / 23 as foo.foo / Int64(23)]                  |
   |               |   MemoryExec: partitions=1, partition_sizes=[1]                                                             |
   |               |                                                                                                             |
   +---------------+-------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.004 seconds.
   ❯ explain select foo / foo from foo;
   +---------------+-------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                          |
   +---------------+-------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: CAST(foo.foo AS Int32) AS foo.foo AS foo.foo AS foo.foo / CAST(foo.foo AS Int32) AS foo.foo AS foo.foo AS foo.foo |
   |               |   TableScan: foo projection=[foo]                                                                                             |
   | physical_plan | ProjectionExec: expr=[CAST(foo@0 AS Int32) / CAST(foo@0 AS Int32) as foo.foo / foo.foo]                                       |
   |               |   MemoryExec: partitions=1, partition_sizes=[1]                                                                               |
   |               |                                                                                                                               |
   +---------------+-------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.006 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #6785: Consistently coerce dictionaries for arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1614429330

   >  but also possible impact to other operations in the query engine
   
   To clarify this PR **does not alter the return type** of the evaluation, the output will still always be primitive. I therefore don't foresee this having downstream implications.
   
   This PR **only** changes the way that DF **evaluates** `dictionary op primitive`. Now instead of coercing the arguments to dictionaries it coerces them to primitives. This is now consistent with the existing behaviour for all other combinations of dictionaries and primitives,  `dictionary op dictionary`, `primitive op dictionary` and `primitive op primitive`, which coerce to primitives on main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6785: Coerce dictionaries for arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1611139452

   > The support for arithmetic on dictionaries is inconsistent, results in a huge amount of code gen, and doesn't yield significant performance savings. There is an upstream PR proposing removing this functionality - https://github.com/apache/arrow-rs/pull/4407
   
   Do we have any numbers that illustrate this claim?
   
   > As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger
   
   I think @viirya added this feature so perhaps he could answer
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on pull request #6785: Consistently coerce dictionaries for arithmetic

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1613667880

   >  The benchmarks on https://github.com/apache/arrow-rs/pull/4407 show that for the wrapping addition used by DataFusion casting first is faster than the logic for operating on the dictionary ..., the change in this PR should result in a performance improvement for DF.
   
   For performance, I'd talk more about end-to-end query benchmark instead of kernel benchmark only. Arithmetic operations are only small pieces in the query execution time (ms/s v.s. minute/hour), I think. We will also do benchmark on new DataFusion release when we are going to upgrade internal integration.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #6785: Coerce dictionaries for arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1611687217

   > Hmm, I don't get the question.
   
   I would like to remove the special case support for dictionaries as it already causes an inordinate amount of code complexity, and that is without it properly covering cases like temporal arithmetic. The question is therefore **can we remove the explicit support in favor of coercion or is it really important to some use-case**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on pull request #6785: Consistently coerce dictionaries for arithmetic

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1613664348

   > Therefore since this PR still supports the calculation (via casting to primitive and calling the primitive kernels) I think it doesn't undo any of that work
   
   Yea, it just casts to primitive before arithmetic operations so dictionary arrays are still going through them. From end-to-end perspective, it might be not noticeable if you don't look at the output array type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6785: Coerce dictionaries for arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1611876074

   I think this change makes sense, but I defer to @viirya  -- if he is ok with it in principle I will review the PR carefully


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #6785: Coerce dictionaries for arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1611140067

   > Do we have any numbers that illustrate this claim?
   
   They are on the linked ticket - https://github.com/apache/arrow-rs/pull/4407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on pull request #6785: Coerce dictionaries for arithmetic

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1611682160

   >> As an aside it is unclear why you would ever use a primitive dictionary, they will almost always be slower to process and especially once you factor in dictionary sparsity, may be significantly larger
   > I think @viirya added this feature so perhaps he has more context
   
   Hmm, I don't get the question. For the dictionary behavior, it follows original behavior. It basically follows kernel behavior too.  Previously I cleaned up the decimal type part, but for dictionary behavior, I think it is unchanged.
   
   I think for dictionary, it still has storage advantage in memory and disk (e.g. shuffle), otherwise it has no reason to have it. It might be true that dictionary is slower to process in computation kernel, but for overall cost I'm not sure it is always a win to get rid of dictionary. Except we have something that can automatically detect dictionary sparsity and convert to dictionary before outputting to storage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on pull request #6785: Coerce dictionaries for arithmetic

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1612052971

   I think I agree with this can largely reduce code complexity. My little concern might be if this is a clear win? Or it is probably a win for some cases but not for all? Dictionary process might be slower for some cases, but for some I think it might be faster, e.g., dictionary + scalar which I remember is directly computing on dictionary values. Not to mention there is extra coercion (casting) operation.
   
   Also I'm not sure about selection kernels, do we have them in DataFusion?
   
   I guess I'm okay with this (+0?), but we might need to look at performance impact if any.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold merged pull request #6785: Consistently coerce dictionaries for arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6785: Consistently coerce dictionaries for arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6785:
URL: https://github.com/apache/arrow-datafusion/pull/6785#discussion_r1246892438


##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -2968,16 +2962,10 @@ mod tests {
         op: Operator,
         expected: BooleanArray,
     ) -> Result<()> {
-        let left_type = left.data_type();
-        let right_type = right.data_type();
-        let (lhs, rhs) = get_input_types(left_type, &op, right_type)?;
-
-        let left_expr = try_cast(col("a", schema)?, schema, lhs)?;
-        let right_expr = try_cast(col("b", schema)?, schema, rhs)?;
-        let arithmetic_op = binary_simple(left_expr, op, right_expr, schema);
+        let op = binary_op(col("a", schema)?, op, col("b", schema)?, schema)?;

Review Comment:
   this is a much nicer formulation



##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -226,13 +226,13 @@ fn math_decimal_coercion(
     use arrow::datatypes::DataType::*;
 
     match (lhs_type, rhs_type) {
-        (Dictionary(key_type, value_type), _) => {
+        (Dictionary(_, value_type), _) => {

Review Comment:
   maybe it is worth a comment here explaining we are expanding out dictionaries to the underlying type and leave a comment pointing back at this PR / a summary of https://github.com/apache/arrow-datafusion/pull/6785#issuecomment-1612957624
   
   I think that is critical context



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org