You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/16 07:06:37 UTC

[GitHub] [arrow-datafusion] HaoYang670 opened a new pull request, #4234: [WIP] Unfold the `round` function in logical plan if it has 2 arguments

HaoYang670 opened a new pull request, #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234

   Signed-off-by: remzi <13...@gmail.com>
   
   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #4191.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1023682648


##########
datafusion/optimizer/tests/integration-test.rs:
##########
@@ -226,6 +226,39 @@ fn concat_ws_literals() -> Result<()> {
     Ok(())
 }
 
+#[test]
+fn unfold_round() -> Result<()> {
+    let sql = "SELECT round(col_f32, col_int32) as col from test";
+    let plan = test_sql(sql)?;
+    let expected =
+        "Projection: round(CAST(test.col_f32 AS Float64) * power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64))CAST(CAST(test.col_int32 AS Int64) AS Float64)CAST(test.col_int32 AS Int64)test.col_int32Float64(10) AS power(Float64(10),test.col_int32)) / power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64))CAST(CAST(test.col_int32 AS Int64) AS Float64)CAST(test.col_int32 AS Int64)test.col_int32Float64(10) AS power(Float64(10),test.col_int32) AS col\
+        \n  Projection: power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64)) AS power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64))CAST(CAST(test.col_int32 AS Int64) AS Float64)CAST(test.col_int32 AS Int64)test.col_int32Float64(10), test.col_f32\
+        \n    TableScan: test projection=[col_int32, col_f32]";
+    assert_eq!(expected, format!("{:?}", plan));
+    Ok(())
+}
+
+#[test]
+fn round_partially_execute() -> Result<()> {
+    let sql = "SELECT round(col_f32, 2) as col from test";
+    let plan = test_sql(sql)?;
+    let expected =
+        "Projection: round(CAST(test.col_f32 AS Float64) * Float64(100)) / Float64(100) AS col\

Review Comment:
   We can benefit from the expression simplification when one of the argument is a constant



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1024818973


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -760,6 +766,29 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,

Review Comment:
   > This makes the logical plan harder to use for other query engines
   
   Why is that? Could you please give an example?
   
   > We can implement the functionality in the physical plan in a separate PR.
   
   The advantages of implementing in physical plan are:
   1. much straight forward and easy to debug.
   2. we would benefit from arrow-rs, if arrow-rs could support `round(source, n)` in the future.
   
   The disadvantages of implementing in physical plan are:
   1. Rust only support the `round` function with one argument. If we want to implement in physical plan, we have to assemble some arrow compute kernel together, which the logic is same as what I've done in the physical plan.
   3. We can't optimize the expression if the second function is a constant, which we have to add additional optimize rules.
   
   The advantages of implementing in logical plan are:
   1. Reuse other code to achieve the functionality of `round(source, n)`, so the implementation could be much easier.
   2. We can reuse our expression optimizer to do optimizations, for example when the 2nd argument is a constant.
   3. Flexible. If we plan to change the parameter type of `round` from `Float`, to `Decimal`, the change can be very easy.
   4. Make logical stage and physical stage independent. We can achieve complex functionality in the SQL query by unfolding and rewriting expressions (or plans) in the logical stage and we only need to maintain a reduced set of physical expressions (or plans).  
   
   The disadvantages of implementing in logical plan are:
   1. Put the implementation in the logical optimizer isn't a good solution, we should unfold the expression before doing any optimization.
   1. the query plan isn't user-friendly to read because `round` is unfolded. But I think this is not a big deal, as the optimized code is always ugly to read. Users can also use `explain verbose` to find what has happened.
   
   After weighing the pros and cons, I prefer to implement in logical plan. If the rust lib or arrow-rs could support `round(source, n)`, in the future, we can change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#issuecomment-1316612723

   @andygrove Could you please help to review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1024818973


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -760,6 +766,29 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,

Review Comment:
   > This makes the logical plan harder to use for other query engines
   
   Why is that? Could you please give an example?
   
   > We can implement the functionality in the physical plan in a separate PR.
   
   The reason that I don't implement in physical plan is that the implementation isn't straightforward.
   Rust only support the `round` function with one argument. 
   If we want to implement in physical plan, we have to assemble some arrow compute kernel together, which the logic is same as what I've done in the physical plan. However, the drawback is that, we can't optimize the expression if the second function is a constant, which we have to add additional optimize rules.
   
   The disadvantages of implementing in logical plan are:
   1. Put the implementation in the logical optimizer isn't a good solution, we should unfold the expression before doing any optimization.
   1. the query plan isn't user-friendly to read because `round` is unfolded. But I think this is not a big deal, as the optimized code is always ugly to read. Users can also use `explain verbose` to find what has happened.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1026180988


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -760,6 +766,29 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,

Review Comment:
   >With this PR, the logical plan now has a complex expression that is specific to the DataFusion implementation. Dask SQL would have to undo the pow logic here to get back to the original parameters to round so that it can pass them to numpy.
   
   
   If the dask-sql has a physical execution to support `round(source, n)`, this could be a problem. 🤔
   
   > Dask SQL would have to undo the pow logic 
   
   They don't have to, I guess. 😅 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Unfold the `round` function in logical plan if it has 2 arguments [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#issuecomment-1830643522

   Closing as this PR is over a year old. Please feel free to reopen it / rebase it if you plan to keep working on it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: [WIP] Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1023593430


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -728,6 +729,30 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,
+            } => match &args[..] {
+                [source, n] => {
+                    println!("I am here, source = {:?}, n = {:?}", source, n);
+                    let exp = Box::new(power(
+                        lit(10.0),
+                        Expr::Cast(Cast::new(n.clone().into(), DataType::Float64)),

Review Comment:
   One problem here is that we will always convert the data type to `Float64` to adapt the `power` function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1026180988


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -760,6 +766,29 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,

Review Comment:
   >With this PR, the logical plan now has a complex expression that is specific to the DataFusion implementation. Dask SQL would have to undo the pow logic here to get back to the original parameters to round so that it can pass them to numpy.
   
   
   If the dask-sql has a physical execution to support `round(source, n)`, this could be a problem. 🤔
   
   
   
   > the logical plan now has a complex expression that is specific to the DataFusion implementation
   
   This is a little weird for me that the Datafusion logical plan should not be specific to Datafusion physical plan. 
   If this is the truth, why don't we seperate the Datafusion to be two projects?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1025837127


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -760,6 +766,29 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,

Review Comment:
   > Why is that? Could you please give an example?
   
   Sure. Taking Dask SQL as an example, we would want to translate `round(a, b)` to a call to a Python function to perform the rounding (perhaps in numpy). With this PR, the logical plan now has a complex expression that is specific to the DataFusion implementation. Dask SQL would have to undo the `pow` logic here to get back to the original parameters to `round` so that it can pass them to numpy.
   
   ```
   Projection: round(CAST(test.column_1 AS Float64) * Float64(0.01)) / Float64(0.01) AS round(test.column_1,Int64(-2))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1024289092


##########
datafusion/optimizer/tests/integration-test.rs:
##########
@@ -226,6 +226,39 @@ fn concat_ws_literals() -> Result<()> {
     Ok(())
 }
 
+#[test]
+fn unfold_round() -> Result<()> {
+    let sql = "SELECT round(col_f32, col_int32) as col from test";
+    let plan = test_sql(sql)?;
+    let expected =
+        "Projection: round(CAST(test.col_f32 AS Float64) * power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64))CAST(CAST(test.col_int32 AS Int64) AS Float64)CAST(test.col_int32 AS Int64)test.col_int32Float64(10) AS power(Float64(10),test.col_int32)) / power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64))CAST(CAST(test.col_int32 AS Int64) AS Float64)CAST(test.col_int32 AS Int64)test.col_int32Float64(10) AS power(Float64(10),test.col_int32) AS col\

Review Comment:
   Possibly related to https://github.com/apache/arrow-datafusion/issues/3786 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Unfold the `round` function in logical plan if it has 2 arguments [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments
URL: https://github.com/apache/arrow-datafusion/pull/4234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1026180988


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -760,6 +766,29 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,

Review Comment:
   >With this PR, the logical plan now has a complex expression that is specific to the DataFusion implementation. Dask SQL would have to undo the pow logic here to get back to the original parameters to round so that it can pass them to numpy.
   
   
   If dask-sql has a physical execution to support `round(source, n)`, this could be a problem. 🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on pull request #4234: [WIP] Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#issuecomment-1316559665

   Also, I didn't want to put this logic in the optimizer. Because it doesn't optimize the plan, but convert some syntax sugar. I plan to create a new stage calling "desugar_expression" and put it in front of the optimizer, if it is necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1023681564


##########
datafusion/optimizer/tests/integration-test.rs:
##########
@@ -226,6 +226,39 @@ fn concat_ws_literals() -> Result<()> {
     Ok(())
 }
 
+#[test]
+fn unfold_round() -> Result<()> {
+    let sql = "SELECT round(col_f32, col_int32) as col from test";
+    let plan = test_sql(sql)?;
+    let expected =
+        "Projection: round(CAST(test.col_f32 AS Float64) * power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64))CAST(CAST(test.col_int32 AS Int64) AS Float64)CAST(test.col_int32 AS Int64)test.col_int32Float64(10) AS power(Float64(10),test.col_int32)) / power(Float64(10), CAST(CAST(test.col_int32 AS Int64) AS Float64))CAST(CAST(test.col_int32 AS Int64) AS Float64)CAST(test.col_int32 AS Int64)test.col_int32Float64(10) AS power(Float64(10),test.col_int32) AS col\

Review Comment:
   The lots of `Cast` here is a little weird, I am looking into it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #4234: Unfold the `round` function in logical plan if it has 2 arguments

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #4234:
URL: https://github.com/apache/arrow-datafusion/pull/4234#discussion_r1024287522


##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -760,6 +766,29 @@ impl<'a, S: SimplifyInfo> ExprRewriter for Simplifier<'a, S> {
                 },
             },
 
+            // unfold the expression round(source, n)
+            Expr::ScalarFunction {
+                fun: BuiltinScalarFunction::Round,
+                args,

Review Comment:
   I would prefer that we don't re-write this in the logical plan to work around the fact that we don't currently support this in the physical plan. This makes the logical plan harder to use for other query engines, such as Dask SQL, that are using DataFusion for SQL query planning.
   
   I suggest for this PR that we throw an error in the physical query planner if we see `round` with two arguments. This would be better than producing incorrect results. We can implement the functionality in the physical plan in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org