You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jonahgao (via GitHub)" <gi...@apache.org> on 2024/03/25 14:53:10 UTC

[PR] fix: ensure mutual compatibility of the two input schemas from recursive CTEs [arrow-datafusion]

jonahgao opened a new pull request, #9795:
URL: https://github.com/apache/arrow-datafusion/pull/9795

   ## Which issue does this PR close?
   Closes #9794.
   
   ## Rationale for this change
   Make the schema of the recursive term compatible with  the static term.
   
   This differs from the handling of the ordinary [union](https://github.com/apache/arrow-datafusion/blob/ad89ff82421e9c4670f4440dd5a6fa6fb55c40c3/datafusion/expr/src/logical_plan/builder.rs#L1323) operator. We can't change the static term's schema, otherwise we would have to recreate the logic plan of the recursive term, as it depends on the static term through the work table.
   
   That is to say, we use the schema of the static term as the final schema for CTE. 
   DuckDB also adopted a similar approach.
   ```sh
   D WITH RECURSIVE my_cte AS(
           SELECT 1::int AS a
           UNION ALL
           SELECT a::bigint+2 FROM my_cte WHERE a<3
       ) SELECT a,typeof(a) FROM my_cte;
   ┌───────┬───────────┐
   │   a   │ typeof(a) │
   │ int32 │  varchar  │
   ├───────┼───────────┤
   │     1 │ INTEGER   │
   │     3 │ INTEGER   │
   └───────┴───────────┘
   ```
   The output type is `INTEGER` from the static term, not `BIGINT`.
   
   ## What changes are included in this PR?
   Check and coerce the schemas of recursive CTE's inputs.
   
   ## Are these changes tested?
   Yes
   
   ## Are there any user-facing changes?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix: ensure mutual compatibility of the two input schemas from recursive CTEs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #9795:
URL: https://github.com/apache/arrow-datafusion/pull/9795


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix: ensure mutual compatibility of the two input schemas from recursive CTEs [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #9795:
URL: https://github.com/apache/arrow-datafusion/pull/9795#discussion_r1540405151


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -132,14 +132,26 @@ impl LogicalPlanBuilder {
     ) -> Result<Self> {
         // TODO: we need to do a bunch of validation here. Maybe more.
         if is_distinct {
-            return Err(DataFusionError::NotImplemented(
-                "Recursive queries with a distinct 'UNION' (in which the previous iteration's results will be de-duplicated) is not supported".to_string(),
-            ));
+            return not_impl_err!(
+                "Recursive queries with a distinct 'UNION' (in which the previous iteration's results will be de-duplicated) is not supported"
+            );
+        }
+        // Ensure that the static term and the recursive term have the same number of fields
+        let static_fields_len = self.plan.schema().fields().len();
+        let recurive_fields_len = recursive_term.schema().fields().len();
+        if static_fields_len != recurive_fields_len {
+            return plan_err!(
+                "Non-recursive term and recursive term must have the same number of columns ({} != {})",
+                static_fields_len, recurive_fields_len
+            );
         }
+        // Ensure that the recursive term has the same field types as the static term

Review Comment:
   Given the tests all pass this seems like a good plan to me. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] fix: ensure mutual compatibility of the two input schemas from recursive CTEs [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on code in PR #9795:
URL: https://github.com/apache/arrow-datafusion/pull/9795#discussion_r1537745876


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -132,14 +132,26 @@ impl LogicalPlanBuilder {
     ) -> Result<Self> {
         // TODO: we need to do a bunch of validation here. Maybe more.
         if is_distinct {
-            return Err(DataFusionError::NotImplemented(
-                "Recursive queries with a distinct 'UNION' (in which the previous iteration's results will be de-duplicated) is not supported".to_string(),
-            ));
+            return not_impl_err!(
+                "Recursive queries with a distinct 'UNION' (in which the previous iteration's results will be de-duplicated) is not supported"
+            );
+        }
+        // Ensure that the static term and the recursive term have the same number of fields
+        let static_fields_len = self.plan.schema().fields().len();
+        let recurive_fields_len = recursive_term.schema().fields().len();
+        if static_fields_len != recurive_fields_len {
+            return plan_err!(
+                "Non-recursive term and recursive term must have the same number of columns ({} != {})",
+                static_fields_len, recurive_fields_len
+            );
         }
+        // Ensure that the recursive term has the same field types as the static term

Review Comment:
   Not sure if it's necessary to ensure that the field names are the same.I tested some queries, and none of them failed.
   
   The union operator uses [project_with_column_index](https://github.com/apache/arrow-datafusion/blob/ad89ff82421e9c4670f4440dd5a6fa6fb55c40c3/datafusion/expr/src/logical_plan/builder.rs#L1350) only ensures the field names of projection plans are the same.
   ```sh
   DataFusion CLI v36.0.0
   ❯ -- The field names of the two input are the same.
   explain select 1 union all select 2;
   +--------------+------------------------------------+
   | plan_type    | plan                               |
   +--------------+------------------------------------+
   | logical_plan | Union                              |
   |              |   Projection: Int64(1) AS Int64(1) |
   |              |     EmptyRelation                  |
   |              |   Projection: Int64(2) AS Int64(1) |
   |              |     EmptyRelation                  |
   +--------------+------------------------------------+
   
   ❯ -- The field names of the two input are different.
   explain select 1 union all values(2);
   +--------------+------------------------------------+
   | plan_type    | plan                               |
   +--------------+------------------------------------+
   | logical_plan | Union                              |
   |              |   Projection: Int64(1) AS Int64(1) |
   |              |     EmptyRelation                  |
   |              |   Values: (Int64(2))               |
   +--------------+------------------------------------+
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org