You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jayzhan211 (via GitHub)" <gi...@apache.org> on 2023/06/29 05:44:53 UTC

[GitHub] [arrow-datafusion] jayzhan211 opened a new pull request, #6796: Implement unnest function

jayzhan211 opened a new pull request, #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796

   # Which issue does this PR close?
   
   Ref https://github.com/apache/arrow-datafusion/issues/6555
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Support `unnest` for arrays
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   - [x] `select unnest()`
   - [X] `select * from unnest()`
   - [ ] unnest args more than 2
   - [ ] unnest alias
   - [ ] deterministic result for `datafusion.execution.target_partitions > 1` (We might not need this)
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] vincev commented on a diff in pull request #6796: Implement basic unnest function

Posted by "vincev (via GitHub)" <gi...@apache.org>.
vincev commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278314487


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I find having `unnest` in the `from` clause a bit strange, [DuckDB](https://duckdb.org/docs/sql/query_syntax/unnest.html) syntax looks intuitive, the above would look like:
   
   ```
   D SELECT * FROM (VALUES ([1]), ([2, 3]), ([4,5,6]), ([7,8]), ([9])) tbl(l);
   ┌───────────┐
   │     l     │
   │  int32[]  │
   ├───────────┤
   │ [1]       │
   │ [2, 3]    │
   │ [4, 5, 6] │
   │ [7, 8]    │
   │ [9]       │
   └───────────┘
   ```
   
   and then unnesting:
   
   ```
   D SELECT unnest(l) FROM (VALUES ([1]), ([2, 3]), ([4,5,6]), ([7,8]), ([9])) tbl(l);
   ┌───────────┐
   │ unnest(l) │
   │   int32   │
   ├───────────┤
   │         1 │
   │         2 │
   │         3 │
   │         4 │
   │         5 │
   │         6 │
   │         7 │
   │         8 │
   │         9 │
   └───────────┘
   ```
   
   This would also work if we mix array and non array values:
   
   ```
   D SELECT * FROM (VALUES ([11],1), ([21, 22],2), ([31,32,33],3), ([41,42],4), ([51],5)) tbl(l, m);
   ┌──────────────┬───────┐
   │      l       │   m   │
   │   int32[]    │ int32 │
   ├──────────────┼───────┤
   │ [11]         │     1 │
   │ [21, 22]     │     2 │
   │ [31, 32, 33] │     3 │
   │ [41, 42]     │     4 │
   │ [51]         │     5 │
   └──────────────┴───────┘
   ```
   
   unnesting them:
   
   ```
   D SELECT unnest(l), m FROM (VALUES ([11],1), ([21, 22],2), ([31,32,33],3), ([41,42],4), ([51],5)) tbl(l, m);
   ┌───────────┬───────┐
   │ unnest(l) │   m   │
   │   int32   │ int32 │
   ├───────────┼───────┤
   │        11 │     1 │
   │        21 │     2 │
   │        22 │     2 │
   │        31 │     3 │
   │        32 │     3 │
   │        33 │     3 │
   │        41 │     4 │
   │        42 │     4 │
   │        51 │     5 │
   └───────────┴───────┘
   ```
   



##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I find having `unnest` in the `from` clause a bit strange, [DuckDB](https://duckdb.org/docs/sql/query_syntax/unnest.html) syntax looks intuitive, the above would look like:
   
   ```
   D SELECT * FROM (VALUES ([1]), ([2, 3]), ([4,5,6]), ([7,8]), ([9])) tbl(l);
   ┌───────────┐
   │     l     │
   │  int32[]  │
   ├───────────┤
   │ [1]       │
   │ [2, 3]    │
   │ [4, 5, 6] │
   │ [7, 8]    │
   │ [9]       │
   └───────────┘
   ```
   
   and then unnesting:
   
   ```
   D SELECT unnest(l) FROM (VALUES ([1]), ([2, 3]), ([4,5,6]), ([7,8]), ([9])) tbl(l);
   ┌───────────┐
   │ unnest(l) │
   │   int32   │
   ├───────────┤
   │         1 │
   │         2 │
   │         3 │
   │         4 │
   │         5 │
   │         6 │
   │         7 │
   │         8 │
   │         9 │
   └───────────┘
   ```
   
   This would also work if we mix array and non array values:
   
   ```
   D SELECT * FROM (VALUES ([11],1), ([21, 22],2), ([31,32,33],3), ([41,42],4), ([51],5)) tbl(l, m);
   ┌──────────────┬───────┐
   │      l       │   m   │
   │   int32[]    │ int32 │
   ├──────────────┼───────┤
   │ [11]         │     1 │
   │ [21, 22]     │     2 │
   │ [31, 32, 33] │     3 │
   │ [41, 42]     │     4 │
   │ [51]         │     5 │
   └──────────────┴───────┘
   ```
   
   unnesting them:
   
   ```
   D SELECT unnest(l), m FROM (VALUES ([11],1), ([21, 22],2), ([31,32,33],3), ([41,42],4), ([51],5)) tbl(l, m);
   ┌───────────┬───────┐
   │ unnest(l) │   m   │
   │   int32   │ int32 │
   ├───────────┼───────┤
   │        11 │     1 │
   │        21 │     2 │
   │        22 │     2 │
   │        31 │     3 │
   │        32 │     3 │
   │        33 │     3 │
   │        41 │     4 │
   │        42 │     4 │
   │        51 │     5 │
   └───────────┴───────┘
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278475130


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I think we can only keep `select * from unnest syntax`, since it is equivalent to `select unnest(a), unnest(b)...unnest(z) from table` but simplified



##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I think we can keep `select * from unnest syntax`, since it is equivalent to `select unnest(a), unnest(b)...unnest(z) from table` but simplified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1612470826

   Current implementation.
   
   There are two workflows for unnest
   `select unnest()`, we will get statements from `select.projection`. `Unnest` is transformed to `Expr::Unnest`. We will then build the LogicalPlan in optimize phase.
   
   `select * from unnest()`, we can get the table relation Unnest directly. In this case, we can build LogicalPlan without Expr::Unnest involved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1715666841

   The Unnest function is now processing in the planning step, Expr::Unnest to LogicalPlan::Unnest. Nothing to do in the optimize phase and only the existing UnnestExec is processed in the PhysicalPlan phase. Not sure if this is a good design but it works :)
   
   Maybe we can review the overall design, and then I will address the comments and resolve conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-2043642467

   I think DataFusion now supports UNNEST -- is this PR still being worked on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-2044001515

   > > I think DataFusion now supports UNNEST -- is this PR still being worked on?
   > 
   > We support single columns only, the challenging one multiple columns has not yet been supported, I remember @jonahgao is working on it. I forgot which PR you mentioned.
   
   Not started yet. I just discovered that it hasn't been implemented in https://github.com/apache/arrow-datafusion/pull/9592#discussion_r1522777692,  but I can start working on it from now on.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6796: Implement basic unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1286376608


##########
datafusion/expr/src/expr.rs:
##########
@@ -970,6 +987,43 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    /// Flatten the nested array expressions until the base array is reached.
+    /// For example:
+    ///  unnest([[1, 2, 3], [4, 5, 6]]) => unnest([1, 2, 3, 4, 5, 6])
+    pub fn flatten(&self) -> Self {
+        self.try_flatten().unwrap_or_else(|_| self.clone())
+    }
+
+    pub fn try_flatten(&self) -> Result<Self> {
+        match self {
+            Self::ScalarFunction(ScalarFunction {
+                fun: built_in_function::BuiltinScalarFunction::MakeArray,
+                args,
+            }) => {
+                let flatten_args: Vec<Expr> =
+                    args.iter().flat_map(Self::flatten_internal).collect();
+                Ok(Self::ScalarFunction(ScalarFunction {
+                    fun: built_in_function::BuiltinScalarFunction::MakeArray,
+                    args: flatten_args,
+                }))
+            }
+            _ => Err(DataFusionError::Internal(format!(

Review Comment:
   ```suggestion
               _ => Err(DataFusionError::NotYetImplemented(format!(
   ```



##########
datafusion/expr/src/expr.rs:
##########
@@ -970,6 +987,43 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    /// Flatten the nested array expressions until the base array is reached.
+    /// For example:
+    ///  unnest([[1, 2, 3], [4, 5, 6]]) => unnest([1, 2, 3, 4, 5, 6])

Review Comment:
   this example is in terms of `unnest` but the function is for flatten.



##########
datafusion/expr/src/expr.rs:
##########
@@ -970,6 +987,43 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    /// Flatten the nested array expressions until the base array is reached.
+    /// For example:
+    ///  unnest([[1, 2, 3], [4, 5, 6]]) => unnest([1, 2, 3, 4, 5, 6])
+    pub fn flatten(&self) -> Self {
+        self.try_flatten().unwrap_or_else(|_| self.clone())
+    }
+
+    pub fn try_flatten(&self) -> Result<Self> {

Review Comment:
   I don't understand what this is doing -- why is it checking for make_array during `Expr` creation? I would expect the Expr to always be created and then perhaps an error thrown during analyze or expression simplifcation



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1038,6 +1042,114 @@ impl LogicalPlanBuilder {
     pub fn unnest_column(self, column: impl Into<Column>) -> Result<Self> {
         Ok(Self::from(unnest(self.plan, column.into())?))
     }
+
+    pub fn join_unnest_plans(

Review Comment:
   Is it possible to use https://docs.rs/datafusion/latest/datafusion/physical_plan/unnest/struct.UnnestExec.html directly rather than building up a special LogicalPlan? It seems like `UnnestExec` already does this



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -86,6 +87,32 @@ impl ExprSchemable for Expr {
                     .collect::<Result<Vec<_>>>()?;
                 Ok((fun.return_type)(&data_types)?.as_ref().clone())
             }
+            Expr::Unnest(Unnest { array_exprs }) => {
+                let data_types = array_exprs
+                    .iter()
+                    .map(|e| e.get_type(schema))
+                    .collect::<Result<Vec<_>>>()?;
+
+                if data_types.is_empty() {
+                    return Err(DataFusionError::NotImplemented(
+                        "Unnest does not support empty data types".to_string(),
+                    ));
+                }
+
+                // Use a HashSet to efficiently check for unique data types
+                let unique_data_types: HashSet<_> = data_types.iter().collect();
+
+                // If there is more than one unique data type, return an error
+                if unique_data_types.len() > 1 {
+                    return Err(DataFusionError::NotImplemented(
+                        "Unnest does not support inconsistent data types".to_string(),

Review Comment:
   It would be nice to include the types in this error message as well (so it is clear if for example the mismatch is `Int32` and `Int64`



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1426,6 +1542,141 @@ pub fn unnest(input: LogicalPlan, column: Column) -> Result<LogicalPlan> {
     }))
 }
 
+// Create function name for unnest
+// Different from create_function_physical_name, so we create a customize one.
+// We need fun(arg1,arg2,arg3) format, but not fun(arg1, arg2, arg3).
+// TODO: We dont need this if we can customize format in impl::Display for Expr
+fn create_unnest_arguments_name(array_exprs: Vec<Expr>) -> Result<Vec<String>> {
+    array_exprs
+        .iter()
+        .map(|e| match e {
+            Expr::ScalarFunction(ScalarFunction { fun, args, .. }) => {
+                let arg_str = args
+                    .iter()
+                    .map(|e| e.to_string())
+                    .collect::<Vec<String>>()
+                    .join(",");
+
+                Ok(format!("{fun}({arg_str})"))
+            }
+            Expr::Literal(sv) => {
+                let name = format!("{sv:?}");
+                Ok(name)
+            }
+            e => Err(DataFusionError::NotImplemented(format!(
+                "Expr {e} is not supported in UNNEST"
+            ))),
+        })
+        .collect::<Result<Vec<String>>>()
+}
+
+/// Create unnest plan from arrays.
+fn build_unnest_plans(

Review Comment:
   I am surprised this can't use https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html#variant.Unnest 



##########
datafusion/expr/src/expr.rs:
##########
@@ -970,6 +987,43 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    /// Flatten the nested array expressions until the base array is reached.
+    /// For example:
+    ///  unnest([[1, 2, 3], [4, 5, 6]]) => unnest([1, 2, 3, 4, 5, 6])
+    pub fn flatten(&self) -> Self {
+        self.try_flatten().unwrap_or_else(|_| self.clone())
+    }
+
+    pub fn try_flatten(&self) -> Result<Self> {

Review Comment:
   Can we please add documentation about what `try_flatten` does and how it is different than `flatten`?



##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----
+1 2 4 7 9
+NULL 3 5 8 NULL
+NULL NULL 6 NULL NULL
+
+query I
+select * from unnest(make_array(1,2,3)) as data
+----
+1
+2
+3
+
+query II
+select * from unnest(make_array(1,2,3),make_array(7,6,5,4)) as data(a,b) order by b
+----
+NULL 4
+3 5
+2 6
+1 7
+
+query ?T?I
+select * from arrays_unnest;
+----
+[1, 2] A [1, 2] 3
+NULL B [4] 5
+[3] C [6, 7, 8] 9
+
+# TODO: Unnest columns fails
+query error DataFusion error: SQL error: TokenizerError\("Unterminated string literal at Line: 2, Column 95"\)
+caused by
+Internal error: UNNEST only supports list type\. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Review Comment:
   this appears still to be NotImplemented



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6796: Implement basic unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1666829532

   I will review this PR tomorrow, Monday


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1668756534

   > As this PR is large and I would like to move it along, I suggest
   > 
   > 1. Remove `flatten` (we can put that into another PR)
   > 2. Make a PR with just the `Expr::Unnest` -- maybe this could simply create a `LogicalPlan::Unnest`?
   > 3. Make a PR to add any additional features needed for unnest (like multiple column support, for example)
   
   Let me work on https://github.com/apache/arrow-datafusion/issues/6995 first since I don't think simply creating LogicalPlan from Expr::Unnest is trivial.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1283874181


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   Internal error: Optimizer rule 'unnest_expressions' failed, due to generate a different schema, original schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: "unnest(make_array(Int64(1),Int64(2),Int64(3)))", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: None, field: Field { name: "unnest(make_array(Int64(4),Int64(5)))", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: "make_array(Int64(1),Int64(2),Int64(3))", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
  }, DFField { qualifier: None, field: Field { name: "make_array(Int64(4),Int64(5),NULL)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }
   
   I don't know how to fix these schema changes
   
   name changes: unnest(make_array(Int64(1),Int64(2),Int64(3))) -> make_array(Int64(1),Int64(2),Int64(3))
   data_type changes: List(int64) -> int64



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280737663


##########
datafusion/sql/tests/sql_integration.rs:
##########
@@ -2962,36 +2962,37 @@ fn cte_with_no_column_names() {
 
 #[test]
 fn cte_with_column_names() {
-    let sql = "WITH \
-        numbers(a, b, c) AS ( \
-            SELECT 1, 2, 3 \
-        ) \
-        SELECT * FROM numbers;";
+    let sql = r#"WITH
+        numbers(a, b, c) AS (
+            SELECT 1, 2, 3
+        )
+        SELECT * FROM numbers;"#;
 
-    let expected = "Projection: numbers.a, numbers.b, numbers.c\
-        \n  SubqueryAlias: numbers\
-        \n    Projection: Int64(1) AS a, Int64(2) AS b, Int64(3) AS c\
-        \n      Projection: Int64(1), Int64(2), Int64(3)\
-        \n        EmptyRelation";
+    let expected = r#"Projection: numbers.a, numbers.b, numbers.c
+  SubqueryAlias: numbers
+    Projection: Int64(1) AS a, Int64(2) AS b, Int64(3) AS c
+      Projection: Int64(1), Int64(2), Int64(3)
+        EmptyRelation"#;
 
-    quick_test(sql, expected)
+    quick_test(sql, expected);
 }
 
 #[test]
 fn cte_with_column_aliases_precedence() {

Review Comment:
   This change is not related to unnest directly, but they are affected by adding the `UnnestExpression` layer, so they are better changed together in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1699328788

   > 2. Make a PR with just the Expr::Unnest -- maybe this could simply create a LogicalPlan::Unnest?
   
   To create LogicalPlan::Unnest, it seems we can only do that in the optimization step. But this can't avoid schema change, due to `name` unnest([1,2,3]) -> [1,2,3] and `type` list(i64) -> i64.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280733598


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   There is the schema mismatch bug, I thought it is fine for `unnest`, let me take a look carefully.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #6796: Implement basic unnest function
URL: https://github.com/apache/arrow-datafusion/pull/6796


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6796: Implement unnest function

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1246664186


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -531,3 +531,49 @@ SELECT
 FROM t
 ----
 true true
+statement ok

Review Comment:
   Can you create a test with multidimensional arrays.
   The result should be:
   ```
   postgres=# select unnest(array[array[1, 2, 3], array[4, 5, 6]]);
    unnest 
   --------
         1
         2
         3
         4
         5
         6
   (6 rows)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1249292027


##########
datafusion/sql/src/planner.rs:
##########
@@ -228,12 +228,10 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         plan: LogicalPlan,
         alias: TableAlias,
     ) -> Result<LogicalPlan> {
-        let apply_name_plan = LogicalPlan::SubqueryAlias(SubqueryAlias::try_new(
-            plan,
-            self.normalizer.normalize(alias.name),
-        )?);
-
-        self.apply_expr_alias(apply_name_plan, alias.columns)
+        let plan = self.apply_expr_alias(plan, alias.columns)?;

Review Comment:
   The order is changed so that we can get either `expr` or `table. expr`.
   
   Example in array.slt
   table.expr: `SELECT sum(data.a), sum(data.b) FROM unnest(make_array(1, 2, 3), make_array(4, 5, 6, 7)) as data(a, b);`
   expr: `SELECT sum(a), sum(b) FROM unnest(make_array(1, 2, 3), make_array(4, 5, 6, 7)) as data(a, b);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1655483678

   @izveigor @alamb Ready for review, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] vincev commented on pull request #6796: Implement basic unnest function

Posted by "vincev (via GitHub)" <gi...@apache.org>.
vincev commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1656762357

   I had quick look through the code to see if we can apply `unnest_column`, if I understand correctly we create a `LogicalPlan` for each array column then apply a row number window function to each plan and then join all these plans with a `Full` join. 
   
   I guess this makes sense when we create arrays in the select expression like:
   
   ```
   select unnest(make_array(1,2,3,4,5)), unnest(make_array(6,7));
   ```
   
   `unnest_column` is useful when we work on a full table that contains a arrays and non arrays like:
   
   ```
   ┌──────────────┬───────┐
   │      l       │   m   │
   │   int32[]    │ int32 │
   ├──────────────┼───────┤
   │ [11]         │     1 │
   │ [21, 22]     │     2 │
   │ [31, 32, 33] │     3 │
   │ [41, 42]     │     4 │
   │ [51]         │     5 │
   └──────────────┴───────┘
   ```
   
   Then a statement like:
   
   ```
   select unnest(l), m from table;
   ```
   
   should call `unnest_column` on column `l` that will take take care of unnesting and null handling (see https://github.com/apache/arrow-datafusion/pull/7088). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1281762658


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   let me check if those two are necessary or not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1637109129

   @alamb Kindly remind you. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1618221578

   > I was planning to release sqlparser in a few weeks -- but if it is blocking this PR I can find time to make a release sooner
   
   I hope sqlparser can be released soon! Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278474676


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I agree this syntax is better if we like partial unnest, e.g. `select unnest(a), b, unnest(c) from table`.
   Clickhouse has a similar syntax https://clickhouse.com/docs/en/sql-reference/functions/array-join.
   
   `select * from unnest()` is equivalent to `select unnest(a),unnest(b),unnest(c) from table` but with simplified syntax, so it is nice to keep this syntax.
   https://www.postgresql.org/docs/current/functions-array.html
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1283877277


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1418,6 +1534,141 @@ pub fn unnest(input: LogicalPlan, column: Column) -> Result<LogicalPlan> {
     }))
 }
 
+// Create function name for unnest
+// Different from create_function_physical_name, so we create a customize one.
+// We need fun(arg1,arg2,arg3) format, but not fun(arg1, arg2, arg3).
+// TODO: We dont need this if we can customize format in impl::Display for Expr
+fn create_unnest_arguments_name(array_exprs: Vec<Expr>) -> Result<Vec<String>> {

Review Comment:
   New function to review



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on pull request #6796: Implement basic unnest function

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1656098986

   I will review this PR tomorrow. But I think @vincev @smiklos opinions are more important due to their great interest in the topic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280740817


##########
datafusion/optimizer/src/unnest_expressions.rs:
##########
@@ -0,0 +1,80 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_expr::expr::Alias;
+use datafusion_expr::expr::Unnest;
+use datafusion_expr::Expr;
+use datafusion_expr::LogicalPlan;
+use datafusion_expr::LogicalPlanBuilder;
+
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+
+#[derive(Default)]
+pub struct UnnestExpressions {}
+
+impl OptimizerRule for UnnestExpressions {
+    fn name(&self) -> &str {
+        "unnest_expressions"
+    }
+
+    fn try_optimize(
+        &self,
+        plan: &LogicalPlan,
+        _config: &dyn OptimizerConfig,
+    ) -> Result<Option<LogicalPlan>> {
+        Ok(Some(Self::optimize_internal(plan)?))
+    }
+}
+
+impl UnnestExpressions {
+    pub fn new() -> Self {
+        Self::default()
+    }
+
+    fn optimize_internal(plan: &LogicalPlan) -> Result<LogicalPlan> {
+        if let LogicalPlan::Projection(_) = plan {

Review Comment:
   Recursively visiting means there might have multiple layers of unnest expressions, like `unnest( unnest () )`. I think we might not need them, at least in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1281272125


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   Two differences in the schema.
   1. qualifier with sq0, sq1..., since I wrap each unnest column with subquery
   2. column name, I rename them to col0, col1..., just for simplicity.
   
   @alamb Does it makes sense to skip these two checks and only check the other?
   If not, is there an existing API to change the schema for LogicalPlan, so I can set None to qualifier and create a new field with the original name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1346614902


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -88,6 +88,32 @@ impl ExprSchemable for Expr {
                     .collect::<Result<Vec<_>>>()?;
                 Ok((fun.return_type)(&data_types)?.as_ref().clone())
             }
+            Expr::Unnest(Unnest { array_exprs, .. }) => {
+                let data_types = array_exprs
+                    .iter()
+                    .map(|e| e.get_type(schema))
+                    .collect::<Result<Vec<_>>>()?;
+
+                if data_types.is_empty() {
+                    return Err(DataFusionError::NotImplemented(

Review Comment:
   TODO: use macro



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278474676


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I agree this syntax is better if we like partial unnest, e.g. select unnest(a), b, unnest(c) from table.
   Clickhouse has the similar syntax https://clickhouse.com/docs/en/sql-reference/functions/array-join.
   
   `select * from unnest()` is equivalent to `select unnest(a),unnest(b),unnest(c) from table`.
   https://www.postgresql.org/docs/current/functions-array.html
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278475130


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I think we can keep `select * from unnest()` syntax, since it is equivalent to `select unnest(a), unnest(b)...unnest(z) from table` but simplified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6796: Implement basic unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1657101599

   This looks awesome -- thank you @jayzhan211  -- I plan to take a look later today or tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1283874181


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   After removing alias, I still get the schema mismatch error
   
   Internal error: Optimizer rule 'unnest_expressions' failed, due to generate a different schema, original schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: "unnest(make_array(Int64(1),Int64(2),Int64(3)))", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: None, field: Field { name: "unnest(make_array(Int64(4),Int64(5)))", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: "make_array(Int64(1),Int64(2),Int64(3))", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
  }, DFField { qualifier: None, field: Field { name: "make_array(Int64(4),Int64(5),NULL)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }
   
   I don't know how to fix these schema changes
   
   name changes: unnest(make_array(Int64(1),Int64(2),Int64(3))) -> make_array(Int64(1),Int64(2),Int64(3))
   data_type changes: List(int64) -> int64



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6796: Implement unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1619040806

   Next release is tracked by https://github.com/sqlparser-rs/sqlparser-rs/issues/905
   
   I will try and prioritize it but this week is short for me because of the US holiday (and I am spending most of my DataFusion time on #6800)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1247850168


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -531,3 +531,49 @@ SELECT
 FROM t
 ----
 true true
+statement ok

Review Comment:
   This does not work in this PR, I plan to complete these in another PR. I will add it to the task list.



##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -531,3 +531,49 @@ SELECT
 FROM t
 ----
 true true
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ?
+select unnest(make_array(1,2,3,4,5))

Review Comment:
   This one also.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1617191608

   Wait for sqlparser 0.36


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1346620948


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1038,6 +1042,114 @@ impl LogicalPlanBuilder {
     pub fn unnest_column(self, column: impl Into<Column>) -> Result<Self> {
         Ok(Self::from(unnest(self.plan, column.into())?))
     }
+
+    pub fn join_unnest_plans(

Review Comment:
   We need `join` function for each unnested column. Otherwise, we get something like (1, 4), (1, 5), (2, 4), (2, 5), (3, 4), (3, 5), (3, null).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6796: Implement basic unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1666000777

   I have this on my list for review, but may not get to it today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1286491902


##########
datafusion/expr/src/expr.rs:
##########
@@ -970,6 +987,43 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    /// Flatten the nested array expressions until the base array is reached.
+    /// For example:
+    ///  unnest([[1, 2, 3], [4, 5, 6]]) => unnest([1, 2, 3, 4, 5, 6])
+    pub fn flatten(&self) -> Self {
+        self.try_flatten().unwrap_or_else(|_| self.clone())
+    }
+
+    pub fn try_flatten(&self) -> Result<Self> {

Review Comment:
   Since we only support flatten on array now so I check if it is make_array.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280733598


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   There is the schema mismatch bug, I thought it is fine for `unnest`, so I just bypass it, let me take a look carefully.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1283877560


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   Need to review again



##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   Need to review again



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1346614106


##########
datafusion/expr/src/expr.rs:
##########
@@ -970,6 +987,43 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    /// Flatten the nested array expressions until the base array is reached.
+    /// For example:
+    ///  unnest([[1, 2, 3], [4, 5, 6]]) => unnest([1, 2, 3, 4, 5, 6])
+    pub fn flatten(&self) -> Self {
+        self.try_flatten().unwrap_or_else(|_| self.clone())
+    }
+
+    pub fn try_flatten(&self) -> Result<Self> {
+        match self {
+            Self::ScalarFunction(ScalarFunction {
+                fun: built_in_function::BuiltinScalarFunction::MakeArray,
+                args,
+            }) => {
+                let flatten_args: Vec<Expr> =
+                    args.iter().flat_map(Self::flatten_internal).collect();
+                Ok(Self::ScalarFunction(ScalarFunction {
+                    fun: built_in_function::BuiltinScalarFunction::MakeArray,
+                    args: flatten_args,
+                }))
+            }
+            _ => Err(DataFusionError::Internal(format!(

Review Comment:
   The updated version is in https://github.com/apache/arrow-datafusion/pull/7461/files



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1655485755

   @vincev @smiklos You might be interested about Unnest function.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278474676


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I agree this syntax is better if we like partial unnest, e.g. `select unnest(a), b, unnest(c) from table`.
   Clickhouse has a similar syntax https://clickhouse.com/docs/en/sql-reference/functions/array-join.
   
   `select * from unnest()` is equivalent to `select unnest(a),unnest(b),unnest(c) from table` but with simplified syntax, so I think keeping it is also nice.
   https://www.postgresql.org/docs/current/functions-array.html
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1655153526

   Possible improvement:
   Utilize the existing unnest_columns function instead building a new one.
   
   TODO in another PR
   - [ ] Support column-wise unnest
   - [ ] Able to preserve nulls
   - [ ] Select partial columns
   - [ ] Subquery


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-2043864051

   > I think DataFusion now supports UNNEST -- is this PR still being worked on?
   
   We support single columns only, the challenging one multiple columns has not yet been supported, I remember @jonahgao is working on it. I forgot which PR you mentioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1247850168


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -531,3 +531,49 @@ SELECT
 FROM t
 ----
 true true
+statement ok

Review Comment:
   This does not work in this PR, I plan to complete it in another PR. I will add it to the task list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1310386655


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   For `select unnest([1,2,3]), unnest(4,5)`,
   
   I found that if I do the `unnest` inside optimization step, I can not avoid the schema change like List(i64) -> i64.
   Therefore, I tried to do the `unnest` for in PhysicalExpr evaluation but realized that we could only convert unnest expr one by one. This means we need to `align the size` of them after `evaluation`. To align the result, we need to add nulls in `ProjectionStream.batch_project`. Not sure if is it a good place to add nulls for unnested_array, any good idea?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1283876832


##########
datafusion/expr/src/expr.rs:
##########
@@ -931,6 +948,40 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    pub fn flatten(&self) -> Self {

Review Comment:
   Not exactly the opposite of unnest, we expect to get the 1D list of elements in the nested array in this function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280747311


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   External error: query failed: DataFusion error: unnest_expressions
   caused by
   Internal error: Optimizer rule 'unnest_expressions' failed, due to generate a different schema, original schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: "unnest(make_array(Int64(1),Int64(2),Int64(3)))", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: None, field: Field { name: "unnest(make_array(Int64(4),Int64(5)))", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { fields: [DFField { qualifier: Some(Bare { table: "sq0" }), field: Field { name: "col0", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField
  { qualifier: Some(Bare { table: "sq1" }), field: Field { name: "col1", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
   [SQL] select unnest(make_array(1,2,3)),
          unnest(make_array(4,5))
   ;



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280693923


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;

Review Comment:
   Unless the order is not important for the function, but I think we should check the order for `Unnest`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280737663


##########
datafusion/sql/tests/sql_integration.rs:
##########
@@ -2962,36 +2962,37 @@ fn cte_with_no_column_names() {
 
 #[test]
 fn cte_with_column_names() {
-    let sql = "WITH \
-        numbers(a, b, c) AS ( \
-            SELECT 1, 2, 3 \
-        ) \
-        SELECT * FROM numbers;";
+    let sql = r#"WITH
+        numbers(a, b, c) AS (
+            SELECT 1, 2, 3
+        )
+        SELECT * FROM numbers;"#;
 
-    let expected = "Projection: numbers.a, numbers.b, numbers.c\
-        \n  SubqueryAlias: numbers\
-        \n    Projection: Int64(1) AS a, Int64(2) AS b, Int64(3) AS c\
-        \n      Projection: Int64(1), Int64(2), Int64(3)\
-        \n        EmptyRelation";
+    let expected = r#"Projection: numbers.a, numbers.b, numbers.c
+  SubqueryAlias: numbers
+    Projection: Int64(1) AS a, Int64(2) AS b, Int64(3) AS c
+      Projection: Int64(1), Int64(2), Int64(3)
+        EmptyRelation"#;
 
-    quick_test(sql, expected)
+    quick_test(sql, expected);
 }
 
 #[test]
 fn cte_with_column_aliases_precedence() {

Review Comment:
   This change is not related to unnest directly, but they are affected by adding the `UnnestExpression` layer, so they are better changed together in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6796: Implement unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1617786945

   I was planning to release sqlparser in a few weeks -- but if it is blocking this PR I can find time to make a release sooner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6796: Implement basic unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1715821967

   I will try and review this more carefully later today and provide feedback. Thank you @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6796: Implement basic unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1681117382

   Marking as draft to signify it is not waiting on review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278474676


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I also agree this syntax is better if we like partial unnest, e.g. select unnest(a), b, unnest(c) from table.
   Clickhouse has the similar syntax https://clickhouse.com/docs/en/sql-reference/functions/array-join



##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I agree this syntax is better if we like partial unnest, e.g. select unnest(a), b, unnest(c) from table.
   Clickhouse has the similar syntax https://clickhouse.com/docs/en/sql-reference/functions/array-join



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6796: Implement unnest function

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1246674117


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -531,3 +531,49 @@ SELECT
 FROM t
 ----
 true true
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ?
+select unnest(make_array(1,2,3,4,5))

Review Comment:
   I also recommend to create test with one of aggregate functions as stated in the description of the issue: https://github.com/apache/arrow-datafusion/issues/6555
   ```
   SELECT sum(a) AS total FROM (SELECT unnest(make_array(3, 5, 6) AS a) AS b;
   ----
   14
   ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1612589514

   Cleanup done, ready for review.
   @izveigor @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1247850285


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -531,3 +531,49 @@ SELECT
 FROM t
 ----
 true true
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ?
+select unnest(make_array(1,2,3,4,5))

Review Comment:
   This one also. Add to task list



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] vincev commented on pull request #6796: Implement basic unnest function

Posted by "vincev (via GitHub)" <gi...@apache.org>.
vincev commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1656123080

   I'll try to take a look tomorrow, yes I agree  maybe we can simplify the code if we can use `unnest_column` somehow.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1278475130


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----

Review Comment:
   I think we can keep `select * from unnest()` syntax, since it is equivalent to `select unnest(a), unnest(b)...unnest(z) from table` but simplified



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6796: Implement basic unnest function

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1279366021


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;

Review Comment:
   Another way to get deterministic results rather than changing `target_partitions` here is to use [`rowsort`](https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/README.md#sqllogictests) sort_mode
   
   So instead of
   
   ```
   query ??? 
   ```
   Use
   
   ```
   query ??? rowsort
   ```
   
   Which will sort the output after execution
   



##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;
+
+query ??
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5))
+;
+----
+1 4
+2 5
+3 NULL
+
+query ???
+select unnest(make_array(1,2,3)),
+       unnest(make_array(4,5)),
+       unnest(make_array(6,7,8,9))
+;
+----
+1 4 6
+2 5 7
+3 NULL 8
+NULL NULL 9
+
+query ???
+select unnest(make_array(1,2,3,4,5)),
+       unnest(make_array(6,7)),
+       unnest(make_array(8,9,10,11,22,33))
+;
+----
+1 6 8
+2 7 9
+3 NULL 10
+4 NULL 11
+5 NULL 22
+NULL NULL 33
+
+# Select From
+
+query IIIII
+select * from unnest(
+  make_array(1), 
+  make_array(2,3), 
+  make_array(4,5,6),
+  make_array(7,8),
+  make_array(9)
+);
+----
+1 2 4 7 9
+NULL 3 5 8 NULL
+NULL NULL 6 NULL NULL
+
+query I
+select * from unnest(make_array(1,2,3)) as data
+----
+1
+2
+3
+
+query II
+select * from unnest(make_array(1,2,3),make_array(7,6,5,4)) as data(a,b) order by b
+----
+NULL 4
+3 5
+2 6
+1 7
+
+query ?T?I
+select * from arrays_unnest;
+----
+[1, 2] A [1, 2] 3
+NULL B [4] 5
+[3] C [6, 7, 8] 9
+
+# TODO: Unnest columns fails
+query error DataFusion error: SQL error: TokenizerError\("Unterminated string literal at Line: 2, Column 95"\)
+caused by
+Internal error: UNNEST only supports list type\. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Review Comment:
   Can we please change this to `NotImplemented` rather than `InternalError` (so people don't file bug reports :) )



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -84,6 +85,16 @@ impl ExprSchemable for Expr {
                     .collect::<Result<Vec<_>>>()?;
                 Ok((fun.return_type)(&data_types)?.as_ref().clone())
             }
+            Expr::Unnest(Unnest { array_exprs }) => {
+                let data_types = array_exprs
+                    .iter()
+                    .map(|e| e.get_type(schema))
+                    .collect::<Result<Vec<_>>>()?;
+                // Consider data types are the same for now
+                // TODO: Return correct data type for inconsistent data types

Review Comment:
   Does this mean ensure all arguments should be the same type? Maybe we can return an "Not yet implemented" error if they are inconsistent



##########
datafusion/core/src/datasource/listing/helpers.rs:
##########
@@ -90,6 +90,7 @@ pub fn expr_applicable_for_cols(col_names: &[String], expr: &Expr) -> bool {
             | Expr::GroupingSet(_)
             | Expr::Case { .. } => VisitRecursion::Continue,
 
+            Expr::Unnest { .. } => todo!("Unnest not implemented yet"),

Review Comment:
   Can we please change this to return `DataFusionError::NotImplemented` rather than `todo()` (which will panic)?



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -84,6 +85,16 @@ impl ExprSchemable for Expr {
                     .collect::<Result<Vec<_>>>()?;
                 Ok((fun.return_type)(&data_types)?.as_ref().clone())
             }
+            Expr::Unnest(Unnest { array_exprs }) => {
+                let data_types = array_exprs
+                    .iter()
+                    .map(|e| e.get_type(schema))
+                    .collect::<Result<Vec<_>>>()?;
+                // Consider data types are the same for now
+                // TODO: Return correct data type for inconsistent data types
+                let return_type = data_types.first().unwrap();

Review Comment:
   This panic's when there are no arguments -- can we please make it return a real error instead?
   
   ```rust
   DataFusion CLI v28.0.0
   ❯ select unnest();
   thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /Users/alamb/Software/arrow-datafusion2/datafusion/expr/src/expr_schema.rs:95:54
   stack backtrace:
      0: rust_begin_unwind
   ```



##########
datafusion/expr/src/expr.rs:
##########
@@ -931,6 +948,40 @@ impl Expr {
     pub fn contains_outer(&self) -> bool {
         !find_out_reference_exprs(self).is_empty()
     }
+
+    pub fn flatten(&self) -> Self {

Review Comment:
   Can we please add doc comments about what this function does? Bonus points for an example.
   
   Perhaps also point out it is the opposite of `unnest`?



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1038,6 +1042,114 @@ impl LogicalPlanBuilder {
     pub fn unnest_column(self, column: impl Into<Column>) -> Result<Self> {
         Ok(Self::from(unnest(self.plan, column.into())?))
     }
+
+    pub fn join_unnest_plans(
+        unnest_plans: Vec<LogicalPlan>,
+        columns_name: Vec<String>,
+    ) -> Result<LogicalPlan> {
+        // Add row_number for each unnested array
+        let window_func_expr = Expr::WindowFunction(expr::WindowFunction::new(
+            WindowFunction::BuiltInWindowFunction(BuiltInWindowFunction::RowNumber),
+            vec![],
+            vec![],
+            vec![],
+            WindowFrame::new(false),
+        ));
+        let window_func_exprs = vec![window_func_expr.clone()];
+
+        let window_plans = unnest_plans
+            .into_iter()
+            .map(|plan| LogicalPlanBuilder::window_plan(plan, window_func_exprs.clone()))
+            .collect::<Result<Vec<LogicalPlan>>>()?;
+
+        // Create alias for row number
+        let row_numbers_name: Vec<String> = (0..columns_name.len())
+            .map(|idx| format!("rn{}", idx))
+            .collect();
+        let project_exprs: Vec<Vec<Expr>> = columns_name
+            .iter()
+            .zip(row_numbers_name.iter())
+            .map(|(col_name, row_number_name)| {
+                vec![
+                    ident(col_name),
+                    window_func_expr.clone().alias(row_number_name),
+                ]
+            })
+            .collect();
+        let project_plans = window_plans
+            .iter()
+            .zip(project_exprs.into_iter())
+            .map(|(plan, expr)| {
+                LogicalPlanBuilder::from(plan.clone())
+                    .project(expr)?
+                    .build()
+            })
+            .collect::<Result<Vec<LogicalPlan>>>()?;
+
+        // Wrap each unnested array into a subquery
+        let subqueries_alias: Vec<String> = (0..project_plans.len())
+            .map(|idx| format!("sq{}", idx))
+            .collect();
+        let subqueries_alias_plan = project_plans
+            .into_iter()
+            .zip(subqueries_alias.iter())
+            .map(|(plan, alias)| {
+                LogicalPlanBuilder::from(plan).alias(alias.clone())?.build()
+            })
+            .collect::<Result<Vec<LogicalPlan>>>()?;
+
+        // Create alias for columns to apply join
+        let columns_to_join_on = subqueries_alias
+            .iter()
+            .zip(row_numbers_name.iter())
+            .map(|(alias, rn)| col(format!("{}.{}", alias, rn)))
+            .collect::<Vec<Expr>>();
+
+        let (join_plan, _) = subqueries_alias_plan
+            .iter()
+            .zip(columns_to_join_on.iter())
+            .skip(1)
+            .fold(
+                Ok((
+                    subqueries_alias_plan[0].clone(),
+                    columns_to_join_on[0].clone(),
+                )),
+                |result: Result<(LogicalPlan, Expr)>, (right_plan, right_column)| {
+                    result.and_then(|(left_plan, left_column)| {
+                        let plan = LogicalPlanBuilder::from(left_plan)
+                            .join(
+                                right_plan.clone(),
+                                JoinType::Full,
+                                (Vec::<Column>::new(), Vec::<Column>::new()),
+                                Some(left_column.eq(right_column.clone())),
+                            )?
+                            .build()?;
+                        Ok((plan, right_column.clone()))
+                    })
+                },
+            )?;
+
+        // Select unnested array only, row_number is not needed
+        let selected_exprs: Vec<Expr> = subqueries_alias
+            .into_iter()
+            .zip(columns_name.into_iter())
+            .map(|(sq, col_name)| col(format!("{sq}.{col_name}")))
+            .collect();
+
+        LogicalPlanBuilder::from(join_plan)
+            .project(selected_exprs)?
+            .build()
+    }
+
+    pub fn unnest_arrays(self, array_exprs: Vec<Expr>) -> Result<Self> {

Review Comment:
   likewise this public API should also have documentation I think 



##########
datafusion/core/src/datasource/listing/helpers.rs:
##########
@@ -90,6 +90,7 @@ pub fn expr_applicable_for_cols(col_names: &[String], expr: &Expr) -> bool {
             | Expr::GroupingSet(_)
             | Expr::Case { .. } => VisitRecursion::Continue,
 
+            Expr::Unnest { .. } => todo!("Unnest not implemented yet"),

Review Comment:
   It would also be fine and conservative to set `is_applicable = false` for unnest and return `Stop`



##########
datafusion/sql/tests/sql_integration.rs:
##########
@@ -2962,36 +2962,37 @@ fn cte_with_no_column_names() {
 
 #[test]
 fn cte_with_column_names() {
-    let sql = "WITH \
-        numbers(a, b, c) AS ( \
-            SELECT 1, 2, 3 \
-        ) \
-        SELECT * FROM numbers;";
+    let sql = r#"WITH
+        numbers(a, b, c) AS (
+            SELECT 1, 2, 3
+        )
+        SELECT * FROM numbers;"#;
 
-    let expected = "Projection: numbers.a, numbers.b, numbers.c\
-        \n  SubqueryAlias: numbers\
-        \n    Projection: Int64(1) AS a, Int64(2) AS b, Int64(3) AS c\
-        \n      Projection: Int64(1), Int64(2), Int64(3)\
-        \n        EmptyRelation";
+    let expected = r#"Projection: numbers.a, numbers.b, numbers.c
+  SubqueryAlias: numbers
+    Projection: Int64(1) AS a, Int64(2) AS b, Int64(3) AS c
+      Projection: Int64(1), Int64(2), Int64(3)
+        EmptyRelation"#;
 
-    quick_test(sql, expected)
+    quick_test(sql, expected);
 }
 
 #[test]
 fn cte_with_column_aliases_precedence() {

Review Comment:
   are the changes in this test related to unnest? Maybe we could break them out into their own PR for review



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1431,6 +1543,122 @@ pub fn unnest(input: LogicalPlan, column: Column) -> Result<LogicalPlan> {
     }))
 }
 
+/// Create unnest plan from arrays.
+pub fn build_unnest_plans(

Review Comment:
   SHould this be public?



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1038,6 +1042,114 @@ impl LogicalPlanBuilder {
     pub fn unnest_column(self, column: impl Into<Column>) -> Result<Self> {
         Ok(Self::from(unnest(self.plan, column.into())?))
     }
+
+    pub fn join_unnest_plans(

Review Comment:
   Can we please add some doc comments about what this function does and what its arguments are?



##########
datafusion/proto/src/logical_plan/to_proto.rs:
##########
@@ -717,6 +717,7 @@ impl TryFrom<&Expr> for protobuf::LogicalExprNode {
                         .to_string(),
                 ))
             }
+            Expr::Unnest(..) => todo!("Unnest not supported"),

Review Comment:
   Also should return an error rather than panic



##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   this doesn't see right to me -- maybe there is a bug in the output schema that `Unnest` reports?



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -84,6 +85,16 @@ impl ExprSchemable for Expr {
                     .collect::<Result<Vec<_>>>()?;
                 Ok((fun.return_type)(&data_types)?.as_ref().clone())
             }
+            Expr::Unnest(Unnest { array_exprs }) => {
+                let data_types = array_exprs
+                    .iter()
+                    .map(|e| e.get_type(schema))
+                    .collect::<Result<Vec<_>>>()?;
+                // Consider data types are the same for now
+                // TODO: Return correct data type for inconsistent data types
+                let return_type = data_types.first().unwrap();

Review Comment:
   This panic's when there are no arguments -- can we please make it return a real error instead?
   
   ```rust
   DataFusion CLI v28.0.0
   ❯ select unnest();
   thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /Users/alamb/Software/arrow-datafusion2/datafusion/expr/src/expr_schema.rs:95:54
   stack backtrace:
      0: rust_begin_unwind
   ```



##########
datafusion/optimizer/src/unnest_expressions.rs:
##########
@@ -0,0 +1,80 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_expr::expr::Alias;
+use datafusion_expr::expr::Unnest;
+use datafusion_expr::Expr;
+use datafusion_expr::LogicalPlan;
+use datafusion_expr::LogicalPlanBuilder;
+
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+
+#[derive(Default)]
+pub struct UnnestExpressions {}
+
+impl OptimizerRule for UnnestExpressions {
+    fn name(&self) -> &str {
+        "unnest_expressions"
+    }
+
+    fn try_optimize(
+        &self,
+        plan: &LogicalPlan,
+        _config: &dyn OptimizerConfig,
+    ) -> Result<Option<LogicalPlan>> {
+        Ok(Some(Self::optimize_internal(plan)?))
+    }
+}
+
+impl UnnestExpressions {
+    pub fn new() -> Self {
+        Self::default()
+    }
+
+    fn optimize_internal(plan: &LogicalPlan) -> Result<LogicalPlan> {
+        if let LogicalPlan::Projection(_) = plan {

Review Comment:
   You might be able to use https://docs.rs/datafusion/latest/datafusion/logical_expr/utils/fn.inspect_expr_pre.html or some other TreeVisit methods to recursively walk the expression trees -- or maybe you only want to walk the top level ones 🤔 



##########
datafusion/optimizer/src/unnest_expressions.rs:
##########
@@ -0,0 +1,80 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_expr::expr::Alias;
+use datafusion_expr::expr::Unnest;
+use datafusion_expr::Expr;
+use datafusion_expr::LogicalPlan;
+use datafusion_expr::LogicalPlanBuilder;
+
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+
+#[derive(Default)]
+pub struct UnnestExpressions {}
+
+impl OptimizerRule for UnnestExpressions {
+    fn name(&self) -> &str {
+        "unnest_expressions"
+    }
+
+    fn try_optimize(
+        &self,
+        plan: &LogicalPlan,
+        _config: &dyn OptimizerConfig,
+    ) -> Result<Option<LogicalPlan>> {
+        Ok(Some(Self::optimize_internal(plan)?))
+    }
+}
+
+impl UnnestExpressions {
+    pub fn new() -> Self {
+        Self::default()
+    }
+
+    fn optimize_internal(plan: &LogicalPlan) -> Result<LogicalPlan> {
+        if let LogicalPlan::Projection(_) = plan {

Review Comment:
   You might be able to use https://docs.rs/datafusion/latest/datafusion/logical_expr/utils/fn.inspect_expr_pre.html or some other TreeVisit methods to recursively walk the expression trees -- or maybe you only want to walk the top level ones 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280740817


##########
datafusion/optimizer/src/unnest_expressions.rs:
##########
@@ -0,0 +1,80 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion_expr::expr::Alias;
+use datafusion_expr::expr::Unnest;
+use datafusion_expr::Expr;
+use datafusion_expr::LogicalPlan;
+use datafusion_expr::LogicalPlanBuilder;
+
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+
+#[derive(Default)]
+pub struct UnnestExpressions {}
+
+impl OptimizerRule for UnnestExpressions {
+    fn name(&self) -> &str {
+        "unnest_expressions"
+    }
+
+    fn try_optimize(
+        &self,
+        plan: &LogicalPlan,
+        _config: &dyn OptimizerConfig,
+    ) -> Result<Option<LogicalPlan>> {
+        Ok(Some(Self::optimize_internal(plan)?))
+    }
+}
+
+impl UnnestExpressions {
+    pub fn new() -> Self {
+        Self::default()
+    }
+
+    fn optimize_internal(plan: &LogicalPlan) -> Result<LogicalPlan> {
+        if let LogicalPlan::Projection(_) = plan {

Review Comment:
   Recursively visiting means there might have multiple layers of unnest expressions, like `unnest( unnest () )`? I think we might not need them, at least in this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1310391846


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -422,6 +424,11 @@ fn assert_schema_is_the_same(
     prev_plan: &LogicalPlan,
     new_plan: &LogicalPlan,
 ) -> Result<()> {
+    if rule_name == "unnest_expressions" {

Review Comment:
   https://github.com/apache/arrow-datafusion/blob/85ec31433615735a05d332f87cd0bdfc11aac663/datafusion/core/src/physical_plan/projection.rs#L423-L440
   
   All nulls to align the row size at L432.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on a diff in pull request #6796: Implement basic unnest function

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on code in PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#discussion_r1280692321


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -1794,9 +1802,150 @@ select make_array(f0) from fixed_size_list_array
 ----
 [[1, 2], [3, 4]]
 
+## Unnest
 
-### Delete tables
+# Set target partitions to 1 for deterministic results
+statement ok
+set datafusion.execution.target_partitions = 1;

Review Comment:
   `rowsort` might not be a good idea, if the result is in descending order, `rowsort` convert it to ascending forcibly, but if the code contains a bug that does not return in descending order, then we might miss it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-1785990023

   Marking this one as draft as it has bitrotted for a while and I think we need to work on simplifing the array code we have before we add more 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Implement basic unnest function [arrow-datafusion]

Posted by "jonahgao (via GitHub)" <gi...@apache.org>.
jonahgao commented on PR #6796:
URL: https://github.com/apache/arrow-datafusion/pull/6796#issuecomment-2044019394

   I think I might have misunderstood. Neither of the two queries below has been implemented, the one discussed in this PR seems to be the latter.
   ```sh
   DataFusion CLI v37.0.0
   ❯ select unnest(ARRAY[1,2]), unnest(ARRAY[3,4,5]);
   This feature is not implemented: Only support single unnest expression for now
   
   ❯ select * from unnest(ARRAY[1,2], ARRAY[2,3]);
   This feature is not implemented: unnest() does not support multiple arguments yet
   ```
   I plan to try to implement them both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org