You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "izveigor (via GitHub)" <gi...@apache.org> on 2023/06/13 18:50:51 UTC

[GitHub] [arrow-datafusion] izveigor opened a new pull request, #6662: feat: supports NULL in arrays

izveigor opened a new pull request, #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes https://github.com/apache/arrow-datafusion/issues/6556
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   Yes
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   Yes
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1604650455

   Hello, @alamb!
   I am well aware of your arguments about NULL handling in arrays and initially agreed with it.
   But I found a serious problem that only initial approach can solve.
   As you know arrays in PostgreSQL and other SQL database are not data types for arithmetic operations.
   They are rather data types that help you look up information and form new columns.
   For example, if we want to create table with `unnest` function (which I hope will soon be implemented in Arrow Datafusion)
   ```
   postgres=# SELECT * FROM
     unnest(
       ARRAY[1, 2, 3, 4, 5],
       ARRAY['Ringo', 'George', 'Paul', 'John', NULL],
       ARRAY[29, 27, 27, 29, NULL]
     ) AS data(id,name,age);
    id |  name  | age 
   ----+--------+-----
     1 | Ringo  |  29
     2 | George |  27
     3 | Paul   |  27
     4 | John   |  29
     5 |        |    
   (5 rows)
   ```
   Or find the information:
   ```
   postgres=# SELECT array_positions(array[1, 2, 3, 4, 5], 1);
    array_positions 
   -----------------
    {1}
   (1 row)
   ```
   As I know we can create column with NULLS values.
   So, I think NULLS must exist in their usual understanding (without casting and passes).
   I hope I explained well.
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1237566894


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -99,9 +116,9 @@ select array_prepend(1, make_array(2, 3, 4)), array_prepend(1.0, make_array(2.0,
 [1, 2, 3, 4] [1.0, 2.0, 3.0, 4.0] [h, e, l, l, o]
 
 # array_fill scalar function #1
-query error DataFusion error: SQL error: ParserError\("Expected an SQL statement, found: caused"\)
+query error DataFusion error: SQL error: TokenizerError\("Unterminated string literal at Line: 2, Column 856"\)
 caused by
-Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)
+Internal error: Optimizer rule 'simplify_expressions' failed, due to generate a different schema, original schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}, new schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}\. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Review Comment:
   Weird, there seems to be a merge problem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1243830545


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -440,46 +440,44 @@ impl BuiltinScalarFunction {
         // the return type of the built in function.
         // Some built-in functions' return type depends on the incoming type.
         match self {
-            BuiltinScalarFunction::ArrayAppend => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept list as the first argument"
-                ))),
-            },
-            BuiltinScalarFunction::ArrayConcat => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept fixed size list as the args."
-                ))),
-            },
-            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
-            BuiltinScalarFunction::ArrayFill => Ok(List(Arc::new(Field::new(
+            BuiltinScalarFunction::ArrayAppend => Ok(List(Arc::new(Field::new(
                 "item",
-                input_expr_types[0].clone(),
+                input_expr_types[1].clone(),
                 true,
             )))),
+            BuiltinScalarFunction::ArrayConcat => {
+                let mut expr_type = Null;
+                for input_expr_type in input_expr_types {
+                    match input_expr_type {
+                        List(field) => {
+                            if !field.data_type().equals_datatype(&Null) {
+                                expr_type = field.data_type().clone();
+                                break;
+                            }
+                        }
+                        _ => {
+                            return Err(DataFusionError::Internal(format!(
+                                "The {self} function can only accept list as the args."
+                            )))
+                        }
+                    }
+                }
+
+                Ok(List(Arc::new(Field::new("item", expr_type, true))))
+            }
+            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
+            BuiltinScalarFunction::ArrayFill => Ok(Null),

Review Comment:
   https://github.com/apache/arrow-datafusion/issues/6779



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1611899986

   It would be desirable to quickly decide the fate of this PR @alamb.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1612025762

   > @alamb I think it's worth creating a separate ticket for the problem.
   
   Can you please file this ticket?
   
   >  Therefore, if there are no serious claims, I think it is worth merging this PR.
   
   Let's do it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on pull request #6662: feat: supports NULL in arrays

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1597626578

   @alamb can you review this PR if you have time.
   I have some problems with `ArrayFill` function. To get its return type, we should know the length of its second element (`List`). So, I think we should create a separate ticket to discuss possible solutions..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1237573873


##########
datafusion/optimizer/src/analyzer/type_coercion.rs:
##########
@@ -635,6 +635,20 @@ fn cast_expr(expr: &Expr, to_type: &DataType, schema: &DFSchema) -> Result<Expr>
     expr.clone().cast_to(to_type, schema)
 }
 
+/// Cast array `expr` to the specified type, if possible
+fn cast_array_expr(
+    expr: &Expr,
+    from_type: &DataType,
+    to_type: &DataType,
+    schema: &DFSchema,
+) -> Result<Expr> {
+    if from_type.equals_datatype(&DataType::Null) {

Review Comment:
   For reference, this is the first way to solve the problem (see below).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1244313363


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -440,46 +440,44 @@ impl BuiltinScalarFunction {
         // the return type of the built in function.
         // Some built-in functions' return type depends on the incoming type.
         match self {
-            BuiltinScalarFunction::ArrayAppend => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept list as the first argument"
-                ))),
-            },
-            BuiltinScalarFunction::ArrayConcat => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept fixed size list as the args."
-                ))),
-            },
-            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
-            BuiltinScalarFunction::ArrayFill => Ok(List(Arc::new(Field::new(
+            BuiltinScalarFunction::ArrayAppend => Ok(List(Arc::new(Field::new(
                 "item",
-                input_expr_types[0].clone(),
+                input_expr_types[1].clone(),
                 true,
             )))),
+            BuiltinScalarFunction::ArrayConcat => {
+                let mut expr_type = Null;
+                for input_expr_type in input_expr_types {
+                    match input_expr_type {
+                        List(field) => {
+                            if !field.data_type().equals_datatype(&Null) {
+                                expr_type = field.data_type().clone();
+                                break;
+                            }
+                        }
+                        _ => {
+                            return Err(DataFusionError::Internal(format!(

Review Comment:
   Indeed: #6108 



##########
datafusion/physical-expr/src/array_expressions.rs:
##########
@@ -56,20 +67,29 @@ macro_rules! new_builder {
 
 macro_rules! array {
     ($ARGS:expr, $ARRAY_TYPE:ident, $BUILDER_TYPE:ident) => {{
-        // downcast all arguments to their common format
-        let args =
-            downcast_vec!($ARGS, $ARRAY_TYPE).collect::<Result<Vec<&$ARRAY_TYPE>>>()?;
-
-        let builder = new_builder!($BUILDER_TYPE, args[0].len());
+        let builder = new_builder!($BUILDER_TYPE, $ARGS[0].len());
         let mut builder =
-            ListBuilder::<$BUILDER_TYPE>::with_capacity(builder, args.len());
+            ListBuilder::<$BUILDER_TYPE>::with_capacity(builder, $ARGS.len());
+
         // for each entry in the array
-        for index in 0..args[0].len() {
-            for arg in &args {
-                if arg.is_null(index) {
-                    builder.values().append_null();
-                } else {
-                    builder.values().append_value(arg.value(index));
+        for index in 0..$ARGS[0].len() {

Review Comment:
   I would expect that the array contains nulls values 
   
   Like
   ```
   select make_array(1, Null, 2 Null);
   ---- 
   1, NULL, 2, NULL
   ```
   
   Where the element of the ListArray are marked as null 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1245649783


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -61,18 +61,35 @@ select make_array(make_array()), make_array(make_array(make_array()))
 ----
 [[]] [[[]]]
 
-# TODO issue: https://github.com/apache/arrow-datafusion/issues/6596
-# array_append scalar function #1
-query error DataFusion error: SQL error: ParserError\("Expected an SQL statement, found: caused"\)
-caused by
-Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)
+# array scalar function with nulls

Review Comment:
   these examples look much better to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1612132299

   I merged up to resolve conflicts in `array.slt`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1625845877

   I believe this PR introduced a regression: https://github.com/apache/arrow-datafusion/issues/6887 cc @izveigor 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1610171065

   > For example, if we want to create table with unnest function (which I hope will soon be implemented in Arrow Datafusion)
   
   
   
   That would be awesome! I found you have filed https://github.com/apache/arrow-datafusion/issues/6555 to track it 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: supports NULL in arrays

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1599591121

   Hi @izveigor  -- I will review this tomorrow. Sorry I am behind


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1237569510


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -440,46 +440,44 @@ impl BuiltinScalarFunction {
         // the return type of the built in function.
         // Some built-in functions' return type depends on the incoming type.
         match self {
-            BuiltinScalarFunction::ArrayAppend => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept list as the first argument"
-                ))),
-            },
-            BuiltinScalarFunction::ArrayConcat => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept fixed size list as the args."
-                ))),
-            },
-            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
-            BuiltinScalarFunction::ArrayFill => Ok(List(Arc::new(Field::new(
+            BuiltinScalarFunction::ArrayAppend => Ok(List(Arc::new(Field::new(
                 "item",
-                input_expr_types[0].clone(),
+                input_expr_types[1].clone(),
                 true,
             )))),
+            BuiltinScalarFunction::ArrayConcat => {
+                let mut expr_type = Null;
+                for input_expr_type in input_expr_types {
+                    match input_expr_type {
+                        List(field) => {
+                            if !field.data_type().equals_datatype(&Null) {
+                                expr_type = field.data_type().clone();
+                                break;
+                            }
+                        }
+                        _ => {
+                            return Err(DataFusionError::Internal(format!(

Review Comment:
   I think I should consider other functions as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1237559594


##########
datafusion/physical-expr/src/array_expressions.rs:
##########
@@ -56,20 +67,29 @@ macro_rules! new_builder {
 
 macro_rules! array {
     ($ARGS:expr, $ARRAY_TYPE:ident, $BUILDER_TYPE:ident) => {{
-        // downcast all arguments to their common format
-        let args =
-            downcast_vec!($ARGS, $ARRAY_TYPE).collect::<Result<Vec<&$ARRAY_TYPE>>>()?;
-
-        let builder = new_builder!($BUILDER_TYPE, args[0].len());
+        let builder = new_builder!($BUILDER_TYPE, $ARGS[0].len());
         let mut builder =
-            ListBuilder::<$BUILDER_TYPE>::with_capacity(builder, args.len());
+            ListBuilder::<$BUILDER_TYPE>::with_capacity(builder, $ARGS.len());
+
         // for each entry in the array
-        for index in 0..args[0].len() {
-            for arg in &args {
-                if arg.is_null(index) {
-                    builder.values().append_null();
-                } else {
-                    builder.values().append_value(arg.value(index));
+        for index in 0..$ARGS[0].len() {

Review Comment:
   Your arguments seem reasonable.
   I have two ideas regarding the introduction of this feature:
   1) Cast all nulls to other data types:
   ```
   select make_array(1, Null, 2 Null);
   ----
   1, 0, 2, 0
   ```
   2) Ignore nulls:
   ```
   select make_array(1, Null, 2, Null);
   ----
   1, 2
   ```
   
   What do you think of these approaches @alamb?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1610170338

   > As I know we can create column with NULLS values.
   > So, I think NULLS must exist in their usual understanding (without casting and passes).
   
   Yes, I agree with this sentiment -- I think I am confused. I was imagning that the `ARRAY` would be a `ListArray` with null elements that represented NULL.
   
   My comments about `NULL`s above I think refer to how coercion works (it should be done early in the query rather than in the physical expr). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1237532955


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -440,46 +440,44 @@ impl BuiltinScalarFunction {
         // the return type of the built in function.
         // Some built-in functions' return type depends on the incoming type.
         match self {
-            BuiltinScalarFunction::ArrayAppend => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept list as the first argument"
-                ))),
-            },
-            BuiltinScalarFunction::ArrayConcat => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept fixed size list as the args."
-                ))),
-            },
-            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
-            BuiltinScalarFunction::ArrayFill => Ok(List(Arc::new(Field::new(
+            BuiltinScalarFunction::ArrayAppend => Ok(List(Arc::new(Field::new(
                 "item",
-                input_expr_types[0].clone(),
+                input_expr_types[1].clone(),
                 true,
             )))),
+            BuiltinScalarFunction::ArrayConcat => {
+                let mut expr_type = Null;
+                for input_expr_type in input_expr_types {
+                    match input_expr_type {
+                        List(field) => {
+                            if !field.data_type().equals_datatype(&Null) {
+                                expr_type = field.data_type().clone();
+                                break;
+                            }
+                        }
+                        _ => {
+                            return Err(DataFusionError::Internal(format!(
+                                "The {self} function can only accept list as the args."
+                            )))
+                        }
+                    }
+                }
+
+                Ok(List(Arc::new(Field::new("item", expr_type, true))))
+            }
+            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
+            BuiltinScalarFunction::ArrayFill => Ok(Null),

Review Comment:
   The function returns nested list:
   ```
   postgres=# select array_fill(3, array[2, 3, 2]);
                   array_fill                 
   -------------------------------------------
    {{{3,3},{3,3},{3,3}},{{3,3},{3,3},{3,3}}}
   (1 row)
   ```
   Therefore it should return nested list. I think this is the serious problem because to return nested list with right dimensions we should know the length of the second argument (list).
   P.S. after the changes: https://github.com/apache/arrow-datafusion/pull/6595 it does not work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#discussion_r1237411053


##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -440,46 +440,44 @@ impl BuiltinScalarFunction {
         // the return type of the built in function.
         // Some built-in functions' return type depends on the incoming type.
         match self {
-            BuiltinScalarFunction::ArrayAppend => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept list as the first argument"
-                ))),
-            },
-            BuiltinScalarFunction::ArrayConcat => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept fixed size list as the args."
-                ))),
-            },
-            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
-            BuiltinScalarFunction::ArrayFill => Ok(List(Arc::new(Field::new(
+            BuiltinScalarFunction::ArrayAppend => Ok(List(Arc::new(Field::new(
                 "item",
-                input_expr_types[0].clone(),
+                input_expr_types[1].clone(),
                 true,
             )))),
+            BuiltinScalarFunction::ArrayConcat => {
+                let mut expr_type = Null;
+                for input_expr_type in input_expr_types {
+                    match input_expr_type {
+                        List(field) => {
+                            if !field.data_type().equals_datatype(&Null) {
+                                expr_type = field.data_type().clone();
+                                break;
+                            }
+                        }
+                        _ => {
+                            return Err(DataFusionError::Internal(format!(
+                                "The {self} function can only accept list as the args."
+                            )))
+                        }
+                    }
+                }
+
+                Ok(List(Arc::new(Field::new("item", expr_type, true))))
+            }
+            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
+            BuiltinScalarFunction::ArrayFill => Ok(Null),

Review Comment:
   I don't understand why ArrayFill always returns `Null` as its data type
   
   My reading of https://www.postgresql.org/docs/9.1/functions-array.html suggests that it should return something like `List(args[0].type)`
   
   



##########
datafusion/optimizer/src/analyzer/type_coercion.rs:
##########
@@ -635,6 +635,20 @@ fn cast_expr(expr: &Expr, to_type: &DataType, schema: &DFSchema) -> Result<Expr>
     expr.clone().cast_to(to_type, schema)
 }
 
+/// Cast array `expr` to the specified type, if possible
+fn cast_array_expr(
+    expr: &Expr,
+    from_type: &DataType,
+    to_type: &DataType,
+    schema: &DFSchema,
+) -> Result<Expr> {
+    if from_type.equals_datatype(&DataType::Null) {

Review Comment:
   I would have thought we should be casting all the arguments that are `null` to the specific type of the rest of the arguments...



##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -99,9 +116,9 @@ select array_prepend(1, make_array(2, 3, 4)), array_prepend(1.0, make_array(2.0,
 [1, 2, 3, 4] [1.0, 2.0, 3.0, 4.0] [h, e, l, l, o]
 
 # array_fill scalar function #1
-query error DataFusion error: SQL error: ParserError\("Expected an SQL statement, found: caused"\)
+query error DataFusion error: SQL error: TokenizerError\("Unterminated string literal at Line: 2, Column 856"\)
 caused by
-Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)
+Internal error: Optimizer rule 'simplify_expressions' failed, due to generate a different schema, original schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}, new schema: DFSchema \{ fields: \[DFField \{ qualifier: None, field: Field \{ name: "array_fill\(Int64\(1\),make_array\(\)\)", data_type: List\(Field \{ name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: \{\} \} \}\], metadata: \{\} \}\. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Review Comment:
   something seems wrong with this test



##########
datafusion/physical-expr/src/array_expressions.rs:
##########
@@ -22,12 +22,23 @@ use arrow::buffer::Buffer;
 use arrow::compute;
 use arrow::datatypes::{DataType, Field};
 use core::any::type_name;
-use datafusion_common::cast::as_list_array;
+use datafusion_common::cast::{as_generic_string_array, as_int64_array, as_list_array};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 use std::sync::Arc;
 
+macro_rules! downcast_arg {

Review Comment:
   I think this is the same as `downcast_value`: https://docs.rs/datafusion-common/26.0.0/datafusion_common/macro.downcast_value.html



##########
datafusion/physical-expr/src/array_expressions.rs:
##########
@@ -56,20 +67,29 @@ macro_rules! new_builder {
 
 macro_rules! array {
     ($ARGS:expr, $ARRAY_TYPE:ident, $BUILDER_TYPE:ident) => {{
-        // downcast all arguments to their common format
-        let args =
-            downcast_vec!($ARGS, $ARRAY_TYPE).collect::<Result<Vec<&$ARRAY_TYPE>>>()?;
-
-        let builder = new_builder!($BUILDER_TYPE, args[0].len());
+        let builder = new_builder!($BUILDER_TYPE, $ARGS[0].len());
         let mut builder =
-            ListBuilder::<$BUILDER_TYPE>::with_capacity(builder, args.len());
+            ListBuilder::<$BUILDER_TYPE>::with_capacity(builder, $ARGS.len());
+
         // for each entry in the array
-        for index in 0..args[0].len() {
-            for arg in &args {
-                if arg.is_null(index) {
-                    builder.values().append_null();
-                } else {
-                    builder.values().append_value(arg.value(index));
+        for index in 0..$ARGS[0].len() {

Review Comment:
   I am not sure about this approach of taking either a `ListArray` or a `NullArray`
   
   In the other functions, the way NULL is treated is that the input types are always the same (in this case ListArray) and the values would be `null` (aka `array.is_valid(i)` would return false for rows that are null.
   
   Complicating matters is if you type a literal `null` in sql like:
   
   ```sql
   select array_concat([1,2], null)
   ```
   
   That comes to DataFusion as a `null` literal (with DataType::Null). The coercion / casting logic normally will coerce this to the appropriate type. 
   
   For example, here is how I think arithmetic works with null:
   
   ```sql
   select 1 + NULL
   ```
   
   Arrives like 
   ```sql
   ScalarValue::Int32(Some(1)) + ScalarValue::Null
   ```
   
   And then the coercion logic will add a cast to Int32:
   
   ```sql
   ScalarValue::Int32(Some(1)) + CAST(ScalarValue::Null, DataType::Int32)
   ```
   
   And then the constant folder will collapse this into:
   
   
   ```sql
   ScalarValue::Int32(Some(1)) + ScalarValue::Int32(None)
   ```
   
   So by the time the arithmetic kernel sees it, it only has to deal with arguments of `Int32`
   



##########
datafusion/expr/src/built_in_function.rs:
##########
@@ -440,46 +440,44 @@ impl BuiltinScalarFunction {
         // the return type of the built in function.
         // Some built-in functions' return type depends on the incoming type.
         match self {
-            BuiltinScalarFunction::ArrayAppend => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept list as the first argument"
-                ))),
-            },
-            BuiltinScalarFunction::ArrayConcat => match &input_expr_types[0] {
-                List(field) => Ok(List(Arc::new(Field::new(
-                    "item",
-                    field.data_type().clone(),
-                    true,
-                )))),
-                _ => Err(DataFusionError::Internal(format!(
-                    "The {self} function can only accept fixed size list as the args."
-                ))),
-            },
-            BuiltinScalarFunction::ArrayDims => Ok(UInt8),
-            BuiltinScalarFunction::ArrayFill => Ok(List(Arc::new(Field::new(
+            BuiltinScalarFunction::ArrayAppend => Ok(List(Arc::new(Field::new(
                 "item",
-                input_expr_types[0].clone(),
+                input_expr_types[1].clone(),
                 true,
             )))),
+            BuiltinScalarFunction::ArrayConcat => {
+                let mut expr_type = Null;
+                for input_expr_type in input_expr_types {
+                    match input_expr_type {
+                        List(field) => {
+                            if !field.data_type().equals_datatype(&Null) {
+                                expr_type = field.data_type().clone();
+                                break;
+                            }
+                        }
+                        _ => {
+                            return Err(DataFusionError::Internal(format!(

Review Comment:
   I think Internal errors are only intended for bugs in DataFusion -- this error seems like it could come from bad user input too
   
   ```suggestion
                               return Err(DataFusionError::Plan(format!(
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1605454655

   Thanks @izveigor  -- I will try and find time to review / comment on this PR in more detail


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6662: feat: support `NULL` in array functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1611907438

   I will review it again shortly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on pull request #6662: feat: support `NULL` in array functions

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on PR #6662:
URL: https://github.com/apache/arrow-datafusion/pull/6662#issuecomment-1612021441

   @alamb I think it's worth creating a separate ticket for the problem. Therefore, if there are no serious claims, I think it is worth merging this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org