You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jayzhan211 (via GitHub)" <gi...@apache.org> on 2023/07/13 02:24:21 UTC

[GitHub] [arrow-datafusion] jayzhan211 opened a new pull request, #6940: Column support for array_to_string

jayzhan211 opened a new pull request, #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   Ref https://github.com/apache/arrow-datafusion/issues/6804
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   # Note
   Revert array.slt changed by #6595 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6940: Column support for array_to_string

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6940: Column support for array_to_string

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940#discussion_r1264065890


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -562,10 +612,10 @@ select trim_array(make_array(1, 2, 3, 4, 5), 2), trim_array(['h', 'e', 'l', 'l',
 [1, 2, 3] [h, e] [1.0]
 
 # trim_array scalar function #2
-query error DataFusion error: SQL error: ParserError\("Expected an SQL statement, found: caused"\)
-caused by
-Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)
+query ??
 select trim_array([[1, 2], [3, 4], [5, 6]], 2), trim_array(array_fill(4, [3, 4, 2]), 2);

Review Comment:
   `trim_array` too (See: https://github.com/apache/arrow-datafusion/discussions/6855#discussioncomment-6429023) (it will be replaced with `array_slice`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on pull request #6940: Column support for array_to_string

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on PR #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940#issuecomment-1636292633

   I think this is the last function that needs column support.
   The remaining functions will already be carried out in accordance with the new concept:
   Thank you for taking on this issue, @jayzhan211!
   P.S. Tomorrow I will create new tickets according to the new concept.
   cc @alamb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jayzhan211 commented on pull request #6940: Column support for array_to_string

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on PR #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940#issuecomment-1636567035

   I think those are good suggestions but as you said we could improve them
   iteratively. I think we can merge this PR for now! 🙂
   
   On Sat, Jul 15, 2023, 4:58 AM Andrew Lamb ***@***.***> wrote:
   
   > ***@***.**** approved this pull request.
   >
   > Thank you @jayzhan211 <https://github.com/jayzhan211> and @izveigor
   > <https://github.com/izveigor>
   >
   > I think there are some improvements / issues with the code in this PR.
   > However, I think it is a step forward for datafusion and I think we can
   > merge it and keep iterating on master.
   >
   > I defer to your preference. Let me know what you want to do
   > ------------------------------
   >
   > In datafusion/physical-expr/src/array_expressions.rs
   > <https://github.com/apache/arrow-datafusion/pull/6940#discussion_r1264167397>
   > :
   >
   > > +                let s = s.strip_suffix(delimeter).unwrap().to_string();
   > +                res.push(Some(s));
   >
   > I am not quite sure if it is sure that delimiter is always present. If it
   > is not, this code will panic.
   >
   > Perhaps it could be something like the following to aavoid the unwrap?
   > ⬇️ Suggested change
   >
   > -                let s = s.strip_suffix(delimeter).unwrap().to_string();
   > -                res.push(Some(s));
   > +                let s = s.strip_suffix(delimeter).unwrap_or(s).to_string();
   > +                res.push(Some(s));
   >
   > ------------------------------
   >
   > In datafusion/physical-expr/src/array_expressions.rs
   > <https://github.com/apache/arrow-datafusion/pull/6940#discussion_r1264170324>
   > :
   >
   > > @@ -1195,21 +1195,56 @@ pub fn array_to_string(args: &[ArrayRef]) -> Result<ArrayRef> {
   >      }
   >
   >      let mut arg = String::from("");
   > -    let mut res = compute_array_to_string(
   > -        &mut arg,
   > -        arr.clone(),
   > -        delimeter.clone(),
   > -        null_string,
   > -        with_null_string,
   > -    )?
   > -    .clone();
   > -    match res.as_str() {
   > -        "" => Ok(Arc::new(StringArray::from(vec![Some(res)]))),
   > +    let mut res: Vec<Option<String>> = Vec::new();
   > +
   > +    match arr.data_type() {
   > +        DataType::List(_) | DataType::LargeList(_) | DataType::FixedSizeList(_, _) => {
   > +            let list_array = arr.as_list::<i32>();
   >
   > I don't think this is going to work (will panic) for LargeList as it
   > should be i64 -- we would probably have to make a generic version of this
   > function.
   >
   > However, for now I think it looks reasonable and a step forward to me
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-datafusion/pull/6940#pullrequestreview-1531032030>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ADZCLRZDJ4OSGW3BQA6Y27TXQGXITANCNFSM6AAAAAA2IIUBZI>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6940: Column support for array_to_string

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940#issuecomment-1637082612

   Thanks @izveigor  and @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] izveigor commented on a diff in pull request #6940: Column support for array_to_string

Posted by "izveigor (via GitHub)" <gi...@apache.org>.
izveigor commented on code in PR #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940#discussion_r1264064741


##########
datafusion/core/tests/sqllogictests/test_files/array.slt:
##########
@@ -344,22 +344,22 @@ select array_prepend(100.1, column2), array_prepend('.', column3) from arrays;
 ## array_fill
 
 # array_fill scalar function #1
-query error DataFusion error: SQL error: ParserError\("Expected an SQL statement, found: caused"\)
-caused by
-Error during planning: Cannot automatically convert List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\) to List\(Field \{ name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: \{\} \}\)
+query ???
 select array_fill(11, make_array(1, 2, 3)), array_fill(3, make_array(2, 3)), array_fill(2, make_array(2));

Review Comment:
   `array_fill` will not be used in future versions (due to the inability to properly set the data type)
   It will be replaced with `array_repeat`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6940: Column support for array_to_string

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6940:
URL: https://github.com/apache/arrow-datafusion/pull/6940#discussion_r1264167397


##########
datafusion/physical-expr/src/array_expressions.rs:
##########
@@ -1195,21 +1195,56 @@ pub fn array_to_string(args: &[ArrayRef]) -> Result<ArrayRef> {
     }
 
     let mut arg = String::from("");
-    let mut res = compute_array_to_string(
-        &mut arg,
-        arr.clone(),
-        delimeter.clone(),
-        null_string,
-        with_null_string,
-    )?
-    .clone();
-    match res.as_str() {
-        "" => Ok(Arc::new(StringArray::from(vec![Some(res)]))),
+    let mut res: Vec<Option<String>> = Vec::new();
+
+    match arr.data_type() {
+        DataType::List(_) | DataType::LargeList(_) | DataType::FixedSizeList(_, _) => {
+            let list_array = arr.as_list::<i32>();
+            for (arr, &delimeter) in list_array.iter().zip(delimeters.iter()) {
+                if let (Some(arr), Some(delimeter)) = (arr, delimeter) {
+                    arg = String::from("");
+                    let s = compute_array_to_string(
+                        &mut arg,
+                        arr,
+                        delimeter.to_string(),
+                        null_string.clone(),
+                        with_null_string,
+                    )?
+                    .clone();
+
+                    if let Some(s) = s.strip_suffix(delimeter) {
+                        res.push(Some(s.to_string()));
+                    } else {
+                        res.push(Some(s));
+                    }
+                } else {
+                    res.push(None);
+                }
+            }
+        }
         _ => {
-            res.truncate(res.len() - delimeter.len());
-            Ok(Arc::new(StringArray::from(vec![Some(res)])))
+            // delimeter length is 1
+            assert_eq!(delimeters.len(), 1);
+            let delimeter = delimeters[0].unwrap();
+            let s = compute_array_to_string(
+                &mut arg,
+                arr.clone(),
+                delimeter.to_string(),
+                null_string,
+                with_null_string,
+            )?
+            .clone();
+
+            if !s.is_empty() {
+                let s = s.strip_suffix(delimeter).unwrap().to_string();
+                res.push(Some(s));

Review Comment:
   I am not quite sure if it is sure that `delimiter` is always present. If it is not, this code will panic.
   
   Perhaps it could be something like the following to aavoid the unwrap?
   
   ```suggestion
                   let s = s.strip_suffix(delimeter).unwrap_or(s).to_string();
                   res.push(Some(s));
   ```



##########
datafusion/physical-expr/src/array_expressions.rs:
##########
@@ -1195,21 +1195,56 @@ pub fn array_to_string(args: &[ArrayRef]) -> Result<ArrayRef> {
     }
 
     let mut arg = String::from("");
-    let mut res = compute_array_to_string(
-        &mut arg,
-        arr.clone(),
-        delimeter.clone(),
-        null_string,
-        with_null_string,
-    )?
-    .clone();
-    match res.as_str() {
-        "" => Ok(Arc::new(StringArray::from(vec![Some(res)]))),
+    let mut res: Vec<Option<String>> = Vec::new();
+
+    match arr.data_type() {
+        DataType::List(_) | DataType::LargeList(_) | DataType::FixedSizeList(_, _) => {
+            let list_array = arr.as_list::<i32>();

Review Comment:
   I don't think this is going to work (will `panic`) for `LargeList` as it should be `i64` -- we would probably have to make a generic version of this function. 
   
   However, for now I think it looks reasonable and a step forward to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org