You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/02/06 12:50:52 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #3668: Use ArrayFormatter in Cast Kernel

tustvold opened a new pull request, #3668:
URL: https://github.com/apache/arrow-rs/pull/3668

   _Draft as builds on #3647_ 
   
   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   Using the ArrayFormatter is less code, more consistent, and is ~10% faster
   
   ```
   cast i64 to string 512  time:   [10.723 µs 10.728 µs 10.733 µs]
                           change: [-11.208% -11.131% -11.052%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   cast f32 to string 512  time:   [16.210 µs 16.222 µs 16.233 µs]
                           change: [-7.7917% -7.6542% -7.5083%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   ```
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #3668: Use ArrayFormatter in Cast Kernel

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #3668:
URL: https://github.com/apache/arrow-rs/pull/3668#discussion_r1100716690


##########
arrow-cast/src/cast.rs:
##########
@@ -155,13 +154,12 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
         (_, Boolean) => DataType::is_numeric(from_type) || from_type == &Utf8 || from_type == &LargeUtf8,
         (Boolean, _) => DataType::is_numeric(to_type) || to_type == &Utf8 || to_type == &LargeUtf8,
 
-        (Utf8, LargeUtf8) => true,

Review Comment:
   for anyone else reviewing, these case got folded down into the other Utf8 and LargeUtf cases



##########
arrow-cast/src/cast.rs:
##########
@@ -2171,172 +2111,25 @@ where
     from.unary_opt::<_, R>(num::cast::cast::<T::Native, R::Native>)
 }
 
-fn as_time_with_string_op<
-    A: ArrayAccessor<Item = T::Native>,
-    OffsetSize,
-    T: ArrowTemporalType,
-    F,
->(
-    iter: ArrayIter<A>,
-    mut builder: GenericStringBuilder<OffsetSize>,
-    op: F,
-) -> ArrayRef
-where
-    OffsetSize: OffsetSizeTrait,
-    F: Fn(NaiveDateTime) -> String,
-    i64: From<T::Native>,
-{
-    iter.into_iter().for_each(|value| {
-        if let Some(value) = value {
-            match as_datetime::<T>(<i64 as From<_>>::from(value)) {
-                Some(dt) => builder.append_value(op(dt)),
-                None => builder.append_null(),
+fn value_to_string<O: OffsetSizeTrait>(
+    array: &dyn Array,
+) -> Result<ArrayRef, ArrowError> {
+    let mut builder = GenericStringBuilder::<O>::new();
+    let options = FormatOptions::default();

Review Comment:
   eventually we could even pass in `FormatOptions` to the cast kernel (definitely not this PR)!



##########
arrow-cast/src/cast.rs:
##########
@@ -2171,172 +2111,25 @@ where
     from.unary_opt::<_, R>(num::cast::cast::<T::Native, R::Native>)
 }
 
-fn as_time_with_string_op<
-    A: ArrayAccessor<Item = T::Native>,
-    OffsetSize,
-    T: ArrowTemporalType,
-    F,
->(
-    iter: ArrayIter<A>,
-    mut builder: GenericStringBuilder<OffsetSize>,
-    op: F,
-) -> ArrayRef
-where
-    OffsetSize: OffsetSizeTrait,
-    F: Fn(NaiveDateTime) -> String,
-    i64: From<T::Native>,
-{
-    iter.into_iter().for_each(|value| {
-        if let Some(value) = value {
-            match as_datetime::<T>(<i64 as From<_>>::from(value)) {
-                Some(dt) => builder.append_value(op(dt)),
-                None => builder.append_null(),
+fn value_to_string<O: OffsetSizeTrait>(
+    array: &dyn Array,
+) -> Result<ArrayRef, ArrowError> {
+    let mut builder = GenericStringBuilder::<O>::new();
+    let options = FormatOptions::default();
+    let formatter = ArrayFormatter::try_new(array, &options)?;
+    let data = array.data();
+    for i in 0..data.len() {
+        match data.is_null(i) {
+            true => builder.append_null(),
+            false => {
+                formatter.value(i).write(&mut builder)?;
+                builder.append_value("");

Review Comment:
   ```suggestion
                   // tell the builder the row is finished
                   builder.append_value("");
   ```



##########
arrow-cast/src/cast.rs:
##########
@@ -182,11 +181,8 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
             | Time64(TimeUnit::Nanosecond)
             | Timestamp(TimeUnit::Nanosecond, None)
         ) => true,
-        (LargeUtf8, _) => DataType::is_numeric(to_type) && to_type != &Float16,
-        (Timestamp(_, _), Utf8) | (Timestamp(_, _), LargeUtf8) => true,

Review Comment:
   What is the rationale for this change? Is the Timestamp nanosecond with no timezone seems to be covered by the cases above, but timestamp with other units and Non null timezone is not (obviously) covered



##########
arrow-cast/src/cast.rs:
##########
@@ -5521,8 +5237,8 @@ mod tests {
         let b = cast(&array, &DataType::Utf8).unwrap();
         let c = b.as_any().downcast_ref::<StringArray>().unwrap();
         assert_eq!(&DataType::Utf8, c.data_type());
-        assert_eq!("1997-05-19 00:00:00", c.value(0));
-        assert_eq!("2018-12-25 00:00:00", c.value(1));
+        assert_eq!("1997-05-19T00:00:00", c.value(0));

Review Comment:
   Why did these change? This seems like a non trivial change, potentially, for downstream crates, right? Is there some way to get the old behavior back (like maybe passing FormatOptions to the cast kernel) 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3668: Use ArrayFormatter in Cast Kernel

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3668:
URL: https://github.com/apache/arrow-rs/pull/3668#discussion_r1100729748


##########
arrow-cast/src/cast.rs:
##########
@@ -182,11 +181,8 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
             | Time64(TimeUnit::Nanosecond)
             | Timestamp(TimeUnit::Nanosecond, None)
         ) => true,
-        (LargeUtf8, _) => DataType::is_numeric(to_type) && to_type != &Float16,
-        (Timestamp(_, _), Utf8) | (Timestamp(_, _), LargeUtf8) => true,

Review Comment:
   They're rolled into
   
   ```
           (_, Utf8 | LargeUtf8) => from_type.is_primitive(),
   ```
   
   A few lines down



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3668: Use ArrayFormatter in Cast Kernel

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3668:
URL: https://github.com/apache/arrow-rs/pull/3668#discussion_r1100728551


##########
arrow-cast/src/cast.rs:
##########
@@ -5521,8 +5237,8 @@ mod tests {
         let b = cast(&array, &DataType::Utf8).unwrap();
         let c = b.as_any().downcast_ref::<StringArray>().unwrap();
         assert_eq!(&DataType::Utf8, c.data_type());
-        assert_eq!("1997-05-19 00:00:00", c.value(0));
-        assert_eq!("2018-12-25 00:00:00", c.value(1));
+        assert_eq!("1997-05-19T00:00:00", c.value(0));

Review Comment:
   They change to be RFC3339, given we've made the same change to CSV, JSON and pretty output and this is the last holdout I figured I'd just change it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3668: Use ArrayFormatter in Cast Kernel

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #3668:
URL: https://github.com/apache/arrow-rs/pull/3668#discussion_r1100506255


##########
arrow-cast/src/cast.rs:
##########
@@ -3457,86 +3247,12 @@ where
     }
 }
 
-/// Helper function to cast from `GenericBinaryArray` to `GenericStringArray`. This function performs
-/// UTF8 validation during casting. For invalid UTF8 value, it could be Null or returning `Err` depending
-/// `CastOptions`.
-fn cast_binary_to_generic_string<I, O>(

Review Comment:
   I implemented this by converting the offsets and data separately, the performance is the same



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #3668: Use ArrayFormatter in Cast Kernel

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #3668:
URL: https://github.com/apache/arrow-rs/pull/3668


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #3668: Use ArrayFormatter in Cast Kernel

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #3668:
URL: https://github.com/apache/arrow-rs/pull/3668#issuecomment-1424197255

   Benchmark runs are scheduled for baseline = a3b344de39dd5652f1216b0497e15ca263b7d648 and contender = 0b8c003d03deb3590fcce560effbbc1534be826a. 0b8c003d03deb3590fcce560effbbc1534be826a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/76c65eb066694a01931fbe5c75620885...cd58c957dd484b9a92057487f17a666d/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5f97e50e1ec24b079650eed8832c3b22...402cdd096bec432e92cf5e003c228168/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/91b5f99977a1409c8fb94fe3ac3db1ce...f20d0fc110784881baf8e139e1aba6f2/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/35ca7ca45ebe423d9c12c937d6fbdd5f...9d10d102ebb24e1fb5789112c822ca12/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org