You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "spebern (via GitHub)" <gi...@apache.org> on 2023/03/13 20:12:07 UTC

[GitHub] [arrow-rs] spebern opened a new pull request, #3854: Improve decimal parsing performance

spebern opened a new pull request, #3854:
URL: https://github.com/apache/arrow-rs/pull/3854

   # Which issue does this PR close?
   
   -
   
   # Rationale for this change
    
   Improve decimal parsing performance
   
   # What changes are included in this PR?
   
   # Are there any user-facing changes?
   
   -
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #3854: Improve decimal parsing performance

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1474114459

   > I think if we can avoid a dependency, especially one that doesn't have a clear maintenance story (it is a personal crate), that is better. Let me see what it looks like, I think it should be straightforward.
   
   Related https://github.com/apache/arrow-rs/pull/3884


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #3854: Improve decimal parsing performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1474025505

   > Does it make sense to have a look at ethnum as a replacement
   
   I think if we can avoid a dependency, especially one that doesn't have a clear maintenance story (it is a personal crate), that is better. Let me see what it looks like, I think it should be straightforward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #3854: Improve decimal parsing performance

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1476833116

   @spebern  would it be possible to update this PR to use the operations added in https://github.com/apache/arrow-rs/pull/3884 rather than a new crate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] spebern commented on pull request #3854: Improve decimal parsing performance

Posted by "spebern (via GitHub)" <gi...@apache.org>.

spebern commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1474022115

   Does it make sense to have a look at `ethnum` as a replacement? `arrow2`'s `i256` is a simple wrapper around it: https://github.com/jorgecarleitao/arrow2/blob/db87f71f3f673b240de80d5a7592d45c0beae805/src/types/native.rs#LL519-L519C35
   
   Would probably save some work, since all the ops from the rust std lib seem to be implemented there.
   
   I will have a look at `lexical` support this weekend, not quite sure for now how much is missing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #3854: Improve decimal parsing performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1467053149

   Thank you for this, will review fully tomorrow. Just some ideas that might make this even faster
   
   * Handle the sign, if any, outside the loop
   * Split the string on a decimal point, if any, so the two parts can be processed using the optimised lexical_core integer parsing routines


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #3854: Improve decimal parsing performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1474014542

   > basic integer operations such as bit shifting
   
   These should be relatively straightforward to implement, I will see what I can do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] spebern commented on pull request #3854: Improve decimal parsing performance

Posted by "spebern (via GitHub)" <gi...@apache.org>.

spebern commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1473989094

   I had a look at what @tustvold suggested and in some cases we'd get another ~40% of performance. However, I only tested this for the 128 bit types, because `lexical` doesn't support 256 bit. I haven't fully grasped yet how to implement `lexical` support for the 256 bit types here yet, but basic things such as shift operations are missing, so it will be a little more work.
   
   What do you think of replacing the usage of `BigInt` by https://docs.rs/ethnum/latest/ethnum/ ? `ethnum` is also used by arrow2 and has a more complete implementation when it comes to basic integer operations such as bit shifting.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold merged pull request #3854: Improve decimal parsing performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold merged PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] spebern commented on pull request #3854: Improve decimal parsing performance

Posted by "spebern (via GitHub)" <gi...@apache.org>.

spebern commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1473979567

   ```
      Compiling arrow-cast v35.0.0 (/home/ben/workspace/arrow-rs/arrow-cast)
       Finished bench [optimized] target(s) in 12.64s
        Running benches/parse_decimal.rs (/home/ben/workspace/arrow-rs/target/release/deps/parse_decimal-ed9b69f81fe2bf18)
   123.123                 time:   [104.52 ns 104.72 ns 104.95 ns]
                           change: [-52.561% -52.471% -52.383%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   123.1234                time:   [104.07 ns 104.20 ns 104.36 ns]
                           change: [-53.401% -53.291% -53.181%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     7 (7.00%) high mild
     3 (3.00%) high severe
   
   123.1                   time:   [98.928 ns 99.199 ns 99.470 ns]
                           change: [-45.234% -45.059% -44.861%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   123                     time:   [91.060 ns 91.155 ns 91.230 ns]
                           change: [-39.558% -39.512% -39.470%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low severe
     3 (3.00%) low mild
     2 (2.00%) high mild
   
   -123.123                time:   [105.71 ns 105.81 ns 105.91 ns]
                           change: [-52.052% -51.944% -51.796%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   -123.1234               time:   [106.38 ns 106.46 ns 106.53 ns]
                           change: [-51.522% -51.477% -51.428%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   -123.1                  time:   [101.36 ns 101.52 ns 101.70 ns]
                           change: [-44.430% -44.366% -44.301%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 17 outliers among 100 measurements (17.00%)
     2 (2.00%) low severe
     14 (14.00%) low mild
     1 (1.00%) high mild
   
   -123                    time:   [93.470 ns 93.481 ns 93.493 ns]
                           change: [-38.446% -38.425% -38.404%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   0.0000123               time:   [76.150 ns 76.335 ns 76.565 ns]
                           change: [-50.658% -50.451% -50.165%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     2 (2.00%) low mild
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   12.                     time:   [75.427 ns 75.511 ns 75.606 ns]
                           change: [-38.211% -38.128% -38.050%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
     7 (7.00%) high mild
     8 (8.00%) high severe
   
   -12.                    time:   [77.339 ns 77.507 ns 77.684 ns]
                           change: [-36.242% -36.182% -36.113%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 11 outliers among 100 measurements (11.00%)
     5 (5.00%) high mild
     6 (6.00%) high severe
   
   00.1                    time:   [84.054 ns 84.180 ns 84.291 ns]
                           change: [-42.095% -42.014% -41.947%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 23 outliers among 100 measurements (23.00%)
     12 (12.00%) low severe
     4 (4.00%) low mild
     7 (7.00%) high mild
   
   -00.1                   time:   [85.438 ns 85.652 ns 85.869 ns]
                           change: [-43.774% -43.677% -43.580%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 32 outliers among 100 measurements (32.00%)
     8 (8.00%) low severe
     5 (5.00%) low mild
     5 (5.00%) high mild
     14 (14.00%) high severe
   
   12345678912345678.1234  time:   [320.02 ns 320.41 ns 320.75 ns]
                           change: [-52.900% -52.848% -52.797%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 13 outliers among 100 measurements (13.00%)
     4 (4.00%) low severe
     8 (8.00%) low mild
     1 (1.00%) high mild
   
   -12345678912345678.1234 time:   [321.51 ns 321.92 ns 322.31 ns]
                           change: [-52.910% -52.852% -52.796%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) low mild
     1 (1.00%) high severe
   
   99999999999999999.999   time:   [321.10 ns 321.52 ns 321.86 ns]
                           change: [-53.848% -53.781% -53.718%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   -99999999999999999.999  time:   [321.45 ns 321.78 ns 322.10 ns]
                           change: [-54.293% -54.246% -54.201%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     7 (7.00%) low mild
     1 (1.00%) high mild
   
   .123                    time:   [59.960 ns 60.020 ns 60.086 ns]
                           change: [-50.726% -50.672% -50.624%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   -.123                   time:   [60.894 ns 60.979 ns 61.080 ns]
                           change: [-50.659% -50.555% -50.477%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 17 outliers among 100 measurements (17.00%)
     3 (3.00%) low severe
     2 (2.00%) low mild
     3 (3.00%) high mild
     9 (9.00%) high severe
   
   123.                    time:   [90.291 ns 90.504 ns 90.719 ns]
                           change: [-41.874% -41.739% -41.605%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   -123.                   time:   [92.033 ns 92.224 ns 92.414 ns]
                           change: [-40.311% -40.201% -40.085%] (p = 0.00 < 0.05)
                           Performance has improved.
   
        Running benches/parse_timestamp.rs (/home/ben/workspace/arrow-rs/target/release/deps/parse_timestamp-4e4733f613306eeb)
   
   running 0 tests
   
   test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] spebern commented on pull request #3854: Improve decimal parsing performance

Posted by "spebern (via GitHub)" <gi...@apache.org>.

spebern commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1476949887

   I had a closer look at `lexical` and to me, it seems that not enough of the
   functions are public to use the optimized methods.
   Sorry, I was a little too optimistic, that implementing a couple of
   trait methods might make it possible to use `lexical`'s underlying machinery.
   The macros that derive the impls are almost all private.
   
   I opened https://github.com/Alexhuszagh/rust-lexical/issues/93, because as
   it is right now, it doesn't seem to be feasible for me to implement it.
   
   For now, I see three options for continuing here:
   
   1. Merge as is
   
   2. Implement `FromLexical` for i256 without the actual optimizations from `lexical`
   
      Here some benchmark results for Decimal128 when using `lexical`. Note, that
      not everything is getting faster, but more digits definitely profit.
      
   ```
   123.123                 time:   [17.647 ns 17.713 ns 17.814 ns]
                           change: [+11.220% +11.737% +12.425%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high severe
   
   123.1234                time:   [17.927 ns 17.977 ns 18.048 ns]
                           change: [+8.2108% +8.5561% +8.8512%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high severe
   
   123.1                   time:   [23.494 ns 23.526 ns 23.560 ns]
                           change: [+26.116% +26.332% +26.549%] (p = 0.00 < 0.05)
                           Performance has regressed.
   
   123                     time:   [12.598 ns 12.611 ns 12.624 ns]
                           change: [-19.372% -19.212% -19.064%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     4 (4.00%) high mild
     2 (2.00%) high severe
   
   -123.123                time:   [17.898 ns 17.917 ns 17.936 ns]
                           change: [+2.3288% +2.4573% +2.6023%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   -123.1234               time:   [18.256 ns 18.277 ns 18.299 ns]
                           change: [+0.2064% +0.5108% +0.7597%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   -123.1                  time:   [23.471 ns 23.511 ns 23.552 ns]
                           change: [+16.213% +16.474% +16.731%] (p = 0.00 < 0.05)
                           Performance has regressed.
   
   -123                    time:   [13.065 ns 13.080 ns 13.095 ns]
                           change: [-26.644% -26.545% -26.438%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   0.0000123               time:   [14.956 ns 14.975 ns 14.994 ns]
                           change: [+4.5860% +4.7599% +4.9359%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   12.                     time:   [13.072 ns 13.090 ns 13.109 ns]
                           change: [-13.966% -13.509% -13.053%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   -12.                    time:   [13.506 ns 13.525 ns 13.545 ns]
                           change: [-14.013% -13.878% -13.746%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   00.1                    time:   [20.984 ns 21.023 ns 21.062 ns]
                           change: [+26.903% +27.158% +27.407%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   -00.1                   time:   [21.674 ns 21.692 ns 21.712 ns]
                           change: [+21.888% +22.153% +22.414%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low severe
     4 (4.00%) high mild
   
   12345678912345678.1234  time:   [25.908 ns 25.932 ns 25.957 ns]
                           change: [-42.285% -42.218% -42.149%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 15 outliers among 100 measurements (15.00%)
     3 (3.00%) low mild
     5 (5.00%) high mild
     7 (7.00%) high severe
   
   -12345678912345678.1234 time:   [25.995 ns 26.030 ns 26.064 ns]
                           change: [-44.608% -44.413% -44.258%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   99999999999999999.999   time:   [24.779 ns 24.802 ns 24.828 ns]
                           change: [-44.160% -44.036% -43.927%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     8 (8.00%) high mild
     1 (1.00%) high severe
   
   -99999999999999999.999  time:   [21.675 ns 21.707 ns 21.741 ns]
                           change: [-52.810% -52.745% -52.682%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   .123                    time:   [13.940 ns 13.956 ns 13.973 ns]
                           change: [+19.897% +20.075% +20.245%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   -.123                   time:   [14.584 ns 14.598 ns 14.613 ns]
                           change: [+16.487% +16.619% +16.761%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 4 outliers among 100 measurements (4.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
   
   123.                    time:   [13.525 ns 13.546 ns 13.569 ns]
                           change: [-17.893% -17.772% -17.644%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 1 outliers among 100 measurements (1.00%)
     1 (1.00%) high mild
   
   -123.                   time:   [14.290 ns 14.319 ns 14.346 ns]
                           change: [-17.726% -17.552% -17.384%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     6 (6.00%) low mild
     1 (1.00%) high mild
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3854: Improve decimal parsing performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#discussion_r1143176177


##########
arrow-cast/src/parse.rs:
##########
@@ -613,100 +613,93 @@ pub fn parse_decimal<T: DecimalType>(
     precision: u8,
     scale: i8,
 ) -> Result<T::Native, ArrowError> {
-    if !is_valid_decimal(s) {
-        return Err(ArrowError::ParseError(format!(
-            "can't parse the string value {s} to decimal"
-        )));
-    }
-    let mut offset = s.len();
-    let len = s.len();
-    let mut base = T::Native::usize_as(1);
-    let scale_usize = usize::from(scale as u8);
-
-    // handle the value after the '.' and meet the scale
-    let delimiter_position = s.find('.');
-    match delimiter_position {
-        None => {
-            // there is no '.'
-            base = T::Native::usize_as(10).pow_checked(scale as u32)?;
-        }
-        Some(mid) => {
-            // there is the '.'
-            if len - mid >= scale_usize + 1 {
-                // If the string value is "123.12345" and the scale is 2, we should just remain '.12' and drop the '345' value.
-                offset -= len - mid - 1 - scale_usize;
-            } else {
-                // If the string value is "123.12" and the scale is 4, we should append '00' to the tail.
-                base = T::Native::usize_as(10)
-                    .pow_checked((scale_usize + 1 + mid - len) as u32)?;
-            }
-        }
-    };
-
-    // each byte is digit、'-' or '.'
-    let bytes = s.as_bytes();
-    let mut negative = false;
-    let mut result = T::Native::usize_as(0);
-
-    bytes[0..offset]
-        .iter()
-        .rev()
-        .try_for_each::<_, Result<(), ArrowError>>(|&byte| {
-            match byte {
-                b'-' => {
-                    negative = true;
-                }
-                b'0'..=b'9' => {
-                    let add =
-                        T::Native::usize_as((byte - b'0') as usize).mul_checked(base)?;
-                    result = result.add_checked(add)?;
-                    base = base.mul_checked(T::Native::usize_as(10))?;
-                }
-                // because we have checked the string value
-                _ => (),
-            }
-            Ok(())
-        })?;
-
-    if negative {
-        result = result.neg_checked()?;
-    }
-
-    match T::validate_decimal_precision(result, precision) {
-        Ok(_) => Ok(result),
-        Err(e) => Err(ArrowError::ParseError(format!(
-            "parse decimal overflow: {e}"
-        ))),
-    }
-}
-
-fn is_valid_decimal(s: &str) -> bool {
     let mut seen_dot = false;
-    let mut seen_digit = false;
     let mut seen_sign = false;
+    let mut negative = false;
 
-    for c in s.as_bytes() {
-        match c {
-            b'-' | b'+' => {
-                if seen_digit || seen_dot || seen_sign {
-                    return false;
+    let mut result = T::Native::usize_as(0);
+    let mut fractionals = 0;
+    let mut digits = 0;
+    let base = T::Native::usize_as(10);
+    let mut bs = s.as_bytes().iter();
+    while let Some(b) = bs.next() {
+        match b {
+            b'0'..=b'9' => {
+                if seen_dot {
+                    if fractionals == scale {
+                        // We have processed and validated the whole part of our decimal (including sign and dot).
+                        // All that is left is to validate the fractional part.
+                        if bs.any(|b| !b.is_ascii_digit()) {
+                            return Err(ArrowError::ParseError(format!(
+                                "can't parse the string value {s} to decimal"
+                            )));
+                        }
+                        break;
+                    }
+                    fractionals += 1;
                 }
-                seen_sign = true;
+                digits += 1;
+                if digits > precision {
+                    return Err(ArrowError::ParseError(
+                        "parse decimal overflow".to_string(),
+                    ));
+                }
+                result = result.mul_checked(base)?;

Review Comment:
   It occurs to me that the precision check above should mean this can't overflow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] spebern commented on pull request #3854: Improve decimal parsing performance

Posted by "spebern (via GitHub)" <gi...@apache.org>.

spebern commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1467517027

   Thanks for the suggestions. I will have a look the next few days!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3854: Improve decimal parsing performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on code in PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#discussion_r1143176177


##########
arrow-cast/src/parse.rs:
##########
@@ -613,100 +613,93 @@ pub fn parse_decimal<T: DecimalType>(
     precision: u8,
     scale: i8,
 ) -> Result<T::Native, ArrowError> {
-    if !is_valid_decimal(s) {
-        return Err(ArrowError::ParseError(format!(
-            "can't parse the string value {s} to decimal"
-        )));
-    }
-    let mut offset = s.len();
-    let len = s.len();
-    let mut base = T::Native::usize_as(1);
-    let scale_usize = usize::from(scale as u8);
-
-    // handle the value after the '.' and meet the scale
-    let delimiter_position = s.find('.');
-    match delimiter_position {
-        None => {
-            // there is no '.'
-            base = T::Native::usize_as(10).pow_checked(scale as u32)?;
-        }
-        Some(mid) => {
-            // there is the '.'
-            if len - mid >= scale_usize + 1 {
-                // If the string value is "123.12345" and the scale is 2, we should just remain '.12' and drop the '345' value.
-                offset -= len - mid - 1 - scale_usize;
-            } else {
-                // If the string value is "123.12" and the scale is 4, we should append '00' to the tail.
-                base = T::Native::usize_as(10)
-                    .pow_checked((scale_usize + 1 + mid - len) as u32)?;
-            }
-        }
-    };
-
-    // each byte is digit、'-' or '.'
-    let bytes = s.as_bytes();
-    let mut negative = false;
-    let mut result = T::Native::usize_as(0);
-
-    bytes[0..offset]
-        .iter()
-        .rev()
-        .try_for_each::<_, Result<(), ArrowError>>(|&byte| {
-            match byte {
-                b'-' => {
-                    negative = true;
-                }
-                b'0'..=b'9' => {
-                    let add =
-                        T::Native::usize_as((byte - b'0') as usize).mul_checked(base)?;
-                    result = result.add_checked(add)?;
-                    base = base.mul_checked(T::Native::usize_as(10))?;
-                }
-                // because we have checked the string value
-                _ => (),
-            }
-            Ok(())
-        })?;
-
-    if negative {
-        result = result.neg_checked()?;
-    }
-
-    match T::validate_decimal_precision(result, precision) {
-        Ok(_) => Ok(result),
-        Err(e) => Err(ArrowError::ParseError(format!(
-            "parse decimal overflow: {e}"
-        ))),
-    }
-}
-
-fn is_valid_decimal(s: &str) -> bool {
     let mut seen_dot = false;
-    let mut seen_digit = false;
     let mut seen_sign = false;
+    let mut negative = false;
 
-    for c in s.as_bytes() {
-        match c {
-            b'-' | b'+' => {
-                if seen_digit || seen_dot || seen_sign {
-                    return false;
+    let mut result = T::Native::usize_as(0);
+    let mut fractionals = 0;
+    let mut digits = 0;
+    let base = T::Native::usize_as(10);
+    let mut bs = s.as_bytes().iter();
+    while let Some(b) = bs.next() {
+        match b {
+            b'0'..=b'9' => {
+                if seen_dot {
+                    if fractionals == scale {
+                        // We have processed and validated the whole part of our decimal (including sign and dot).
+                        // All that is left is to validate the fractional part.
+                        if bs.any(|b| !b.is_ascii_digit()) {
+                            return Err(ArrowError::ParseError(format!(
+                                "can't parse the string value {s} to decimal"
+                            )));
+                        }
+                        break;
+                    }
+                    fractionals += 1;
                 }
-                seen_sign = true;
+                digits += 1;
+                if digits > precision {
+                    return Err(ArrowError::ParseError(
+                        "parse decimal overflow".to_string(),
+                    ));
+                }
+                result = result.mul_checked(base)?;

Review Comment:
   It occurs to me that the precision check above should mean neither of these can overflow



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #3854: Improve decimal parsing performance

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #3854:
URL: https://github.com/apache/arrow-rs/pull/3854#issuecomment-1473912363

   Thank you @spebern  -- this looks quite exciting. Do you have any benchmarks you can share for this branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org