You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/08 08:38:11 UTC

[GitHub] [arrow-rs] liukun4515 opened a new pull request, #2360: [WIP]speed up the decimal validation and add benchmark test

liukun4515 opened a new pull request, #2360:
URL: https://github.com/apache/arrow-rs/pull/2360

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes https://github.com/apache/arrow-rs/issues/2320
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1212752633

   ```
   [05:20:36] 'compile:es2015:umd' errored after 36 s
   [05:20:36] Error: gulp-google-closure-compiler: java.util.zip.ZipException: invalid entry CRC (expected 0x4e1f14a4 but got 0xb1e0eb5b)
   	at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:410)
   	at java.util.zip.ZipInputStream.read(ZipInputStream.java:199)
   	at java.util.zip.ZipInputStream.closeEntry(ZipInputStream.java:143)
   	at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:121)
   	at com.google.javascript.jscomp.AbstractCommandLineRunner.getBuiltinExterns(AbstractCommandLineRunner.java:500)
   	at com.google.javascript.jscomp.CommandLineRunner.createExterns(CommandLineRunner.java:2084)
   	at com.google.javascript.jscomp.AbstractCommandLineRunner.doRun(AbstractCommandLineRunner.java:1187)
   	at com.google.javascript.jscomp.AbstractCommandLineRunner.run(AbstractCommandLineRunner.java:551)
   	at com.google.javascript.jscomp.CommandLineRunner.main(CommandLineRunner.java:2246)
   Error writing to stdin of the compiler. write EPIPE
   
   CustomError: gulp-google-closure-compiler: Compilation errors occurred
       at CompilationStream._compilationComplete (/arrow/js/node_modules/google-closure-compiler/lib/gulp/index.js:238:28)
       at /arrow/js/node_modules/google-closure-compiler/lib/gulp/index.js:208:14
       at processTicksAndRejections (node:internal/process/task_queues:96:5)
   
       at formatError (/arrow/js/node_modules/gulp-cli/lib/versioned/^4.0.0/format-error.js:21:10)
       at Gulp.<anonymous> (/arrow/js/node_modules/gulp-cli/lib/versioned/^4.0.0/log/events.js:33:15)
       at Gulp.emit (node:events:538:35)
       at Gulp.emit (node:domain:475:12)
       at Object.error (/arrow/js/node_modules/undertaker/lib/helpers/createExtensions.js:61:10)
       at handler (/arrow/js/node_modules/now-and-later/lib/mapSeries.js:47:14)
       at f (/arrow/js/node_modules/once/once.js:25:25)
       at f (/arrow/js/node_modules/once/once.js:25:25)
       at tryCatch (/arrow/js/node_modules/bach/node_modules/async-done/index.js:24:15)
       at done (/arrow/js/node_modules/bach/node_modules/async-done/index.js:40:12)
   [05:20:36] 'build:es2015:umd' errored after 2.35 min
   [05:20:36] 'build:apache-arrow' errored after 2.35 min
   [05:20:36] 'build' errored after 2.37 min
   error Command failed with exit code 1.
   info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
   1
   Error: `docker-compose --file /home/runner/work/arrow-rs/arrow-rs/docker-compose.yml run --rm -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration` exited with a non-zero exit code 1, see the process log above.
   ```
   The it failed, but nothing about this pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r941303072


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -263,6 +265,706 @@ impl fmt::Display for DataType {
     }
 }
 
+// Max decimal128 value of little-endian format for each precision.
+pub(crate) const MAX_DECIMAL_BYTES_FOR_EACH_PRECISION: [[u8; 16]; 38] = [
+    9_i128.to_le_bytes(),
+    99_i128.to_le_bytes(),
+    999_i128.to_le_bytes(),
+    9999_i128.to_le_bytes(),
+    99999_i128.to_le_bytes(),
+    999999_i128.to_le_bytes(),
+    9999999_i128.to_le_bytes(),
+    99999999_i128.to_le_bytes(),
+    999999999_i128.to_le_bytes(),
+    9999999999_i128.to_le_bytes(),
+    99999999999_i128.to_le_bytes(),
+    999999999999_i128.to_le_bytes(),
+    9999999999999_i128.to_le_bytes(),
+    99999999999999_i128.to_le_bytes(),
+    999999999999999_i128.to_le_bytes(),
+    9999999999999999_i128.to_le_bytes(),
+    99999999999999999_i128.to_le_bytes(),
+    999999999999999999_i128.to_le_bytes(),
+    9999999999999999999_i128.to_le_bytes(),
+    99999999999999999999_i128.to_le_bytes(),
+    999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999999999_i128.to_le_bytes(),
+];
+
+// Min decimal128 value of little-endian format for each precision.
+pub(crate) const MIN_DECIMAL_BYTES_FOR_EACH_PRECISION: [[u8; 16]; 38] = [
+    (-9_i128).to_le_bytes(),
+    (-99_i128).to_le_bytes(),
+    (-999_i128).to_le_bytes(),
+    (-9999_i128).to_le_bytes(),
+    (-99999_i128).to_le_bytes(),
+    (-999999_i128).to_le_bytes(),
+    (-9999999_i128).to_le_bytes(),
+    (-99999999_i128).to_le_bytes(),
+    (-999999999_i128).to_le_bytes(),
+    (-9999999999_i128).to_le_bytes(),
+    (-99999999999_i128).to_le_bytes(),
+    (-999999999999_i128).to_le_bytes(),
+    (-9999999999999_i128).to_le_bytes(),
+    (-99999999999999_i128).to_le_bytes(),
+    (-999999999999999_i128).to_le_bytes(),
+    (-9999999999999999_i128).to_le_bytes(),
+    (-99999999999999999_i128).to_le_bytes(),
+    (-999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999999999_i128).to_le_bytes(),
+];
+
+// MAX decimal256 value of little-endian format for each precision.
+pub(crate) const MAX_DECIMAL_BYTES_FOR_LARGER_EACH_PRECISION: [[u8; 32]; 76] = [

Review Comment:
   Could you please add more docs, as there are lots of magical values ?
   
   Besides, is there any way to check these numbers are right?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940118590


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   > yes, i think the overhead is from the xxDecimal::new() and convert to i128.
   
   LLVM should be smart enough to elide this, it likely isn't because there are bound checks on the byte array length (which are completely unnecessary) and it isn't smart enough to elide these, which prevents eliding the rest.
   
   It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940113059


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   yes, i think the overhead is from the `xxDecimal::new()` and convert to `i128`.
   
   Using this pr method, we can improve the performance from reducing above overhead.
   
   In current `iter` or `index ` decimal array, the result of value is Decimal128 and Decimal256.
   In many case, we don't need the struct of decimal, and just need the little-endian bytes.
   
   Can we add more method to `index` or `iter` decimal array, and the result type of value is `&[u8]`?
   Do you mean this?
   @tustvold 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940096750


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   Unless I'm missing something, the performance benefit from this is derived from the additional work being performed by the iterator. In particular `BasicDecimal::new`. Perhaps we should fix this, as this will have benefits all over the codebase? For one thing `BasicDecimal` could easily take a slice with a known size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r941927655


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -263,6 +265,706 @@ impl fmt::Display for DataType {
     }
 }
 
+// Max decimal128 value of little-endian format for each precision.
+pub(crate) const MAX_DECIMAL_BYTES_FOR_EACH_PRECISION: [[u8; 16]; 38] = [
+    9_i128.to_le_bytes(),
+    99_i128.to_le_bytes(),
+    999_i128.to_le_bytes(),
+    9999_i128.to_le_bytes(),
+    99999_i128.to_le_bytes(),
+    999999_i128.to_le_bytes(),
+    9999999_i128.to_le_bytes(),
+    99999999_i128.to_le_bytes(),
+    999999999_i128.to_le_bytes(),
+    9999999999_i128.to_le_bytes(),
+    99999999999_i128.to_le_bytes(),
+    999999999999_i128.to_le_bytes(),
+    9999999999999_i128.to_le_bytes(),
+    99999999999999_i128.to_le_bytes(),
+    999999999999999_i128.to_le_bytes(),
+    9999999999999999_i128.to_le_bytes(),
+    99999999999999999_i128.to_le_bytes(),
+    999999999999999999_i128.to_le_bytes(),
+    9999999999999999999_i128.to_le_bytes(),
+    99999999999999999999_i128.to_le_bytes(),
+    999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999999999_i128.to_le_bytes(),
+];
+
+// Min decimal128 value of little-endian format for each precision.
+pub(crate) const MIN_DECIMAL_BYTES_FOR_EACH_PRECISION: [[u8; 16]; 38] = [
+    (-9_i128).to_le_bytes(),
+    (-99_i128).to_le_bytes(),
+    (-999_i128).to_le_bytes(),
+    (-9999_i128).to_le_bytes(),
+    (-99999_i128).to_le_bytes(),
+    (-999999_i128).to_le_bytes(),
+    (-9999999_i128).to_le_bytes(),
+    (-99999999_i128).to_le_bytes(),
+    (-999999999_i128).to_le_bytes(),
+    (-9999999999_i128).to_le_bytes(),
+    (-99999999999_i128).to_le_bytes(),
+    (-999999999999_i128).to_le_bytes(),
+    (-9999999999999_i128).to_le_bytes(),
+    (-99999999999999_i128).to_le_bytes(),
+    (-999999999999999_i128).to_le_bytes(),
+    (-9999999999999999_i128).to_le_bytes(),
+    (-99999999999999999_i128).to_le_bytes(),
+    (-999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999999999_i128).to_le_bytes(),
+];
+
+// MAX decimal256 value of little-endian format for each precision.
+pub(crate) const MAX_DECIMAL_BYTES_FOR_LARGER_EACH_PRECISION: [[u8; 32]; 76] = [

Review Comment:
   From your suggestion, I will add test to check the const table with the raw format.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on pull request #2360: speed up the decimal256 validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1211510499

   >I find some wired things about decimal128/i128, So I remove the refactor of decimal128 and just left decimal256. PTAL
   
   What's weird thing did you meet? @liukun4515 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944062951


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {

Review Comment:
   `value_unchecked` also use it.
   Maybe you can extract them with follow-up pr



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944075821


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {
+                    let pos = self.value_offset_at(offset);
+                    std::slice::from_raw_parts(
+                        self.raw_value_data_ptr().offset(pos as isize),
+                        Self::VALUE_LENGTH as usize,
+                    )
+                };
+                validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
             }
         }
         Ok(())

Review Comment:
   Could you please try 
   ```rust
     (0..self.len())
         .try_for_each(|idx| {
               if self.is_valid(idx) { ... }
   }
   ```
   ?
   I am afraid the filter function introduce some overhead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r943962006


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -479,57 +1015,38 @@ pub(crate) fn validate_decimal_precision(value: i128, precision: usize) -> Resul
     }
 }
 
-/// Validates that the specified string value can be properly
-/// interpreted as a Decimal256 number with precision `precision`
+/// Validates that the specified `byte_array` of little-endian format
+/// value can be properly interpreted as a Decimal256 number with precision `precision`
 #[inline]
-pub(crate) fn validate_decimal256_precision(
-    value: &BigInt,
+pub(crate) fn validate_decimal256_precision_with_lt_bytes(
+    lt_value: &[u8],
     precision: usize,
-) -> Result<&BigInt> {
+) -> Result<()> {
     if precision > DECIMAL256_MAX_PRECISION {
         return Err(ArrowError::InvalidArgumentError(format!(
-            "Max precision of a Decimal256 is {}, but got {}",
+            "Max precision of a Decima256 is {}, but got {}",

Review Comment:
   Original is correct, why change it?
   
   ```suggestion
               "Max precision of a Decimal256 is {}, but got {}",
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1213297941

   Thanks everyone for getting this through. Good team work 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944087109


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {
+                    let pos = self.value_offset_at(offset);
+                    std::slice::from_raw_parts(
+                        self.raw_value_data_ptr().offset(pos as isize),
+                        Self::VALUE_LENGTH as usize,
+                    )
+                };
+                validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
             }
         }
         Ok(())

Review Comment:
   I think this change make sense.
   Will try it and post the benchmark result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1210052129

   > > refactor the validation for decimal256 @viirya
   > > almost 50x faster than before. You can get the benchmark code in the `decimal_validate.rs` file.
   > 
   > Yea, I think for 256-bit decimal, current value validation is somehow inefficient. Comparing directly on bytes should be more efficient on 256-bit decimal. For 128-bit, I think that it won't be too much different than i128 comparison, maybe some marginal number.
   
   agree with your thought about 128-bit, but you miss a point.
   When we validate the decimal128array with 128-bit/i128, we should convert the [u8;16] to i128 first with the struct`decimal128`. This will cause some unnecessary overhea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r942124548


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -941,3 +1457,36 @@ impl DataType {
         }
     }
 }
+
+#[cfg(test)]
+mod test {

Review Comment:
   test for the magic table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1207981500

   I guess that we can get more improvement for decimal256array type


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940121711


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   > > yes, i think the overhead is from the xxDecimal::new() and convert to i128.
   > 
   > LLVM should be smart enough to elide this, it likely isn't because there are bound checks on the byte array length (which are completely unnecessary) and it isn't smart enough to elide these, which prevents eliding the rest.
   > 
   > It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator
   
   Do you have some suggestions or plan for that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940329929


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   If you don't mind, I would like to take some time at some point in the next couple of days to take a holistic look at what is going on with decimals, and then get back to you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1209180343

   refactor the validation for decimal256 @viirya 
   ```
   validate_decimal256_array_slow 2000
                           time:   [160.63 ms 161.90 ms 163.14 ms]
                           change: [-27.571% -25.268% -23.024%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   validate_decimal256_array_fast 2000                                                                             ^@
                           time:   [3.0104 ms 3.0427 ms 3.0766 ms]
   ```
   
   You can get the benchmark code in the `decimal_validate.rs` file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: speed up the decimal256 validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1210323616

   > For 128-bit, I think that it won't be too much different than i128 comparison, maybe some marginal number.
   
   @HaoYang670 @viirya  I find some wired things about decimal128/i128, So I remove the refactor of decimal128 and just left decimal256.
   PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1212703837

   If the changes looks good to you, please approve it. @HaoYang670 
   After the ci is passed, I will merge it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 merged pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 merged PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944061681


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {

Review Comment:
   > Personally, I prefer this function to be an independent `fn`, but not a method of `Decimal256Array`, because
   > 
   
   If this is method is only for `Decimal256Array`, why we should move it out of the `impl`? I am confused about this suggestion.
   Besides, It's a private method.
   If move it out of the `impl Decimal256Array`, how to get the `data` of the `Self`?
   
   > 1. DecimalArray has its own precision, providing another `precision` to its method is somewhat weird.
   
   `with_precision_and_scale` also provide the `precision` and `scale`, is this also weird?
   
   > 2. This method is just for `Decimal256Array`, but not the generic decimal array. Moving it out the `impl Decimal256Array` can make the code cleaner.
   
   I feel confused about this reason. The method is only used to `Decimal256Array`, why not treat it as private function? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r943961064


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -263,6 +265,626 @@ impl fmt::Display for DataType {
     }
 }
 
+// MAX decimal256 value of little-endian format for each precision.
+// Each element is the max value of signed 256-bit integer for the specified precision which
+// is encoded to the 76-byte width format of little-endian.

Review Comment:
   ```suggestion
   // is encoded to the 32-byte width format of little-endian.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: speed up the decimal256 validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1211577166

   > > I find some wired things about decimal128/i128, So I remove the refactor of decimal128 and just left decimal256. PTAL
   > 
   > What's weird thing did you meet? @liukun4515
   
   Diff sizes of Decimal128Array have diff benchmark result.
   For example, when the array size is 2000,  the performance of comparison using the `bytes array: [u8;16]` format is better than using  `i128` directly.
   But when the array size is 20000, the result if the opposite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1210093333

   > > > refactor the validation for decimal256 @viirya
   > > > almost 50x faster than before. You can get the benchmark code in the `decimal_validate.rs` file.
   > > 
   > > 
   > > Yea, I think for 256-bit decimal, current value validation is somehow inefficient. Comparing directly on bytes should be more efficient on 256-bit decimal. For 128-bit, I think that it won't be too much different than i128 comparison, maybe some marginal number.
   > 
   > agree with your thought about 128-bit, but you miss a point. When we validate the decimal128array with 128-bit/i128, we should convert the [u8;16] to i128 first with the struct`decimal128`. This will cause some unnecessary overhead.
   > 
   > @HaoYang670 @viirya I think this can explains why the validation of decimal128array got the performance improvement
   
   I add some benchmarks for above explanation.
   Got the result:
   ```
   validate_decimal128_bytes 2000
                           time:   [46.438 us 46.965 us 47.578 us]
   Found 7 outliers among 100 measurements (7.00%)
     5 (5.00%) high mild
     2 (2.00%) high severe
   
   validate_decimal128_i128 2000
                           time:   [69.251 us 69.591 us 69.959 us]
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low mild
     4 (4.00%) high mild
     1 (1.00%) high severe
   ```
   
   The performance using the bytes is better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r942051076


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -263,6 +265,706 @@ impl fmt::Display for DataType {
     }
 }
 
+// Max decimal128 value of little-endian format for each precision.
+pub(crate) const MAX_DECIMAL_BYTES_FOR_EACH_PRECISION: [[u8; 16]; 38] = [
+    9_i128.to_le_bytes(),
+    99_i128.to_le_bytes(),
+    999_i128.to_le_bytes(),
+    9999_i128.to_le_bytes(),
+    99999_i128.to_le_bytes(),
+    999999_i128.to_le_bytes(),
+    9999999_i128.to_le_bytes(),
+    99999999_i128.to_le_bytes(),
+    999999999_i128.to_le_bytes(),
+    9999999999_i128.to_le_bytes(),
+    99999999999_i128.to_le_bytes(),
+    999999999999_i128.to_le_bytes(),
+    9999999999999_i128.to_le_bytes(),
+    99999999999999_i128.to_le_bytes(),
+    999999999999999_i128.to_le_bytes(),
+    9999999999999999_i128.to_le_bytes(),
+    99999999999999999_i128.to_le_bytes(),
+    999999999999999999_i128.to_le_bytes(),
+    9999999999999999999_i128.to_le_bytes(),
+    99999999999999999999_i128.to_le_bytes(),
+    999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999999_i128.to_le_bytes(),
+    999999999999999999999999999999999999_i128.to_le_bytes(),
+    9999999999999999999999999999999999999_i128.to_le_bytes(),
+    99999999999999999999999999999999999999_i128.to_le_bytes(),
+];
+
+// Min decimal128 value of little-endian format for each precision.
+pub(crate) const MIN_DECIMAL_BYTES_FOR_EACH_PRECISION: [[u8; 16]; 38] = [
+    (-9_i128).to_le_bytes(),
+    (-99_i128).to_le_bytes(),
+    (-999_i128).to_le_bytes(),
+    (-9999_i128).to_le_bytes(),
+    (-99999_i128).to_le_bytes(),
+    (-999999_i128).to_le_bytes(),
+    (-9999999_i128).to_le_bytes(),
+    (-99999999_i128).to_le_bytes(),
+    (-999999999_i128).to_le_bytes(),
+    (-9999999999_i128).to_le_bytes(),
+    (-99999999999_i128).to_le_bytes(),
+    (-999999999999_i128).to_le_bytes(),
+    (-9999999999999_i128).to_le_bytes(),
+    (-99999999999999_i128).to_le_bytes(),
+    (-999999999999999_i128).to_le_bytes(),
+    (-9999999999999999_i128).to_le_bytes(),
+    (-99999999999999999_i128).to_le_bytes(),
+    (-999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999999_i128).to_le_bytes(),
+    (-999999999999999999999999999999999999_i128).to_le_bytes(),
+    (-9999999999999999999999999999999999999_i128).to_le_bytes(),
+    (-99999999999999999999999999999999999999_i128).to_le_bytes(),
+];
+
+// MAX decimal256 value of little-endian format for each precision.
+pub(crate) const MAX_DECIMAL_BYTES_FOR_LARGER_EACH_PRECISION: [[u8; 32]; 76] = [

Review Comment:
   add the test `test_decimal256_min_max_for_precision` to verify the table  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940328294


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   > I can add it to my list of things to do this week
   
   Do you think this PR is on the right path, If so, I will going on this work.
   Refactor the decimal256array validation, and add iter bytes array from the decimal array.
   @tustvold 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1209354316

   As this is a large change and improvement, I expect a design doc (or add a simple example in the code) to explain your method to other users and developers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944061681


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {

Review Comment:
   > Personally, I prefer this function to be an independent `fn`, but not a method of `Decimal256Array`, because
   > 
   
   If this is method is just used to `Decimal256Array`, why we should move it out of the `impl`? I am confused about this suggestion.
   Besides, It's a private method.
   If move it out of the `impl Decimal256Array`, how to get the `data` of the `Self`?
   
   > 1. DecimalArray has its own precision, providing another `precision` to its method is somewhat weird.
   
   `with_precision_and_scale` also provide the `precision` and `scale`, is this also weird?
   
   > 2. This method is just for `Decimal256Array`, but not the generic decimal array. Moving it out the `impl Decimal256Array` can make the code cleaner.
   
   I feel confused about this reason. The method is only used to `Decimal256Array`, why not we use it  private function? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944071411


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {
+                    let pos = self.value_offset_at(offset);
+                    std::slice::from_raw_parts(
+                        self.raw_value_data_ptr().offset(pos as isize),
+                        Self::VALUE_LENGTH as usize,
+                    )
+                };
+                validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
             }
         }
         Ok(())

Review Comment:
   Not apply the suggestion, because of the performance regression.
   The performance of your version:
   ```
   validate_decimal256_array 20000
                           time:   [393.73 us 402.59 us 412.64 us]
                           change: [+0.0864% +1.9579% +3.8640%] (p = 0.05 < 0.05)
                           Change within noise threshold.
   ```
   My version:
   ```
   validate_decimal256_array 20000
                           time:   [282.54 us 289.68 us 297.22 us]
                           change: [-29.976% -27.993% -25.897%] (p = 0.00 < 0.05)
                           Performance has improved.
   ```
   I guess the reason is loop twice and create an intermediate Iter
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944074522


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {
+                    let pos = self.value_offset_at(offset);
+                    std::slice::from_raw_parts(
+                        self.raw_value_data_ptr().offset(pos as isize),
+                        Self::VALUE_LENGTH as usize,
+                    )
+                };
+                validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
             }
         }
         Ok(())

Review Comment:
   @HaoYang670 I think this is an interesting result.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: speed up the decimal256 validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r942188946


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   @tustvold I remove the changes for validation for decimal128 and just refactor the logic of validation decimal256.
   PTAL



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r941305601


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +341,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   Actually, I'd like more docs about how we do the validation in this function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940329626


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   @tustvold @alamb From the result of benchmark, I think this refactor is workable and reasonable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940628901


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   > It should definitely be possible to make these methods perform the same, it just might require a bit of hand-holding to help LLVM. This would then benefit all call-sites that use the iterator
   
   This sounds promising.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944054617


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {
+                    let pos = self.value_offset_at(offset);
+                    std::slice::from_raw_parts(
+                        self.raw_value_data_ptr().offset(pos as isize),
+                        Self::VALUE_LENGTH as usize,
+                    )
+                };
+                validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
             }
         }
         Ok(())

Review Comment:
   ```suggestion
           (0..self.len())
               .filter(|idx| self.is_valid(*idx))
               .try_for_each(|idx| {
                   let raw_val = unsafe {
                       let pos = self.value_offset(idx);
                       std::slice::from_raw_parts(
                           self.raw_value_data_ptr().offset(pos as isize),
                           Self::VALUE_LENGTH as usize,
                       )
                   };
                   validate_decimal256_precision_with_lt_bytes(raw_val, precision)
               })
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1212757578

   Benchmark runs are scheduled for baseline = e7dcfbc5e99e62f52c2fe1eed0934ab2abb6c447 and contender = b235173b210dd74307a005e96633bae50fd5798d. b235173b210dd74307a005e96633bae50fd5798d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1a736d836b1341789bea4e8898eccaa7...60211ba453cd47d6881d3fb81baeb082/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/0fa87866705746cd8f2d06da569f4b34...3a97542395e643b397c367d5864dd284/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/80d0a80664a144bdb67b558b1d418c70...97800a8991674097be37178ba0b76b46/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8958164fe14e424f98262fccb912582e...c0e7f3a9571a42e7939775b24e4714b5/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r939977289


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -479,6 +563,76 @@ pub(crate) fn validate_decimal_precision(value: i128, precision: usize) -> Resul
     }
 }
 
+// duplicate code

Review Comment:
   copy from decimal.rs file



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940096750


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   Unless I'm missing something, the performance benefit from this is derived from the additional work being performed by the iterator. In particular `BasicDecimal::new`. Perhaps we should fix this, as this will have benefits all over the codebase?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1207851116

   I just add benchmark for validation for decimal128array. Got the result below:
   ```
   validate_decimal128_array_slow 20000
                           time:   [125.08 us 125.62 us 126.22 us]
                           change: [-7.0125% -6.4283% -5.8267%] (p = 0.00 < 0.05)
                           Performance has improved.
   
   validate_decimal128_array_fast 20000
                           time:   [107.54 us 107.72 us 107.88 us]
                           change: [-13.320% -13.002% -12.692%] (p = 0.00 < 0.05)
                           Performance has improved.
   ```
   From the current code, there is a 10~20% performance improvement.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1210044915

   > Yea, I think for 256-bit decimal, current value validation is somehow inefficient. Comparing directly on bytes should be more efficient on 256-bit decimal. For 128-bit, I think that it won't be too much different than i128 comparison, maybe some marginal number.
   
   Another benefit we could get from this is that we can make the fn `validate_decimal_precision` more general. 
   Currently, I put this function in `Decimal128Array` and `Decimal256Array` but not ``DecimalArray<BYTE_WIDTH>` because they use different `validate_decimal_precision` functions and have different APIs: https://github.com/apache/arrow-rs/blob/master/arrow/src/array/array_decimal.rs#L276-L283


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1209616992

   > refactor the validation for decimal256 @viirya
   > almost 50x faster than before. You can get the benchmark code in the `decimal_validate.rs` file.
   
   Yea, I think for 256-bit decimal, current value validation is somehow inefficient. Comparing directly on bytes should be more efficient on 256-bit decimal. For 128-bit, I think that it won't be too much different than i128 comparison, maybe some marginal number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: speed up the decimal256 validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r942160358


##########
arrow/benches/decimal_validate.rs:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#[macro_use]
+extern crate criterion;
+
+use arrow::array::{
+    Array, BasicDecimalArray, Decimal128Array, Decimal256Array, Decimal256Builder,
+};
+use arrow::datatypes::{validate_decimal_precision, DataType};
+use criterion::Criterion;
+
+extern crate arrow;
+
+use arrow::util::decimal::{BasicDecimal, Decimal128, Decimal256};
+
+fn validate_decimal128_array(array: Decimal128Array) {
+    array.with_precision_and_scale(35, 0).unwrap();
+}
+
+fn validate_decimal256_array(array: Decimal256Array) {
+    array.with_precision_and_scale(35, 0).unwrap();
+}
+
+fn validate_decimal128_benchmark(c: &mut Criterion) {
+    let decimal_array = Decimal128Array::from_iter_values(vec![12324; 20000]);
+    let data = decimal_array.into_data();
+    c.bench_function("validate_decimal128_array 2000", |b| {
+        b.iter(|| {
+            let array = Decimal128Array::from(data.clone());
+            validate_decimal128_array(array);
+        })
+    });
+}
+
+// fn validate_decimal256_benchmark(c: &mut Criterion) {

Review Comment:
   base on this pr: https://github.com/apache/arrow-rs/pull/2376



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: speed up the decimal256 validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r942193993


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -941,3 +1457,36 @@ impl DataType {
         }
     }
 }
+
+#[cfg(test)]
+mod test {

Review Comment:
   @HaoYang670 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1210054015

   There are some conflict, I will resolve the conflict.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r939977865


##########
arrow/src/datatypes/datatype.rs:
##########
@@ -479,6 +563,76 @@ pub(crate) fn validate_decimal_precision(value: i128, precision: usize) -> Resul
     }
 }
 
+// duplicate code
+#[inline]
+fn singed_cmp_le_bytes(left: &[u8], right: &[u8]) -> Ordering {
+    assert_eq!(
+        left.len(),
+        right.len(),
+        "Can't compare bytes array with different len: {}, {}",
+        left.len(),
+        right.len()
+    );
+    assert_ne!(left.len(), 0, "Can't compare bytes array of length 0");
+    let len = left.len();
+    // the sign bit is 1, the value is negative
+    let left_negative = left[len - 1] >= 0x80_u8;
+    let right_negative = right[len - 1] >= 0x80_u8;
+    if left_negative != right_negative {
+        return match left_negative {
+            true => {
+                // left is negative value
+                // right is positive value
+                Ordering::Less
+            }
+            false => Ordering::Greater,
+        };
+    }
+    for i in 0..len {
+        let l_byte = left[len - 1 - i];
+        let r_byte = right[len - 1 - i];
+        match l_byte.cmp(&r_byte) {
+            Ordering::Less => {
+                return Ordering::Less;
+            }
+            Ordering::Greater => {
+                return Ordering::Greater;
+            }
+            Ordering::Equal => {}
+        }
+    }
+    Ordering::Equal
+}
+
+pub(crate) fn validate_decimal_precision_with_bytes(lt_value: &[u8], precision: usize) -> Result<i128> {

Review Comment:
   will change to Result<()>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1207837113

   @alamb @viirya @tustvold  This is just a drat pr for bench my thoughts brought in https://github.com/apache/arrow-rs/issues/2320


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r940122823


##########
arrow/src/array/array_decimal.rs:
##########
@@ -336,6 +336,31 @@ impl BasicDecimalArray<Decimal128, Decimal128Array> for Decimal128Array {
         }
         Ok(())
     }
+
+    fn validate_decimal_with_bytes(&self, precision: usize) -> Result<()> {

Review Comment:
   I can add it to my list of things to do this week



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1208611061

   Hi @liukun4515  -- I don't fully understand this approach for validating Decimal values though I have not studied it as carefully as I would like to. Most other arrays have two paths:
   1. Create from known good data (used in Builders where each individual value was validated individually) -- `ArrayData::new_unchecked` typically
   2. Create from an array of arbitrary user input and then do a vectorized validation (ArrayData::validate)
   
   Since this PR seems to do something else a bit different I need to think about it some more, but I don't think I'll have time until later this week or this weekend to do so (I got caught up in https://github.com/apache/arrow-rs/pull/2369 for longer than I expected and now I am behind in some other areas)
   
   It sounds like @tustvold  has some thoughts about the "big picture" of Decimal in general so perhaps that will help
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1208631395

   Possibly also related: https://github.com/apache/arrow-rs/issues/2362


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on pull request #2360: [WIP]speed up the decimal validation and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#issuecomment-1209183075

   I will try to replace the validation base on the `i128` or `str` to `byte_array`. @alamb @viirya @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944046416


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {

Review Comment:
   Personally, I prefer this function to be an independent `fn`, but not a method of `Decimal256Array`,
   because
   1. DecimalArray has its own precision, providing another `precision` to its method is somewhat weird.
   2. This method is just for `Decimal256Array`, but not the generic decimal array. Moving it out the `impl Decimal256Array` can make the code cleaner. 



##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {

Review Comment:
   Do we need this check if `precision >= self.precision`?



##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {

Review Comment:
   This could be extracted as a separate method `DecimalArray::raw_value`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944061681


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {

Review Comment:
   > Personally, I prefer this function to be an independent `fn`, but not a method of `Decimal256Array`, because
   > 
   
   If this is method is just used to `Decimal256Array`, why we should move it out of the `impl`? I am confused about this suggestion.
   Besides, It's a private method.
   If move it out of the `impl Decimal256Array`, how to get the `data` of the `Self`?
   
   > 1. DecimalArray has its own precision, providing another `precision` to its method is somewhat weird.
   
   `with_precision_and_scale` also provide the `precision` and `scale`, is this also weird?
   
   > 2. This method is just for `Decimal256Array`, but not the generic decimal array. Moving it out the `impl Decimal256Array` can make the code cleaner.
   
   I feel confused about this reason. The method is only used to `Decimal256Array`, why not treat it as private function? 
   
   



##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {

Review Comment:
   good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2360: Speed up `Decimal256` validation based on bytes comparison and add benchmark test

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2360:
URL: https://github.com/apache/arrow-rs/pull/2360#discussion_r944090481


##########
arrow/src/array/array_decimal.rs:
##########
@@ -332,15 +331,30 @@ impl Decimal128Array {
 impl Decimal256Array {
     /// Validates decimal values in this array can be properly interpreted
     /// with the specified precision.
-    pub fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
-        if precision < self.precision {
-            for v in self.iter().flatten() {
-                validate_decimal256_precision(&v.to_big_int(), precision)?;
+    fn validate_decimal_precision(&self, precision: usize) -> Result<()> {
+        let current_end = self.data.len();
+        let mut current: usize = 0;
+        let data = &self.data;
+
+        while current != current_end {
+            if self.is_null(current) {
+                current += 1;
+                continue;
+            } else {
+                let offset = current + data.offset();
+                current += 1;
+                let raw_val = unsafe {
+                    let pos = self.value_offset_at(offset);
+                    std::slice::from_raw_parts(
+                        self.raw_value_data_ptr().offset(pos as isize),
+                        Self::VALUE_LENGTH as usize,
+                    )
+                };
+                validate_decimal256_precision_with_lt_bytes(raw_val, precision)?;
             }
         }
         Ok(())

Review Comment:
   There is no performance regression without the `filter`, and I will apply the change to the codebase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org