You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/13 14:25:17 UTC

[GitHub] [arrow-rs] jackwener opened a new pull request #1435: Improve performance of DictionaryArray::try_new()

jackwener opened a new pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435


   # Which issue does this PR close?
   
   Closes #1313.
   
   # What changes are included in this PR?
   
   extract the dictionary validation logic out of `ArrayData::validate_full` into a separate function. 
   
   `DictionaryArray::try_new`  use the `ArrayDataBuilder::build_unchecked` and afterwards call the new function which only validates that the keys are in bounds.
   
   # Are there any user-facing changes?
   
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#issuecomment-1075352032


   lgtm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r827709336



##########
File path: arrow/src/array/data.rs
##########
@@ -936,68 +936,68 @@ impl ArrayData {
             )));
         }
 
+        self.validate_dictionary_offest()?;
+
+        // validate all children recursively
+        self.child_data
+            .iter()
+            .enumerate()
+            .try_for_each(|(i, child_data)| {
+                child_data.validate_full().map_err(|e| {
+                    ArrowError::InvalidArgumentError(format!(
+                        "{} child #{} invalid: {}",
+                        self.data_type, i, e
+                    ))
+                })
+            })?;
+
+        Ok(())
+    }
+
+    pub fn validate_dictionary_offest(&self) -> Result<()> {

Review comment:
       there is a typo here I think:
   
   ```suggestion
       pub fn validate_dictionary_offset(&self) -> Result<()> {
   ```

##########
File path: arrow/src/array/data.rs
##########
@@ -936,68 +936,68 @@ impl ArrayData {
             )));
         }
 
+        self.validate_dictionary_offest()?;
+
+        // validate all children recursively
+        self.child_data
+            .iter()
+            .enumerate()
+            .try_for_each(|(i, child_data)| {
+                child_data.validate_full().map_err(|e| {
+                    ArrowError::InvalidArgumentError(format!(
+                        "{} child #{} invalid: {}",
+                        self.data_type, i, e
+                    ))
+                })
+            })?;
+
+        Ok(())
+    }
+
+    pub fn validate_dictionary_offest(&self) -> Result<()> {

Review comment:
       It might be clearer to call this `validate_dictionary_offsets_full`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#issuecomment-1066115420


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1435?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1435](https://codecov.io/gh/apache/arrow-rs/pull/1435?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b9b79cf) into [master](https://codecov.io/gh/apache/arrow-rs/commit/d9099c4a0045f761bfa491b940bb89134f64b9a1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d9099c4) will **decrease** coverage by `0.00%`.
   > The diff coverage is `89.28%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1435/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1435?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1435      +/-   ##
   ==========================================
   - Coverage   82.68%   82.67%   -0.01%     
   ==========================================
     Files         185      185              
     Lines       53822    53825       +3     
   ==========================================
   - Hits        44502    44501       -1     
   - Misses       9320     9324       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1435?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1435/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `82.82% <88.00%> (-0.34%)` | :arrow_down: |
   | [arrow/src/array/array\_dictionary.rs](https://codecov.io/gh/apache/arrow-rs/pull/1435/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2RpY3Rpb25hcnkucnM=) | `91.73% <100.00%> (+0.07%)` | :arrow_up: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1435/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.79% <0.00%> (-0.31%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1435/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.31% <0.00%> (-0.12%)` | :arrow_down: |
   | [arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow-rs/pull/1435/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2xpc3QucnM=) | `95.52% <0.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1435/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.71% <0.00%> (+0.19%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1435/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.43% <0.00%> (+0.22%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1435?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1435?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d9099c4...b9b79cf](https://codecov.io/gh/apache/arrow-rs/pull/1435?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r827720977



##########
File path: arrow/src/array/array_dictionary.rs
##########
@@ -112,7 +112,11 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
             _ => data = data.null_count(0),
         }
 
-        Ok(data.build()?.into())
+        let array = unsafe { data.build_unchecked() };
+
+        array.validate_dictionary_offest()?;

Review comment:
       Hmm, we only have a limit that `K: ArrowPrimitiveType`. But in `validate`, it will further check if `K` is dictionary key type by `DataType::is_dictionary_key_type`, so I think the check is missed after this change?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r827940618



##########
File path: arrow/src/array/array_dictionary.rs
##########
@@ -112,7 +112,11 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
             _ => data = data.null_count(0),
         }
 
-        Ok(data.build()?.into())
+        let array = unsafe { data.build_unchecked() };
+
+        array.validate_dictionary_offest()?;

Review comment:
       That is an excellent point @viirya  -- perhaps we can add a test case that tries to do something crazy like use `Float64` keys to verify. 
   
   Perhaps the test could look like this (untested):
   
   
   ```rust
       #[test]
       #[should_panic(
           expected = "Type is not valid dictionary type"
       )]
       fn test_try_new_index_too_large() {
           let values: StringArray = [Some("foo"), Some("bar")].into_iter().collect();
           let keys: Float32Array = [Some(0), None, Some(3)].into_iter().collect();
           DictionaryArray::<Float32Type>::try_new(&keys, &values).unwrap();
       }
   ```
   
   

##########
File path: arrow/src/array/array_dictionary.rs
##########
@@ -112,7 +112,11 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
             _ => data = data.null_count(0),
         }
 
-        Ok(data.build()?.into())
+        let array = unsafe { data.build_unchecked() };
+
+        array.validate_dictionary_offest()?;

Review comment:
       @viirya  are you suggesting we call `array.validate()` in addition to `array.validate_dictionary_offest()`? If so that makes sense to me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r825515485



##########
File path: arrow/src/array/data.rs
##########
@@ -936,68 +936,68 @@ impl ArrayData {
             )));
         }
 
+        self.validate_dictionary_offest()?;
+
+        // validate all children recursively
+        self.child_data
+            .iter()
+            .enumerate()
+            .try_for_each(|(i, child_data)| {
+                child_data.validate_full().map_err(|e| {
+                    ArrowError::InvalidArgumentError(format!(
+                        "{} child #{} invalid: {}",
+                        self.data_type, i, e
+                    ))
+                })
+            })?;
+
+        Ok(())
+    }
+
+    pub fn validate_dictionary_offest(&self) -> Result<()> {
         match &self.data_type {
-            DataType::Utf8 => {
-                self.validate_utf8::<i32>()?;
-            }
-            DataType::LargeUtf8 => {
-                self.validate_utf8::<i64>()?;
-            }
-            DataType::Binary => {
-                self.validate_offsets_full::<i32>(self.buffers[1].len())?;
-            }
+            DataType::Utf8 => self.validate_utf8::<i32>(),
+            DataType::LargeUtf8 => self.validate_utf8::<i64>(),
+            DataType::Binary => self.validate_offsets_full::<i32>(self.buffers[1].len()),
             DataType::LargeBinary => {
-                self.validate_offsets_full::<i64>(self.buffers[1].len())?;
+                self.validate_offsets_full::<i64>(self.buffers[1].len())
             }
             DataType::List(_) | DataType::Map(_, _) => {
                 let child = &self.child_data[0];
-                self.validate_offsets_full::<i32>(child.len + child.offset)?;
+                self.validate_offsets_full::<i32>(child.len + child.offset)
             }
             DataType::LargeList(_) => {
                 let child = &self.child_data[0];
-                self.validate_offsets_full::<i64>(child.len + child.offset)?;
+                self.validate_offsets_full::<i64>(child.len + child.offset)
             }
             DataType::Union(_, _) => {
                 // Validate Union Array as part of implementing new Union semantics
                 // See comments in `ArrayData::validate()`
                 // https://github.com/apache/arrow-rs/issues/85
                 //
                 // TODO file follow on ticket for full union validation
+                Ok(())
             }
             DataType::Dictionary(key_type, _value_type) => {
                 let dictionary_length: i64 = self.child_data[0].len.try_into().unwrap();
                 let max_value = dictionary_length - 1;
                 match key_type.as_ref() {
-                    DataType::UInt8 => self.check_bounds::<u8>(max_value)?,
-                    DataType::UInt16 => self.check_bounds::<u16>(max_value)?,
-                    DataType::UInt32 => self.check_bounds::<u32>(max_value)?,
-                    DataType::UInt64 => self.check_bounds::<u64>(max_value)?,
-                    DataType::Int8 => self.check_bounds::<i8>(max_value)?,
-                    DataType::Int16 => self.check_bounds::<i16>(max_value)?,
-                    DataType::Int32 => self.check_bounds::<i32>(max_value)?,
-                    DataType::Int64 => self.check_bounds::<i64>(max_value)?,
+                    DataType::UInt8 => self.check_bounds::<u8>(max_value),
+                    DataType::UInt16 => self.check_bounds::<u16>(max_value),
+                    DataType::UInt32 => self.check_bounds::<u32>(max_value),
+                    DataType::UInt64 => self.check_bounds::<u64>(max_value),
+                    DataType::Int8 => self.check_bounds::<i8>(max_value),
+                    DataType::Int16 => self.check_bounds::<i16>(max_value),
+                    DataType::Int32 => self.check_bounds::<i32>(max_value),
+                    DataType::Int64 => self.check_bounds::<i64>(max_value),
                     _ => unreachable!(),

Review comment:
       > I think a possible solution would be to extract the dictionary validation logic out of ArrayData::validate_full into a separate function. DictionaryArray::try_new could then use ArrayDataBuilder::build_unchecked and afterwards call the new function which only validates that the keys are in bounds.
   
   I think "the dictionary validation logic" is only for the logic inside `DataType::Dictionary` pattern branch.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jackwener commented on pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
jackwener commented on pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#issuecomment-1074810995


   I have read the conversion above. thanks @viirya @alamb .
   I add the cheap `validate()` and a unit test for `wrong dictionary key`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#issuecomment-1074992080


   Looks good -- thank you @jackwener  and @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r832005623



##########
File path: arrow/src/array/array_dictionary.rs
##########
@@ -112,7 +113,12 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
             _ => data = data.null_count(0),
         }
 
-        Ok(data.build()?.into())
+        let array = unsafe { data.build_unchecked() };

Review comment:
       ```suggestion
           // Safety: `validate` ensures key type is correct, and
           //  `validate_dictionary_offset` ensures all offsets are within range
           let array = unsafe { data.build_unchecked() };
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r827712787



##########
File path: arrow/src/array/array_dictionary.rs
##########
@@ -112,7 +112,11 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
             _ => data = data.null_count(0),
         }
 
-        Ok(data.build()?.into())
+        let array = unsafe { data.build_unchecked() };
+
+        array.validate_dictionary_offest()?;

Review comment:
       I think the idea is that since the input to this function is a valid array, then the array data itself is also valid. 
   
   Thus, we know that the dictionary keys are valid integers of type `K` but we don't know that they are all within the range `0..dictionary_values.len()` so that validation is still required. I think this PR makes this change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r828196838



##########
File path: arrow/src/array/array_dictionary.rs
##########
@@ -112,7 +112,11 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
             _ => data = data.null_count(0),
         }
 
-        Ok(data.build()?.into())
+        let array = unsafe { data.build_unchecked() };
+
+        array.validate_dictionary_offest()?;

Review comment:
       Yes, looks like we also need `validate` in additional to `validate_dictionary_offest`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r825515628



##########
File path: arrow/src/array/data.rs
##########
@@ -936,68 +936,68 @@ impl ArrayData {
             )));
         }
 
+        self.validate_dictionary_offest()?;
+
+        // validate all children recursively
+        self.child_data
+            .iter()
+            .enumerate()
+            .try_for_each(|(i, child_data)| {
+                child_data.validate_full().map_err(|e| {
+                    ArrowError::InvalidArgumentError(format!(
+                        "{} child #{} invalid: {}",
+                        self.data_type, i, e
+                    ))
+                })
+            })?;
+
+        Ok(())
+    }
+
+    pub fn validate_dictionary_offest(&self) -> Result<()> {
         match &self.data_type {
-            DataType::Utf8 => {
-                self.validate_utf8::<i32>()?;
-            }
-            DataType::LargeUtf8 => {
-                self.validate_utf8::<i64>()?;
-            }
-            DataType::Binary => {
-                self.validate_offsets_full::<i32>(self.buffers[1].len())?;
-            }
+            DataType::Utf8 => self.validate_utf8::<i32>(),
+            DataType::LargeUtf8 => self.validate_utf8::<i64>(),
+            DataType::Binary => self.validate_offsets_full::<i32>(self.buffers[1].len()),
             DataType::LargeBinary => {
-                self.validate_offsets_full::<i64>(self.buffers[1].len())?;
+                self.validate_offsets_full::<i64>(self.buffers[1].len())
             }
             DataType::List(_) | DataType::Map(_, _) => {
                 let child = &self.child_data[0];
-                self.validate_offsets_full::<i32>(child.len + child.offset)?;
+                self.validate_offsets_full::<i32>(child.len + child.offset)
             }
             DataType::LargeList(_) => {
                 let child = &self.child_data[0];
-                self.validate_offsets_full::<i64>(child.len + child.offset)?;
+                self.validate_offsets_full::<i64>(child.len + child.offset)
             }
             DataType::Union(_, _) => {
                 // Validate Union Array as part of implementing new Union semantics
                 // See comments in `ArrayData::validate()`
                 // https://github.com/apache/arrow-rs/issues/85
                 //
                 // TODO file follow on ticket for full union validation
+                Ok(())
             }
             DataType::Dictionary(key_type, _value_type) => {
                 let dictionary_length: i64 = self.child_data[0].len.try_into().unwrap();
                 let max_value = dictionary_length - 1;
                 match key_type.as_ref() {
-                    DataType::UInt8 => self.check_bounds::<u8>(max_value)?,
-                    DataType::UInt16 => self.check_bounds::<u16>(max_value)?,
-                    DataType::UInt32 => self.check_bounds::<u32>(max_value)?,
-                    DataType::UInt64 => self.check_bounds::<u64>(max_value)?,
-                    DataType::Int8 => self.check_bounds::<i8>(max_value)?,
-                    DataType::Int16 => self.check_bounds::<i16>(max_value)?,
-                    DataType::Int32 => self.check_bounds::<i32>(max_value)?,
-                    DataType::Int64 => self.check_bounds::<i64>(max_value)?,
+                    DataType::UInt8 => self.check_bounds::<u8>(max_value),
+                    DataType::UInt16 => self.check_bounds::<u16>(max_value),
+                    DataType::UInt32 => self.check_bounds::<u32>(max_value),
+                    DataType::UInt64 => self.check_bounds::<u64>(max_value),
+                    DataType::Int8 => self.check_bounds::<i8>(max_value),
+                    DataType::Int16 => self.check_bounds::<i16>(max_value),
+                    DataType::Int32 => self.check_bounds::<i32>(max_value),
+                    DataType::Int64 => self.check_bounds::<i64>(max_value),
                     _ => unreachable!(),

Review comment:
       The validation logic for other data types are not for dictionary offset.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jackwener edited a comment on pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
jackwener edited a comment on pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#issuecomment-1074810995


   I have read the dialogue above. thanks @viirya @alamb ❤😃.
   I add the cheap `validate()` and a unit test for `wrong dictionary key`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jackwener edited a comment on pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
jackwener edited a comment on pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#issuecomment-1074810995


   I have read the conversion above. thanks @viirya @alamb ❤😃.
   I add the cheap `validate()` and a unit test for `wrong dictionary key`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1435: Improve performance of DictionaryArray::try_new()

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1435:
URL: https://github.com/apache/arrow-rs/pull/1435#discussion_r825516254



##########
File path: arrow/src/array/array_dictionary.rs
##########
@@ -112,7 +112,11 @@ impl<'a, K: ArrowPrimitiveType> DictionaryArray<K> {
             _ => data = data.null_count(0),
         }
 
-        Ok(data.build()?.into())
+        let array = unsafe { data.build_unchecked() };
+
+        array.validate_dictionary_offest()?;

Review comment:
       Actually `validate_full` also contains "cheap" validation (`validate`), like buffer length, key type, etc. I think they seems necessary still.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org