You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/15 20:11:55 UTC

[GitHub] [arrow-rs] alamb opened a new issue #1313: Improve performance of `DictionaryArray::try_new()`

alamb opened a new issue #1313:
URL: https://github.com/apache/arrow-rs/issues/1313


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   https://github.com/apache/arrow-rs/issues/1300 adds `DictionaryArray::try_new()` for creating `DictionaryArrays` from existing keys and dictionaries. However, as noted by @jhorstmann in https://github.com/apache/arrow-rs/issues/1300#issuecomment-1037978864 this results in a double validation (that the keys and values are valid arrays, when we know they already are, in addition to validating that all the keys are valid indexes into the dictionary. 
   
   
   
   **Describe the solution you'd like**
   Minimize the checking in `DictionaryArray::try_new()` to only verify that all indexes are within the range of the dictionary, which is required for this to be a safe API.
   
   It also might be worth adding a `unsafe DictionaryArray::new_unchecked()` which builds the arrays from data the user asserts is valid. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb closed issue #1313: Improve performance of `DictionaryArray::try_new()`

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1313:
URL: https://github.com/apache/arrow-rs/issues/1313


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jackwener commented on issue #1313: Improve performance of `DictionaryArray::try_new()`

Posted by GitBox <gi...@apache.org>.
jackwener commented on issue #1313:
URL: https://github.com/apache/arrow-rs/issues/1313#issuecomment-1063605242


   I can take it up. Could you assign it me? @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jackwener commented on issue #1313: Improve performance of `DictionaryArray::try_new()`

Posted by GitBox <gi...@apache.org>.
jackwener commented on issue #1313:
URL: https://github.com/apache/arrow-rs/issues/1313#issuecomment-1065802115


   @jhorstmann Hello, I am newer for project, I am gradually trying to contribute.
   I can't get the point `double validation`. Could you give me some hint? Thanks !❤
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on issue #1313: Improve performance of `DictionaryArray::try_new()`

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1313:
URL: https://github.com/apache/arrow-rs/issues/1313#issuecomment-1065844808


   `ArrayDataBuilder::build` calls `ArrayData::try_new`, which does a full validation of that struct, including all child ArrayData objects. For example building a DictionaryArray from Int32 keys to String values first validates that the child data containing these strings is valid. Since layout of a StringArray consists of an offset buffer and a data buffer, it has to check that the offsets are monotonically increasing and in bounds for the data buffer. When the child StringArray is valid then also the dictionary keys need to be validated to be in bounds of this StringArray.
   
   In `DictionaryArray::try_new`, we get the child array as a parameter and can assume that is is valid without having to validate it again. We only need to check that the given keys are in bounds.
   
   I think a possible solution would be to extract the dictionary validation logic out of `ArrayData::validate_full` into a separate function. `DictionaryArray::try_new` could then use `ArrayDataBuilder::build_unchecked` and afterwards call the new function which only validates that the keys are in bounds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on issue #1313: Improve performance of `DictionaryArray::try_new()`

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1313:
URL: https://github.com/apache/arrow-rs/issues/1313#issuecomment-1065368070


   @jackwener I just opened a PR that does a small change to `try_new` in #1430, shouldn't cause any conflicts I think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org