You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/10 11:52:34 UTC

[GitHub] [arrow-rs] HaoYang670 opened a new issue, #1531: Epic: enhance the `substring` kernel

HaoYang670 opened a new issue, #1531:
URL: https://github.com/apache/arrow-rs/issues/1531

   This is an epic issue to enhance the `substring` kernel from performance, safety and supported types. 
   You can find more info from the discussion: https://github.com/apache/arrow-rs/pull/1529#discussion_r846762535
   
   # List of individual issues is below:
   
   ## Performance
   - [ ] Improve the performance of `substring` (#1512)
   
   ## Safety
   - [ ] Update the document of `substring` (#1529)
   - [ ] Mark the current `substring` function as `unsafe`
   - [ ] Implement `safe` `substring` for string array
   - [ ] Implement `substring_by_char` for string array
   
   ## Support more array types
   - [ ] Add `substring` support for binary array
   - [ ] Add `substring` support for fixed size binary array
   - [ ] Add `substring` support for list array
   - [ ] Add `substring` support for fixed size list array
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098632899

   Hi @viirya. In general:
   Speed: `substring by byte unchecked` >= `substring by byte checked` >> `substring by char`
   Safety: `substring by char` > `substring by byte checked` > `substring by byte unchecked`. (If the input array has an invalid utf8 string, `substring by byte checked` might not find it, but `substring by char` can always find it)
   User-friendly: `substring by char` > `substring by byte checked` > `substring by byte unchecked`
   
   That's why we need 3 versions of `substring`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098005732

   Thank you for your thoughts @HaoYang670 
   
   I still think switching `substring` to use chars is preferable because:
   1. For ascii (single byte utf8) text, bytes and chars are equivalent
   2. If someone has  multi-byte utf8 string data the substring calculations are likely subtlety incorrect and they are in danger if creating invalid utf8 when using `substring`
   1. If we make `substring` (by bytes) safe, the performance will regress which some people will regard as  backwards incompatible as well
   
   Perhaps other maintainers such as @nevi-me  @viirya @sunchao, @tustvold  and @jhorstmann  have some thoughts about if/how we should change `substring` to handle utf8 ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098017580

   Coming at the problem from a Rust perspective, as opposed to potentially a query engine perspective, I personally would expect it to behave similarly to the standard library. That is:
   
   * [Bytes by default](https://doc.rust-lang.org/std/primitive.str.html#method.get), with validation that it isn't [splitting a codepoint](https://doc.rust-lang.org/std/primitive.str.html#method.is_char_boundary)
   * [Unsafe unchecked bytes APIs](https://doc.rust-lang.org/std/primitive.str.html#method.get_unchecked)
   * [Separate APIs](https://doc.rust-lang.org/std/primitive.str.html#method.chars) for operating on chars.
   
   That being said I don't really understand the use-cases of `substring` within a query engine context, i.e. if there is some SQL primitive it maps onto or something, and so if there is some standard practice here...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
viirya commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098300414

   Hmm, @tustvold and @HaoYang670 's proposal looks good. I think It is good idea to follow up the standard library.
   
   >[Bytes by default](https://doc.rust-lang.org/std/primitive.str.html#method.get), with validation that it isn't [splitting a codepoint](https://doc.rust-lang.org/std/primitive.str.html#method.is_char_boundary)
   >Additional [Unsafe unchecked bytes APIs](https://doc.rust-lang.org/std/primitive.str.html#method.get_unchecked)
   >[Separate APIs](https://doc.rust-lang.org/std/primitive.str.html#method.chars) for operating on chars.
   
   About the proposal, my question is, do we really need two APIs for `substring` on bytes? Is the validation is necessary? As we have separate API for operating on chars, users should use it for utf8 cases, I think. What use cases it could be that `substring` is called to operate on bytes, but it still needs to make sure not messing up utf8 content?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098026806

   I think @tustvold 's perspective matches @HaoYang670 's proposal https://github.com/apache/arrow-rs/issues/1531#issuecomment-1097613212 . I am convinced 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1095604139

   Thanks @HaoYang670 
   
   I suggest:
   
   * Mark the current substring function as unsafe *AND RENAME TO `substring_bytes_unchecked`*
   * Implement safe substring for string array *AND NAME `substring_bytes`*
   * Implement substring_by_char for string array *AND NAME `substring`*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1103334548

   We don't need to implement the `unsafe substring` based on the benchmark result of #1577. The utf-8 validation checking only decreases the speed by about 10%.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098033328

   So I think the first step is to add the utf-8 checking into the current `substring` function and test the performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098027612

   > there is a bit-twiddling trick you can do to make it relatively cheap - see [here](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/offset_buffer.rs#L66).
   
   This is a fantastic point!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
viirya commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098638044

   Thanks @HaoYang670 . It is clear from the above list, `substring by byte checked` is always in the middle in many aspects. So I'm wondering if it is really needed? For safety case, we want to make sure valid utf8 result, we definitely choose `substring by char`. If we don't care about it, we use `substring by byte unchecked`. I'm not sure what `substring by byte checked` can provide us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1098747941

   We can design a situation where `substring by byte checked` is the best choice:
   
   Suppose we want to get the substrings with the **byte** length of the shortest string in the array. We could do in this way by using the `length` kernel and the `aggregation` kernel:
   ```rust
   substring(string_array, start = 0, length = Some(min(length(string_array))))
   ```
   
   Then, give an input `string_array` like this:
   ```rust
   let input  = [
       "14 bytes  🗻",
       "This string has 24 bytes",
       "🗻 13 bytes"
   ];
   ``` 
   We will get an invalid string array if using `substring by byte unchecked`, which is dangerous.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #1531: Epic: enhance the `substring` kernel

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #1531:
URL: https://github.com/apache/arrow-rs/issues/1531#issuecomment-1097613212

   Hi @alamb. After thinking twice about the renaming, I suggest we leave the `substring by byte` as the implicit version, and make `substring by char` as an explicit version, so that we can make minimize API change (because the current `substring` is `by byte`) and easily extend the function:
   
   | function name | Supported array|
   |-------|------------------|
   | substring| StringArray(by byte)(slow), BinaryArray, FixedSizeBinary, FixedSizedListArray, ListArray|
   | unsafe substring| StringArray(by byte)(fast), and more if we find invalid outputs in the future|
   | substring_by_char| StringArray|
   
   In this way, the back compatibility will not be broken. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org