You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 13:35:59 UTC
[GitHub] [arrow-rs] HaoYang670 opened a new pull request #1503: Support calculating number of chars for `StringArray`
HaoYang670 opened a new pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503
# Which issue does this PR close?
<!---
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
-->
Closes #1493 .
# Rationale for this change
We need a method to calculate the number of chars for `StringArray`
# What changes are included in this PR?
Provide a pub method `num_chars`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#issuecomment-1083436241
Thanks @HaoYang670 and @liukun4515 for the review
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837490317
##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
self.data.buffers()[1].clone()
}
+ /// Returns the number of chars in the string at index `i`.
+ /// # Panic
+ /// If an invalid utf-8 byte is found, the function will panic.
+ /// However, this function does not check every byte. So you might
+ /// get an unexpected result if the string is in invalid utf-8 format.
+ /// # Performance
+ /// This function has `O(n)` time complexity where `n` is the string length.
+ /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+ /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
+ pub fn num_chars(&self, i: usize) -> usize {
+ let offsets = self.value_offsets();
+ let start = offsets[i].to_usize().unwrap();
+ let end = offsets[i + 1].to_usize().unwrap();
+ let chars = &self.data.buffers()[1].as_slice()[start..end];
+
+ let mut char_iter = chars.iter();
+ let mut length: usize = 0;
+ while let Some(prefix) = char_iter.next() {
+ let ones = prefix.leading_ones() as usize;
+ match ones {
Review comment:
We just check the validation of the first byte for every char
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#issuecomment-1082522921
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#1503](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bc9ef92) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1eba3c588f0ea63cce73c84188729a1d34cdc8d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f1eba3c) will **increase** coverage by `0.03%`.
> The diff coverage is `100.00%`.
```diff
@@ Coverage Diff @@
## master #1503 +/- ##
==========================================
+ Coverage 82.70% 82.73% +0.03%
==========================================
Files 188 188
Lines 54403 54359 -44
==========================================
- Hits 44993 44974 -19
+ Misses 9410 9385 -25
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X3N0cmluZy5ycw==) | `97.70% <100.00%> (+0.04%)` | :arrow_up: |
| [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.40% <0.00%> (-0.40%)` | :arrow_down: |
| [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.46% <0.00%> (-0.12%)` | :arrow_down: |
| [arrow/src/buffer/mutable.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2J1ZmZlci9tdXRhYmxlLnJz) | `90.53% <0.00%> (+0.02%)` | :arrow_up: |
| [arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy90YWtlLnJz) | `95.41% <0.00%> (+0.13%)` | :arrow_up: |
| [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `86.70% <0.00%> (+2.34%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [f1eba3c...bc9ef92](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837821883
##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
self.data.buffers()[1].clone()
}
+ /// Returns the number of chars in the string at index `i`.
+ /// # Panic
+ /// If an invalid utf-8 byte is found, the function will panic.
+ /// However, this function does not check every byte. So you might
+ /// get an unexpected result if the string is in invalid utf-8 format.
+ /// # Performance
+ /// This function has `O(n)` time complexity where `n` is the string length.
+ /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+ /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
Review comment:
I wonder if you have considered using `array.value(i).chars().len()` to count utf8 codepoints, as described in
https://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r838028156
##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
self.data.buffers()[1].clone()
}
+ /// Returns the number of chars in the string at index `i`.
+ /// # Panic
+ /// If an invalid utf-8 byte is found, the function will panic.
+ /// However, this function does not check every byte. So you might
+ /// get an unexpected result if the string is in invalid utf-8 format.
+ /// # Performance
+ /// This function has `O(n)` time complexity where `n` is the string length.
+ /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+ /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
Review comment:
Done!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r838023117
##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
self.data.buffers()[1].clone()
}
+ /// Returns the number of chars in the string at index `i`.
+ /// # Panic
+ /// If an invalid utf-8 byte is found, the function will panic.
+ /// However, this function does not check every byte. So you might
+ /// get an unexpected result if the string is in invalid utf-8 format.
+ /// # Performance
+ /// This function has `O(n)` time complexity where `n` is the string length.
+ /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+ /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
Review comment:
> I wonder if you have considered using `array.value(i).chars().len()` to count utf8 codepoints, as described in https://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?
Thank you for your helpful suggestion, I will have a try!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837486062
##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
self.data.buffers()[1].clone()
}
+ /// Returns the number of chars in the string at index `i`.
+ /// # Panic
+ /// If an invalid utf-8 byte is found, the function will panic.
+ /// However, this function does not check every byte. So you might
+ /// get an unexpected result if the string is in invalid utf-8 format.
+ /// # Performance
+ /// This function has `O(n)` time complexity where `n` is the string length.
+ /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+ /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
+ pub fn num_chars(&self, i: usize) -> usize {
Review comment:
I don't find an elegant way to make the returned type as `OffsetSize`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837489063
##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
self.data.buffers()[1].clone()
}
+ /// Returns the number of chars in the string at index `i`.
+ /// # Panic
+ /// If an invalid utf-8 byte is found, the function will panic.
+ /// However, this function does not check every byte. So you might
+ /// get an unexpected result if the string is in invalid utf-8 format.
+ /// # Performance
+ /// This function has `O(n)` time complexity where `n` is the string length.
+ /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+ /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
Review comment:
`length of string` == `number of chars` when all chars are in 0000 ~ 007F
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] liukun4515 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r838142364
##########
File path: arrow/src/array/array_string.rs
##########
@@ -377,9 +386,9 @@ mod tests {
#[test]
fn test_string_array_from_u8_slice() {
- let values: Vec<&str> = vec!["hello", "", "parquet"];
+ let values: Vec<&str> = vec!["hello", "", "A£ऀ𖼚𝌆৩ƐZ"];
Review comment:
👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837822214
##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
self.data.buffers()[1].clone()
}
+ /// Returns the number of chars in the string at index `i`.
+ /// # Panic
+ /// If an invalid utf-8 byte is found, the function will panic.
+ /// However, this function does not check every byte. So you might
+ /// get an unexpected result if the string is in invalid utf-8 format.
+ /// # Performance
+ /// This function has `O(n)` time complexity where `n` is the string length.
+ /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+ /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
+ pub fn num_chars(&self, i: usize) -> usize {
Review comment:
I think returning the length as a `usize` is a good API and is what would be expected by rust programmers 👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb merged pull request #1503: Support calculating number of chars for `StringArray`
Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org