You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 13:35:59 UTC

[GitHub] [arrow-rs] HaoYang670 opened a new pull request #1503: Support calculating number of chars for `StringArray`

HaoYang670 opened a new pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503


   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1493 .
   
   # Rationale for this change
    We need a method to calculate the number of chars for `StringArray`
   
   # What changes are included in this PR?
   Provide a pub method `num_chars`.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#issuecomment-1083436241


   Thanks @HaoYang670  and @liukun4515  for the review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837490317



##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
         self.data.buffers()[1].clone()
     }
 
+    /// Returns the number of chars in the string at index `i`.
+    /// # Panic
+    /// If an invalid utf-8 byte is found, the function will panic.
+    /// However, this function does not check every byte. So you might
+    /// get an unexpected result if the string is in invalid utf-8 format.
+    /// # Performance
+    /// This function has `O(n)` time complexity where `n` is the string length.
+    /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+    /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
+    pub fn num_chars(&self, i: usize) -> usize {
+        let offsets = self.value_offsets();
+        let start = offsets[i].to_usize().unwrap();
+        let end = offsets[i + 1].to_usize().unwrap();
+        let chars = &self.data.buffers()[1].as_slice()[start..end];
+
+        let mut char_iter = chars.iter();
+        let mut length: usize = 0;
+        while let Some(prefix) = char_iter.next() {
+            let ones = prefix.leading_ones() as usize;
+            match ones {

Review comment:
       We just check the validation of the first byte for every char




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#issuecomment-1082522921


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1503](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bc9ef92) into [master](https://codecov.io/gh/apache/arrow-rs/commit/f1eba3c588f0ea63cce73c84188729a1d34cdc8d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f1eba3c) will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1503      +/-   ##
   ==========================================
   + Coverage   82.70%   82.73%   +0.03%     
   ==========================================
     Files         188      188              
     Lines       54403    54359      -44     
   ==========================================
   - Hits        44993    44974      -19     
   + Misses       9410     9385      -25     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X3N0cmluZy5ycw==) | `97.70% <100.00%> (+0.04%)` | :arrow_up: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.40% <0.00%> (-0.40%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.46% <0.00%> (-0.12%)` | :arrow_down: |
   | [arrow/src/buffer/mutable.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2J1ZmZlci9tdXRhYmxlLnJz) | `90.53% <0.00%> (+0.02%)` | :arrow_up: |
   | [arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy90YWtlLnJz) | `95.41% <0.00%> (+0.13%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1503/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `86.70% <0.00%> (+2.34%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [f1eba3c...bc9ef92](https://codecov.io/gh/apache/arrow-rs/pull/1503?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837821883



##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
         self.data.buffers()[1].clone()
     }
 
+    /// Returns the number of chars in the string at index `i`.
+    /// # Panic
+    /// If an invalid utf-8 byte is found, the function will panic.
+    /// However, this function does not check every byte. So you might
+    /// get an unexpected result if the string is in invalid utf-8 format.
+    /// # Performance
+    /// This function has `O(n)` time complexity where `n` is the string length.
+    /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+    /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.

Review comment:
       I wonder if you have considered using `array.value(i).chars().len()` to count utf8 codepoints, as described in 
   https://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r838028156



##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
         self.data.buffers()[1].clone()
     }
 
+    /// Returns the number of chars in the string at index `i`.
+    /// # Panic
+    /// If an invalid utf-8 byte is found, the function will panic.
+    /// However, this function does not check every byte. So you might
+    /// get an unexpected result if the string is in invalid utf-8 format.
+    /// # Performance
+    /// This function has `O(n)` time complexity where `n` is the string length.
+    /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+    /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.

Review comment:
       Done!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r838023117



##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
         self.data.buffers()[1].clone()
     }
 
+    /// Returns the number of chars in the string at index `i`.
+    /// # Panic
+    /// If an invalid utf-8 byte is found, the function will panic.
+    /// However, this function does not check every byte. So you might
+    /// get an unexpected result if the string is in invalid utf-8 format.
+    /// # Performance
+    /// This function has `O(n)` time complexity where `n` is the string length.
+    /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+    /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.

Review comment:
       > I wonder if you have considered using `array.value(i).chars().len()` to count utf8 codepoints, as described in https://stackoverflow.com/questions/46290655/get-the-string-length-in-characters-in-rust?
   
   Thank you for your helpful suggestion, I will have a try!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837486062



##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
         self.data.buffers()[1].clone()
     }
 
+    /// Returns the number of chars in the string at index `i`.
+    /// # Panic
+    /// If an invalid utf-8 byte is found, the function will panic.
+    /// However, this function does not check every byte. So you might
+    /// get an unexpected result if the string is in invalid utf-8 format.
+    /// # Performance
+    /// This function has `O(n)` time complexity where `n` is the string length.
+    /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+    /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
+    pub fn num_chars(&self, i: usize) -> usize {

Review comment:
       I don't find an elegant way to make the returned type as `OffsetSize`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] HaoYang670 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837489063



##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
         self.data.buffers()[1].clone()
     }
 
+    /// Returns the number of chars in the string at index `i`.
+    /// # Panic
+    /// If an invalid utf-8 byte is found, the function will panic.
+    /// However, this function does not check every byte. So you might
+    /// get an unexpected result if the string is in invalid utf-8 format.
+    /// # Performance
+    /// This function has `O(n)` time complexity where `n` is the string length.
+    /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+    /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.

Review comment:
       `length of string` == `number of chars` when all chars are in 0000 ~ 007F




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] liukun4515 commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r838142364



##########
File path: arrow/src/array/array_string.rs
##########
@@ -377,9 +386,9 @@ mod tests {
 
     #[test]
     fn test_string_array_from_u8_slice() {
-        let values: Vec<&str> = vec!["hello", "", "parquet"];
+        let values: Vec<&str> = vec!["hello", "", "A£ऀ𖼚𝌆৩ƐZ"];

Review comment:
       👍




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503#discussion_r837822214



##########
File path: arrow/src/array/array_string.rs
##########
@@ -78,6 +78,39 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
         self.data.buffers()[1].clone()
     }
 
+    /// Returns the number of chars in the string at index `i`.
+    /// # Panic
+    /// If an invalid utf-8 byte is found, the function will panic.
+    /// However, this function does not check every byte. So you might
+    /// get an unexpected result if the string is in invalid utf-8 format.
+    /// # Performance
+    /// This function has `O(n)` time complexity where `n` is the string length.
+    /// If you can make sure that all chars in the string are in the range `U+0x0000` ~ `U+0x007F`,
+    /// please use the function [`value_length`](#method.value_length) which has O(1) time complexity.
+    pub fn num_chars(&self, i: usize) -> usize {

Review comment:
       I think returning the length as a `usize` is a good API and is what would be expected by rust programmers 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1503: Support calculating number of chars for `StringArray`

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1503:
URL: https://github.com/apache/arrow-rs/pull/1503


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org