You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "sgilmore10 (via GitHub)" <gi...@apache.org> on 2023/06/19 17:21:51 UTC

[GitHub] [arrow] sgilmore10 opened a new pull request, #36167: GH-36166: [C++] [MATLAB]: Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

sgilmore10 opened a new pull request, #36167:
URL: https://github.com/apache/arrow/pull/36167

   ### Rationale for this change
   
   MATLAB uses UTF-16 encoded strings, but arrow uses UTF-8.  We need a way to convert between the two encodings. 
   
   ### What changes are included in this PR?
   
   Added two new utility functions:
   
   1. `std::string UTF16StringToUTF8(const std::basic_string<char16_t>& source)`
   2. `std::basic_string<char16_t> UTF8StringToUTF16(const std::string& source)` 
   
   ### Are these changes tested?
   
   Added two test cases to `utf8_util_test.cc`:
   
   1. `UTF16StringToUTF8`
   2. `UTF8StringToUTF16`
   
   
   ### Are there any user-facing changes?
   No, these APIs are intended for developers.
   
   ### Future Directions
   
   In a followup PR, we will update the MATLAB Interface source code to use these utilities when converting between UTF16 and UTF8 encoded strings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36167: GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36167:
URL: https://github.com/apache/arrow/pull/36167#discussion_r1234855168


##########
cpp/src/arrow/util/utf8.cc:
##########
@@ -164,5 +176,21 @@ ARROW_EXPORT Result<std::string> WideStringToUTF8(const std::wstring& source) {
   }
 }
 
+ARROW_EXPORT Result<std::string> UTF16StringToUTF8(const std::u16string& source) {

Review Comment:
   For the record, `ARROW_EXPORT` is only useful on declarations, not definitions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sgilmore10 commented on a diff in pull request #36167: GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "sgilmore10 (via GitHub)" <gi...@apache.org>.
sgilmore10 commented on code in PR #36167:
URL: https://github.com/apache/arrow/pull/36167#discussion_r1235253118


##########
cpp/src/arrow/util/utf8.cc:
##########
@@ -164,5 +176,21 @@ ARROW_EXPORT Result<std::string> WideStringToUTF8(const std::wstring& source) {
   }
 }
 
+ARROW_EXPORT Result<std::string> UTF16StringToUTF8(const std::u16string& source) {

Review Comment:
   Oh that's my bad. I copied the line from the header file but forgot to delete `ARROW_EXPORT`. I'll submit a followup pull request.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36167: GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36167:
URL: https://github.com/apache/arrow/pull/36167#discussion_r1234856883


##########
cpp/src/arrow/util/utf8_util_test.cc:
##########
@@ -397,6 +397,49 @@ TEST(WideStringToUTF8, Basics) {
 #endif
 }
 
+TEST(UTF8StringToUTF16, Basics) {
+  auto CheckOk = [](const std::string& s, const std::u16string& expected) -> void {
+    ASSERT_OK_AND_ASSIGN(std::u16string u16s, UTF8StringToUTF16(s));
+    ASSERT_EQ(u16s, expected);
+  };
+
+  auto CheckInvalid = [](const std::string& s) -> void {
+    ASSERT_RAISES(Invalid, UTF8StringToUTF16(s));
+  };
+
+  CheckOk("", u"");
+  CheckOk("foo", u"foo");
+  CheckOk("h\xc3\xa9h\xc3\xa9", u"h\u00e9h\u00e9");
+  CheckOk("\xf0\x9f\x98\x80", u"\U0001F600");
+  CheckOk("\xf4\x8f\xbf\xbf", u"\U0010FFFF");
+  CheckOk({0, 'x'}, {0, u'x'});
+
+  CheckInvalid("\xff");
+  CheckInvalid("h\xc3");

Review Comment:
   Would have been nice to add a lone surrogate test here as well?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sgilmore10 commented on a diff in pull request #36167: GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "sgilmore10 (via GitHub)" <gi...@apache.org>.
sgilmore10 commented on code in PR #36167:
URL: https://github.com/apache/arrow/pull/36167#discussion_r1235253664


##########
cpp/src/arrow/util/utf8_util_test.cc:
##########
@@ -397,6 +397,49 @@ TEST(WideStringToUTF8, Basics) {
 #endif
 }
 
+TEST(UTF8StringToUTF16, Basics) {
+  auto CheckOk = [](const std::string& s, const std::u16string& expected) -> void {
+    ASSERT_OK_AND_ASSIGN(std::u16string u16s, UTF8StringToUTF16(s));
+    ASSERT_EQ(u16s, expected);
+  };
+
+  auto CheckInvalid = [](const std::string& s) -> void {
+    ASSERT_RAISES(Invalid, UTF8StringToUTF16(s));
+  };
+
+  CheckOk("", u"");
+  CheckOk("foo", u"foo");
+  CheckOk("h\xc3\xa9h\xc3\xa9", u"h\u00e9h\u00e9");
+  CheckOk("\xf0\x9f\x98\x80", u"\U0001F600");
+  CheckOk("\xf4\x8f\xbf\xbf", u"\U0010FFFF");
+  CheckOk({0, 'x'}, {0, u'x'});
+
+  CheckInvalid("\xff");
+  CheckInvalid("h\xc3");

Review Comment:
   Definitely, I can add one in a followup pull request.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] sgilmore10 commented on pull request #36167: GH-36166: [C++] [MATLAB]: Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "sgilmore10 (via GitHub)" <gi...@apache.org>.
sgilmore10 commented on PR #36167:
URL: https://github.com/apache/arrow/pull/36167#issuecomment-1597672264

   The CI failures seem unrelated to these changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36167: GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36167:
URL: https://github.com/apache/arrow/pull/36167#issuecomment-1598226285

   Conbench analyzed the 6 benchmark runs on commit `bd7455f0`.
   
   There were 4 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-06-19 22:49:03Z](http://conbench.ursa.dev/compare/runs/31640d56abd041829386122b747716e0...c53fdfaecd99421c9773e1e86f40a794/)
     - [source=cpp-micro, suite=parquet-bloom-filter-benchmark](http://conbench.ursa.dev/compare/benchmarks/064909150a1f7e0980006ea9e4bf1cee...06490dc01b1c770980000bf13777b07b)
     - [params=8192, source=cpp-micro, suite=arrow-bit-util-benchmark](http://conbench.ursa.dev/compare/benchmarks/0649090be4777cb18000d4b3873c9e7e...06490db7f22a7a8d80008016970e00e4)
   - and 2 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14394392197) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #36167: GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #36167:
URL: https://github.com/apache/arrow/pull/36167#discussion_r1234888379


##########
cpp/src/arrow/util/utf8.cc:
##########
@@ -164,5 +176,21 @@ ARROW_EXPORT Result<std::string> WideStringToUTF8(const std::wstring& source) {
   }
 }
 
+ARROW_EXPORT Result<std::string> UTF16StringToUTF8(const std::u16string& source) {

Review Comment:
   Oh, sorry. I missed this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou merged pull request #36167: GH-36166: [C++][MATLAB] Add utility to convert UTF-8 strings to UTF-16 and UTF-16 strings to UTF-8

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou merged PR #36167:
URL: https://github.com/apache/arrow/pull/36167


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org