You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/29 10:22:42 UTC

[GitHub] [arrow] milesgranger opened a new pull request, #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

milesgranger opened a new pull request, #14764:
URL: https://github.com/apache/arrow/pull/14764

   Will close [ARROW-18123](https://issues.apache.org/jira/browse/ARROW-18123)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche merged pull request #14764: ARROW-18123: [Python] Fix writing files with multi-byte characters in file name

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche merged PR #14764:
URL: https://github.com/apache/arrow/pull/14764


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14764:
URL: https://github.com/apache/arrow/pull/14764#discussion_r1034568038


##########
cpp/src/arrow/util/uri.cc:
##########
@@ -245,19 +256,25 @@ const std::string& Uri::ToString() const { return impl_->string_rep_; }
 Status Uri::Parse(const std::string& uri_string) {
   impl_->Reset();
 
-  const auto& s = impl_->KeepString(uri_string);
-  impl_->string_rep_ = s;
-  const char* error_pos;
-  if (uriParseSingleUriExA(&impl_->uri_, s.data(), s.data() + s.size(), &error_pos) !=
+  std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
+  std::wstring wide_uri = converter.from_bytes(uri_string);
+  std::wcout << "Wide URI: " << wide_uri << std::endl;
+
+  const auto& s = impl_->KeepString(wide_uri);
+  impl_->string_rep_ = converter.to_bytes(s);
+
+  const wchar_t* error_pos;
+  if (uriParseSingleUriExW(&impl_->uri_, s.data(), s.data() + s.size(), &error_pos) !=
       URI_SUCCESS) {
-    return Status::Invalid("Cannot parse URI: '", uri_string, "'");
+    return Status::Invalid("Cannot parse URI: '", uri_string, "' at: '",

Review Comment:
   Hi @pitrou, I know most of the PR is a bit of a mess, but I was thinking this specific area needed to have `wstring` support to parse the example in the issue (`例.parquet`) but while it now compiles, it still fails to parse it. And I'm out of ideas. :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1331923976

   > I suppose that's because "foo.parquet" in that case is a valid (local) URL
   
   So is "foo.parquet" considered a valid local URL, but "例.parquet" not? (because URLs cannot have multi-byte characters?)
   
   But in any case, our LocalFileSystem works with local paths and does support such file names for example when inspecting with `get_file_info` (I created two empty files with those names):
   
   ```
   In [1]: from pyarrow.fs import LocalFileSystem
   
   In [2]: fs = LocalFileSystem()
   
   In [3]: fs.get_file_info("foo.parquet")
   Out[3]: <FileInfo for 'foo.parquet': type=FileType.File, size=0>
   
   In [4]: fs.get_file_info("例.parquet")
   Out[4]: <FileInfo for '例.parquet': type=FileType.File, size=0>
   ```
   
   So since that the filesystem object itself can handle it, it seems there is going something wrong with how this is called in `write_table`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1330405974

   https://issues.apache.org/jira/browse/ARROW-18123


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1331933506

   OK, so the problem is that we use `_resolve_filesystem_and_path` to preprocess the path. And for handling relative paths, this helper function assumes that the file should exist (it was written for when _reading_ data ...):
   
   https://github.com/apache/arrow/blob/54e17920eee65e4227eba889aadbdfeb66c114cd/python/pyarrow/fs.py#L169-L185
   
   And so if the file doesn't exist as a local (potentially relative) path, we fall back to use `from_uri`. And for this case, that then fails because indeed for a URI, the multi-byte character is invalid .. 
   It actually also fails for "foo.parquet", because we don't support plain paths as URIs, but this gives:
   
   ```
   In [12]: LocalFileSystem.from_uri("foo.parquet")
   ...
   ArrowInvalid: URI has empty scheme: 'foo.parquet'
   ```
   
   And that error about "empty scheme" is specifically checked for, to ignore that in this case:
   
   https://github.com/apache/arrow/blob/54e17920eee65e4227eba889aadbdfeb66c114cd/python/pyarrow/fs.py#L184-L191
   
   So I think the gist is that `_resolve_filesystem_and_path` wasn't written initially with _writing_ in mind (when files don't (need to) exist yet), and so we should update that logic. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #14764:
URL: https://github.com/apache/arrow/pull/14764#discussion_r1036843440


##########
python/pyarrow/fs.py:
##########
@@ -183,11 +184,11 @@ def _resolve_filesystem_and_path(
     if not exists_locally:
         try:
             filesystem, path = FileSystem.from_uri(path)
-        except ValueError as e:
+        except pa.ArrowInvalid as e:
             # neither an URI nor a locally existing path, so assume that
             # local path was given and propagate a nicer file not found error
             # instead of a more confusing scheme parsing error
-            if "empty scheme" not in str(e):
+            if "URI" not in str(e):

Review Comment:
   This might be too broad. I think the "Unrecognized filesystem type in URI: .." error is certainly one we want to keep raising. 
   Otherwise we will assume a local file path, which will then error because of getting something that looks like a URI (and then the user might think URIs are not supported in general, while they maybe just made a typo in the URI type, or are using an unsupported file system)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
milesgranger commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1331185036

   Apologies, maybe I misunderstood. I was reading relative paths in those comments in the context of a mutli-byte character, because a relative path "foo.parquet" works fine:
   
   ```python
   In [1]: table = pa.table({'col1': pa.array([5, 6])})
   
   In [2]: pq.write_table(table, 'foo.parquet')
   
   In [3]: pq.write_table(table, '例.parquet')
   ---------------------------------------------------------------------------
   ArrowInvalid                              Traceback (most recent call last)
   Cell In [3], line 1
   ----> 1 pq.write_table(table, '例.parquet')
   ...
   ArrowInvalid: Cannot parse URI: '例.parquet'
   ```
   
   The and thus the error leading me to this attempt. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14764:
URL: https://github.com/apache/arrow/pull/14764#discussion_r1034568038


##########
cpp/src/arrow/util/uri.cc:
##########
@@ -245,19 +256,25 @@ const std::string& Uri::ToString() const { return impl_->string_rep_; }
 Status Uri::Parse(const std::string& uri_string) {
   impl_->Reset();
 
-  const auto& s = impl_->KeepString(uri_string);
-  impl_->string_rep_ = s;
-  const char* error_pos;
-  if (uriParseSingleUriExA(&impl_->uri_, s.data(), s.data() + s.size(), &error_pos) !=
+  std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
+  std::wstring wide_uri = converter.from_bytes(uri_string);
+  std::wcout << "Wide URI: " << wide_uri << std::endl;
+
+  const auto& s = impl_->KeepString(wide_uri);
+  impl_->string_rep_ = converter.to_bytes(s);
+
+  const wchar_t* error_pos;
+  if (uriParseSingleUriExW(&impl_->uri_, s.data(), s.data() + s.size(), &error_pos) !=
       URI_SUCCESS) {
-    return Status::Invalid("Cannot parse URI: '", uri_string, "'");
+    return Status::Invalid("Cannot parse URI: '", uri_string, "' at: '",

Review Comment:
   Hi @pitrou, I know most of the PR is a bit of a mess, but I was thinking this specific area needed to have `wstring` support to parse the example in the issue (`例.parquet`) but while it now compiles, it still fails to parse it. And I'm out of ideas now. :sweat_smile: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1331193568

   I suppose that's because "foo.parquet" in that case is a valid (local) URL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1330941785

   As the comments on the JIRA diagnose, this is not a problem with URI parsing, but with the Python side not supporting relative paths.
   
   Encoding the example filename to be URI-compatible gives the following:
   ```py
   >>> from urllib.parse import quote
   >>> quote("例.parquet")
   '%E4%BE%8B.parquet'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #14764: ARROW-18123: [Python] Fix writing files with multi-byte characters in file name

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1348983368

   Benchmark runs are scheduled for baseline = 17b90d1dd4cece2f10d862067d1131636a330867 and contender = 26a426f325256e260a15521d5097efffd2f1ceb1. 26a426f325256e260a15521d5097efffd2f1ceb1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/14a7467ef8e042f78b728f6c27274e4d...fd3308e6de8f4a22a9f1a779aa723b15/)
   [Finished :arrow_down:0.1% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/2a1e0be70fb64ad0b3b401243412d775...f6c91e5603514be2895296e694e8c0c2/)
   [Finished :arrow_down:0.27% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/c1e61bb8c36c4e7aa4e3ebc4f704f3c3...626fdc43dba9445bb4b150c6bd3b0dc5/)
   [Finished :arrow_down:0.38% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b6abec4685174520afe99a27ab6d335c...cd1d088026af4ec9ab5a14bec2e5f757/)
   Buildkite builds:
   [Finished] [`26a426f3` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1985)
   [Finished] [`26a426f3` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2007)
   [Finished] [`26a426f3` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1977)
   [Finished] [`26a426f3` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1999)
   [Finished] [`17b90d1d` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1984)
   [Finished] [`17b90d1d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2006)
   [Finished] [`17b90d1d` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1976)
   [Finished] [`17b90d1d` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1998)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #14764: ARROW-18123: [C++] Multi-byte characters in file names write table

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #14764:
URL: https://github.com/apache/arrow/pull/14764#issuecomment-1331942589

   (as a result of the above, you also have the very confusing situation were the writing actually works if the file already exists, and will happily overwrite it in that case ..)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org