You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/14 15:12:54 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9192: ARROW-10264: [Python] Fix failing hdfs test

jorisvandenbossche commented on a change in pull request #9192:
URL: https://github.com/apache/arrow/pull/9192#discussion_r557464597



##########
File path: cpp/src/arrow/filesystem/hdfs.cc
##########
@@ -69,6 +69,14 @@ class HadoopFileSystem::Impl {
   HdfsOptions options() const { return options_; }
 
   Result<FileInfo> GetFileInfo(const std::string& path) {
+    // It has unfortunately been a frequent logic error to pass URIs down
+    // to GetFileInfo (e.g. ARROW-10264).  Unlike other filesystems, HDFS
+    // silently accepts URIs but returns different results than if given the
+    // equivalent in-filesystem paths.  Instead of raising cryptic errors
+    // later, notify the underlying problem immediately.
+    if (path.substr(0, 5) == "hdfs:") {

Review comment:
       or "viewfs" ? 
   (I am not familiar with it, I only know that in the python/cython code there are some places that checks for this as well ..)

##########
File path: python/pyarrow/parquet.py
##########
@@ -1493,15 +1493,16 @@ def __init__(self, path_or_paths, filesystem=None, filters=None,
                 single_file = path_or_paths[0]
         else:
             if _is_path_like(path_or_paths):
-                path = str(path_or_paths)
+                path_or_paths = str(path_or_paths)
                 if filesystem is None:
                     # path might be a URI describing the FileSystem as well
                     try:
-                        filesystem, path = FileSystem.from_uri(path)
+                        filesystem, path_or_paths = FileSystem.from_uri(
+                            path_or_paths)

Review comment:
       Ah, good catch. So we were passing below still the original `path_or_paths` URI to the dataset constructor (instead of the non-URI path returned by from_uri), but also passing the filesystem inferred from the URI here. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org