You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/05 21:12:58 UTC

[GitHub] [arrow] bkietz opened a new pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

bkietz opened a new pull request #7907:
URL: https://github.com/apache/arrow/pull/7907


   I still apply ignore_prefixes to all segments of paths yielded by a selector which lie *outside* an explicit partition base directory.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-669950497


   (again, moot for python since the two directories are identical) Since the user's selection of paths is really based in `selector.base_dir`, I think it'd be more approprate to apply ignore_prefixes relative to that


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#discussion_r469348756



##########
File path: python/pyarrow/fs.py
##########
@@ -117,6 +116,9 @@ def get_type_name(self):
             protocol = protocol[0]
         return "fsspec+{0}".format(protocol)
 
+    def normalize_path(self, path):
+        return path

Review comment:
       From a quick looks, it seems that their LocalFileSystem should handle incoming "windows" (non-normalized) paths (see eg https://github.com/intake/filesystem_spec/pull/65/files, where a bunch of `path = make_path_posix(path)` were added to the LocalFileSystem methods). So I think this should be fine.
   
   Only, do we rely on this method on the C++ side to actually return a normalized path? 
   Eg in the C++ dataset discovery code, that might be relying on being able to split the normalized path on `/` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7907:
URL: https://github.com/apache/arrow/pull/7907


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#discussion_r469901269



##########
File path: python/pyarrow/fs.py
##########
@@ -117,6 +116,9 @@ def get_type_name(self):
             protocol = protocol[0]
         return "fsspec+{0}".format(protocol)
 
+    def normalize_path(self, path):
+        return path

Review comment:
       We indeed already do that:
   
   https://github.com/apache/arrow/blob/90d1ab73a132faaaa0a205a8640d78bf1c60005f/python/pyarrow/fs.py#L77-L79
   
   at least in cases where the `_ensure_filesystem` is used to process the user specified filesystem before passing it to the underlying functions (but which should in principle be done by all user facing functions).
   
   So the actual fsspec's LocalFileSystem is only used for testing, so I think this is certainly fine then (since other filesystems should not have this problem of windows paths)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#discussion_r469351892



##########
File path: python/pyarrow/fs.py
##########
@@ -117,6 +116,9 @@ def get_type_name(self):
             protocol = protocol[0]
         return "fsspec+{0}".format(protocol)
 
+    def normalize_path(self, path):
+        return path

Review comment:
       When calling `GetFileInfo(selector)`, we need the return values to be a lexical child of the selector base_dir.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz edited a comment on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
bkietz edited a comment on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-669936148


   > And the "partition base directory" is automatically set if a user does something like ds.dataset("_shouldnt_be_ignored/dataset/") ?
   
   Currently in python the partition_base_dir is always identical to the base directory of the recursive selector used, so yes.
   
   In principle it would be possible to select `/mnt/data/**` and set `partition_base_dir=/mnt/data/partitioned`, in which case `/mnt/data/other/**` would be included in the dataset but would not have partition information attached. Alternatively, one could set `partition_base_dir=/mnt` and select `/mnt/year=2020/**`, in which case all fragments would include the `"year"_ == 2020` partition expression. I question the utility of these cases, honestly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-671512620


   Segfault in test_dataset.py?
   
   ```
   ============================= test session starts =============================
   platform win32 -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
   rootdir: C:\projects\arrow\python, configfile: setup.cfg
   plugins: hypothesis-5.24.0, lazy-fixture-0.6.3
   collected 3415 items / 3 skipped / 3412 selected
   pyarrow\tests\test_adhoc_memory_leak.py s                                [  0%]
   pyarrow\tests\test_array.py ......................s..................... [  1%]
   .....................................................................sss [  3%]
   sssssss...............................................................ss [  5%]
   .......                                                                  [  5%]
   pyarrow\tests\test_builder.py ....                                       [  5%]
   pyarrow\tests\test_cffi.py ....                                          [  5%]
   pyarrow\tests\test_compute.py .......................................... [  7%]
   .................................................................        [  9%]
   pyarrow\tests\test_convert_builtin.py .................................. [ 10%]
   ........................................................................ [ 12%]
   ........................................................................ [ 14%]
   ........................................................................ [ 16%]
   ............................x........................................... [ 18%]
   .......x....ssss......................................................   [ 20%]
   pyarrow\tests\test_csv.py .............................................. [ 21%]
   ..........................s.                                             [ 22%]
   pyarrow\tests\test_cython.py .                                           [ 22%]
   pyarrow\tests\test_dataset.py ....................
   (arrow) C:\projects\arrow\python>set lastexitcode=3 
   (arrow) C:\projects\arrow\python>set  1>C:\Users\appveyor\AppData\Local\Temp\1\tmp728E.tmp 
   (arrow) C:\projects\arrow\python>echo C:\projects\arrow\python  1>C:\Users\appveyor\AppData\Local\Temp\1\tmp728F.tmp 
   (arrow) C:\projects\arrow\python>exit /b 3 
   Command exited with code 3
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#discussion_r469418747



##########
File path: python/pyarrow/fs.py
##########
@@ -117,6 +116,9 @@ def get_type_name(self):
             protocol = protocol[0]
         return "fsspec+{0}".format(protocol)
 
+    def normalize_path(self, path):
+        return path

Review comment:
       So if `GetFileInfo` normalizes, so should `NormalizePath`. Otherwise, not necessary.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#discussion_r469986621



##########
File path: python/pyarrow/fs.py
##########
@@ -117,6 +116,9 @@ def get_type_name(self):
             protocol = protocol[0]
         return "fsspec+{0}".format(protocol)
 
+    def normalize_path(self, path):
+        return path

Review comment:
       Let's assume it's fine, then.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-669754731


   And the "partition base directory" is automatically set if a user does something like `ds.dataset("_shouldnt_be_ignored/dataset/")` ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-672900543


   I'm taking a look at the segfault.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-669516018


   https://issues.apache.org/jira/browse/ARROW-9644


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#discussion_r469330525



##########
File path: python/pyarrow/fs.py
##########
@@ -117,6 +116,9 @@ def get_type_name(self):
             protocol = protocol[0]
         return "fsspec+{0}".format(protocol)
 
+    def normalize_path(self, path):
+        return path

Review comment:
       @jorisvandenbossche This seems to work for now. I'm not sure if `fsspec` normalizes paths internally (especially on Windows).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#discussion_r469419589



##########
File path: python/pyarrow/fs.py
##########
@@ -117,6 +116,9 @@ def get_type_name(self):
             protocol = protocol[0]
         return "fsspec+{0}".format(protocol)
 
+    def normalize_path(self, path):
+        return path

Review comment:
       That said, we can also special-case `fsspec`'s LocalFileSystem to use our own LocalFileSystem instead... do we already do that?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-670496983


   Added the specific case of a directory to the existing test about this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-669936148


   > And the "partition base directory" is automatically set if a user does something like ds.dataset("_shouldnt_be_ignored/dataset/") ?
   
   Currently in python the partition_base_dir is always identical to the base directory of the recursive selector used, so yes. In principle it would be possible to select `/mnt/data/**` and set `partition_base_dir=/mnt/data/partitioned`, in which case `/mnt/data/other/**` would be included in the dataset but would not have partition information attached.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #7907: ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #7907:
URL: https://github.com/apache/arrow/pull/7907#issuecomment-671508052


   I didn't understand what the Appveyor failure was so I re-triggered the build to see if it was a flake.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org