You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/21 16:39:47 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

jorisvandenbossche opened a new pull request #9284:
URL: https://github.com/apache/arrow/pull/9284


   * This fixed the commented-out test mentioned in ARROW-10370 
   * Use the general filesystem-resolve utility code in `write_dataset` (+add check for pathlib paths not being allowed for filesystems other than local filesystem, this is checked in the legacy filesystems)
   * Add S3 tests for `write_dataset` to cover the new functionality by properly resolving the path/URI now


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-764778634


   https://issues.apache.org/jira/browse/ARROW-10370


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766808103


   Writing datasets is not yet used by dask/pandas/kartothek, I think, but I also slightly changed `_resolve_filesystem_and_path` which is used in reading datasets, so yes, good idea, let's run them to be sure.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766878267


   Looking good, thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766801684


   Does this need to be tested on some crossbow builds before merging?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766808833


   Revision: f855a3166addea9d615f1aa61ac5c5e090df3367
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-44](https://github.com/ursacomputing/crossbow/branches/all?query=actions-44)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.6-pandas-0.23|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.6-pandas-0.23)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.6-pandas-0.23)|
   |test-conda-python-3.7-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-dask-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-dask-latest)|
   |test-conda-python-3.7-hdfs-3.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-hdfs-3.2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-hdfs-3.2)|
   |test-conda-python-3.7-kartothek-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-kartothek-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-kartothek-latest)|
   |test-conda-python-3.7-kartothek-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-kartothek-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-kartothek-master)|
   |test-conda-python-3.7-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-pandas-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-pandas-latest)|
   |test-conda-python-3.7-pandas-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-pandas-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-pandas-master)|
   |test-conda-python-3.7-spark-branch-3.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-spark-branch-3.0)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-spark-branch-3.0)|
   |test-conda-python-3.7-turbodbc-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-turbodbc-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-turbodbc-latest)|
   |test-conda-python-3.7-turbodbc-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-turbodbc-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-turbodbc-master)|
   |test-conda-python-3.8-dask-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-dask-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-dask-master)|
   |test-conda-python-3.8-jpype|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-jpype)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-jpype)|
   |test-conda-python-3.8-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-pandas-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-pandas-latest)|
   |test-conda-python-3.8-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-pandas-nightly)|
   |test-conda-python-3.8-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-spark-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-spark-master)|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #9284:
URL: https://github.com/apache/arrow/pull/9284


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#discussion_r563706429



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1337,7 +1337,7 @@ def test_construct_from_single_file(tempdir):
     # instantiate from a single file with a filesystem object
     d2 = ds.dataset(path, filesystem=fs.LocalFileSystem())
     # instantiate from a single file with prefixed filesystem URI
-    d3 = ds.dataset(relative_path, filesystem=_filesystem_uri(directory))
+    d3 = ds.dataset(str(relative_path), filesystem=_filesystem_uri(directory))

Review comment:
       Agreed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-764778634


   https://issues.apache.org/jira/browse/ARROW-10370


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766808833


   Revision: f855a3166addea9d615f1aa61ac5c5e090df3367
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-44](https://github.com/ursacomputing/crossbow/branches/all?query=actions-44)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.6-pandas-0.23|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.6-pandas-0.23)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.6-pandas-0.23)|
   |test-conda-python-3.7-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-dask-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-dask-latest)|
   |test-conda-python-3.7-hdfs-3.2|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-hdfs-3.2)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-hdfs-3.2)|
   |test-conda-python-3.7-kartothek-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-kartothek-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-kartothek-latest)|
   |test-conda-python-3.7-kartothek-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-kartothek-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-kartothek-master)|
   |test-conda-python-3.7-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-pandas-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-pandas-latest)|
   |test-conda-python-3.7-pandas-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-pandas-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-pandas-master)|
   |test-conda-python-3.7-spark-branch-3.0|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-spark-branch-3.0)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-spark-branch-3.0)|
   |test-conda-python-3.7-turbodbc-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-turbodbc-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-turbodbc-latest)|
   |test-conda-python-3.7-turbodbc-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.7-turbodbc-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.7-turbodbc-master)|
   |test-conda-python-3.8-dask-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-dask-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-dask-master)|
   |test-conda-python-3.8-jpype|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-jpype)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-jpype)|
   |test-conda-python-3.8-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-pandas-latest)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-pandas-latest)|
   |test-conda-python-3.8-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-pandas-nightly)|
   |test-conda-python-3.8-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-44-github-test-conda-python-3.8-spark-master)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-44-github-test-conda-python-3.8-spark-master)|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766808321


   @github-actions crossbow submit -g integration


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766808103






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#discussion_r562029994



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1337,7 +1337,7 @@ def test_construct_from_single_file(tempdir):
     # instantiate from a single file with a filesystem object
     d2 = ds.dataset(path, filesystem=fs.LocalFileSystem())
     # instantiate from a single file with prefixed filesystem URI
-    d3 = ds.dataset(relative_path, filesystem=_filesystem_uri(directory))
+    d3 = ds.dataset(str(relative_path), filesystem=_filesystem_uri(directory))

Review comment:
       This change is needed because `relative_path` is a pathlib.Path object, but after this change, we only support that for actual LocalFilesystem, while here we have a SubtreeFilesystem. 
   I suppose that in theory I could check for SubtreeFilesystem that wraps LocalFilesystem as well (but not sure we want to encourage using pathlib.Path for relative paths, and when having an absolute path, you never need SubtreeFileSystem).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #9284:
URL: https://github.com/apache/arrow/pull/9284


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#discussion_r563706429



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1337,7 +1337,7 @@ def test_construct_from_single_file(tempdir):
     # instantiate from a single file with a filesystem object
     d2 = ds.dataset(path, filesystem=fs.LocalFileSystem())
     # instantiate from a single file with prefixed filesystem URI
-    d3 = ds.dataset(relative_path, filesystem=_filesystem_uri(directory))
+    d3 = ds.dataset(str(relative_path), filesystem=_filesystem_uri(directory))

Review comment:
       Agreed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#issuecomment-766801684






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9284: ARROW-10370: [Python] Clean-up filesystem handling in write_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9284:
URL: https://github.com/apache/arrow/pull/9284#discussion_r562029994



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1337,7 +1337,7 @@ def test_construct_from_single_file(tempdir):
     # instantiate from a single file with a filesystem object
     d2 = ds.dataset(path, filesystem=fs.LocalFileSystem())
     # instantiate from a single file with prefixed filesystem URI
-    d3 = ds.dataset(relative_path, filesystem=_filesystem_uri(directory))
+    d3 = ds.dataset(str(relative_path), filesystem=_filesystem_uri(directory))

Review comment:
       This change is needed because `relative_path` is a pathlib.Path object, but after this change, we only support that for actual LocalFilesystem, while here we have a SubtreeFilesystem. 
   I suppose that in theory I could check for SubtreeFilesystem that wraps LocalFilesystem as well (but not sure we want to encourage using pathlib.Path for relative paths, and when having an absolute path, you never need SubtreeFileSystem).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org