You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/27 12:47:31 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

jorisvandenbossche opened a new pull request #8064:
URL: https://github.com/apache/arrow/pull/8064


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r485021894



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2170,7 +2170,10 @@ def s3_example_s3fs(s3_connection, s3_server, s3_bucket):
 
     fs.mkdir(bucket_uri)
     yield fs, bucket_uri
-    fs.rm(bucket_uri, recursive=True)
+    try:
+        fs.rm(bucket_uri, recursive=True)
+    except FileNotFoundError:
+        pass

Review comment:
       This is using `s3fs`. I don't think we care about such errors.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r484804628



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2170,7 +2170,10 @@ def s3_example_s3fs(s3_connection, s3_server, s3_bucket):
 
     fs.mkdir(bucket_uri)
     yield fs, bucket_uri
-    fs.rm(bucket_uri, recursive=True)
+    try:
+        fs.rm(bucket_uri, recursive=True)
+    except FileNotFoundError:
+        pass

Review comment:
       For some reason the teardown gave errors for the added test where nothing got written (because we already errored when checking the file path). Not sure why though.
   
   It gave `FileNotFoundError: s3://test-s3fs/07f5f3460595484aae4482f19168c3cc`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8064:
URL: https://github.com/apache/arrow/pull/8064


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r485030253



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2936,7 +2939,34 @@ def test_write_to_dataset_with_partitions_and_index_name(
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_write_to_dataset_no_partitions(tempdir, use_legacy_dataset):
-    _test_write_to_dataset_no_partitions(str(tempdir))
+    _test_write_to_dataset_no_partitions(str(tempdir), use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib(tempdir, use_legacy_dataset):
+    _test_write_to_dataset_with_partitions(
+        tempdir / "test1", use_legacy_dataset)
+    _test_write_to_dataset_no_partitions(
+        tempdir / "test2", use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@pytest.mark.s3
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib_nonlocal(
+    tempdir, s3_example_s3fs, use_legacy_dataset
+):
+    # pathlib paths are only accepted for local files
+    fs, _ = s3_example_s3fs

Review comment:
       That wound sound good, but too late :-) We can probably revisit later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r485022899



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2936,7 +2939,34 @@ def test_write_to_dataset_with_partitions_and_index_name(
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_write_to_dataset_no_partitions(tempdir, use_legacy_dataset):
-    _test_write_to_dataset_no_partitions(str(tempdir))
+    _test_write_to_dataset_no_partitions(str(tempdir), use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib(tempdir, use_legacy_dataset):
+    _test_write_to_dataset_with_partitions(
+        tempdir / "test1", use_legacy_dataset)
+    _test_write_to_dataset_no_partitions(
+        tempdir / "test2", use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@pytest.mark.s3
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib_nonlocal(
+    tempdir, s3_example_s3fs, use_legacy_dataset
+):
+    # pathlib paths are only accepted for local files
+    fs, _ = s3_example_s3fs

Review comment:
       Hmm. Ideally we don't require S3 for such tests (because it's optional, and also because it makes tests much more expensive to run). Can we use a mock filesystem or something?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r482832927



##########
File path: python/pyarrow/parquet.py
##########
@@ -1740,7 +1740,7 @@ def write_to_dataset(table, root_path, partition_cols=None,
     Parameters
     ----------
     table : pyarrow.Table
-    root_path : str,
+    root_path : str, pathlib.Path

Review comment:
       The same.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r485028789



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2936,7 +2939,34 @@ def test_write_to_dataset_with_partitions_and_index_name(
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_write_to_dataset_no_partitions(tempdir, use_legacy_dataset):
-    _test_write_to_dataset_no_partitions(str(tempdir))
+    _test_write_to_dataset_no_partitions(str(tempdir), use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib(tempdir, use_legacy_dataset):
+    _test_write_to_dataset_with_partitions(
+        tempdir / "test1", use_legacy_dataset)
+    _test_write_to_dataset_no_partitions(
+        tempdir / "test2", use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@pytest.mark.s3
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib_nonlocal(
+    tempdir, s3_example_s3fs, use_legacy_dataset
+):
+    # pathlib paths are only accepted for local files
+    fs, _ = s3_example_s3fs

Review comment:
       The problem is that this is using the "legacy" filesystems (I am going to add support for the new filesystems when the dataset writing with parquet is merged, and even then it might take a different code path), and I don't think we have a mock filesystem in pyarrow.filesytem? (only the local, since we also can't use hdfs)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#issuecomment-681926850


   https://issues.apache.org/jira/browse/ARROW-9864


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r482759794



##########
File path: python/pyarrow/parquet.py
##########
@@ -1740,7 +1740,7 @@ def write_to_dataset(table, root_path, partition_cols=None,
     Parameters
     ----------
     table : pyarrow.Table
-    root_path : str,
+    root_path : str, pathlib.Path

Review comment:
       OK, and what about objects with `__fspath__` (the protocol doesn't seem to be explicitly only about local paths, but in practice might also be?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r485029324



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2936,7 +2939,34 @@ def test_write_to_dataset_with_partitions_and_index_name(
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_write_to_dataset_no_partitions(tempdir, use_legacy_dataset):
-    _test_write_to_dataset_no_partitions(str(tempdir))
+    _test_write_to_dataset_no_partitions(str(tempdir), use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib(tempdir, use_legacy_dataset):
+    _test_write_to_dataset_with_partitions(
+        tempdir / "test1", use_legacy_dataset)
+    _test_write_to_dataset_no_partitions(
+        tempdir / "test2", use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@pytest.mark.s3
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib_nonlocal(
+    tempdir, s3_example_s3fs, use_legacy_dataset
+):
+    # pathlib paths are only accepted for local files
+    fs, _ = s3_example_s3fs

Review comment:
       Ah, I see. This is good enough then.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#issuecomment-688766298


   @pitrou added an additional check that the path is not a pathlib path if the filesystem is not a local filesystem


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r481104060



##########
File path: python/pyarrow/parquet.py
##########
@@ -1740,7 +1740,7 @@ def write_to_dataset(table, root_path, partition_cols=None,
     Parameters
     ----------
     table : pyarrow.Table
-    root_path : str,
+    root_path : str, pathlib.Path

Review comment:
       Ok, but we should probably disallow passing a `Path` if the filesystem is not a local filesystem.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8064: ARROW-9864: [Python] Support pathlib.path in pq.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8064:
URL: https://github.com/apache/arrow/pull/8064#discussion_r485029384



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2936,7 +2939,34 @@ def test_write_to_dataset_with_partitions_and_index_name(
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_write_to_dataset_no_partitions(tempdir, use_legacy_dataset):
-    _test_write_to_dataset_no_partitions(str(tempdir))
+    _test_write_to_dataset_no_partitions(str(tempdir), use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib(tempdir, use_legacy_dataset):
+    _test_write_to_dataset_with_partitions(
+        tempdir / "test1", use_legacy_dataset)
+    _test_write_to_dataset_no_partitions(
+        tempdir / "test2", use_legacy_dataset)
+
+
+@pytest.mark.pandas
+@pytest.mark.s3
+@parametrize_legacy_dataset
+def test_write_to_dataset_pathlib_nonlocal(
+    tempdir, s3_example_s3fs, use_legacy_dataset
+):
+    # pathlib paths are only accepted for local files
+    fs, _ = s3_example_s3fs

Review comment:
       I could of course use a "mock" (eg in-memory) filesytem from fsspec. Then it still uses an optional dependency in this test, but at least is not that slow as S3




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org