You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/02/13 08:42:36 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6822: Python: Set PyArrow as the default FileIO

Fokko opened a new pull request, #6822:
URL: https://github.com/apache/iceberg/pull/6822

   PyArrow is the most feature-complete FileIO, and it can be a bit confusing that we default to s3fs.
   
   Resolves #6820


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1104878428


##########
python/tests/catalog/test_glue.py:
##########
@@ -52,7 +52,7 @@ def test_create_table_with_database_location(
     _bucket_initialize: None, _patch_aiobotocore: None, table_schema_nested: Schema, database_name: str, table_name: str
 ) -> None:
     identifier = (database_name, table_name)
-    test_catalog = GlueCatalog("glue")
+    test_catalog = GlueCatalog("glue", **{"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"})

Review Comment:
   sorry just to get some more context here, why do we need to use `FsspecFileIO` for the tests? Would `PyArrowFileIO` not work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1104910238


##########
python/tests/catalog/test_glue.py:
##########
@@ -52,7 +52,7 @@ def test_create_table_with_database_location(
     _bucket_initialize: None, _patch_aiobotocore: None, table_schema_nested: Schema, database_name: str, table_name: str
 ) -> None:
     identifier = (database_name, table_name)
-    test_catalog = GlueCatalog("glue")
+    test_catalog = GlueCatalog("glue", **{"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"})

Review Comment:
   I think `PyArrowFileIO` is not compatible with the mocked S3 FIleSystem (handled by `moto` and `_patch_aiobotocore`) we used in the unit test. I received 
   ```
   OSError: When getting information for key '...' in bucket 'test_bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
   ```
   when using `PyArrowFileIO` in the unit test. Seems it still tries to interact with the real s3 bucket not the mocked one.
   
   I also rerun the `integration_test_glue.py` with the new changes and verify that everything works fine with io default to `PyArrowFileIO`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1104910238


##########
python/tests/catalog/test_glue.py:
##########
@@ -52,7 +52,7 @@ def test_create_table_with_database_location(
     _bucket_initialize: None, _patch_aiobotocore: None, table_schema_nested: Schema, database_name: str, table_name: str
 ) -> None:
     identifier = (database_name, table_name)
-    test_catalog = GlueCatalog("glue")
+    test_catalog = GlueCatalog("glue", **{"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"})

Review Comment:
   I think `PyArrowFileIO` is not compatible with the mocked S3 FIleSystem (handled by `moto` and `_patch_aiobotocore`) used in the unit tests. I received 
   ```
   OSError: When getting information for key '...' in bucket 'test_bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
   ```
   when using `PyArrowFileIO` in the unit test. Seems it still tries to interact with the real s3 bucket not the mocked one.
   
   I also reran the `integration_test_glue.py` with the new changes and verified that everything works fine with io default to `PyArrowFileIO`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rosner commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "rosner (via GitHub)" <gi...@apache.org>.
rosner commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1112840867


##########
python/pyiceberg/io/__init__.py:
##########
@@ -254,10 +254,10 @@ def delete(self, location: Union[str, InputFile, OutputFile]) -> None:
 # Mappings from the Java FileIO impl to a Python one. The list is ordered by preference.
 # If an implementation isn't installed, it will fall back to the next one.
 SCHEMA_TO_FILE_IO: Dict[str, List[str]] = {
-    "s3": [FSSPEC_FILE_IO, ARROW_FILE_IO],
-    "s3a": [FSSPEC_FILE_IO, ARROW_FILE_IO],
-    "s3n": [FSSPEC_FILE_IO, ARROW_FILE_IO],
-    "gcs": [ARROW_FILE_IO],
+    "s3": [ARROW_FILE_IO, FSSPEC_FILE_IO],
+    "s3a": [ARROW_FILE_IO, FSSPEC_FILE_IO],
+    "s3n": [ARROW_FILE_IO, FSSPEC_FILE_IO],
+    "gs": [ARROW_FILE_IO],

Review Comment:
   👏 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1104910238


##########
python/tests/catalog/test_glue.py:
##########
@@ -52,7 +52,7 @@ def test_create_table_with_database_location(
     _bucket_initialize: None, _patch_aiobotocore: None, table_schema_nested: Schema, database_name: str, table_name: str
 ) -> None:
     identifier = (database_name, table_name)
-    test_catalog = GlueCatalog("glue")
+    test_catalog = GlueCatalog("glue", **{"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"})

Review Comment:
   I think `PyArrowFileIO` is not compatible with the mocked S3 FIleSystem (handled by `moto` and `_patch_aiobotocore`) we used in the unit test. I received 
   ```
   OSError: When getting information for key '...' in bucket 'test_bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
   ```
   when using `PyArrowFileIO` in the unit test. Seems it still tries to interact with the real s3 bucket not the mocked one.
   
   I also reran the `integration_test_glue.py` with the new changes and verified that everything works fine with io default to `PyArrowFileIO`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "jackye1995 (via GitHub)" <gi...@apache.org>.
jackye1995 commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1104915741


##########
python/tests/catalog/test_glue.py:
##########
@@ -52,7 +52,7 @@ def test_create_table_with_database_location(
     _bucket_initialize: None, _patch_aiobotocore: None, table_schema_nested: Schema, database_name: str, table_name: str
 ) -> None:
     identifier = (database_name, table_name)
-    test_catalog = GlueCatalog("glue")
+    test_catalog = GlueCatalog("glue", **{"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"})

Review Comment:
   Thanks for the explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1104910238


##########
python/tests/catalog/test_glue.py:
##########
@@ -52,7 +52,7 @@ def test_create_table_with_database_location(
     _bucket_initialize: None, _patch_aiobotocore: None, table_schema_nested: Schema, database_name: str, table_name: str
 ) -> None:
     identifier = (database_name, table_name)
-    test_catalog = GlueCatalog("glue")
+    test_catalog = GlueCatalog("glue", **{"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"})

Review Comment:
   I think `PyArrowFileIO` is not compatible with the mocked S3 FIleSystem (handled by `moto` and `_patch_aiobotocore`) we used in the unit test. I received 
   ```
   OSError: When getting information for key '...' in bucket 'test_bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
   ```
   when using `PyArrowFileIO` in the unit test. Seems it still trying to interact with the real s3 bucket not the mocked one.
   
   I also rerun the `integration_test_glue.py` with the new changes and verify that everything works fine with io default to `PyArrowFileIO`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #6822:
URL: https://github.com/apache/iceberg/pull/6822#discussion_r1104933171


##########
python/tests/catalog/test_glue.py:
##########
@@ -52,7 +52,7 @@ def test_create_table_with_database_location(
     _bucket_initialize: None, _patch_aiobotocore: None, table_schema_nested: Schema, database_name: str, table_name: str
 ) -> None:
     identifier = (database_name, table_name)
-    test_catalog = GlueCatalog("glue")
+    test_catalog = GlueCatalog("glue", **{"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"})

Review Comment:
   Thanks for jumping in here @JonasJ-ap 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #6822: Python: Set PyArrow as the default FileIO

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko merged PR #6822:
URL: https://github.com/apache/iceberg/pull/6822


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org