You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Guillem96 (via GitHub)" <gi...@apache.org> on 2023/02/13 07:44:42 UTC
[GitHub] [iceberg] Guillem96 opened a new issue, #6820: Cannot create a PyArrow dataframe after an scan
Guillem96 opened a new issue, #6820:
URL: https://github.com/apache/iceberg/issues/6820
### Apache Iceberg version
None
### Query engine
Athena
### Please describe the bug 🐞
When running this code snippet to query my Athena Iceberg table:
```python
from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import EqualTo
catalog = load_catalog("default", warehouse="...")
table = catalog.load_table(("...", "..."))
df = (
table.scan()
.filter(EqualTo("uuid", "..."))
.select("rt", "cs1", "in")
.to_arrow()
)
print(df)
```
I get this error:
```
Traceback (most recent call last):
File "", line 12, in <module>
.to_arrow()
^^^^^^^^^^
File ".../lib/python3.11/site-packages/pyiceberg/table/__init__.py", line 367, in to_arrow
return project_table(
^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 541, in project_table
raise ValueError(f"Expected PyArrowFileIO, got: {table.io}")
ValueError: Expected PyArrowFileIO, got: <pyiceberg.io.fsspec.FsspecFileIO object at 0x12618e290>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #6820: Cannot create a PyArrow dataframe after an scan
Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on issue #6820:
URL: https://github.com/apache/iceberg/issues/6820#issuecomment-1427528840
Hey @Guillem96 thanks for raising this issue. It looks like you're using the `FsspecFileIO` to fetch the files. You can set it to PyArrow explicitly using:
```
rest:
...
py-io-impl: pyiceberg.io.pyarrow.PyArrowFileIO
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Guillem96 closed issue #6820: Cannot create a PyArrow dataframe after an scan
Posted by "Guillem96 (via GitHub)" <gi...@apache.org>.
Guillem96 closed issue #6820: Cannot create a PyArrow dataframe after an scan
URL: https://github.com/apache/iceberg/issues/6820
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Guillem96 commented on issue #6820: Cannot create a PyArrow dataframe after an scan
Posted by "Guillem96 (via GitHub)" <gi...@apache.org>.
Guillem96 commented on issue #6820:
URL: https://github.com/apache/iceberg/issues/6820#issuecomment-1427980266
Okay I managed to solve it with your PR changes and adding `AWS_DEFAULT_REGION=eu-west-1` to my environment.
But just to let you know, this is a wierd behaviour because I already have my default region setup in `~.aws/credentials`, so I'm expecting pyiceberg (or underlying pyarrow s3FileSystem) to fetch it from there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Guillem96 commented on issue #6820: Cannot create a PyArrow dataframe after an scan
Posted by "Guillem96 (via GitHub)" <gi...@apache.org>.
Guillem96 commented on issue #6820:
URL: https://github.com/apache/iceberg/issues/6820#issuecomment-1427582965
Thank you for the quick reply.
If I update my catalog config to:
```
catalog:
default:
type: glue
py-io-impl: pyiceberg.io.pyarrow.PyArrowFileIO
```
I'm getting this error when loading the table:
```
Traceback (most recent call last):
File "test_iceberg.py", line 6, in <module>
table = catalog.load_table(("...", "..."))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 278, in load_table
return self._convert_glue_to_iceberg(load_table_response.get(PROP_GLUE_TABLE, {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 180, in _convert_glue_to_iceberg
metadata = FromInputFile.table_metadata(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/pyiceberg/serializers.py", line 56, in table_metadata
with input_file.open() as input_stream:
^^^^^^^^^^^^^^^^^
File ".../lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 197, in open
input_file = self._filesystem.open_input_file(self._path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When reading information for key 'iceberg_zstd_v3/metadata/00001-4c2dec16-77be-4a9d-8a47-00f74d2fd881.metadata.json' in bucket 'bucket-name': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.
```
I have AWS admin rights and I can check the the `HeadObject` api calls works fine in the same environment by:
```
$ aws s3api head-object --bucket bucket-name --key iceberg_zstd_v3/metadata/00001-4c2dec16-77be-4a9d-8a47-00f74d2fd881.metadata.json
{
"AcceptRanges": "bytes",
"LastModified": "2023-02-08T15:46:48+00:00",
"ContentLength": 2974,
"ContentType": "application/octet-stream",
"ServerSideEncryption": "aws:kms",
"Metadata": {},
"SSEKMSKeyId": "..."
}
```
So you know what's going on? Why I cannot read the table metadata if I change the `py-io-impl`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org