You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "Guillem96 (via GitHub)" <gi...@apache.org> on 2023/02/13 07:44:42 UTC

[GitHub] [iceberg] Guillem96 opened a new issue, #6820: Cannot create a PyArrow dataframe after an scan

Guillem96 opened a new issue, #6820:
URL: https://github.com/apache/iceberg/issues/6820

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   Athena
   
   ### Please describe the bug 🐞
   
   When running this code snippet to query my Athena Iceberg table:
   
   ```python
   from pyiceberg.catalog import load_catalog
   from pyiceberg.expressions import EqualTo
   
   catalog = load_catalog("default", warehouse="...")
   table = catalog.load_table(("...", "..."))
   
   df = (
       table.scan()
       .filter(EqualTo("uuid", "..."))
       .select("rt", "cs1", "in")
       .to_arrow()
   )
   
   print(df)
   ```
   
   I get this error:
   
   ```
   Traceback (most recent call last):
     File "", line 12, in <module>
       .to_arrow()
        ^^^^^^^^^^
     File ".../lib/python3.11/site-packages/pyiceberg/table/__init__.py", line 367, in to_arrow
       return project_table(
              ^^^^^^^^^^^^^^
     File ".../lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 541, in project_table
       raise ValueError(f"Expected PyArrowFileIO, got: {table.io}")
   ValueError: Expected PyArrowFileIO, got: <pyiceberg.io.fsspec.FsspecFileIO object at 0x12618e290>
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #6820: Cannot create a PyArrow dataframe after an scan

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #6820:
URL: https://github.com/apache/iceberg/issues/6820#issuecomment-1427528840

   Hey @Guillem96 thanks for raising this issue. It looks like you're using the `FsspecFileIO` to fetch the files. You can set it to PyArrow explicitly using:
   ```
       rest:
           ...
           py-io-impl: pyiceberg.io.pyarrow.PyArrowFileIO
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Guillem96 closed issue #6820: Cannot create a PyArrow dataframe after an scan

Posted by "Guillem96 (via GitHub)" <gi...@apache.org>.

Guillem96 closed issue #6820: Cannot create a PyArrow dataframe after an scan
URL: https://github.com/apache/iceberg/issues/6820


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Guillem96 commented on issue #6820: Cannot create a PyArrow dataframe after an scan

Posted by "Guillem96 (via GitHub)" <gi...@apache.org>.

Guillem96 commented on issue #6820:
URL: https://github.com/apache/iceberg/issues/6820#issuecomment-1427980266

   Okay I managed to solve it with your PR changes and adding `AWS_DEFAULT_REGION=eu-west-1` to my environment.
   
   But just to let you know, this is a wierd behaviour because I already have my default region setup in `~.aws/credentials`, so I'm expecting pyiceberg (or underlying pyarrow s3FileSystem) to fetch it from there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Guillem96 commented on issue #6820: Cannot create a PyArrow dataframe after an scan

Posted by "Guillem96 (via GitHub)" <gi...@apache.org>.

Guillem96 commented on issue #6820:
URL: https://github.com/apache/iceberg/issues/6820#issuecomment-1427582965

   Thank you for the quick reply.
   
   If I update my catalog config to:
   
   ```
   catalog:
     default:
       type: glue
       py-io-impl: pyiceberg.io.pyarrow.PyArrowFileIO
   ```
   
   I'm getting this error when loading the table:
   
   ```
   Traceback (most recent call last):
     File "test_iceberg.py", line 6, in <module>
       table = catalog.load_table(("...", "..."))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File ".../lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 278, in load_table
       return self._convert_glue_to_iceberg(load_table_response.get(PROP_GLUE_TABLE, {}))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File ".../lib/python3.11/site-packages/pyiceberg/catalog/glue.py", line 180, in _convert_glue_to_iceberg
       metadata = FromInputFile.table_metadata(file)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File ".../lib/python3.11/site-packages/pyiceberg/serializers.py", line 56, in table_metadata
       with input_file.open() as input_stream:
            ^^^^^^^^^^^^^^^^^
     File ".../lib/python3.11/site-packages/pyiceberg/io/pyarrow.py", line 197, in open
       input_file = self._filesystem.open_input_file(self._path)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
     File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
   OSError: When reading information for key 'iceberg_zstd_v3/metadata/00001-4c2dec16-77be-4a9d-8a47-00f74d2fd881.metadata.json' in bucket 'bucket-name': AWS Error UNKNOWN (HTTP status 301) during HeadObject operation: No response body.
   
   ```
   
   I have AWS admin rights and I can check the the `HeadObject` api calls works fine in the same environment by:
   
   ```
   $ aws s3api head-object --bucket bucket-name --key iceberg_zstd_v3/metadata/00001-4c2dec16-77be-4a9d-8a47-00f74d2fd881.metadata.json
   {
       "AcceptRanges": "bytes",
       "LastModified": "2023-02-08T15:46:48+00:00",
       "ContentLength": 2974,
       "ContentType": "application/octet-stream",
       "ServerSideEncryption": "aws:kms",
       "Metadata": {},
       "SSEKMSKeyId": "..."
   }
   ```
   
   So you know what's going on? Why I cannot read the table metadata if I change the `py-io-impl`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org