You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/03/02 09:21:33 UTC
[GitHub] [iceberg] Fokko commented on issue #6973: PyIceberg: ORC file format support
Fokko commented on issue #6973:
URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1451550761
@alaturqua thanks for raising this issue.
Some steps on how I would add ORC support. First I would refactor the code:
https://github.com/apache/iceberg/blob/35692a360ffd2d1b9107ec6018f9fd97fda04671/python/pyiceberg/io/pyarrow.py#L484-L519
By removing the `pq` package from there, and read in a PyArrow fragment instead:
```python
import pyarrow.dataset as ds
from pyarrow.fs import S3FileSystem
with fs.open_input_file(path) as fout:
# First get the fragment
fragment = parquet_format.make_fragment(fout, None)
print(f"Schema: {fragment.physical_schema}")
arrow_table = ds.Scanner.from_fragment(
fragment=fragment
).to_table()
```
We need to add logic to decide the `Format` based on the extension of the file, we can read in different fragments:
- https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html
- https://arrow.apache.org/docs/python/generated/pyarrow.dataset.OrcFileFormat.html
The fragment has this option to say `physicial_schema` that will return the schema in PyArrow.
Note: For Parquet we actually fetch the Iceberg schema in JSON format from the metadata currently. https://github.com/apache/iceberg/issues/6505 Will provide the logic to convert a PyArrow schema into an Iceberg schema.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org