You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/03/02 09:21:33 UTC

[GitHub] [iceberg] Fokko commented on issue #6973: PyIceberg: ORC file format support

Fokko commented on issue #6973:
URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1451550761

   @alaturqua thanks for raising this issue.
   
   Some steps on how I would add ORC support. First I would refactor the code:
   https://github.com/apache/iceberg/blob/35692a360ffd2d1b9107ec6018f9fd97fda04671/python/pyiceberg/io/pyarrow.py#L484-L519
   
   By removing the `pq` package from there, and read in a PyArrow fragment instead:
   ```python
   import pyarrow.dataset as ds
   from pyarrow.fs import S3FileSystem
   
   with fs.open_input_file(path) as fout:
       # First get the fragment
       fragment = parquet_format.make_fragment(fout, None)
       print(f"Schema: {fragment.physical_schema}")
       arrow_table = ds.Scanner.from_fragment(
           fragment=fragment
       ).to_table()
   ```
   
   We need to add logic to decide the `Format` based on the extension of the file, we can read in different fragments:
   - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFileFormat.html
   - https://arrow.apache.org/docs/python/generated/pyarrow.dataset.OrcFileFormat.html
   
   The fragment has this option to say `physicial_schema` that will return the schema in PyArrow.
   
   Note: For Parquet we actually fetch the Iceberg schema in JSON format from the metadata currently. https://github.com/apache/iceberg/issues/6505 Will provide the logic to convert a PyArrow schema into an Iceberg schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org