You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "JonasJ-ap (via GitHub)" <gi...@apache.org> on 2023/03/12 04:31:51 UTC
[GitHub] [iceberg] JonasJ-ap commented on issue #6973: PyIceberg: ORC file format support
JonasJ-ap commented on issue #6973:
URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1465090060
I would like to share a quick test on reading field metadata from ORC file (from an iceberg table created on AWS Athena engine v3) by pyarrow. It seems that the pa.Schema of the ORC file does not contain any field_id information even though they do exist in the ORC file:
I first use `orc-tools` to inspect the metadata of the ORC file and find that the field id is stored in `iceberg.id`:
```zsh
orc-tools meta /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc [length: 1104]
Structure for /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc
File Version: 0.12 with TRINO_ORIGINAL by Trino
Rows: 1
Compression: ZSTD
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<user_id:int,user_name:string,status:string,cost:double,quantity:int,quantity_big:int,creation_date:date,created_at:timestamp,inserted_at:timestamp>
Attributes on root.user_id
iceberg.id: 1
iceberg.required: false
Attributes on root.user_name
iceberg.id: 2
iceberg.required: false
Attributes on root.status
iceberg.id: 3
iceberg.required: false
Attributes on root.cost
iceberg.id: 4
iceberg.required: false
Attributes on root.quantity
iceberg.id: 5
iceberg.required: false
```
However, when I used the similar way in #7033 to get the pyarrow schema, the field id info get lost:
```python
orc_test_path = "/Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc"
fs = LocalFileSystem()
with fs.open_input_file(orc_test_path) as f:
# print(po.ORCFile(f).read().schema)
print(ds.OrcFileFormat().make_fragment(f).physical_schema)
```
```zsh
user_id: int32
user_name: string
status: string
cost: double
quantity: int32
quantity_big: int32
creation_date: date32[day]
created_at: timestamp[ns]
inserted_at: timestamp[ns]
-- schema metadata --
presto_query_id: '20230311_203248_00007_2vuin'
trino.writer.version: '0.215-16024-g6eec71f'
presto_version: '0.215-16024-g6eec71f'
```
The pyarrow schema containing field id info should look like:
```zsh
user_id: int32
-- field metadata --
PARQUET:field_id: '1'
user_name: string
-- field metadata --
PARQUET:field_id: '2'
status: string
-- field metadata --
PARQUET:field_id: '3'
...
```
It seems even with #6997 merged, we still cannot fully support the ORC file format if we use pyarrow to read it. The name mapping feature may need to be implemented here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org