You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "JonasJ-ap (via GitHub)" <gi...@apache.org> on 2023/03/12 04:31:51 UTC

[GitHub] [iceberg] JonasJ-ap commented on issue #6973: PyIceberg: ORC file format support

JonasJ-ap commented on issue #6973:
URL: https://github.com/apache/iceberg/issues/6973#issuecomment-1465090060

   I would like to share a quick test on reading field metadata from ORC file (from an iceberg table created on AWS Athena engine v3) by pyarrow. It seems that the pa.Schema of the ORC file does not contain any field_id information even though they do exist in the ORC file:
   
   I first use `orc-tools` to inspect the metadata of the ORC file and find that the field id is stored in `iceberg.id`:
   ```zsh
   orc-tools meta /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc
   log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
   log4j:WARN Please initialize the log4j system properly.
   log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
   Processing data file /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc [length: 1104]
   Structure for /Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc
   File Version: 0.12 with TRINO_ORIGINAL by Trino 
   Rows: 1
   Compression: ZSTD
   Compression size: 262144
   Calendar: Julian/Gregorian
   Type: struct<user_id:int,user_name:string,status:string,cost:double,quantity:int,quantity_big:int,creation_date:date,created_at:timestamp,inserted_at:timestamp>
   Attributes on root.user_id
     iceberg.id: 1
     iceberg.required: false
   Attributes on root.user_name
     iceberg.id: 2
     iceberg.required: false
   Attributes on root.status
     iceberg.id: 3
     iceberg.required: false
   Attributes on root.cost
     iceberg.id: 4
     iceberg.required: false
   Attributes on root.quantity
     iceberg.id: 5
     iceberg.required: false
   ```
   
   However, when I used the similar way in #7033 to get the pyarrow schema, the field id info get lost:
   ```python
   orc_test_path = "/Users/jonasjiang/.CMVolumes/gluetestjonas/warehouse/iceberg_ref/athena_orc_test/data/30f712c8/creation_date=2023-03-11/user_id_bucket=2/20230311_203248_00007_2vuin-316269af-62cb-4bef-8f52-ca2d46bc6397.orc"
   fs = LocalFileSystem()
   with fs.open_input_file(orc_test_path) as f:
       # print(po.ORCFile(f).read().schema)
       print(ds.OrcFileFormat().make_fragment(f).physical_schema)
   ```
   ```zsh
   user_id: int32
   user_name: string
   status: string
   cost: double
   quantity: int32
   quantity_big: int32
   creation_date: date32[day]
   created_at: timestamp[ns]
   inserted_at: timestamp[ns]
   -- schema metadata --
   presto_query_id: '20230311_203248_00007_2vuin'
   trino.writer.version: '0.215-16024-g6eec71f'
   presto_version: '0.215-16024-g6eec71f'
   ```
   The pyarrow schema containing field id info should look like:
   ```zsh
   user_id: int32
     -- field metadata --
     PARQUET:field_id: '1'
   user_name: string
     -- field metadata --
     PARQUET:field_id: '2'
   status: string
     -- field metadata --
     PARQUET:field_id: '3'
   ...
   ```
   
    It seems even with #6997 merged, we still cannot fully support the ORC file format if we use pyarrow to read it. The name mapping feature may need to be implemented here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org