You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "nicor88 (via GitHub)" <gi...@apache.org> on 2023/01/23 09:28:13 UTC

[GitHub] [iceberg] nicor88 opened a new issue, #6647: Issue reading iceberg tables written by Athena with pyiceberg

nicor88 opened a new issue, #6647:
URL: https://github.com/apache/iceberg/issues/6647

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   I'm trying to read an iceberg table written by Athena (engine v3), not sure which iceberg version it uses.
   
   When running this code:
   
   ```
   from pyiceberg import catalog
   from pyiceberg.expressions import GreaterThanOrEqual
   
   
   glue_catalog = catalog.load_glue(name='glue', conf={})
   
   glue_catalog.list_namespaces()
   
   
   glue_catalog.list_tables('silver_marketing')
   
   table = glue_catalog.load_table("silver_marketing.performance_kpis")
   
   scan = table.scan(
       row_filter=GreaterThanOrEqual("report_date", "2023-01-01")
   )
   
   files = [task.file.file_path for task in scan.plan_files()]
   print(files)
   df_iceberg = scan.to_pandas()
   print(len(df_iceberg))
   ```
   If fails on the `df_iceberg = scan.to_pandas()` (I tried also with `scan.to_arrow()`.
   
   The error is the following:
   ```
   Traceback (most recent call last):
     File "/Users/nicor88/deng-swiss-knife/icerberg/get_data.py", line 31, in <module>
       df_iceberg = scan.to_arrow()
     File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 341, in to_arrow
       return project_table(
     File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/io/pyarrow.py", line 508, in project_table
       schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA)
   AttributeError: 'NoneType' object has no attribute 'get'
   ```
   
   an example table can be created like that:
   
   ```
   create table
       data_engineering.iceberg_example_1
     with (
       table_type='iceberg',
       is_external=false,
       location='s3://xxxx/iceberg_1',
       partitioning=ARRAY['creation_date', 'bucket(user_id, 5)'],
       format='parquet',
       vacuum_max_snapshot_age_seconds=86400,
       optimize_rewrite_delete_file_threshold=2
     )
     as
       
   
   with data as (
       select
           1 as user_id,
           'pi' as user_name,
           'active' as status,
           17.89 as cost,
           1 as quantity,
           100000000 as quantity_big,
           cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
           cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
           cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
       union all
       select
           2 as user_id,
           'beta' as user_name,
           'inactive' as status,
           3 as cost,
           5 as quantity,
           100000000 as quantity_big,
           cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
           cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
           cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
   )
   
   select
       user_id,
       user_name,
       status,
       cost,
       quantity,
       quantity_big,
       creation_date,
       created_at,
       cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as inserted_at
   from data
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue closed issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue closed issue #6647: Issue reading iceberg tables written by Athena with pyiceberg
URL: https://github.com/apache/iceberg/issues/6647


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on issue #6647:
URL: https://github.com/apache/iceberg/issues/6647#issuecomment-1435230249

   @Fokko, the solution here is to convert the Parquet schema to Iceberg and preserve the IDs right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on issue #6647:
URL: https://github.com/apache/iceberg/issues/6647#issuecomment-1530630558

   @sheinbergon, we just merged #6505 which should solve this.
   
   I'm closing this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #6647:
URL: https://github.com/apache/iceberg/issues/6647#issuecomment-1527318632

   @sheinbergon We're wrapping up the release, and hope to start the release process soon. I would expect within 1-2 weeks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on issue #6647:
URL: https://github.com/apache/iceberg/issues/6647#issuecomment-1401099271

   Need to investigate more deeply but based on the stack trace it seems like the metadata field on the parquet_schema is not even defined. Here's where pyarrow returns the schema https://github.com/apache/arrow/blob/master/python/pyarrow/parquet/core.py#L3656
   
   Not entirely sure what this metadata field is in parquet_schema and if every parquet writer is expected to write this out.
   
   I'll double check the Parquet spec, but If I had to hazard a guess, I would say it's probably not required to write out and we need to perform the same check done here https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L509
   and raise an error linking to this known issue https://github.com/apache/iceberg/issues/6505
   that PyIceberg should derive the Iceberg schema from the actual Parquet schema.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on issue #6647:
URL: https://github.com/apache/iceberg/issues/6647#issuecomment-1436091339

   @rdblue that's correct.
   
   I took a shortcut and get the schema from the metadata. That should still be the preferred way because it is probably faster (Pydantic without any GIL just going through the JSON document).
   
   However, the fields in Parquet also have metadata, and those contain the `field_id`:
   ```python
   ➜  Desktop python3
   Python 3.10.10 (main, Feb  8 2023, 05:34:50) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> from pyarrow.fs import LocalFileSystem
   >>> import pyarrow.parquet as pq
   >>> 
   >>> fs = LocalFileSystem()
   >>> 
   >>> path = "/Users/fokkodriesprong/Desktop/00003-4-5e18fe8b-532b-42ec-bdb9-0c7ba7078155-00001.parquet"
   >>> 
   >>> with fs.open_input_file(path) as fout:
   ...     parquet_schema = pq.read_schema(fout)
   ... 
   >>> 
   >>> parquet_schema
   VendorID: int64
     -- field metadata --
     PARQUET:field_id: '1'
   tpep_pickup_datetime: timestamp[us, tz=UTC]
     -- field metadata --
     PARQUET:field_id: '2'
   tpep_dropoff_datetime: timestamp[us, tz=UTC]
     -- field metadata --
     PARQUET:field_id: '3'
   passenger_count: double
     -- field metadata --
     PARQUET:field_id: '4'
   trip_distance: double
     -- field metadata --
     PARQUET:field_id: '5'
   RatecodeID: double
     -- field metadata --
     PARQUET:field_id: '6'
   store_and_fwd_flag: string
     -- field metadata --
     PARQUET:field_id: '7'
   PULocationID: int64
     -- field metadata --
     PARQUET:field_id: '8'
   DOLocationID: int64
     -- field metadata --
     PARQUET:field_id: '9'
   payment_type: int64
     -- field metadata --
     PARQUET:field_id: '10'
   fare_amount: double
     -- field metadata --
     PARQUET:field_id: '11'
   extra: double
     -- field metadata --
     PARQUET:field_id: '12'
   mta_tax: double
     -- field metadata --
     PARQUET:field_id: '13'
   tip_amount: double
     -- field metadata --
     PARQUET:field_id: '14'
   tolls_amount: double
     -- field metadata --
     PARQUET:field_id: '15'
   improvement_surcharge: double
     -- field metadata --
     PARQUET:field_id: '16'
   total_amount: double
     -- field metadata --
     PARQUET:field_id: '17'
   congestion_surcharge: double
     -- field metadata --
     PARQUET:field_id: '18'
   airport_fee: double
     -- field metadata --
     PARQUET:field_id: '19'
   -- schema metadata --
   iceberg.schema: '{"type":"struct","schema-id":0,"fields":[{"id":1,"name":' + 1236
   ```
   
   @amogh-jahagirdar  We could implement a visitor to convert this Arrow schema  back to an Iceberg one with the correct IDs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sheinbergon commented on issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "sheinbergon (via GitHub)" <gi...@apache.org>.

sheinbergon commented on issue #6647:
URL: https://github.com/apache/iceberg/issues/6647#issuecomment-1507699227

   @rdblue just wondering - when is this feature/fix due? I see work in #6505 had already taken place, but not sure where things stand. I care greatly about the ability to read externally generated iceberg tables natively via Python. Is there anyway I can contribute or help with making 0.4.0 happen and have the fix for this issue included?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on issue #6647: Issue reading iceberg tables written by Athena with pyiceberg

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on issue #6647:
URL: https://github.com/apache/iceberg/issues/6647#issuecomment-1401198333

   https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1022
   
   Yeah the file key value metadata (which after stepping through the code) is not required to be written based on the Parquet spec. 
   
   Ultimately, I think the proper solution is if k/v metadata is not defined or the iceberg schema property in the k/v metadata is not defined we derive the schema https://github.com/apache/iceberg/issues/6505 @Fokko @JonasJ-ap 
   
   That change maybe complex in the interim, so maybe we still want to raise a PR which surfaces the existing failure when metadata is not defined? I'l raise that and we can close it if we determine we want to just wait for the proper solution for the next release. I think it'll be helpful for users to at least see it's a known issue until it's fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org