You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "JonasJ-ap (via GitHub)" <gi...@apache.org> on 2023/05/04 04:45:43 UTC

[GitHub] [iceberg] JonasJ-ap opened a new pull request, #7523: Python: Support Inferring Iceberg UUID type from parquet files

JonasJ-ap opened a new pull request, #7523:
URL: https://github.com/apache/iceberg/pull/7523

   This PR follows #6997 
   
   In the previous PR, we supports infer iceberg schema from parquet files. However, the implemented visitor does not correctly differentiate between `UUIDType` and `FixedType(16)` since they both stored as `fixed_len_byte_array[16]` by [spec](https://iceberg.apache.org/spec/#primitive-types). This PR fix this issue by passing an optional `expected_schema` to the `pyarrow_to_iceberg` visitor so that the visitor can check if `UUIDType` is preferred during conversion.
   
   ## Example
   For table with schema like:
   ```
   Current schema        Schema, id=0
                         ├── 1: uuid: required uuid (c1)
                         ├── 2: c1: required string (c1)
                         ├── 3: struct1: required struct<5: c2: required uuid (c2), 6: c3: required string (c3)>
                         └── 4: list: required list<struct<8: c4: required uuid (c4), 9: c5: required uuid (c5)>>
   ```
   and optimized by AWS Athena's query:
   ```
   OPTIMIZE uuid_test REWRITE DATA USING BIN_PACK;
   ```
   Running the following query on the current master branch raises the following exception
   ```bash
   >>> table = catalog.load_table("iceberg_ref.uuid_test")
   >>> df = table. Scan().to_pandas()
   Error loading table uuid_test to pandas: Cannot promote fixed[16] to uuid
   ```
   After this PR fix, we can successfully read the table:
   ```bash
   ============Loading table iceberg_ref.uuid_test to pandas ===========
                                                   uuid           c1                                            struct1                                               list
   0     b'\xa8_\x9a\x0e\x14cM\x84\xb0\xe5\x83<!*W\xc4'        text2  {'c2': b'\xcb\xb6\xee\x04\x94\x10E\xfd\x97\x16...  [{'c4': b'\x07\x84\x05\xabv\xe5C\x1c\x86B\xe5F...
   1  b'\xf2\x9b\xe7M\xd3\xe5G&\xae\xbf\x90Q\x85!~\x82'        text2  {'c2': b"\xce\x87\xe7_''@\xd1\xa0{z\x1d\n\x1e\...  [{'c4': b'\xc8\xc6\xeaF\xb9\xadL,\x856\x05Uj\x...
   2     b'\xd1k^\xd0r\xe8J\xa4\xb4[\x16\x90\x88H\xb21'        text1  {'c2': b'\x0b1\xd1\x07\x89\x90C\xa9\xa7\xc1\xf...  [{'c4': b'\xb0\x1b\xce\x07\xa0\xe6DE\xae\x88O\...
   3        b'P\xd6`\x10\xc1\xecO\x17\x991\x19N\xbcI-4'        text1  {'c2': b'rC\x16\xf2\xbbYB\x93\xa1\x04\xa0\x99g...  [{'c4': b'K\xfb\xfa\xa1{\x9fB\x0e\xaeu\xe2\xbb...
   4  b'\x19R\xb9g"\xafH\x02\xbb\x1d\x14\xe9\xd5\xe9...  Sample Text  {'c2': b'\x8f\xf0z\xe0\x8b1N\x1a\xa7f\xb0BD\x8...  [{'c4': b"^-\xdd\x87\x0ffD\x8a\x88'I\xad\x87\x...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.

JonasJ-ap commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1185209684


##########
python/pyiceberg/schema.py:
##########
@@ -1364,3 +1364,11 @@ def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType:
             raise ResolveError(f"Cannot reduce precision from {file_type} to {read_type}")
     else:
         raise ResolveError(f"Cannot promote an decimal to {read_type}")
+
+
+@promote.register(FixedType)
+def _(file_type: FixedType, read_type: IcebergType) -> IcebergType:
+    if isinstance(read_type, UUIDType) and len(file_type) == 16:

Review Comment:
   Added a comment below😊



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org