You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "JonasJ-ap (via GitHub)" <gi...@apache.org> on 2023/05/04 04:45:43 UTC

[GitHub] [iceberg] JonasJ-ap opened a new pull request, #7523: Python: Support Inferring Iceberg UUID type from parquet files

JonasJ-ap opened a new pull request, #7523:
URL: https://github.com/apache/iceberg/pull/7523

   This PR follows #6997 
   
   In the previous PR, we supports infer iceberg schema from parquet files. However, the implemented visitor does not correctly differentiate between `UUIDType` and `FixedType(16)` since they both stored as `fixed_len_byte_array[16]` by [spec](https://iceberg.apache.org/spec/#primitive-types). This PR fix this issue by passing an optional `expected_schema` to the `pyarrow_to_iceberg` visitor so that the visitor can check if `UUIDType` is preferred during conversion.
   
   ## Example
   For table with schema like:
   ```
   Current schema        Schema, id=0
                         ├── 1: uuid: required uuid (c1)
                         ├── 2: c1: required string (c1)
                         ├── 3: struct1: required struct<5: c2: required uuid (c2), 6: c3: required string (c3)>
                         └── 4: list: required list<struct<8: c4: required uuid (c4), 9: c5: required uuid (c5)>>
   ```
   and optimized by AWS Athena's query:
   ```
   OPTIMIZE uuid_test REWRITE DATA USING BIN_PACK;
   ```
   Running the following query on the current master branch raises the following exception
   ```bash
   >>> table = catalog.load_table("iceberg_ref.uuid_test")
   >>> df = table. Scan().to_pandas()
   Error loading table uuid_test to pandas: Cannot promote fixed[16] to uuid
   ```
   After this PR fix, we can successfully read the table:
   ```bash
   ============Loading table iceberg_ref.uuid_test to pandas ===========
                                                   uuid           c1                                            struct1                                               list
   0     b'\xa8_\x9a\x0e\x14cM\x84\xb0\xe5\x83<!*W\xc4'        text2  {'c2': b'\xcb\xb6\xee\x04\x94\x10E\xfd\x97\x16...  [{'c4': b'\x07\x84\x05\xabv\xe5C\x1c\x86B\xe5F...
   1  b'\xf2\x9b\xe7M\xd3\xe5G&\xae\xbf\x90Q\x85!~\x82'        text2  {'c2': b"\xce\x87\xe7_''@\xd1\xa0{z\x1d\n\x1e\...  [{'c4': b'\xc8\xc6\xeaF\xb9\xadL,\x856\x05Uj\x...
   2     b'\xd1k^\xd0r\xe8J\xa4\xb4[\x16\x90\x88H\xb21'        text1  {'c2': b'\x0b1\xd1\x07\x89\x90C\xa9\xa7\xc1\xf...  [{'c4': b'\xb0\x1b\xce\x07\xa0\xe6DE\xae\x88O\...
   3        b'P\xd6`\x10\xc1\xecO\x17\x991\x19N\xbcI-4'        text1  {'c2': b'rC\x16\xf2\xbbYB\x93\xa1\x04\xa0\x99g...  [{'c4': b'K\xfb\xfa\xa1{\x9fB\x0e\xaeu\xe2\xbb...
   4  b'\x19R\xb9g"\xafH\x02\xbb\x1d\x14\xe9\xd5\xe9...  Sample Text  {'c2': b'\x8f\xf0z\xe0\x8b1N\x1a\xa7f\xb0BD\x8...  [{'c4': b"^-\xdd\x87\x0ffD\x8a\x88'I\xad\x87\x...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1185209684


##########
python/pyiceberg/schema.py:
##########
@@ -1364,3 +1364,11 @@ def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType:
             raise ResolveError(f"Cannot reduce precision from {file_type} to {read_type}")
     else:
         raise ResolveError(f"Cannot promote an decimal to {read_type}")
+
+
+@promote.register(FixedType)
+def _(file_type: FixedType, read_type: IcebergType) -> IcebergType:
+    if isinstance(read_type, UUIDType) and len(file_type) == 16:

Review Comment:
   Added a comment below😊



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUID type from parquet files

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1184538670


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -612,14 +612,40 @@ def _get_field_doc(field: pa.Field) -> Optional[str]:
 
 
 class _ConvertToIceberg(PyArrowSchemaVisitor[Union[IcebergType, Schema]]):
+    def __init__(self, expected_schema: Optional[Schema] = None):
+        self.expected_schema = expected_schema
+
+    def cast_if_needed(self, field_id: int, field_type: IcebergType) -> IcebergType:

Review Comment:
   An alternative solution (a simpler one) may be adding allowing promoting from `FixedType(16)` to `UUIDType` here:https://github.com/apache/iceberg/blob/3db4e896587d95318a690979e97bab55db250e23/python/pyiceberg/schema.py#L1308-L1322
   
   by adding 
   ```python
   @promote.register(FixedType)
   def _(file_type: FixedType, read_type: IcebergType) -> IcebergType:
       if isinstance(read_type, UUIDType) and len(file_type) == 16:
           return read_type
       else:
           raise ResolveError(f"Cannot promote an {file_type} to {read_type}")
   ```
   
   My concern here is that the added promotion is not listed as a valid type promotions in [spec](https://iceberg.apache.org/spec/#schema-evolution and the effect of the added promotion may be too large. So currently I chose to narrow the fix to the visitor. Looking forward to hearing more thoughts on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko merged PR #7523:
URL: https://github.com/apache/iceberg/pull/7523


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1184766069


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -612,14 +612,40 @@ def _get_field_doc(field: pa.Field) -> Optional[str]:
 
 
 class _ConvertToIceberg(PyArrowSchemaVisitor[Union[IcebergType, Schema]]):
+    def __init__(self, expected_schema: Optional[Schema] = None):
+        self.expected_schema = expected_schema
+
+    def cast_if_needed(self, field_id: int, field_type: IcebergType) -> IcebergType:

Review Comment:
   Hey @JonasJ-ap, first of all; thanks for catching this! 👏🏻 
   
   Looking at it, I think adding promotions in there is the right thing to do. This way we have the correct reader three and don't have to check any types down the line.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#issuecomment-1535044492

   Thanks again @JonasJ-ap for fixing this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1185209684


##########
python/pyiceberg/schema.py:
##########
@@ -1364,3 +1364,11 @@ def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType:
             raise ResolveError(f"Cannot reduce precision from {file_type} to {read_type}")
     else:
         raise ResolveError(f"Cannot promote an decimal to {read_type}")
+
+
+@promote.register(FixedType)
+def _(file_type: FixedType, read_type: IcebergType) -> IcebergType:
+    if isinstance(read_type, UUIDType) and len(file_type) == 16:

Review Comment:
   Added a comment😊



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1184787197


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -612,14 +612,40 @@ def _get_field_doc(field: pa.Field) -> Optional[str]:
 
 
 class _ConvertToIceberg(PyArrowSchemaVisitor[Union[IcebergType, Schema]]):
+    def __init__(self, expected_schema: Optional[Schema] = None):
+        self.expected_schema = expected_schema
+
+    def cast_if_needed(self, field_id: int, field_type: IcebergType) -> IcebergType:

Review Comment:
   For context, I was afraid that this would also affect other logical types, such as dates. But that seems to work, and PyArrow directly reads it as a `date32`, and no promotion is needed. Digging into it, it looks like the source of the bug is upstream in Arrow, where it doesn't pick up the UUID logical type on the Parquet field.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1184772516


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -612,14 +612,40 @@ def _get_field_doc(field: pa.Field) -> Optional[str]:
 
 
 class _ConvertToIceberg(PyArrowSchemaVisitor[Union[IcebergType, Schema]]):
+    def __init__(self, expected_schema: Optional[Schema] = None):
+        self.expected_schema = expected_schema
+
+    def cast_if_needed(self, field_id: int, field_type: IcebergType) -> IcebergType:

Review Comment:
   I'm able to reproduce this with Trino as well:
   ![image](https://user-images.githubusercontent.com/1134248/236166112-da4551e2-6d0a-4d46-97c8-b2a71b964619.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "JonasJ-ap (via GitHub)" <gi...@apache.org>.
JonasJ-ap commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1185038580


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -612,14 +612,40 @@ def _get_field_doc(field: pa.Field) -> Optional[str]:
 
 
 class _ConvertToIceberg(PyArrowSchemaVisitor[Union[IcebergType, Schema]]):
+    def __init__(self, expected_schema: Optional[Schema] = None):
+        self.expected_schema = expected_schema
+
+    def cast_if_needed(self, field_id: int, field_type: IcebergType) -> IcebergType:

Review Comment:
   Thank you so much for the explanation and test. I've changed the PR to add the promotion from FixedType[16] to UUIDType instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1185245987


##########
python/pyiceberg/schema.py:
##########
@@ -1364,3 +1364,11 @@ def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType:
             raise ResolveError(f"Cannot reduce precision from {file_type} to {read_type}")
     else:
         raise ResolveError(f"Cannot promote an decimal to {read_type}")
+
+
+@promote.register(FixedType)
+def _(file_type: FixedType, read_type: IcebergType) -> IcebergType:
+    if isinstance(read_type, UUIDType) and len(file_type) == 16:

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #7523: Python: Support Inferring Iceberg UUIDType from parquet files

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on code in PR #7523:
URL: https://github.com/apache/iceberg/pull/7523#discussion_r1185133380


##########
python/pyiceberg/schema.py:
##########
@@ -1364,3 +1364,11 @@ def _(file_type: DecimalType, read_type: IcebergType) -> IcebergType:
             raise ResolveError(f"Cannot reduce precision from {file_type} to {read_type}")
     else:
         raise ResolveError(f"Cannot promote an decimal to {read_type}")
+
+
+@promote.register(FixedType)
+def _(file_type: FixedType, read_type: IcebergType) -> IcebergType:
+    if isinstance(read_type, UUIDType) and len(file_type) == 16:

Review Comment:
   One minor request. Could you add a small comment here explaining why this promotion is there? That PyArrow reads a Parquet UUID as FixedBytes, and therefore we need to allow this promotion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org