You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/02 15:49:52 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6346: Python: Fix PyArrow Type conversion

Fokko opened a new pull request, #6346:
URL: https://github.com/apache/iceberg/pull/6346

   It turns out that we need to cast the literal to the right type:
   
   ```
   ArrowNotImplementedError: Function 'greater_equal' has no kernel matching input types (timestamp[us, tz=+00:00], int64)
   ```
   
   After casting the scalar, it works perfect. One awkward thing to note that we don't have to do this for the `isin` predicate. There this is done for us. Please check the Jupyter notebook.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6346: Python: Fix PyArrow Type conversion

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6346:
URL: https://github.com/apache/iceberg/pull/6346#discussion_r1038381242


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -387,6 +388,12 @@ def _(_: BinaryType) -> pa.DataType:
     return pa.binary()
 
 
+def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:
+    if not isinstance(iceberg_type, PrimitiveType):

Review Comment:
   This is because there is some extra logic when passing in the values other than an PyArrow iterable:
   
   https://github.com/apache/arrow/blob/8a8999e94038aa9a60d3ac15741cf9c7abad0433/python/pyarrow/_compute.pyx#L2431-L2450
   
   When constructing the array, it will infer the type:
   
   https://github.com/apache/arrow/blob/8a8999e94038aa9a60d3ac15741cf9c7abad0433/python/pyarrow/array.pxi#L117-L129
   
   Which will go to the visitor:
   
   https://github.com/apache/arrow/blob/8a8999e94038aa9a60d3ac15741cf9c7abad0433/python/pyarrow/src/arrow/python/inference.cc#L352-L422
   
   To determine the type. Instead, we can also pass in the type when we manually construct the array, which is nicer:
   
   ```python
   pyarrow_literals = pa.array(literals, type=_iceberg_to_pyarrow_type(term.ref().field.field_type))
   ```
   
   I've updated the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6346: Python: Fix PyArrow Type conversion

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6346:
URL: https://github.com/apache/iceberg/pull/6346#discussion_r1038381242


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -387,6 +388,12 @@ def _(_: BinaryType) -> pa.DataType:
     return pa.binary()
 
 
+def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:
+    if not isinstance(iceberg_type, PrimitiveType):

Review Comment:
   This is because there is some extra logic when passing in the values other than an PyArrow iterable:
   
   https://github.com/apache/arrow/blob/8a8999e94038aa9a60d3ac15741cf9c7abad0433/python/pyarrow/_compute.pyx#L2431-L2450
   
   When constructing the array, it will infer the type:
   
   https://github.com/apache/arrow/blob/8a8999e94038aa9a60d3ac15741cf9c7abad0433/python/pyarrow/array.pxi#L117-L129
   
   Which will go to the visitor:
   
   https://github.com/apache/arrow/blob/8a8999e94038aa9a60d3ac15741cf9c7abad0433/python/pyarrow/src/arrow/python/inference.cc#L352-L422
   
   To determine the type. Instead, we can also pass in the type when we manually construct the array, which is nicer. I've updated the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6346: Python: Fix PyArrow Type conversion

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6346:
URL: https://github.com/apache/iceberg/pull/6346#discussion_r1038336833


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -387,6 +388,12 @@ def _(_: BinaryType) -> pa.DataType:
     return pa.binary()
 
 
+def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:
+    if not isinstance(iceberg_type, PrimitiveType):

Review Comment:
   Why isn't this necessary for `In` and `NotIn`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6346: Python: Fix PyArrow Type conversion

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6346:
URL: https://github.com/apache/iceberg/pull/6346#discussion_r1038403903


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -409,22 +418,22 @@ def visit_not_null(self, term: BoundTerm[Any]) -> pc.Expression:
         return pc.field(term.ref().field.name).is_valid()
 
     def visit_equal(self, term: BoundTerm[Any], literal: Literal[Any]) -> pc.Expression:
-        return pc.field(term.ref().field.name) == literal.value
+        return pc.field(term.ref().field.name) == _convert_scalar(literal.value, term.ref().field.field_type)

Review Comment:
   We explicitly state micros in the schema, and we cast the PyArrow scalar to it:
   
   ```python
   @_iceberg_to_pyarrow_type.register
   def _(_: TimestampType) -> pa.DataType:
       return pa.timestamp(unit="us")
   
   @_iceberg_to_pyarrow_type.register
   def _(_: TimestamptzType) -> pa.DataType:
       return pa.timestamp(unit="us", tz="+00:00")
   ```
   https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L369-L376
   
   It also looks like PyArrow is doing the conversion from seconds and milliseconds to microseconds: https://github.com/apache/iceberg/pull/6070 We don't support nanoseconds because then we would lose precision.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6346: Python: Fix PyArrow Type conversion

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6346:
URL: https://github.com/apache/iceberg/pull/6346#discussion_r1038385171


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -409,22 +418,22 @@ def visit_not_null(self, term: BoundTerm[Any]) -> pc.Expression:
         return pc.field(term.ref().field.name).is_valid()
 
     def visit_equal(self, term: BoundTerm[Any], literal: Literal[Any]) -> pc.Expression:
-        return pc.field(term.ref().field.name) == literal.value
+        return pc.field(term.ref().field.name) == _convert_scalar(literal.value, term.ref().field.field_type)

Review Comment:
   How does this know what timestamp format our filter is in? Does it assume micros?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #6346: Python: Fix PyArrow Type conversion

Posted by GitBox <gi...@apache.org>.
Fokko merged PR #6346:
URL: https://github.com/apache/iceberg/pull/6346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6346: Python: Fix PyArrow Type conversion

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6346:
URL: https://github.com/apache/iceberg/pull/6346#discussion_r1038403903


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -409,22 +418,22 @@ def visit_not_null(self, term: BoundTerm[Any]) -> pc.Expression:
         return pc.field(term.ref().field.name).is_valid()
 
     def visit_equal(self, term: BoundTerm[Any], literal: Literal[Any]) -> pc.Expression:
-        return pc.field(term.ref().field.name) == literal.value
+        return pc.field(term.ref().field.name) == _convert_scalar(literal.value, term.ref().field.field_type)

Review Comment:
   We explicitly tell it to read `micros`:
   
   ```python
   @_iceberg_to_pyarrow_type.register
   def _(_: TimestampType) -> pa.DataType:
       return pa.timestamp(unit="us")
   
   @_iceberg_to_pyarrow_type.register
   def _(_: TimestamptzType) -> pa.DataType:
       return pa.timestamp(unit="us", tz="+00:00")
   ```
   https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L369-L376
   
   It also looks like PyArrow is doing the conversion from seconds and milliseconds to microseconds: https://github.com/apache/iceberg/pull/6070 We don't support nanoseconds because then we would lose precision.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org