You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/05 20:39:28 UTC

[GitHub] [iceberg] rdblue opened a new pull request, #6127: Python: Add expression evaluator

rdblue opened a new pull request, #6127:
URL: https://github.com/apache/iceberg/pull/6127

   Implement an expression evaluator in Python. This is needed for pruning data files based on the partition tuple in a manifest file.
   
   This also makes a couple of other changes:
   * Fixes types in `BoundBooleanExpressionVisitor`
   * Adds a `literals_set` to `BoundSetPredicate` for faster in and not in evaluation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#discussion_r1014724722


##########
python/tests/expressions/test_evaluator.py:
##########
@@ -0,0 +1,203 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from typing import Any, List, TypeVar
+
+from pyiceberg.expressions import AlwaysTrue, AlwaysFalse, LessThan, Reference, LessThanOrEqual, GreaterThan, \
+    GreaterThanOrEqual, EqualTo, NotEqualTo, IsNaN, NotNaN, IsNull, NotNull, Not, And, Or, In, NotIn
+from pyiceberg.expressions.literals import literal
+from pyiceberg.expressions.visitors import evaluator
+from pyiceberg.files import StructProtocol
+from pyiceberg.schema import Schema
+from pyiceberg.types import NestedField, LongType, StringType, DoubleType
+
+V = TypeVar('V')
+
+
+class Column:

Review Comment:
   @Fokko, I found it difficult to create expressions because we are missing methods to create an expression from names and values -- our classes all require literals and references. This is a quick hack to make it easy. I'm not sure if we'd go with this, so I kept it in tests. We should talk more about how we want people to create expressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #6127:
URL: https://github.com/apache/iceberg/pull/6127


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#discussion_r1014890146


##########
python/pyiceberg/expressions/__init__.py:
##########
@@ -281,6 +289,30 @@ def __invert__(self) -> BoundIsNull:
         return BoundIsNull(self.term)
 
 
+def coerce_unary_arguments(data_class: Union[type, None]) -> Union[type, Callable[[type], type]]:

Review Comment:
   @Fokko, this PR introduces 3 of these decorators to make it easy to construct predicates. This decorator replaces `__init__` with a version that converts a string passed as `term` into a `Reference`. This makes it much easier to create expressions, as you can see from the test cases.
   
   The other decorators handle set predicates and literal predicates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#discussion_r1014889665


##########
python/tests/expressions/test_evaluator.py:
##########
@@ -0,0 +1,203 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from typing import Any, List, TypeVar
+
+from pyiceberg.expressions import AlwaysTrue, AlwaysFalse, LessThan, Reference, LessThanOrEqual, GreaterThan, \
+    GreaterThanOrEqual, EqualTo, NotEqualTo, IsNaN, NotNaN, IsNull, NotNull, Not, And, Or, In, NotIn
+from pyiceberg.expressions.literals import literal
+from pyiceberg.expressions.visitors import evaluator
+from pyiceberg.files import StructProtocol
+from pyiceberg.schema import Schema
+from pyiceberg.types import NestedField, LongType, StringType, DoubleType
+
+V = TypeVar('V')
+
+
+class Column:

Review Comment:
   I fixed this by adding decorators that will coerce arguments before passing to the constructors of the frozen data classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#discussion_r1027340824


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -417,6 +422,75 @@ def visit_bound_predicate(self, predicate) -> BooleanExpression:
         return predicate
 
 
+def expression_evaluator(schema: Schema, unbound: BooleanExpression, case_sensitive=True) -> Callable[[StructProtocol], bool]:
+    return _ExpressionEvaluator(schema, unbound, case_sensitive).eval
+
+
+class _ExpressionEvaluator(BoundBooleanExpressionVisitor[bool]):
+    bound: BooleanExpression
+    struct: StructProtocol
+
+    def __init__(self, schema: Schema, unbound: BooleanExpression, case_sensitive=True):

Review Comment:
   Fixed.



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -417,6 +422,75 @@ def visit_bound_predicate(self, predicate) -> BooleanExpression:
         return predicate
 
 
+def expression_evaluator(schema: Schema, unbound: BooleanExpression, case_sensitive=True) -> Callable[[StructProtocol], bool]:
+    return _ExpressionEvaluator(schema, unbound, case_sensitive).eval
+
+
+class _ExpressionEvaluator(BoundBooleanExpressionVisitor[bool]):
+    bound: BooleanExpression
+    struct: StructProtocol
+
+    def __init__(self, schema: Schema, unbound: BooleanExpression, case_sensitive=True):
+        self.bound = bind(schema, unbound, case_sensitive)
+
+    def eval(self, struct: StructProtocol):

Review Comment:
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#discussion_r1015073328


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -417,6 +422,75 @@ def visit_bound_predicate(self, predicate) -> BooleanExpression:
         return predicate
 
 
+def expression_evaluator(schema: Schema, unbound: BooleanExpression, case_sensitive=True) -> Callable[[StructProtocol], bool]:

Review Comment:
   ```suggestion
   def expression_evaluator(schema: Schema, unbound: BooleanExpression, case_sensitive: bool = True) -> Callable[[StructProtocol], bool]:
   ```



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -417,6 +422,75 @@ def visit_bound_predicate(self, predicate) -> BooleanExpression:
         return predicate
 
 
+def expression_evaluator(schema: Schema, unbound: BooleanExpression, case_sensitive=True) -> Callable[[StructProtocol], bool]:
+    return _ExpressionEvaluator(schema, unbound, case_sensitive).eval
+
+
+class _ExpressionEvaluator(BoundBooleanExpressionVisitor[bool]):
+    bound: BooleanExpression
+    struct: StructProtocol
+
+    def __init__(self, schema: Schema, unbound: BooleanExpression, case_sensitive=True):

Review Comment:
   ```suggestion
       def __init__(self, schema: Schema, unbound: BooleanExpression, case_sensitive: bool = True):
   ```



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -417,6 +422,75 @@ def visit_bound_predicate(self, predicate) -> BooleanExpression:
         return predicate
 
 
+def expression_evaluator(schema: Schema, unbound: BooleanExpression, case_sensitive=True) -> Callable[[StructProtocol], bool]:
+    return _ExpressionEvaluator(schema, unbound, case_sensitive).eval
+
+
+class _ExpressionEvaluator(BoundBooleanExpressionVisitor[bool]):
+    bound: BooleanExpression
+    struct: StructProtocol
+
+    def __init__(self, schema: Schema, unbound: BooleanExpression, case_sensitive=True):
+        self.bound = bind(schema, unbound, case_sensitive)
+
+    def eval(self, struct: StructProtocol):

Review Comment:
   ```suggestion
       def eval(self, struct: StructProtocol) -> bool:
   ```



##########
python/pyiceberg/expressions/__init__.py:
##########
@@ -281,6 +289,30 @@ def __invert__(self) -> BoundIsNull:
         return BoundIsNull(self.term)
 
 
+def coerce_unary_arguments(data_class: Union[type, None]) -> Union[type, Callable[[type], type]]:

Review Comment:
   I'm not sold on this, mostly because it adds another layer of complexity. These things should normally just happen in the init. I've created a PR https://github.com/apache/iceberg/pull/6127 That does exactly this, well tested 👍🏻 



##########
python/pyiceberg/expressions/__init__.py:
##########
@@ -351,17 +386,56 @@ def bind(self, schema: Schema, case_sensitive: bool = True) -> BooleanExpression
 
 
 @dataclass(frozen=True)
-class BoundSetPredicate(BoundPredicate[T]):
-    literals: set[Literal[T]]
+class BoundSetPredicate(BoundPredicate[C]):
+    literals: set[Literal[C]]
+
+    @property

Review Comment:
   ```suggestion
       @cached_property
   ```
   This way we don't have to build the caching mechanism ourselves.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#discussion_r1015499369


##########
python/pyiceberg/expressions/__init__.py:
##########
@@ -281,6 +289,30 @@ def __invert__(self) -> BoundIsNull:
         return BoundIsNull(self.term)
 
 
+def coerce_unary_arguments(data_class: Union[type, None]) -> Union[type, Callable[[type], type]]:

Review Comment:
   I'm not sold on this, mostly because it adds another layer of complexity. These things should normally just happen in the init. I've created a PR https://github.com/apache/iceberg/pull/6139 That does exactly this, well tested 👍🏻 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#discussion_r1014724824


##########
python/tests/expressions/test_evaluator.py:
##########
@@ -0,0 +1,203 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+from typing import Any, List, TypeVar
+
+from pyiceberg.expressions import AlwaysTrue, AlwaysFalse, LessThan, Reference, LessThanOrEqual, GreaterThan, \
+    GreaterThanOrEqual, EqualTo, NotEqualTo, IsNaN, NotNaN, IsNull, NotNull, Not, And, Or, In, NotIn
+from pyiceberg.expressions.literals import literal
+from pyiceberg.expressions.visitors import evaluator
+from pyiceberg.files import StructProtocol
+from pyiceberg.schema import Schema
+from pyiceberg.types import NestedField, LongType, StringType, DoubleType
+
+V = TypeVar('V')
+
+
+class Column:
+    """A helper class for creating expressions"""
+    name: str
+
+    def __init__(self, name):
+        self.name = name
+
+    def __lt__(self, other: V) -> LessThan:
+        return LessThan(term=Reference(self.name), literal=literal(other))
+
+    def __le__(self, other: Any) -> LessThanOrEqual:
+        return LessThanOrEqual(term=Reference(self.name), literal=literal(other))
+
+    def __gt__(self, other: Any) -> GreaterThan:
+        return GreaterThan(term=Reference(self.name), literal=literal(other))
+
+    def __ge__(self, other: Any) -> GreaterThanOrEqual:
+        return GreaterThanOrEqual(term=Reference(self.name), literal=literal(other))
+
+    def __eq__(self, other: Any) -> EqualTo:
+        return EqualTo(term=Reference(self.name), literal=literal(other))
+
+    def __ne__(self, other: Any) -> NotEqualTo:
+        return NotEqualTo(term=Reference(self.name), literal=literal(other))
+
+    def is_null(self) -> IsNaN:
+        return IsNull(term=Reference(self.name))
+
+    def not_null(self) -> IsNaN:
+        return NotNull(term=Reference(self.name))
+
+    def is_nan(self) -> IsNaN:
+        return IsNaN(term=Reference(self.name))
+
+    def not_nan(self) -> IsNaN:
+        return NotNaN(term=Reference(self.name))
+
+
+def col(name: str) -> Column:
+    return Column(name)
+
+
+class Record(StructProtocol):

Review Comment:
   @Fokko, I also had to add this class since we don't really have a generic record class anywhere. We should follow up and move this into a `data` package or something.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6127: Python: Add expression evaluator

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6127:
URL: https://github.com/apache/iceberg/pull/6127#issuecomment-1321223339

   Thanks for reviewing and fixing up the expressions, @Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org