You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/10 12:50:52 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6555: Python: Expression to disjunctive normal form

Fokko opened a new pull request, #6555:
URL: https://github.com/apache/iceberg/pull/6555

   First time I needed this knowledge since University πŸ•ΊπŸ» 
   
   Adds a visitor to rewrite an expression to the DNF.
   
   This is required for filtering data in Dask:
   
   ![image](https://user-images.githubusercontent.com/1134248/211556028-4a3de683-0e2a-4ec9-a8ae-601705ad52c3.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] martindurant commented on pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
martindurant commented on PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#issuecomment-1377624143

   > The user is able to define the predicate as they like, I'm not sure if we should limit users to AND.
   
   Of course; this was just for historical context


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066183022


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,49 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):

Review Comment:
   I think this would be cleaner if it returns `Tuple[BooleanExpression, ...]`.
   
   Right now, the `and` and `or` cases need to look inside of child expressions to ensure that there are no nested `or` expressions. That works, but it ends up being complicated.
   
   If instead this used a tuple, then each level can assume children produce a tuple. So you get more generic logic that doesn't need to account for so many cases:
   
   ```python
   class RewriteToDNF(Tuple[BooleanExpression, ...]):
       def visit_and(left_result: Tuple[BooleanExpression, ...], right_result: Tuple[BooleanExpression, ...]) -> Tuple[BooleanExpression, ...]:
           return tuple(tuple(And(le, re) for le in left_result) for re in right_result)
   
       def visit_or(left_result: Tuple[BooleanExpression, ...], right_result: Tuple[BooleanExpression, ...]) -> Tuple[BooleanExpression, ...]:
           return left_result + right_result
   
       def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Tuple[BooleanExpression, ...]:
           return (predicate,)
   ```
   
   Then the conversion method would just need to `Or` the results together, or return them as a list for further conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066283072


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +843,41 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[Tuple[BooleanExpression, ...]]):
+    def visit_true(self) -> Tuple[BooleanExpression, ...]:
+        return (AlwaysTrue(),)
+
+    def visit_false(self) -> Tuple[BooleanExpression, ...]:
+        return (AlwaysFalse(),)
+
+    def visit_not(self, child_result: Tuple[BooleanExpression, ...]) -> Tuple[BooleanExpression, ...]:
+        return tuple(~result for result in child_result)

Review Comment:
   Ah, thanks! Raising an error is indeed the best thing to do! πŸ‘πŸ» 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066189573


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,49 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):
+    def visit_true(self) -> BooleanExpression:
+        return AlwaysTrue()
+
+    def visit_false(self) -> BooleanExpression:
+        return AlwaysFalse()
+
+    def visit_not(self, child_result: BooleanExpression) -> BooleanExpression:
+        return ~child_result
+
+    def visit_and(self, left_result: BooleanExpression, right_result: BooleanExpression) -> BooleanExpression:
+        if isinstance(left_result, Or) and isinstance(right_result, Or):
+            # Distributive law: ((P OR Q) AND (R OR S)) AND (((P AND R) OR (P AND S)) OR ((Q AND R) OR ((Q AND S)))
+            return Or(
+                And(left_result.left, right_result.left),
+                And(left_result.left, right_result.right),
+                And(left_result.right, right_result.left),
+                And(left_result.right, right_result.right),
+            )
+        if isinstance(right_result, Or):
+            # Distributive law: A AND (B OR C) = (A AND B) OR (A AND C)
+            return Or(And(left_result, right_result.left), And(left_result, right_result.right))
+        elif isinstance(left_result, Or):

Review Comment:
   What if there are more levels of nesting than this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066096838


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,59 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):
+    def visit_true(self) -> BooleanExpression:
+        return AlwaysTrue()
+
+    def visit_false(self) -> BooleanExpression:
+        return AlwaysFalse()
+
+    def visit_not(self, child_result: BooleanExpression) -> BooleanExpression:

Review Comment:
   I've done some checks, and that works well! Added some more tests and the logic is indeed redundant here. The difference is that we don't have any `Not()` in the output, but that works well πŸ‘πŸ» 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066183022


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,49 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):

Review Comment:
   I think this would be cleaner if it returns `Tuple[BooleanExpression, ...]`.
   
   Right now, the `and` and `or` cases need to look inside of child expressions to ensure that there are no nested `or` expressions. That works, but it ends up being complicated.
   
   If instead this used a tuple, then each level can assume children produce a tuple. So you get more generic logic that doesn't need to account for so many cases:
   
   ```python
   class RewriteToDNF(Tuple[BooleanExpression, ...]):
       def visit_and(left_result: Tuple[BooleanExpression, ...], right_result: Tuple[BooleanExpression, ...]) -> Tuple[BooleanExpression, ...]:
           return tuple(tuple(And(le, re) for le in left_result) for re in right_result)
   
       def visit_or(left_result: Tuple[BooleanExpression, ...], right_result: Tuple[BooleanExpression, ...]) -> Tuple[BooleanExpression, ...]:
           return left_result + right_result
   
       def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Tuple[BooleanExpression, ...]:
           return (predicate,)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066271144


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,49 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):
+    def visit_true(self) -> BooleanExpression:
+        return AlwaysTrue()
+
+    def visit_false(self) -> BooleanExpression:
+        return AlwaysFalse()
+
+    def visit_not(self, child_result: BooleanExpression) -> BooleanExpression:
+        return ~child_result
+
+    def visit_and(self, left_result: BooleanExpression, right_result: BooleanExpression) -> BooleanExpression:
+        if isinstance(left_result, Or) and isinstance(right_result, Or):
+            # Distributive law: ((P OR Q) AND (R OR S)) AND (((P AND R) OR (P AND S)) OR ((Q AND R) OR ((Q AND S)))
+            return Or(
+                And(left_result.left, right_result.left),
+                And(left_result.left, right_result.right),
+                And(left_result.right, right_result.left),
+                And(left_result.right, right_result.right),
+            )
+        if isinstance(right_result, Or):
+            # Distributive law: A AND (B OR C) = (A AND B) OR (A AND C)
+            return Or(And(left_result, right_result.left), And(left_result, right_result.right))
+        elif isinstance(left_result, Or):

Review Comment:
   Since we go from the leaves, this should go well. See:
   ```python
   def test_to_dnf_nested_or() -> None:
       expr = Or(EqualTo("P", "a"), And(EqualTo("Q", "b"), Or(EqualTo("R", "c"), EqualTo("S", "d"))))
       assert rewrite_to_dnf(expr) == (
           EqualTo("P", "a"),
           And(EqualTo("Q", "b"), EqualTo("R", "c")),
           And(EqualTo("Q", "b"), EqualTo("S", "d")),
       )
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066284722


##########
python/pyiceberg/expressions/__init__.py:
##########
@@ -216,7 +216,8 @@ def __str__(self) -> str:
     def __repr__(self) -> str:
         return f"And(left={repr(self.left)}, right={repr(self.right)})"
 
-    def __invert__(self) -> Or:
+    def __invert__(self) -> BooleanExpression:

Review Comment:
   It doesn't matter, since the parent class is:
   ```
   class BooleanExpression(ABC):
       """An expression that evaluates to a boolean"""
   
       @abstractmethod
       def __invert__(self) -> BooleanExpression:
           """Transform the Expression into its negated version."""
   ```
   So we have to check anyway. This is because my PyCharm is complaining:
   <img width="770" alt="image" src="https://user-images.githubusercontent.com/1134248/211653685-0d6e6a20-2a70-44a4-a8e4-56e4a8a76b2e.png">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] martindurant commented on pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
martindurant commented on PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#issuecomment-1377359789

   You need this to get pyarrow to filter *within* row-groups? When used with dask, I am surprised you would need this, because ice would be in charge of filtering row groups, and dask would pass the filters directly to pyarrow. That's me guessing without looking at the detail of the code.
   
   I must admit to having come up with the [original simple form](https://github.com/dask/fastparquet/commit/cba69795c0ee2c1ec9a4e276ec728f2b4ae6b2fc#diff-38b333399f04a93e200b35be164c2f66d8c1d99817ca5aff26b3fd01d5079fe9R194) of this, `("col", "==", 0)` and the AND form, but not the ANDOR extended form. 
   
   The docstring above is actually incorrect: fastparquet does also support row-wise filtering, but not via dask (and although you save on memory, it can be slower than loading everything and filtering after). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#issuecomment-1377577817

   Thanks @martindurant for the input, appreciate it.
   
   > You need this to get pyarrow to filter within row-groups? When used with dask, I am surprised you would need this, because ice would be in charge of filtering row groups, and dask would pass the filters directly to pyarrow. That's me guessing without looking at the detail of the code.
   
   So we have Iceberg that does the partition pruning, and the `plan_files`, then we need to filter down on a row level (including skipping row groups would be nice of course).
   
   > I must admit to having come up with the [original simple form](https://github.com/dask/fastparquet/commit/cba69795c0ee2c1ec9a4e276ec728f2b4ae6b2fc#diff-38b333399f04a93e200b35be164c2f66d8c1d99817ca5aff26b3fd01d5079fe9R194) of this, ("col", "==", 0) and the AND form, but not the ANDOR extended form.
   
   The user is able to define the predicate as they like, I'm not sure if we should limit users to `AND`.
   
   > The docstring above is actually incorrect: fastparquet does also support row-wise filtering, but not via dask (and although you save on memory, it can be slower than loading everything and filtering after).
   
   Ah, that's good to know. I went for PyArrow right now because it looks like the most complete in terms of filtering, and then we can add `fastparquet` later on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
Fokko merged PR #6555:
URL: https://github.com/apache/iceberg/pull/6555


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066013273


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,59 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):
+    def visit_true(self) -> BooleanExpression:
+        return AlwaysTrue()
+
+    def visit_false(self) -> BooleanExpression:
+        return AlwaysFalse()
+
+    def visit_not(self, child_result: BooleanExpression) -> BooleanExpression:

Review Comment:
   @Fokko, can this be handled by running `rewriteNot` before this visitor is applied? That should avoid the need to have so much logic here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066271887


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,49 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):

Review Comment:
   I'm ambivalent about this one:
   ```
   def test_to_dnf_nested_or() -> None:
       expr = Or(EqualTo("P", "a"), And(EqualTo("Q", "b"), Or(EqualTo("R", "c"), EqualTo("S", "d"))))
       assert rewrite_to_dnf(expr) == Or(
           EqualTo("P", "a"), Or(And(EqualTo("Q", "b"), EqualTo("R", "c")), And(EqualTo("Q", "b"), EqualTo("S", "d")))
       )
   
   
   def test_to_dnf_nested_or_tuple() -> None:
       expr = Or(EqualTo("P", "a"), And(EqualTo("Q", "b"), Or(EqualTo("R", "c"), EqualTo("S", "d"))))
       assert rewrite_to_dnf2(expr) == (
           EqualTo('P', 'a'),
           And(EqualTo('Q', 'b'), EqualTo('R', 'c')),
           And(EqualTo('Q', 'b'), EqualTo('S', 'd'))
       )
   ```
   I think the first one is more clear. The next step is turning it into a predicate that we can feed into Dask, and that can be done using a visitor as well. With the `Tuple` approach we'll loop over the Tuple, and then feed that into the Visitor for each element.



##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +842,49 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[BooleanExpression]):

Review Comment:
   I'm ambivalent about this one:
   ```python
   def test_to_dnf_nested_or() -> None:
       expr = Or(EqualTo("P", "a"), And(EqualTo("Q", "b"), Or(EqualTo("R", "c"), EqualTo("S", "d"))))
       assert rewrite_to_dnf(expr) == Or(
           EqualTo("P", "a"), Or(And(EqualTo("Q", "b"), EqualTo("R", "c")), And(EqualTo("Q", "b"), EqualTo("S", "d")))
       )
   
   
   def test_to_dnf_nested_or_tuple() -> None:
       expr = Or(EqualTo("P", "a"), And(EqualTo("Q", "b"), Or(EqualTo("R", "c"), EqualTo("S", "d"))))
       assert rewrite_to_dnf2(expr) == (
           EqualTo('P', 'a'),
           And(EqualTo('Q', 'b'), EqualTo('R', 'c')),
           And(EqualTo('Q', 'b'), EqualTo('S', 'd'))
       )
   ```
   I think the first one is more clear. The next step is turning it into a predicate that we can feed into Dask, and that can be done using a visitor as well. With the `Tuple` approach we'll loop over the Tuple, and then feed that into the Visitor for each element.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066277900


##########
python/pyiceberg/expressions/visitors.py:
##########
@@ -842,3 +843,41 @@ def visit_bound_predicate(self, predicate: BoundPredicate[L]) -> Set[int]:
 
 def extract_field_ids(expr: BooleanExpression) -> Set[int]:
     return visit(expr, _ExpressionFieldIDs())
+
+
+class _RewriteToDNF(BooleanExpressionVisitor[Tuple[BooleanExpression, ...]]):
+    def visit_true(self) -> Tuple[BooleanExpression, ...]:
+        return (AlwaysTrue(),)
+
+    def visit_false(self) -> Tuple[BooleanExpression, ...]:
+        return (AlwaysFalse(),)
+
+    def visit_not(self, child_result: Tuple[BooleanExpression, ...]) -> Tuple[BooleanExpression, ...]:
+        return tuple(~result for result in child_result)

Review Comment:
   I think that this would violate the guarantee used by the visitor, that the child result is a tuple of expressions that contain no `Or` nodes, but may contain `And` nodes. If there is an `And(left, right)` then it is negated as `Or(Not(left), Not(right))`, which will violate the guarantee when returned.
   
   Luckily, this won't happen because the `Not` nodes are removed when using `rewrite_to_dnf`. But I think that this method should raise an error when it is used improperly (direct use of `_RewriteToDNF`) instead of returning something possibly incorrect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6555: Python: Expression to disjunctive normal form

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6555:
URL: https://github.com/apache/iceberg/pull/6555#discussion_r1066280116


##########
python/pyiceberg/expressions/__init__.py:
##########
@@ -216,7 +216,8 @@ def __str__(self) -> str:
     def __repr__(self) -> str:
         return f"And(left={repr(self.left)}, right={repr(self.right)})"
 
-    def __invert__(self) -> Or:
+    def __invert__(self) -> BooleanExpression:

Review Comment:
   Minor: Can't we guarantee that the result is an `Or`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org