You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/04 10:55:26 UTC

[GitHub] [iceberg] aokolnychyi commented on pull request #2017: Move RewriteDelete to the analyzer

aokolnychyi commented on pull request #2017:
URL: https://github.com/apache/iceberg/pull/2017#issuecomment-753906176

I have mixed feelings about this direction.

On one hand, it is great to rewrite this early in the analyzer and leave the optimizer untouched. On the other hand, it prohibits optimizations on delete/update/merge nodes and makes certain rewrites more complicated.

To give an example where having merge nodes in the optimizer may be helpful.

If we rewrite the following operation in the analyzer, we will have to use a full outer join.

```
MERGE target t USING source s
ON t.id = s.id
WHEN MATCHED AND exp_equivalent_to_false
UPDATE SET …
WHEN NOT MATCHED
THEN INSERT
```

However, the optimizer can detect that our UPDATE condition always evaluates to false, so it can simplify the plan:

```
MERGE target t USING source s
ON t.id = s.id
WHEN NOT MATCHED
THEN INSERT
```

The optimized merge operation can be executed using a left anti join instead of a full outer join. Similarly, if we can get rid of the NOT MATCHED clause, we can use a right outer join instead of a full outer join. There may be more cases like this that we don't know yet.

Also, there are a lot of rewrites that happen in the optimizer like @dilipbiswal mentioned so the current approach isn't that bad in my view. In addition, row-level ops are a bit different compared to the refresh of materialized views which is replaced with INSERTs. There is a clear translation for refresh statements and it does not require any optimization. Update/delete/merge commands are more complicated to me and I'll be fine addressing them differently.

If we rewrite the plans after operator optimizations as we planned earlier, we miss these rules:
- `PullupCorrelatedPredicates` (fixed in Spark 3.1 to cover delete/update/merge)
- `OptimizeSubqueries` (works fine now as it runs on `SubqueryExpression` which is a child of the row-level plans)
- Replacement of certain expressions with equivalent ones
- Operator optimization rules (we match the behavior for subqueries)

I think it is fine to run the rewrite after those rules.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org