You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/09 21:13:27 UTC
[GitHub] [spark] WweiL commented on pull request #38503: [SPARK-40940] Remove Multi-stateful operator checkers for streaming queries.

WweiL commented on PR #38503:
URL: https://github.com/apache/spark/pull/38503#issuecomment-1309381602

   [From this comment](https://github.com/apache/spark/pull/38503#discussion_r1013665318), Jungtaek suggested we check not only aggregates but also all stateful operators, with a kind intent to reduce engineering work of reasoning all tricky combinations of allowed / disallowed cases.
   
   I followed his idea and disallowed all stateful ops running on Complete and Update mode. Then there were some test cases failing, because the error message is changed. To name a few:
   ```
   test: streaming plan - flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming relation with aggregation in Update mode: not supported
   exception message: flatMapGroupsWithState in update mode is not supported with aggregation on a streaming DataFrame/Dataset
   
   test: streaming plan - flatMapGroupsWithState - flatMapGroupsWithState(Update) on streaming relation with aggregation in Complete mode: not supported
   message: flatMapGroupsWithState in update mode is not supported with aggregation on a streaming DataFrame/Dataset
   
   test: streaming plan - mapGroupsWithState - mapGroupsWithState on streaming relation with aggregation in Update mode: not supported
   message: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;
   ```
   
   I sense that, from here we could say, the tricky cases Jungtaek suggested might have already been handled by the checker. Or put it in another way, __if we only disable multiple aggregates (in Update and Complete mode) but not multiple stateful ops in line 186 of [UnsupportedOperationChecker.scala](https://github.com/apache/spark/pull/38503/files/c0381591fa955042c1cf64221a6f4c909143405f..3637b807d57ff5c534386132fb9d47c7cce72705#diff-7c879c08d2f379c139d5229a88857229ae69bb48f0a138a3d64e1b2dde3502fe), things that failed before will still fail (in Update and Complete mode), but we still loose the restriction by allowing some multiple stateful ops in append mode__. 
   
   I raised this to Alex and Jungtaek, but we finally agree to proceed this direction. But then we found that it will also change existing allowed cases, for example [here](https://github.com/apache/spark/pull/38503#discussion_r1017413562) and [here](https://github.com/apache/spark/pull/38503#discussion_r1017447511). If this is pushed, currently running pipelines will fail.
   
   We could definitely then change the code to specifically allow these cases, but I sense that, for least engineering effort, it's enough to just revert the change from disallowing multiple stateful operators to just multiple aggregates (in Update and Complete mode). Given that the tricky cases will be handled somewhere else in the checker.
   
   I believe only disabling multiple aggregates (in Update and Complete mode) looses the restriction in a proper way. As we will still loose the restriction by allow cases that were disallowed before. (Before all multiple agg is disallowed, but now with append mode it is allowed), and all tests that were passing before still passes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org