You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jungtaek Lim <ka...@gmail.com> on 2019/10/02 05:03:56 UTC

[SS] Possible inconsistent semantics on metric "updated" between stateful operators

Hi devs,

I've indicated the different semantics on metric "updated" between
(Flat)MapGroupsWithState and other stateful operators.

* (Flat)MapGroupsWithState: removal is counted as updated
* others: removal is not counted as updated

Technically, the meanings of "removal" are different:
(Flat)MapGroupsWithState requires the state function to remove state
(removed via user logic), whereas others are evicting state based on
watermark. So removed via user logic vs removed automatically via mechanism
of Spark.

Even taking the difference into account, it may be still confusing - as end
users would assume total state rows >= updated rows when they are playing
with streaming aggregations or stream-stream joins, and when they start to
use (Flat)MapGroupsWithState, they would indicate their assumption is
incorrect - it's possible for FlatMapGroupsWithState to have metrics (total
0, updated 1) which might look odd for them.

We have some options here:

1) It's by intention and it works as expected. Leave it as it is.
2) Don't increase "updated" when state is removed for FlatMapGroupsWithState
3) Add a new metric "removed" and apply this to all stateful operators
(both removal and eviction)

Would like to hear voices on this.

Thanks in advance,
Jungtaek Lim (HeartSaVioR)

* JIRA issue: https://issues.apache.org/jira/browse/SPARK-29312