You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2018/11/02 02:50:00 UTC

[jira] [Comment Edited] (SPARK-10816) EventTime based sessionization

    [ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672478#comment-16672478 ] 

Jungtaek Lim edited comment on SPARK-10816 at 11/2/18 2:49 AM:
---------------------------------------------------------------

UPDATE: I just discovered the performance critical issue on my patch (I guess same issue is occurring on Baidu's patch) and just fixed yesterday.

1. Plenty Of Sessions In Key / Append Mode (rate: 500,000)

* HWX, Linked List version: max around 250,000
* HWX, Latest: max around 235,000
* flatMapGroupsWithState (with my state func. implementation): max around 250,000

They're showing CPU being maxed out. (Ran with local[3], and 3 cores of CPU are all 100% user.)
When I increase rate to 1,000,000 I observed max rate would go up to around 350,000 for linked list version.

2. Plenty Of Keys / Append Mode (rate: 5,000,000)

* HWX, Linked List version: max around 3,700,000
* HWX, Latest: max around 4,000,000
* flatMapGroupsWithState (with my state func. implementation): max around 2,100,000

3. Plenty Of Rows In Session / Append Mode (rate: 50000)

* HWX, Linked List version: max around 27,000
* HWX, Latest: max around 36,000
* flatMapGroupsWithState (with my state func. implementation): max around 26,000

Please note that Linked list version doesn't materialize all sessions (even in given key) into memory, so it gets rid of concern on loading sessions in memory. It also shows close or even better performance than manual implementation of flatMapGroupsWithState. Linked list version sometimes faster than latest version (load all sessions in group memory) and sometimes slower.

I'll update my patch against linked list version. I guess there's no performance issue now, now waiting for committers' review.


was (Author: kabhwan):
UPDATE: I just discovered the performance critical issue on my patch (I guess same issue is occurring on Baidu's patch) and just fixed yesterday.

1. Plenty Of Sessions In Key / Append Mode (rate: 500,000)

* HWX, Linked List version: max around 250,000
* HWX, Latest: max around 235,000
* flatMapGroupsWithState (with my state func. implementation): max around 250,000

They're showing CPU being maxed out. (Ran with local[3], and 3 cores of CPU are all 100% user.)
When I increase rate to 1,000,000 I observed max rate would go up to around 350,000 for linked list version.

2. Plenty Of Keys / Append Mode (rate: 5,000,000)

* HWX, Linked List version: max around 3,700,000
* HWX, Latest: max around 4,000,000
* flatMapGroupsWithState (with my state func. implementation): max around 2,100,000

3. Plenty Of Rows In Session / Append Mode (rate: 50000)

* HWX, Linked List version: max around 27,000
* HWX, Latest: max around 36,000
* flatMapGroupsWithState (with my state func. implementation): max around 26,000

Please note that Linked list version doesn't materialize all sessions (even in given key) into memory, so it shows close or even better performance than manual implementation of flatMapGroupsWithState, as well as get rid of concern on loading sessions in memory. Linked list version sometimes faster than latest version (load all sessions in group memory) and sometimes slower.

I'll update my patch against linked list version. I guess there's no performance issue now, now waiting for committers' review.

> EventTime based sessionization
> ------------------------------
>
>                 Key: SPARK-10816
>                 URL: https://issues.apache.org/jira/browse/SPARK-10816
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>            Reporter: Reynold Xin
>            Priority: Major
>         Attachments: SPARK-10816 Support session window natively.pdf, Session Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org