You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jungtaek Lim <ka...@gmail.com> on 2018/07/06 01:09:49 UTC

Re: Asking for reviewing PRs regarding structured streaming

Bump. I have been having hard time working on making additional PRs since
some of these rely on non-merged PRs, so spending additional time to
decouple these things if possible.

https://github.com/apache/spark/pulls/HeartSaVioR

Pending 5 PRs so far, and may add more sooner or later.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Kindly reminder since around 2 weeks passed. I've added more PR during 2
> weeks and even planning to do more.
>
> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> Hi Spark devs,
>>
>> I have couple of pull requests for structured streaming which are getting
>> older and fading out from earlier pages in PR pages.
>>
>> https://github.com/apache/spark/pull/21469
>> https://github.com/apache/spark/pull/21357
>> https://github.com/apache/spark/pull/21222
>>
>> Two of them are in a kind of approval by couple of folks, but no approval
>> from committers yet.
>> One of them needs rebase and I would be happy to do it after reviewing or
>> in progress of reviewing.
>>
>> Getting reviewed in time would be critical for contributors to be honest,
>> so I'd like to ask dev mailing list to review my PRs.
>>
>> Thanks in advance,
>> Jungtaek Lim (HeartSaVioR)
>>
>

Re: Asking for reviewing PRs regarding structured streaming

Posted by Jungtaek Lim <ka...@gmail.com>.

I'd like to bump this again, since only one of 6 pull requests is merged (5
remaining), and others are not reviewed (non code style) from committers.

https://github.com/apache/spark/pulls/HeartSaVioR

All pull requests are related to Structured Streaming, and most of all are
already reviewed by couple of contributors. They're open for 17 days at
least and more than 2 months at most. I'm not persuading committers to
merge them in 2.4, but hope to see any reactions / reviews so that I can
hopefully reflect and take them forward to be ready to merge.

- Jungtaek Lim (HeartSaVioR)

2018년 7월 16일 (월) 오후 1:04, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Bump. I got couple of review comments from contributors including soft
> LGTM, but still haven't got any (non code style) review from committers, so
> technically haven't have any progress to be merged.
>
> I'm planning to work on adding new feature as well, but it's not easy for
> me to concentrate on something with also concerning to maintain 6 existing
> pull requests. Merge conflicts would be matter on maintaining, especially
> other pull requests (submitted later than my pull requests) are getting
> reviewed and merged.
>
> I'd like to ask any structured streaming related committer to take a look
> at pull requests.
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2018년 7월 12일 (목) 오후 10:41, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> I recently added more test results to SPARK-24763 [1] which shows that
>> the proposal reduces state size according to the ratio of key-value size,
>> whereas there's no performance hit and sometimes even slight boost.
>>
>> Please refer the latest comment in JIRA issue [2] to see the numbers from
>> perf. tests.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24763
>> 2.
>> https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367
>>
>>
>> 2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>
>>> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
>>> option to enable optimization of state size in streaming aggregation
>>> without hurting performance.
>>>
>>> The idea is to remove data for key fields from value which is duplicated
>>> between key and value in state row. This requires additional operations
>>> like projection and join, but smaller state row would also give performance
>>> benefit, which can offset each other.
>>>
>>> Please refer the comment in JIRA issue [2] to see the numbers from
>>> simple perf. test.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-24763
>>>
>>>
>>> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>
>>>> Ted Yu suggested posting the improved numbers to this thread and I
>>>> think it's good idea, so also posting here, but I also think explaining
>>>> rationalization of my issues would help understanding why I'm submitting
>>>> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>>>>
>>>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS
>>>> state store provider from 10x~80x of size of state for a batch according to
>>>> various stateful workloads to less than or around 2x. The new option is
>>>> flexible so it can be even around 1x or even effectively disable cache.
>>>> Please refer the comment in the PR [2] to get more details. (hard to post
>>>> detailed numbers in mail format so link a Github comment instead)
>>>>
>>>> I have interest on stateful streaming processing on Structured
>>>> Streaming, and have been learning from codebase as well as analyzing memory
>>>> usage as well as latency (while I admit it is hard to measure latency
>>>> correctly...).
>>>>
>>>>
>>>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>>>>
>>>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
>>>> excessive caching. As I described in section "The impact of storing
>>>> multiple versions from HDFSBackedStateStoreProvider" in above link, while
>>>> multiple versions share the same UnsafeRow unless there's a change on the
>>>> value which lessen the impact of caching multiple versions (credit to Jose
>>>> Torres since I realized it from his comment). But in some workloads which
>>>> lots of writes to state incurs in a batch, the overall memory usage of
>>>> state is going to be out of expectation.
>>>>
>>>> Related patch [3] is also submitted from other contributor (so I'm not
>>>> the one to notice this behavior), whereas the patch might not look enough
>>>> generalized to apply.
>>>>
>>>> First I decided to track the overall memory size of state provider
>>>> cache and expose to UI as well as query status (SPARK-24441 [4]). The
>>>> metric looked like critical and worth to monitor, so I thought it is better
>>>> to expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>>>>
>>>> Based on adoption of SPARK-24441, I could find more flexible way to
>>>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>>>>
>>>> So 3 of 5 issues are coupled so far to track and resolve one issue.
>>>> Hope that it helps explaining worth of reviews for these patches.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1. https://issues.apache.org/jira/browse/SPARK-24717
>>>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
>>>> 3. https://github.com/apache/spark/pull/21500
>>>> 4. https://issues.apache.org/jira/browse/SPARK-24441
>>>> 5. https://issues.apache.org/jira/browse/SPARK-24637
>>>>
>>>> ps. Before all mentioned issues I also submitted some other issues
>>>> regarding feature addition/refactor (2 of 5 issues).
>>>>
>>>>
>>>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>>
>>>>> Bump. I have been having hard time working on making additional PRs
>>>>> since some of these rely on non-merged PRs, so spending additional time to
>>>>> decouple these things if possible.
>>>>>
>>>>> https://github.com/apache/spark/pulls/HeartSaVioR
>>>>>
>>>>> Pending 5 PRs so far, and may add more sooner or later.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>>>
>>>>>> Kindly reminder since around 2 weeks passed. I've added more PR
>>>>>> during 2 weeks and even planning to do more.
>>>>>>
>>>>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>>>>
>>>>>>> Hi Spark devs,
>>>>>>>
>>>>>>> I have couple of pull requests for structured streaming which are
>>>>>>> getting older and fading out from earlier pages in PR pages.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/21469
>>>>>>> https://github.com/apache/spark/pull/21357
>>>>>>> https://github.com/apache/spark/pull/21222
>>>>>>>
>>>>>>> Two of them are in a kind of approval by couple of folks, but no
>>>>>>> approval from committers yet.
>>>>>>> One of them needs rebase and I would be happy to do it after
>>>>>>> reviewing or in progress of reviewing.
>>>>>>>
>>>>>>> Getting reviewed in time would be critical for contributors to be
>>>>>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>

Re: Asking for reviewing PRs regarding structured streaming

Posted by Jungtaek Lim <ka...@gmail.com>.

Bump. I got couple of review comments from contributors including soft
LGTM, but still haven't got any (non code style) review from committers, so
technically haven't have any progress to be merged.

I'm planning to work on adding new feature as well, but it's not easy for
me to concentrate on something with also concerning to maintain 6 existing
pull requests. Merge conflicts would be matter on maintaining, especially
other pull requests (submitted later than my pull requests) are getting
reviewed and merged.

I'd like to ask any structured streaming related committer to take a look
at pull requests.

- Jungtaek Lim (HeartSaVioR)

2018년 7월 12일 (목) 오후 10:41, Jungtaek Lim <ka...@gmail.com>님이 작성:

> I recently added more test results to SPARK-24763 [1] which shows that the
> proposal reduces state size according to the ratio of key-value size,
> whereas there's no performance hit and sometimes even slight boost.
>
> Please refer the latest comment in JIRA issue [2] to see the numbers from
> perf. tests.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24763
> 2.
> https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367
>
>
> 2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
>> option to enable optimization of state size in streaming aggregation
>> without hurting performance.
>>
>> The idea is to remove data for key fields from value which is duplicated
>> between key and value in state row. This requires additional operations
>> like projection and join, but smaller state row would also give performance
>> benefit, which can offset each other.
>>
>> Please refer the comment in JIRA issue [2] to see the numbers from simple
>> perf. test.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24763
>>
>>
>> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>
>>> Ted Yu suggested posting the improved numbers to this thread and I think
>>> it's good idea, so also posting here, but I also think explaining
>>> rationalization of my issues would help understanding why I'm submitting
>>> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>>>
>>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
>>> store provider from 10x~80x of size of state for a batch according to
>>> various stateful workloads to less than or around 2x. The new option is
>>> flexible so it can be even around 1x or even effectively disable cache.
>>> Please refer the comment in the PR [2] to get more details. (hard to post
>>> detailed numbers in mail format so link a Github comment instead)
>>>
>>> I have interest on stateful streaming processing on Structured
>>> Streaming, and have been learning from codebase as well as analyzing memory
>>> usage as well as latency (while I admit it is hard to measure latency
>>> correctly...).
>>>
>>>
>>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>>>
>>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
>>> excessive caching. As I described in section "The impact of storing
>>> multiple versions from HDFSBackedStateStoreProvider" in above link, while
>>> multiple versions share the same UnsafeRow unless there's a change on the
>>> value which lessen the impact of caching multiple versions (credit to Jose
>>> Torres since I realized it from his comment). But in some workloads which
>>> lots of writes to state incurs in a batch, the overall memory usage of
>>> state is going to be out of expectation.
>>>
>>> Related patch [3] is also submitted from other contributor (so I'm not
>>> the one to notice this behavior), whereas the patch might not look enough
>>> generalized to apply.
>>>
>>> First I decided to track the overall memory size of state provider cache
>>> and expose to UI as well as query status (SPARK-24441 [4]). The metric
>>> looked like critical and worth to monitor, so I thought it is better to
>>> expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>>>
>>> Based on adoption of SPARK-24441, I could find more flexible way to
>>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>>>
>>> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
>>> that it helps explaining worth of reviews for these patches.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-24717
>>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
>>> 3. https://github.com/apache/spark/pull/21500
>>> 4. https://issues.apache.org/jira/browse/SPARK-24441
>>> 5. https://issues.apache.org/jira/browse/SPARK-24637
>>>
>>> ps. Before all mentioned issues I also submitted some other issues
>>> regarding feature addition/refactor (2 of 5 issues).
>>>
>>>
>>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>
>>>> Bump. I have been having hard time working on making additional PRs
>>>> since some of these rely on non-merged PRs, so spending additional time to
>>>> decouple these things if possible.
>>>>
>>>> https://github.com/apache/spark/pulls/HeartSaVioR
>>>>
>>>> Pending 5 PRs so far, and may add more sooner or later.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>>
>>>>> Kindly reminder since around 2 weeks passed. I've added more PR during
>>>>> 2 weeks and even planning to do more.
>>>>>
>>>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> I have couple of pull requests for structured streaming which are
>>>>>> getting older and fading out from earlier pages in PR pages.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/21469
>>>>>> https://github.com/apache/spark/pull/21357
>>>>>> https://github.com/apache/spark/pull/21222
>>>>>>
>>>>>> Two of them are in a kind of approval by couple of folks, but no
>>>>>> approval from committers yet.
>>>>>> One of them needs rebase and I would be happy to do it after
>>>>>> reviewing or in progress of reviewing.
>>>>>>
>>>>>> Getting reviewed in time would be critical for contributors to be
>>>>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>

Re: Asking for reviewing PRs regarding structured streaming

Posted by Jungtaek Lim <ka...@gmail.com>.

I recently added more test results to SPARK-24763 [1] which shows that the
proposal reduces state size according to the ratio of key-value size,
whereas there's no performance hit and sometimes even slight boost.

Please refer the latest comment in JIRA issue [2] to see the numbers from
perf. tests.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24763
2.
https://issues.apache.org/jira/browse/SPARK-24763?focusedCommentId=16541367&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16541367


2018년 7월 9일 (월) 오후 5:28, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
> option to enable optimization of state size in streaming aggregation
> without hurting performance.
>
> The idea is to remove data for key fields from value which is duplicated
> between key and value in state row. This requires additional operations
> like projection and join, but smaller state row would also give performance
> benefit, which can offset each other.
>
> Please refer the comment in JIRA issue [2] to see the numbers from simple
> perf. test.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24763
>
>
> 2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> Ted Yu suggested posting the improved numbers to this thread and I think
>> it's good idea, so also posting here, but I also think explaining
>> rationalization of my issues would help understanding why I'm submitting
>> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>>
>> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
>> store provider from 10x~80x of size of state for a batch according to
>> various stateful workloads to less than or around 2x. The new option is
>> flexible so it can be even around 1x or even effectively disable cache.
>> Please refer the comment in the PR [2] to get more details. (hard to post
>> detailed numbers in mail format so link a Github comment instead)
>>
>> I have interest on stateful streaming processing on Structured Streaming,
>> and have been learning from codebase as well as analyzing memory usage as
>> well as latency (while I admit it is hard to measure latency correctly...).
>>
>>
>> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>>
>> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
>> excessive caching. As I described in section "The impact of storing
>> multiple versions from HDFSBackedStateStoreProvider" in above link, while
>> multiple versions share the same UnsafeRow unless there's a change on the
>> value which lessen the impact of caching multiple versions (credit to Jose
>> Torres since I realized it from his comment). But in some workloads which
>> lots of writes to state incurs in a batch, the overall memory usage of
>> state is going to be out of expectation.
>>
>> Related patch [3] is also submitted from other contributor (so I'm not
>> the one to notice this behavior), whereas the patch might not look enough
>> generalized to apply.
>>
>> First I decided to track the overall memory size of state provider cache
>> and expose to UI as well as query status (SPARK-24441 [4]). The metric
>> looked like critical and worth to monitor, so I thought it is better to
>> expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>>
>> Based on adoption of SPARK-24441, I could find more flexible way to
>> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>>
>> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
>> that it helps explaining worth of reviews for these patches.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-24717
>> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
>> 3. https://github.com/apache/spark/pull/21500
>> 4. https://issues.apache.org/jira/browse/SPARK-24441
>> 5. https://issues.apache.org/jira/browse/SPARK-24637
>>
>> ps. Before all mentioned issues I also submitted some other issues
>> regarding feature addition/refactor (2 of 5 issues).
>>
>>
>> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>
>>> Bump. I have been having hard time working on making additional PRs
>>> since some of these rely on non-merged PRs, so spending additional time to
>>> decouple these things if possible.
>>>
>>> https://github.com/apache/spark/pulls/HeartSaVioR
>>>
>>> Pending 5 PRs so far, and may add more sooner or later.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>
>>>> Kindly reminder since around 2 weeks passed. I've added more PR during
>>>> 2 weeks and even planning to do more.
>>>>
>>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>>
>>>>> Hi Spark devs,
>>>>>
>>>>> I have couple of pull requests for structured streaming which are
>>>>> getting older and fading out from earlier pages in PR pages.
>>>>>
>>>>> https://github.com/apache/spark/pull/21469
>>>>> https://github.com/apache/spark/pull/21357
>>>>> https://github.com/apache/spark/pull/21222
>>>>>
>>>>> Two of them are in a kind of approval by couple of folks, but no
>>>>> approval from committers yet.
>>>>> One of them needs rebase and I would be happy to do it after reviewing
>>>>> or in progress of reviewing.
>>>>>
>>>>> Getting reviewed in time would be critical for contributors to be
>>>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>>>
>>>>> Thanks in advance,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>

Re: Asking for reviewing PRs regarding structured streaming

Posted by Jungtaek Lim <ka...@gmail.com>.

Now I'm adding one more issue (SPARK-24763 [1]), which proposes a new
option to enable optimization of state size in streaming aggregation
without hurting performance.

The idea is to remove data for key fields from value which is duplicated
between key and value in state row. This requires additional operations
like projection and join, but smaller state row would also give performance
benefit, which can offset each other.

Please refer the comment in JIRA issue [2] to see the numbers from simple
perf. test.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24763


2018년 7월 6일 (금) 오후 1:54, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Ted Yu suggested posting the improved numbers to this thread and I think
> it's good idea, so also posting here, but I also think explaining
> rationalization of my issues would help understanding why I'm submitting
> couple of patches, so I'll explain it first. (Sorry to post a wall of text).
>
> tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
> store provider from 10x~80x of size of state for a batch according to
> various stateful workloads to less than or around 2x. The new option is
> flexible so it can be even around 1x or even effectively disable cache.
> Please refer the comment in the PR [2] to get more details. (hard to post
> detailed numbers in mail format so link a Github comment instead)
>
> I have interest on stateful streaming processing on Structured Streaming,
> and have been learning from codebase as well as analyzing memory usage as
> well as latency (while I admit it is hard to measure latency correctly...).
>
>
> https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html
>
> While took a look at HDFSBackedStateStoreProvider I indicated a kind of
> excessive caching. As I described in section "The impact of storing
> multiple versions from HDFSBackedStateStoreProvider" in above link, while
> multiple versions share the same UnsafeRow unless there's a change on the
> value which lessen the impact of caching multiple versions (credit to Jose
> Torres since I realized it from his comment). But in some workloads which
> lots of writes to state incurs in a batch, the overall memory usage of
> state is going to be out of expectation.
>
> Related patch [3] is also submitted from other contributor (so I'm not the
> one to notice this behavior), whereas the patch might not look enough
> generalized to apply.
>
> First I decided to track the overall memory size of state provider cache
> and expose to UI as well as query status (SPARK-24441 [4]). The metric
> looked like critical and worth to monitor, so I thought it is better to
> expose it (and watermark) to Dropwizard (SPARK-24637 [5]).
>
> Based on adoption of SPARK-24441, I could find more flexible way to
> resolve the issue (SPARK-24717) what I've mentioned in tl;dr.
>
> So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
> that it helps explaining worth of reviews for these patches.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24717
> 2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
> 3. https://github.com/apache/spark/pull/21500
> 4. https://issues.apache.org/jira/browse/SPARK-24441
> 5. https://issues.apache.org/jira/browse/SPARK-24637
>
> ps. Before all mentioned issues I also submitted some other issues
> regarding feature addition/refactor (2 of 5 issues).
>
>
> 2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> Bump. I have been having hard time working on making additional PRs since
>> some of these rely on non-merged PRs, so spending additional time to
>> decouple these things if possible.
>>
>> https://github.com/apache/spark/pulls/HeartSaVioR
>>
>> Pending 5 PRs so far, and may add more sooner or later.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>
>>> Kindly reminder since around 2 weeks passed. I've added more PR during 2
>>> weeks and even planning to do more.
>>>
>>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>>
>>>> Hi Spark devs,
>>>>
>>>> I have couple of pull requests for structured streaming which are
>>>> getting older and fading out from earlier pages in PR pages.
>>>>
>>>> https://github.com/apache/spark/pull/21469
>>>> https://github.com/apache/spark/pull/21357
>>>> https://github.com/apache/spark/pull/21222
>>>>
>>>> Two of them are in a kind of approval by couple of folks, but no
>>>> approval from committers yet.
>>>> One of them needs rebase and I would be happy to do it after reviewing
>>>> or in progress of reviewing.
>>>>
>>>> Getting reviewed in time would be critical for contributors to be
>>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>>
>>>> Thanks in advance,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>

Re: Asking for reviewing PRs regarding structured streaming

Posted by Jungtaek Lim <ka...@gmail.com>.

Ted Yu suggested posting the improved numbers to this thread and I think
it's good idea, so also posting here, but I also think explaining
rationalization of my issues would help understanding why I'm submitting
couple of patches, so I'll explain it first. (Sorry to post a wall of text).

tl;dr. SPARK-24717 [1] can reduce the overall memory usage of HDFS state
store provider from 10x~80x of size of state for a batch according to
various stateful workloads to less than or around 2x. The new option is
flexible so it can be even around 1x or even effectively disable cache.
Please refer the comment in the PR [2] to get more details. (hard to post
detailed numbers in mail format so link a Github comment instead)

I have interest on stateful streaming processing on Structured Streaming,
and have been learning from codebase as well as analyzing memory usage as
well as latency (while I admit it is hard to measure latency correctly...).

https://community.hortonworks.com/content/kbentry/199257/memory-usage-of-state-in-structured-streaming.html

While took a look at HDFSBackedStateStoreProvider I indicated a kind of
excessive caching. As I described in section "The impact of storing
multiple versions from HDFSBackedStateStoreProvider" in above link, while
multiple versions share the same UnsafeRow unless there's a change on the
value which lessen the impact of caching multiple versions (credit to Jose
Torres since I realized it from his comment). But in some workloads which
lots of writes to state incurs in a batch, the overall memory usage of
state is going to be out of expectation.

Related patch [3] is also submitted from other contributor (so I'm not the
one to notice this behavior), whereas the patch might not look enough
generalized to apply.

First I decided to track the overall memory size of state provider cache
and expose to UI as well as query status (SPARK-24441 [4]). The metric
looked like critical and worth to monitor, so I thought it is better to
expose it (and watermark) to Dropwizard (SPARK-24637 [5]).

Based on adoption of SPARK-24441, I could find more flexible way to resolve
the issue (SPARK-24717) what I've mentioned in tl;dr.

So 3 of 5 issues are coupled so far to track and resolve one issue. Hope
that it helps explaining worth of reviews for these patches.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24717
2. https://github.com/apache/spark/pull/21700#issuecomment-402902576
3. https://github.com/apache/spark/pull/21500
4. https://issues.apache.org/jira/browse/SPARK-24441
5. https://issues.apache.org/jira/browse/SPARK-24637

ps. Before all mentioned issues I also submitted some other issues
regarding feature addition/refactor (2 of 5 issues).


2018년 7월 6일 (금) 오전 10:09, Jungtaek Lim <ka...@gmail.com>님이 작성:

> Bump. I have been having hard time working on making additional PRs since
> some of these rely on non-merged PRs, so spending additional time to
> decouple these things if possible.
>
> https://github.com/apache/spark/pulls/HeartSaVioR
>
> Pending 5 PRs so far, and may add more sooner or later.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2018년 7월 1일 (일) 오전 6:21, Jungtaek Lim <ka...@gmail.com>님이 작성:
>
>> Kindly reminder since around 2 weeks passed. I've added more PR during 2
>> weeks and even planning to do more.
>>
>> 2018년 6월 19일 (화) 오후 6:34, Jungtaek Lim <ka...@gmail.com>님이 작성:
>>
>>> Hi Spark devs,
>>>
>>> I have couple of pull requests for structured streaming which are
>>> getting older and fading out from earlier pages in PR pages.
>>>
>>> https://github.com/apache/spark/pull/21469
>>> https://github.com/apache/spark/pull/21357
>>> https://github.com/apache/spark/pull/21222
>>>
>>> Two of them are in a kind of approval by couple of folks, but no
>>> approval from committers yet.
>>> One of them needs rebase and I would be happy to do it after reviewing
>>> or in progress of reviewing.
>>>
>>> Getting reviewed in time would be critical for contributors to be
>>> honest, so I'd like to ask dev mailing list to review my PRs.
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>