You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Chengwei Yang <ch...@gmail.com> on 2014/08/30 12:12:28 UTC

Review Request 25184: Delete framework data in TaskStatus to avoid OOM

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/
-----------------------------------------------------------

Review request for mesos and Adam B.


Bugs: MESOS-1746
    https://issues.apache.org/jira/browse/MESOS-1746


Repository: mesos-git


Description
-------

There was a bug found that Spark use TaskStatus.data to transfer computed
result and mesos-master RES memory keeps increasing fast and finally will be
killed by OOM killer.


Diffs
-----

  src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 

Diff: https://reviews.apache.org/r/25184/diff/


Testing
-------

tested with spark


Thanks,

Chengwei Yang


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review51961
-----------------------------------------------------------


Patch looks great!

Reviews applied: [25184]

All tests passed.

- Mesos ReviewBot


On Aug. 30, 2014, 10:12 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Aug. 30, 2014, 10:12 a.m.)
> 
> 
> Review request for mesos and Adam B.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by "Timothy St. Clair" <ts...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review54685
-----------------------------------------------------------


Please rebase to master as that function has diverged.

- Timothy St. Clair


On Sept. 6, 2014, 3:38 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 3:38 a.m.)
> 
> 
> Review request for mesos and Adam B.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by "Timothy St. Clair" <ts...@redhat.com>.

> On Oct. 14, 2014, 2:21 p.m., Timothy St. Clair wrote:
> > I tested locally but not to any great extent, and it passed my make check.  
> > 
> > Could you please elaborate on your testing in the review.

On second ship-it + updated test comment I'll push post haste.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review56526
-----------------------------------------------------------


On Oct. 9, 2014, 2 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 2 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by "Timothy St. Clair" <ts...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review56526
-----------------------------------------------------------

Ship it!


I tested locally but not to any great extent, and it passed my make check.  

Could you please elaborate on your testing in the review.

- Timothy St. Clair


On Oct. 9, 2014, 2 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 2 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review55983
-----------------------------------------------------------


Patch looks great!

Reviews applied: [25184]

All tests passed.

- Mesos ReviewBot


On Oct. 9, 2014, 2 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 2 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Timothy Chen <tn...@apache.org>.

> On Oct. 15, 2014, 5:02 p.m., Timothy Chen wrote:
> > src/master/master.cpp, line 4494
> > <https://reviews.apache.org/r/25184/diff/3/?file=716687#file716687line4494>
> >
> >     Sorry not to keep knit picking, but our style in mesos is to have comments end with periods.
> >     
> >     We usually don't include the jira ticket number as well. Please just remove the line ("MESOS-1746")

Or Tim if you want to just fix the comment and commit it :) It might be easier


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review56729
-----------------------------------------------------------


On Oct. 15, 2014, 2:23 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 2:23 a.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark. It's very easy to reproduce this issue (100%) with spark, when spark use mesos as resource manager, its executor driver will put result into TaskStatus. For example, a result of a single task like below.
> 
> 14/08/22 13:29:18 INFO Executor: Serialized size of result for 248 is 17573033
> 
> It's about 16MB large, and a stage of spark generally consist of maybe hundreds of task and finished in tens of seconds, this will put mesos get killed by OOM killer soon.
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by "Timothy St. Clair" <ts...@redhat.com>.

> On Oct. 15, 2014, 5:02 p.m., Timothy Chen wrote:
> > src/master/master.cpp, line 4494
> > <https://reviews.apache.org/r/25184/diff/3/?file=716687#file716687line4494>
> >
> >     Sorry not to keep knit picking, but our style in mesos is to have comments end with periods.
> >     
> >     We usually don't include the jira ticket number as well. Please just remove the line ("MESOS-1746")
> 
> Timothy Chen wrote:
>     Or Tim if you want to just fix the comment and commit it :) It might be easier

I'll do that.


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review56729
-----------------------------------------------------------


On Oct. 15, 2014, 2:23 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 2:23 a.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark. It's very easy to reproduce this issue (100%) with spark, when spark use mesos as resource manager, its executor driver will put result into TaskStatus. For example, a result of a single task like below.
> 
> 14/08/22 13:29:18 INFO Executor: Serialized size of result for 248 is 17573033
> 
> It's about 16MB large, and a stage of spark generally consist of maybe hundreds of task and finished in tens of seconds, this will put mesos get killed by OOM killer soon.
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review56729
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/25184/#comment97115>

    Sorry not to keep knit picking, but our style in mesos is to have comments end with periods.
    
    We usually don't include the jira ticket number as well. Please just remove the line ("MESOS-1746")


- Timothy Chen


On Oct. 15, 2014, 2:23 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 2:23 a.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark. It's very easy to reproduce this issue (100%) with spark, when spark use mesos as resource manager, its executor driver will put result into TaskStatus. For example, a result of a single task like below.
> 
> 14/08/22 13:29:18 INFO Executor: Serialized size of result for 248 is 17573033
> 
> It's about 16MB large, and a stage of spark generally consist of maybe hundreds of task and finished in tens of seconds, this will put mesos get killed by OOM killer soon.
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Chengwei Yang <ch...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/
-----------------------------------------------------------

(Updated Oct. 15, 2014, 10:23 a.m.)


Review request for mesos, Adam B and Timothy St. Clair.


Bugs: MESOS-1746
    https://issues.apache.org/jira/browse/MESOS-1746


Repository: mesos-git


Description
-------

There was a bug found that Spark use TaskStatus.data to transfer computed
result and mesos-master RES memory keeps increasing fast and finally will be
killed by OOM killer.


Diffs
-----

  src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 

Diff: https://reviews.apache.org/r/25184/diff/


Testing (updated)
-------

tested with spark. It's very easy to reproduce this issue (100%) with spark, when spark use mesos as resource manager, its executor driver will put result into TaskStatus. For example, a result of a single task like below.

14/08/22 13:29:18 INFO Executor: Serialized size of result for 248 is 17573033

It's about 16MB large, and a stage of spark generally consist of maybe hundreds of task and finished in tens of seconds, this will put mesos get killed by OOM killer soon.


Thanks,

Chengwei Yang


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Chengwei Yang <ch...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/
-----------------------------------------------------------

(Updated Oct. 9, 2014, 10 p.m.)


Review request for mesos, Adam B and Timothy St. Clair.


Bugs: MESOS-1746
    https://issues.apache.org/jira/browse/MESOS-1746


Repository: mesos-git


Description
-------

There was a bug found that Spark use TaskStatus.data to transfer computed
result and mesos-master RES memory keeps increasing fast and finally will be
killed by OOM killer.


Diffs (updated)
-----

  src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 

Diff: https://reviews.apache.org/r/25184/diff/


Testing
-------

tested with spark


Thanks,

Chengwei Yang


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Chengwei Yang <ch...@gmail.com>.

> On Oct. 7, 2014, 12:14 a.m., Timothy Chen wrote:
> > Chengwei are you still able to work on this patch ? Will like to see this get merged in 0.21

@Timothy, sorry to late, I have a one week holiday last week, will update this patch within this week.


- Chengwei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review55509
-----------------------------------------------------------


On Sept. 27, 2014, 12:01 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Sept. 27, 2014, 12:01 a.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review55509
-----------------------------------------------------------


Chengwei are you still able to work on this patch ? Will like to see this get merged in 0.21

- Timothy Chen


On Sept. 26, 2014, 4:01 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Sept. 26, 2014, 4:01 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Timothy Chen <tn...@apache.org>.

> On Sept. 26, 2014, 5:17 p.m., Adam B wrote:
> > src/master/master.cpp, line 3173
> > <https://reviews.apache.org/r/25184/diff/2/?file=681985#file681985line3173>
> >
> >     Update Brenden's TODO, since you're now wiping the data field.
> >     I'm not sure if we really want to wipe the message field, especially for terminal states that could be monitored by an external process (not the framework).

Is the message field being used at all though?


- Timothy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review54698
-----------------------------------------------------------


On Sept. 26, 2014, 4:01 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Sept. 26, 2014, 4:01 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Chengwei Yang <ch...@gmail.com>.

> On Sept. 27, 2014, 1:17 a.m., Adam B wrote:
> > src/master/master.cpp, line 3173
> > <https://reviews.apache.org/r/25184/diff/2/?file=681985#file681985line3173>
> >
> >     Update Brenden's TODO, since you're now wiping the data field.
> >     I'm not sure if we really want to wipe the message field, especially for terminal states that could be monitored by an external process (not the framework).
> 
> Timothy Chen wrote:
>     Is the message field being used at all though?

I didn't go through message field yet, I think it's fine to let this patch focus on data field because it did introduced problems.


- Chengwei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review54698
-----------------------------------------------------------


On Oct. 9, 2014, 10 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 10 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Adam B <ad...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review54698
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/25184/#comment94958>

    Update Brenden's TODO, since you're now wiping the data field.
    I'm not sure if we really want to wipe the message field, especially for terminal states that could be monitored by an external process (not the framework).


- Adam B


On Sept. 26, 2014, 9:01 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Sept. 26, 2014, 9:01 a.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Adam B <ad...@mesosphere.io>.

> On Sept. 26, 2014, 9:47 a.m., Timothy Chen wrote:
> > src/master/master.cpp, line 3181
> > <https://reviews.apache.org/r/25184/diff/2/?file=681985#file681985line3181>
> >
> >     Period in the end of the comment.
> 
> Chengwei Yang wrote:
>     I'm not sure if I understand you correctly, if not please correct me. Did you mean that it's better if some comments about how often, how long mesos-master will be killed by OOM killer?
>     
>     If so, the answer as we observed is that when we running spark jobs, every task stored about 17MB data in TaskStatus and a small spark job consists of several thousands of tasks, so it can not finish the job if the leader mesos-master running on a machine with memory small than 10GB.
>     
>     I'll give a common example OOM scenario in comment.

To be honest, I think Tim just wanted you to add a '.' character at the end of the comment, to satisfy our style guidelines.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review54695
-----------------------------------------------------------


On Oct. 9, 2014, 7 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 7 a.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Chengwei Yang <ch...@gmail.com>.

> On Sept. 27, 2014, 12:47 a.m., Timothy Chen wrote:
> > src/master/master.cpp, line 3181
> > <https://reviews.apache.org/r/25184/diff/2/?file=681985#file681985line3181>
> >
> >     Period in the end of the comment.

I'm not sure if I understand you correctly, if not please correct me. Did you mean that it's better if some comments about how often, how long mesos-master will be killed by OOM killer?

If so, the answer as we observed is that when we running spark jobs, every task stored about 17MB data in TaskStatus and a small spark job consists of several thousands of tasks, so it can not finish the job if the leader mesos-master running on a machine with memory small than 10GB.

I'll give a common example OOM scenario in comment.


- Chengwei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review54695
-----------------------------------------------------------


On Oct. 9, 2014, 10 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Oct. 9, 2014, 10 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp cb46cec0674b3aa031450c5b4f48f4f8bb92767d 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review54695
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/25184/#comment94957>

    Period in the end of the comment.


- Timothy Chen


On Sept. 26, 2014, 4:01 p.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Sept. 26, 2014, 4:01 p.m.)
> 
> 
> Review request for mesos, Adam B and Timothy St. Clair.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Chengwei Yang <ch...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/
-----------------------------------------------------------

(Updated Sept. 26, 2014, 4:01 p.m.)


Review request for mesos, Adam B and Timothy St. Clair.


Bugs: MESOS-1746
    https://issues.apache.org/jira/browse/MESOS-1746


Repository: mesos-git


Description
-------

There was a bug found that Spark use TaskStatus.data to transfer computed
result and mesos-master RES memory keeps increasing fast and finally will be
killed by OOM killer.


Diffs
-----

  src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 

Diff: https://reviews.apache.org/r/25184/diff/


Testing
-------

tested with spark


Thanks,

Chengwei Yang


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Mesos ReviewBot <de...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review52544
-----------------------------------------------------------


Bad patch!

Reviews applied: [25184]

Failed command: ./bootstrap

Error:
 autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --warnings=all -I m4
autoreconf: configure.ac: tracing
configure.ac:45: warning: back quotes and double quotes must not be escaped in: unrecognized option: $[1]
configure.ac:45: Try \`$[0] --help' for more information.
aclocal.m4:625: LT_OUTPUT is expanded from...
configure.ac:45: the top level
configure.ac:45: warning: back quotes and double quotes must not be escaped in: unrecognized argument: $[1]
configure.ac:45: Try \`$[0] --help' for more information.
aclocal.m4:625: LT_OUTPUT is expanded from...
configure.ac:45: the top level
configure.ac:430: warning: cannot check for file existence when cross compiling
../../lib/autoconf/general.m4:2777: AC_CHECK_FILE is expanded from...
configure.ac:430: the top level
configure.ac:554: warning: The macro `AC_LANG_SAVE' is obsolete.
configure.ac:554: You should run autoupdate.
../../lib/autoconf/lang.m4:125: AC_LANG_SAVE is expanded from...
m4/acx_pthread.m4:63: ACX_PTHREAD is expanded from...
configure.ac:554: the top level
configure.ac:554: warning: The macro `AC_LANG_C' is obsolete.
configure.ac:554: You should run autoupdate.
../../lib/autoconf/c.m4:72: AC_LANG_C is expanded from...
m4/acx_pthread.m4:63: ACX_PTHREAD is expanded from...
configure.ac:554: the top level
configure.ac:554: warning: The macro `AC_TRY_LINK' is obsolete.
configure.ac:554: You should run autoupdate.
../../lib/autoconf/general.m4:2687: AC_TRY_LINK is expanded from...
m4/acx_pthread.m4:63: ACX_PTHREAD is expanded from...
configure.ac:554: the top level
configure.ac:554: warning: The macro `AC_LANG_RESTORE' is obsolete.
configure.ac:554: You should run autoupdate.
../../lib/autoconf/lang.m4:134: AC_LANG_RESTORE is expanded from...
m4/acx_pthread.m4:63: ACX_PTHREAD is expanded from...
configure.ac:554: the top level
configure.ac:905: warning: The macro `AC_PYTHON_DEVEL' is obsolete.
configure.ac:905: You should run autoupdate.
m4/ax_python_devel.m4:72: AC_PYTHON_DEVEL is expanded from...
configure.ac:905: the top level
autoreconf: configure.ac: adding subdirectory 3rdparty/libprocess to autoreconf
autoreconf: Entering directory `3rdparty/libprocess'
autom4te: cannot open > /tmp/ar9elUtK/am4t5EkGUD/traces.m4: No such file or directory
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1

- Mesos ReviewBot


On Sept. 6, 2014, 3:38 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Sept. 6, 2014, 3:38 a.m.)
> 
> 
> Review request for mesos and Adam B.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Chengwei Yang <ch...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/
-----------------------------------------------------------

(Updated Sept. 6, 2014, 11:38 a.m.)


Review request for mesos and Adam B.


Bugs: MESOS-1746
    https://issues.apache.org/jira/browse/MESOS-1746


Repository: mesos-git


Description
-------

There was a bug found that Spark use TaskStatus.data to transfer computed
result and mesos-master RES memory keeps increasing fast and finally will be
killed by OOM killer.


Diffs (updated)
-----

  src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 

Diff: https://reviews.apache.org/r/25184/diff/


Testing
-------

tested with spark


Thanks,

Chengwei Yang


Re: Review Request 25184: Delete framework data in TaskStatus to avoid OOM

Posted by Timothy Chen <tn...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25184/#review51963
-----------------------------------------------------------



src/master/master.cpp
<https://reviews.apache.org/r/25184/#comment90679>

    I don't think we want to put spark specific comments, we can describe a scenario where we can OOM.
    Also we generally don't use block comments, we use single line comments style in Mesos.
    
    Otherwise I think the changes makes sense. You will want to run all the master unit tests just to make sure.


- Timothy Chen


On Aug. 30, 2014, 10:12 a.m., Chengwei Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25184/
> -----------------------------------------------------------
> 
> (Updated Aug. 30, 2014, 10:12 a.m.)
> 
> 
> Review request for mesos and Adam B.
> 
> 
> Bugs: MESOS-1746
>     https://issues.apache.org/jira/browse/MESOS-1746
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> There was a bug found that Spark use TaskStatus.data to transfer computed
> result and mesos-master RES memory keeps increasing fast and finally will be
> killed by OOM killer.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 2508b38e86b8399886bffcbaca8ec11c731363d8 
> 
> Diff: https://reviews.apache.org/r/25184/diff/
> 
> 
> Testing
> -------
> 
> tested with spark
> 
> 
> Thanks,
> 
> Chengwei Yang
> 
>