You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Kay Ousterhout <ke...@eecs.berkeley.edu> on 2017/01/04 00:35:38 UTC

Tests failing with GC limit exceeded

I've noticed a bunch of the recent builds failing because of GC limits, for
seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have there
been any recent changes in the build configuration that might be causing
this?  Does anyone else have any ideas about what's going on here?

-Kay

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

quick update:

things are looking slightly...  better.  the number of failing builds
due to GC overhead has decreased slightly since the reboots last
week...  in fact, in the last three days the only builds to be
affected are spark-master-test-maven-hadoop-2.7 (three failures) and
spark-master-test-maven-hadoop-2.6 (five failures).

overall percentages (over two weeks) have also dropped from ~9% to
~7%, so at least the rate of failure is dropping.

so, the while we're still bleeding, it's slowed down a bit.  we'll
still need to audit the java heap size allocs in the various tests,
however.

shane

On Fri, Jan 6, 2017 at 1:06 PM, shane knapp <sk...@berkeley.edu> wrote:
> (adding michael armbrust and josh rosen for visibility)
>
> ok.  roughly 9% of all spark tests builds (including both PRB builds
> are failing due to GC overhead limits.
>
> $ wc -l SPARK_TEST_BUILDS GC_FAIL
>  1350 SPARK_TEST_BUILDS
>   125 GC_FAIL
>
> here are the affected builds (over the past ~2 weeks):
> $ sort builds.raw | uniq -c
>       6 NewSparkPullRequestBuilder
>       1 spark-branch-2.0-test-sbt-hadoop-2.6
>       6 spark-branch-2.1-test-maven-hadoop-2.7
>       1 spark-master-test-maven-hadoop-2.4
>      10 spark-master-test-maven-hadoop-2.6
>      12 spark-master-test-maven-hadoop-2.7
>       5 spark-master-test-sbt-hadoop-2.2
>      15 spark-master-test-sbt-hadoop-2.3
>      11 spark-master-test-sbt-hadoop-2.4
>      16 spark-master-test-sbt-hadoop-2.6
>      22 spark-master-test-sbt-hadoop-2.7
>      20 SparkPullRequestBuilder
>
> please note i also included the spark 1.6 test builds in there just to
> check...  they last ran ~1 month ago, and had no GC overhead failures.
> this leads me to believe that this behavior is quite recent.
>
> so yeah...  looks like we (someone other than me?) needs to take a
> look at the sbt and maven java opts.  :)
>
> shane

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

(adding michael armbrust and josh rosen for visibility)

ok.  roughly 9% of all spark tests builds (including both PRB builds
are failing due to GC overhead limits.

$ wc -l SPARK_TEST_BUILDS GC_FAIL
 1350 SPARK_TEST_BUILDS
  125 GC_FAIL

here are the affected builds (over the past ~2 weeks):
$ sort builds.raw | uniq -c
      6 NewSparkPullRequestBuilder
      1 spark-branch-2.0-test-sbt-hadoop-2.6
      6 spark-branch-2.1-test-maven-hadoop-2.7
      1 spark-master-test-maven-hadoop-2.4
     10 spark-master-test-maven-hadoop-2.6
     12 spark-master-test-maven-hadoop-2.7
      5 spark-master-test-sbt-hadoop-2.2
     15 spark-master-test-sbt-hadoop-2.3
     11 spark-master-test-sbt-hadoop-2.4
     16 spark-master-test-sbt-hadoop-2.6
     22 spark-master-test-sbt-hadoop-2.7
     20 SparkPullRequestBuilder

please note i also included the spark 1.6 test builds in there just to
check...  they last ran ~1 month ago, and had no GC overhead failures.
this leads me to believe that this behavior is quite recent.

so yeah...  looks like we (someone other than me?) needs to take a
look at the sbt and maven java opts.  :)

shane

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

On Fri, Jan 6, 2017 at 12:20 PM, shane knapp <sk...@berkeley.edu> wrote:
> FYI, this is happening across all spark builds...  not just the PRB.

s/all/almost all/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

FYI, this is happening across all spark builds...  not just the PRB.
i'm compiling a report now and will email that out this afternoon.
:(

On Thu, Jan 5, 2017 at 9:00 PM, shane knapp <sk...@berkeley.edu> wrote:
> unsurprisingly, we had another GC:
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70949/console
>
> so, definitely not the system (everything looks hunky dory on the build node).
>
>> It can always be some memory leak; if we increase the memory settings
>> and OOMs still happen, that would be a good indication. Also if the
>> same tests tend to fail (even if unreliably).
>>
> yeah, this would be a great way to check for leaks. :)
>
>> Normally some existing code / feature requiring more memory than
>> before, especially some non-trivial amount, is suspicious. But
>> sometimes new features / new tests require more memory than configured
>> in the build scripts, and sometimes blow up.
>>
> yep...  i agree on both points.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

unsurprisingly, we had another GC:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70949/console

so, definitely not the system (everything looks hunky dory on the build node).

> It can always be some memory leak; if we increase the memory settings
> and OOMs still happen, that would be a good indication. Also if the
> same tests tend to fail (even if unreliably).
>
yeah, this would be a great way to check for leaks. :)

> Normally some existing code / feature requiring more memory than
> before, especially some non-trivial amount, is suspicious. But
> sometimes new features / new tests require more memory than configured
> in the build scripts, and sometimes blow up.
>
yep...  i agree on both points.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Thu, Jan 5, 2017 at 4:58 PM, Kay Ousterhout <ke...@eecs.berkeley.edu> wrote:
> But is there any non-memory-leak reason why the tests should need more
> memory?  In theory each test should be cleaning up it's own Spark Context
> etc. right? My memory is that OOM issues in the tests in the past have been
> indicative of memory leaks somewhere.

It can always be some memory leak; if we increase the memory settings
and OOMs still happen, that would be a good indication. Also if the
same tests tend to fail (even if unreliably).

Normally some existing code / feature requiring more memory than
before, especially some non-trivial amount, is suspicious. But
sometimes new features / new tests require more memory than configured
in the build scripts, and sometimes blow up.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.

But is there any non-memory-leak reason why the tests should need more
memory?  In theory each test should be cleaning up it's own Spark Context
etc. right? My memory is that OOM issues in the tests in the past have been
indicative of memory leaks somewhere.

I do agree that it doesn't seem likely that it's an infrastructure issue /
I can't explain why re-booting would improve things.

On Thu, Jan 5, 2017 at 4:38 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> Seems like the OOM is coming from tests, which most probably means
> it's not an infrastructure issue. Maybe tests just need more memory
> these days and we need to update maven / sbt scripts.
>
> On Thu, Jan 5, 2017 at 1:19 PM, shane knapp <sk...@berkeley.edu> wrote:
> > as of first thing this morning, here's the list of recent GC overhead
> > build failures:
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70891/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70874/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70842/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70927/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70551/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70835/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70841/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70869/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70598/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70898/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70629/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70644/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70686/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70620/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70871/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70873/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70622/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70837/console
> > https://amplab.cs.berkeley.edu/jenkins/job/
> SparkPullRequestBuilder/70626/console
> >
> > i haven't really found anything that jumps out at me except perhaps
> > auditing/upping the java memory limits across the build.  this seems
> > to be a massive shot in the dark, and time consuming, so let's just
> > call this a "method of last resort".
> >
> > looking more closely at the systems themselves, it looked to me that
> > there was enough java "garbage" that had accumulated over the last 5
> > months (since the last reboot) that system reboots would be a good
> > first step.
> >
> > https://www.youtube.com/watch?v=nn2FB1P_Mn8
> >
> > over the course of this morning i've been sneaking in worker reboots
> > during quiet times...  the ganglia memory graphs look a lot better
> > (free memory up, cached memory down!), and i'll keep an eye on things
> > over the course of the next few days to see if the build failure
> > frequency is effected.
> >
> > also, i might be scheduling quarterly system reboots if this indeed
> > fixes the problem.
> >
> > shane
> >
> > On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <sk...@berkeley.edu> wrote:
> >> preliminary findings:  seems to be transient, and affecting 4% of
> >> builds from late december until now (which is as far back as we keep
> >> build records for the PRB builds).
> >>
> >>  408 builds
> >>   16 builds.gc   <--- failures
> >>
> >> it's also happening across all workers at about the same rate.
> >>
> >> and best of all, there seems to be no pattern to which tests are
> >> failing (different each time).  i'll look a little deeper and decide
> >> what to do next.
> >>
> >> On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <sk...@berkeley.edu>
> wrote:
> >>> nope, no changes to jenkins in the past few months.  ganglia graphs
> >>> show higher, but not worrying, memory usage on the workers when the
> >>> jobs failed...
> >>>
> >>> i'll take a closer look later tonite/first thing tomorrow morning.
> >>>
> >>> shane
> >>>
> >>> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <ke...@eecs.berkeley.edu>
> wrote:
> >>>> I've noticed a bunch of the recent builds failing because of GC
> limits, for
> >>>> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have
> there
> >>>> been any recent changes in the build configuration that might be
> causing
> >>>> this?  Does anyone else have any ideas about what's going on here?
> >>>>
> >>>> -Kay
> >>>>
> >>>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
>
>
> --
> Marcelo
>

Re: Tests failing with GC limit exceeded

Posted by Marcelo Vanzin <va...@cloudera.com>.

Seems like the OOM is coming from tests, which most probably means
it's not an infrastructure issue. Maybe tests just need more memory
these days and we need to update maven / sbt scripts.

On Thu, Jan 5, 2017 at 1:19 PM, shane knapp <sk...@berkeley.edu> wrote:
> as of first thing this morning, here's the list of recent GC overhead
> build failures:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70891/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70874/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70842/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70927/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70551/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70835/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70841/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70869/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70598/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70898/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70629/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70686/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70620/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70871/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70873/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70622/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70837/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70626/console
>
> i haven't really found anything that jumps out at me except perhaps
> auditing/upping the java memory limits across the build.  this seems
> to be a massive shot in the dark, and time consuming, so let's just
> call this a "method of last resort".
>
> looking more closely at the systems themselves, it looked to me that
> there was enough java "garbage" that had accumulated over the last 5
> months (since the last reboot) that system reboots would be a good
> first step.
>
> https://www.youtube.com/watch?v=nn2FB1P_Mn8
>
> over the course of this morning i've been sneaking in worker reboots
> during quiet times...  the ganglia memory graphs look a lot better
> (free memory up, cached memory down!), and i'll keep an eye on things
> over the course of the next few days to see if the build failure
> frequency is effected.
>
> also, i might be scheduling quarterly system reboots if this indeed
> fixes the problem.
>
> shane
>
> On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <sk...@berkeley.edu> wrote:
>> preliminary findings:  seems to be transient, and affecting 4% of
>> builds from late december until now (which is as far back as we keep
>> build records for the PRB builds).
>>
>>  408 builds
>>   16 builds.gc   <--- failures
>>
>> it's also happening across all workers at about the same rate.
>>
>> and best of all, there seems to be no pattern to which tests are
>> failing (different each time).  i'll look a little deeper and decide
>> what to do next.
>>
>> On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <sk...@berkeley.edu> wrote:
>>> nope, no changes to jenkins in the past few months.  ganglia graphs
>>> show higher, but not worrying, memory usage on the workers when the
>>> jobs failed...
>>>
>>> i'll take a closer look later tonite/first thing tomorrow morning.
>>>
>>> shane
>>>
>>> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <ke...@eecs.berkeley.edu> wrote:
>>>> I've noticed a bunch of the recent builds failing because of GC limits, for
>>>> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have there
>>>> been any recent changes in the build configuration that might be causing
>>>> this?  Does anyone else have any ideas about what's going on here?
>>>>
>>>> -Kay
>>>>
>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.

Thanks for looking into this Shane!

On Thu, Jan 5, 2017 at 1:19 PM, shane knapp <sk...@berkeley.edu> wrote:

> as of first thing this morning, here's the list of recent GC overhead
> build failures:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70891/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70874/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70842/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70927/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70551/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70835/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70841/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70869/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70598/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70898/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70629/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70644/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70686/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70620/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70871/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70873/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70622/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70837/
> console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70626/
> console
>
> i haven't really found anything that jumps out at me except perhaps
> auditing/upping the java memory limits across the build.  this seems
> to be a massive shot in the dark, and time consuming, so let's just
> call this a "method of last resort".
>
> looking more closely at the systems themselves, it looked to me that
> there was enough java "garbage" that had accumulated over the last 5
> months (since the last reboot) that system reboots would be a good
> first step.
>
> https://www.youtube.com/watch?v=nn2FB1P_Mn8
>
> over the course of this morning i've been sneaking in worker reboots
> during quiet times...  the ganglia memory graphs look a lot better
> (free memory up, cached memory down!), and i'll keep an eye on things
> over the course of the next few days to see if the build failure
> frequency is effected.
>
> also, i might be scheduling quarterly system reboots if this indeed
> fixes the problem.
>
> shane
>
> On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <sk...@berkeley.edu> wrote:
> > preliminary findings:  seems to be transient, and affecting 4% of
> > builds from late december until now (which is as far back as we keep
> > build records for the PRB builds).
> >
> >  408 builds
> >   16 builds.gc   <--- failures
> >
> > it's also happening across all workers at about the same rate.
> >
> > and best of all, there seems to be no pattern to which tests are
> > failing (different each time).  i'll look a little deeper and decide
> > what to do next.
> >
> > On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <sk...@berkeley.edu> wrote:
> >> nope, no changes to jenkins in the past few months.  ganglia graphs
> >> show higher, but not worrying, memory usage on the workers when the
> >> jobs failed...
> >>
> >> i'll take a closer look later tonite/first thing tomorrow morning.
> >>
> >> shane
> >>
> >> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <ke...@eecs.berkeley.edu>
> wrote:
> >>> I've noticed a bunch of the recent builds failing because of GC
> limits, for
> >>> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have
> there
> >>> been any recent changes in the build configuration that might be
> causing
> >>> this?  Does anyone else have any ideas about what's going on here?
> >>>
> >>> -Kay
> >>>
> >>>
>

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

as of first thing this morning, here's the list of recent GC overhead
build failures:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70891/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70874/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70842/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70927/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70551/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70835/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70841/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70869/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70598/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70898/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70629/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70644/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70686/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70620/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70871/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70873/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70622/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70837/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70626/console

i haven't really found anything that jumps out at me except perhaps
auditing/upping the java memory limits across the build.  this seems
to be a massive shot in the dark, and time consuming, so let's just
call this a "method of last resort".

looking more closely at the systems themselves, it looked to me that
there was enough java "garbage" that had accumulated over the last 5
months (since the last reboot) that system reboots would be a good
first step.

https://www.youtube.com/watch?v=nn2FB1P_Mn8

over the course of this morning i've been sneaking in worker reboots
during quiet times...  the ganglia memory graphs look a lot better
(free memory up, cached memory down!), and i'll keep an eye on things
over the course of the next few days to see if the build failure
frequency is effected.

also, i might be scheduling quarterly system reboots if this indeed
fixes the problem.

shane

On Wed, Jan 4, 2017 at 1:22 PM, shane knapp <sk...@berkeley.edu> wrote:
> preliminary findings:  seems to be transient, and affecting 4% of
> builds from late december until now (which is as far back as we keep
> build records for the PRB builds).
>
>  408 builds
>   16 builds.gc   <--- failures
>
> it's also happening across all workers at about the same rate.
>
> and best of all, there seems to be no pattern to which tests are
> failing (different each time).  i'll look a little deeper and decide
> what to do next.
>
> On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <sk...@berkeley.edu> wrote:
>> nope, no changes to jenkins in the past few months.  ganglia graphs
>> show higher, but not worrying, memory usage on the workers when the
>> jobs failed...
>>
>> i'll take a closer look later tonite/first thing tomorrow morning.
>>
>> shane
>>
>> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <ke...@eecs.berkeley.edu> wrote:
>>> I've noticed a bunch of the recent builds failing because of GC limits, for
>>> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have there
>>> been any recent changes in the build configuration that might be causing
>>> this?  Does anyone else have any ideas about what's going on here?
>>>
>>> -Kay
>>>
>>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

preliminary findings:  seems to be transient, and affecting 4% of
builds from late december until now (which is as far back as we keep
build records for the PRB builds).

 408 builds
  16 builds.gc   <--- failures

it's also happening across all workers at about the same rate.

and best of all, there seems to be no pattern to which tests are
failing (different each time).  i'll look a little deeper and decide
what to do next.

On Tue, Jan 3, 2017 at 6:49 PM, shane knapp <sk...@berkeley.edu> wrote:
> nope, no changes to jenkins in the past few months.  ganglia graphs
> show higher, but not worrying, memory usage on the workers when the
> jobs failed...
>
> i'll take a closer look later tonite/first thing tomorrow morning.
>
> shane
>
> On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <ke...@eecs.berkeley.edu> wrote:
>> I've noticed a bunch of the recent builds failing because of GC limits, for
>> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have there
>> been any recent changes in the build configuration that might be causing
>> this?  Does anyone else have any ideas about what's going on here?
>>
>> -Kay
>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Tests failing with GC limit exceeded

Posted by shane knapp <sk...@berkeley.edu>.

nope, no changes to jenkins in the past few months.  ganglia graphs
show higher, but not worrying, memory usage on the workers when the
jobs failed...

i'll take a closer look later tonite/first thing tomorrow morning.

shane

On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout <ke...@eecs.berkeley.edu> wrote:
> I've noticed a bunch of the recent builds failing because of GC limits, for
> seemingly unrelated changes (e.g. 70818, 70840, 70842).  Shane, have there
> been any recent changes in the build configuration that might be causing
> this?  Does anyone else have any ideas about what's going on here?
>
> -Kay
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org