You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2013/04/06 01:39:30 UTC

Re: The Jenkins VMs are increasingly slow / overloaded

Looks like 4 ECs Jenkins slaves are offline at the moment ...


On Wed, Mar 27, 2013 at 1:19 PM, Ted Yu <yu...@gmail.com> wrote:

> Looks like Apache Jenkins went off several times this week.
>
> Is it difficult to hook up patching test with the new Jenkins ?
>
> Thanks
>
>
> On Wed, Mar 27, 2013 at 7:49 AM, Andrew Purtell <ap...@apache.org>wrote:
>
>> True, but unlike 0.94 the state of 0.95 and trunk is impacted by Stack's
>> wrangling with Maven to find a sane site and assembly, a number of build
>> failures are due to that. Also you'll note that prior to yesterday the
>> Linux OOM killer was nuking the bloated Maven processes on the build
>> slaves. Let's give these builds a bit of time for this stuff to get sorted
>> out. The failures in 0.94 seem immediately actionable.
>>
>>
>> On Wed, Mar 27, 2013 at 3:38 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > Trunk and 0.95 builds are not in good shape.
>> > 0.95 builds have been failing for 32 times.
>> >
>> > On Apache Jenkins, looks like TestAssignmentManagerOnCluster has failed
>> > quite often for 0.95 and trunk builds.
>> >
>> > On Wed, Mar 27, 2013 at 7:18 AM, Andrew Purtell <ap...@apache.org>
>> > wrote:
>> >
>> > > In general moving from using the m1.large (2 vcores, 7.5 GB RAM) to
>> the
>> > > m1.xlarge (4 vcores, 15 GB RAM) instance type for the slaves helped
>> with
>> > a
>> > > build/test timeout, so now I'd about claim the test environment is
>> sane.
>> > We
>> > > are now seeing that replication tests are flapping, occasionally
>> timing
>> > out
>> > > internally:
>> > >
>> > > See
>> > >
>> > >
>> >
>> http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/24/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
>> > >
>> > >
>> > > and
>> > >
>> > >
>> >
>> http://54.241.6.143/job/HBase-0.94-Security/org.apache.hbase$hbase/7/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/
>> > >
>> > >
>> > > The 0.94 and 0.94-security builds are alternating between green and
>> red
>> > as
>> > > a result.
>> > >
>> > > Perhaps we should reopen/revisit either adjusting the internal
>> timeouts
>> > for
>> > > these tests or the other JIRA about moving minicluster replication
>> tests
>> > to
>> > > hbase-it.
>> > >
>> > >
>> > > On Wed, Mar 27, 2013 at 1:49 AM, Nick Dimiduk <nd...@gmail.com>
>> > wrote:
>> > >
>> > > > On Tue, Mar 26, 2013 at 1:28 PM, Andrew Purtell <
>> apurtell@apache.org>
>> > > > wrote:
>> > > >
>> > > > > The HBase 0.94 build is now testing green!
>> > > > > http://54.241.6.143/job/HBase-0.94/
>> > > > >
>> > > >
>> > > > ^5!
>> > > >
>> > > > On Tue, Mar 26, 2013 at 1:47 AM, Andrew Purtell <
>> apurtell@apache.org>
>> > > > wrote:
>> > > > >
>> > > > > > I found that Maven was being killed on the slaves by the Linux
>> OOM
>> > > > killer
>> > > > > > sometimes for >= 0.95. Seems the m1.large process didn't have
>> > enough
>> > > > > memory
>> > > > > > to host the Jenkins slave, Maven with its 3G+ heap, and the
>> forked
>> > > JVMs
>> > > > > for
>> > > > > > the medium and large tests at the same time. Switching to the
>> > > m1.xlarge
>> > > > > > type resolved this. Now the 0.95 and trunk builds fail for what
>> > looks
>> > > > > like
>> > > > > > a legitimate problem with a hanging test.
>> > > > > >
>> > > > >
>> > > > > --
>> > > > > Best regards,
>> > > > >
>> > > > >    - Andy
>> > > > >
>> > > > > Problems worthy of attack prove their worth by hitting back. -
>> Piet
>> > > Hein
>> > > > > (via Tom White)
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best regards,
>> > >
>> > >    - Andy
>> > >
>> > > Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>> > > (via Tom White)
>> > >
>> >
>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>

Re: The Jenkins VMs are increasingly slow / overloaded

Posted by Andrew Purtell <ap...@apache.org>.

Also, be careful to differentiate between slaves that are "offline" because
they are in the process of being launched, and those that are offline
because of that bug I mention. (It doesn't happen often but does happen.)
If you kill an "offline" slave being launched, this will just cause churn.
And if this seems like something you don't want to bother with, then just
don't worry about it.



On Fri, Apr 5, 2013 at 4:44 PM, Andrew Purtell <ap...@apache.org> wrote:

> This is a bug in the EC2 module for Jenkins. There are other bugs which
> this one fixes so it's not a big deal relative to those. You have an
> account on this system. You can easily go on and delete the slaves which
> end up in offline state.
>
>
> On Fri, Apr 5, 2013 at 4:39 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Looks like 4 ECs Jenkins slaves are offline at the moment ...
>>
>>
>> On Wed, Mar 27, 2013 at 1:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > Looks like Apache Jenkins went off several times this week.
>> >
>> > Is it difficult to hook up patching test with the new Jenkins ?
>> >
>> > Thanks
>> >
>> >
>> > On Wed, Mar 27, 2013 at 7:49 AM, Andrew Purtell <apurtell@apache.org
>> >wrote:
>> >
>> >> True, but unlike 0.94 the state of 0.95 and trunk is impacted by
>> Stack's
>> >> wrangling with Maven to find a sane site and assembly, a number of
>> build
>> >> failures are due to that. Also you'll note that prior to yesterday the
>> >> Linux OOM killer was nuking the bloated Maven processes on the build
>> >> slaves. Let's give these builds a bit of time for this stuff to get
>> sorted
>> >> out. The failures in 0.94 seem immediately actionable.
>> >>
>> >>
>> >> On Wed, Mar 27, 2013 at 3:38 PM, Ted Yu <yu...@gmail.com> wrote:
>> >>
>> >> > Trunk and 0.95 builds are not in good shape.
>> >> > 0.95 builds have been failing for 32 times.
>> >> >
>> >> > On Apache Jenkins, looks like TestAssignmentManagerOnCluster has
>> failed
>> >> > quite often for 0.95 and trunk builds.
>> >> >
>> >> > On Wed, Mar 27, 2013 at 7:18 AM, Andrew Purtell <apurtell@apache.org
>> >
>> >> > wrote:
>> >> >
>> >> > > In general moving from using the m1.large (2 vcores, 7.5 GB RAM) to
>> >> the
>> >> > > m1.xlarge (4 vcores, 15 GB RAM) instance type for the slaves helped
>> >> with
>> >> > a
>> >> > > build/test timeout, so now I'd about claim the test environment is
>> >> sane.
>> >> > We
>> >> > > are now seeing that replication tests are flapping, occasionally
>> >> timing
>> >> > out
>> >> > > internally:
>> >> > >
>> >> > > See
>> >> > >
>> >> > >
>> >> >
>> >>
>> http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/24/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
>> >> > >
>> >> > >
>> >> > > and
>> >> > >
>> >> > >
>> >> >
>> >>
>> http://54.241.6.143/job/HBase-0.94-Security/org.apache.hbase$hbase/7/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/
>> >> > >
>> >> > >
>> >> > > The 0.94 and 0.94-security builds are alternating between green and
>> >> red
>> >> > as
>> >> > > a result.
>> >> > >
>> >> > > Perhaps we should reopen/revisit either adjusting the internal
>> >> timeouts
>> >> > for
>> >> > > these tests or the other JIRA about moving minicluster replication
>> >> tests
>> >> > to
>> >> > > hbase-it.
>> >> > >
>> >> > >
>> >> > > On Wed, Mar 27, 2013 at 1:49 AM, Nick Dimiduk <nd...@gmail.com>
>> >> > wrote:
>> >> > >
>> >> > > > On Tue, Mar 26, 2013 at 1:28 PM, Andrew Purtell <
>> >> apurtell@apache.org>
>> >> > > > wrote:
>> >> > > >
>> >> > > > > The HBase 0.94 build is now testing green!
>> >> > > > > http://54.241.6.143/job/HBase-0.94/
>> >> > > > >
>> >> > > >
>> >> > > > ^5!
>> >> > > >
>> >> > > > On Tue, Mar 26, 2013 at 1:47 AM, Andrew Purtell <
>> >> apurtell@apache.org>
>> >> > > > wrote:
>> >> > > > >
>> >> > > > > > I found that Maven was being killed on the slaves by the
>> Linux
>> >> OOM
>> >> > > > killer
>> >> > > > > > sometimes for >= 0.95. Seems the m1.large process didn't have
>> >> > enough
>> >> > > > > memory
>> >> > > > > > to host the Jenkins slave, Maven with its 3G+ heap, and the
>> >> forked
>> >> > > JVMs
>> >> > > > > for
>> >> > > > > > the medium and large tests at the same time. Switching to the
>> >> > > m1.xlarge
>> >> > > > > > type resolved this. Now the 0.95 and trunk builds fail for
>> what
>> >> > looks
>> >> > > > > like
>> >> > > > > > a legitimate problem with a hanging test.
>> >> > > > > >
>> >> > > > >
>> >> > > > > --
>> >> > > > > Best regards,
>> >> > > > >
>> >> > > > >    - Andy
>> >> > > > >
>> >> > > > > Problems worthy of attack prove their worth by hitting back. -
>> >> Piet
>> >> > > Hein
>> >> > > > > (via Tom White)
>> >> > > > >
>> >> > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > >
>> >> > >    - Andy
>> >> > >
>> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
>> >> Hein
>> >> > > (via Tom White)
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best regards,
>> >>
>> >>    - Andy
>> >>
>> >> Problems worthy of attack prove their worth by hitting back. - Piet
>> Hein
>> >> (via Tom White)
>> >>
>> >
>> >
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: The Jenkins VMs are increasingly slow / overloaded

Posted by Andrew Purtell <ap...@apache.org>.

This is a bug in the EC2 module for Jenkins. There are other bugs which
this one fixes so it's not a big deal relative to those. You have an
account on this system. You can easily go on and delete the slaves which
end up in offline state.


On Fri, Apr 5, 2013 at 4:39 PM, Ted Yu <yu...@gmail.com> wrote:

> Looks like 4 ECs Jenkins slaves are offline at the moment ...
>
>
> On Wed, Mar 27, 2013 at 1:19 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Looks like Apache Jenkins went off several times this week.
> >
> > Is it difficult to hook up patching test with the new Jenkins ?
> >
> > Thanks
> >
> >
> > On Wed, Mar 27, 2013 at 7:49 AM, Andrew Purtell <apurtell@apache.org
> >wrote:
> >
> >> True, but unlike 0.94 the state of 0.95 and trunk is impacted by Stack's
> >> wrangling with Maven to find a sane site and assembly, a number of build
> >> failures are due to that. Also you'll note that prior to yesterday the
> >> Linux OOM killer was nuking the bloated Maven processes on the build
> >> slaves. Let's give these builds a bit of time for this stuff to get
> sorted
> >> out. The failures in 0.94 seem immediately actionable.
> >>
> >>
> >> On Wed, Mar 27, 2013 at 3:38 PM, Ted Yu <yu...@gmail.com> wrote:
> >>
> >> > Trunk and 0.95 builds are not in good shape.
> >> > 0.95 builds have been failing for 32 times.
> >> >
> >> > On Apache Jenkins, looks like TestAssignmentManagerOnCluster has
> failed
> >> > quite often for 0.95 and trunk builds.
> >> >
> >> > On Wed, Mar 27, 2013 at 7:18 AM, Andrew Purtell <ap...@apache.org>
> >> > wrote:
> >> >
> >> > > In general moving from using the m1.large (2 vcores, 7.5 GB RAM) to
> >> the
> >> > > m1.xlarge (4 vcores, 15 GB RAM) instance type for the slaves helped
> >> with
> >> > a
> >> > > build/test timeout, so now I'd about claim the test environment is
> >> sane.
> >> > We
> >> > > are now seeing that replication tests are flapping, occasionally
> >> timing
> >> > out
> >> > > internally:
> >> > >
> >> > > See
> >> > >
> >> > >
> >> >
> >>
> http://54.241.6.143/job/HBase-0.94/org.apache.hbase$hbase/24/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailoverCompressed/queueFailover/
> >> > >
> >> > >
> >> > > and
> >> > >
> >> > >
> >> >
> >>
> http://54.241.6.143/job/HBase-0.94-Security/org.apache.hbase$hbase/7/testReport/junit/org.apache.hadoop.hbase.replication/TestReplicationQueueFailover/queueFailover/
> >> > >
> >> > >
> >> > > The 0.94 and 0.94-security builds are alternating between green and
> >> red
> >> > as
> >> > > a result.
> >> > >
> >> > > Perhaps we should reopen/revisit either adjusting the internal
> >> timeouts
> >> > for
> >> > > these tests or the other JIRA about moving minicluster replication
> >> tests
> >> > to
> >> > > hbase-it.
> >> > >
> >> > >
> >> > > On Wed, Mar 27, 2013 at 1:49 AM, Nick Dimiduk <nd...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > On Tue, Mar 26, 2013 at 1:28 PM, Andrew Purtell <
> >> apurtell@apache.org>
> >> > > > wrote:
> >> > > >
> >> > > > > The HBase 0.94 build is now testing green!
> >> > > > > http://54.241.6.143/job/HBase-0.94/
> >> > > > >
> >> > > >
> >> > > > ^5!
> >> > > >
> >> > > > On Tue, Mar 26, 2013 at 1:47 AM, Andrew Purtell <
> >> apurtell@apache.org>
> >> > > > wrote:
> >> > > > >
> >> > > > > > I found that Maven was being killed on the slaves by the Linux
> >> OOM
> >> > > > killer
> >> > > > > > sometimes for >= 0.95. Seems the m1.large process didn't have
> >> > enough
> >> > > > > memory
> >> > > > > > to host the Jenkins slave, Maven with its 3G+ heap, and the
> >> forked
> >> > > JVMs
> >> > > > > for
> >> > > > > > the medium and large tests at the same time. Switching to the
> >> > > m1.xlarge
> >> > > > > > type resolved this. Now the 0.95 and trunk builds fail for
> what
> >> > looks
> >> > > > > like
> >> > > > > > a legitimate problem with a hanging test.
> >> > > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best regards,
> >> > > > >
> >> > > > >    - Andy
> >> > > > >
> >> > > > > Problems worthy of attack prove their worth by hitting back. -
> >> Piet
> >> > > Hein
> >> > > > > (via Tom White)
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > >
> >> > >    - Andy
> >> > >
> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
> >> Hein
> >> > > (via Tom White)
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>    - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)