You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by "Jin, Hao" <hj...@amazon.com> on 2018/05/03 18:55:19 UTC

Problem with Jenkins GPU instances?

I’ve encountered 2 failed GPU builds due to “initialization error: driver error: failed to process request”, the links to the failed builds are:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10645/17/pipeline/674
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10533/18/pipeline

Re: Problem with Jenkins GPU instances?

Posted by Marco de Abreu <ma...@googlemail.com>.

Sorry for the inconvenience. If there are any further issues, please let me
know.

Best regards,
Marco

On Fri, May 4, 2018 at 6:21 AM, Marco de Abreu <marco.g.abreu@googlemail.com
> wrote:

> Great, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10533/22/ seems to be passing without problems.
>
> On Fri, May 4, 2018 at 6:07 AM, Jin, Hao <hj...@amazon.com> wrote:
>
>> The builds are running now, thanks!
>>
>> On 5/3/18, 8:16 PM, "Marco de Abreu" <ma...@googlemail.com>
>> wrote:
>>
>>     You're right, it seems like the Docker builds are hanging. I'm
>> testing the
>>     new auto scaling feature on the test environment [1] and I noticed
>> that all
>>     jobs hung at the exact same spot until 2:40AM German time. It seems
>> like
>>     some APT servers were having problems and since apt does not have a
>> timeout
>>     included, it hung the build instead of failing gracefully. It's
>> 05:13AM now
>>     and it seems like my test builds recovered. I'll check the production
>>     environment and see if it's working fine over there as well. I'll
>> give you
>>     an update in here as soon a I know more details.
>>
>>     -Marco
>>
>>     [1]:
>>     http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxne
>> t/job/ci-master/
>>
>>     On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <hj...@amazon.com> wrote:
>>
>>     > Thanks for fixing the servers! However I found that some of the
>> builds are
>>     > taking extremely long time (not even starting after ~2 hrs):
>>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>     > incubator-mxnet/detail/PR-10645/18/pipeline/59
>>     > Seems like they are stuck during the setup phase?
>>     > Hao
>>     >
>>     > On 5/3/18, 2:44 PM, "Marco de Abreu" <ma...@googlemail.com>
>>     > wrote:
>>     >
>>     >     Alright, we're back up.
>>     >
>>     >     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
>>     >     marco.g.abreu@googlemail.com> wrote:
>>     >
>>     >     > Seems like the CI will be down until some other people turn
>> off their
>>     >     > instances...
>>     >     >
>>     >     > Error
>>     >     > We currently do not have sufficient g3.8xlarge capacity in
>> zones with
>>     >     > support for 'gp2' volumes. Our system will be working on
>> provisioning
>>     >     > additional capacity.
>>     >     >
>>     >     > -Marco
>>     >     >
>>     >     >
>>     >     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hj...@amazon.com>
>> wrote:
>>     >     >
>>     >     >> Thanks a lot Marco!
>>     >     >> Hao
>>     >     >>
>>     >     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <
>> marco.g.abreu@googlemail.com
>>     > >
>>     >     >> wrote:
>>     >     >>
>>     >     >>     Hello,
>>     >     >>
>>     >     >>     I'm already investigating the issue and it seems to be
>> related
>>     > to the
>>     >     >>     recently introduced KVStore tests. They tend to hang,
>> leading
>>     > to job
>>     >     >> be
>>     >     >>     forcefully terminated by Jenkins. The problem here is
>> that this
>>     > does
>>     >     >> not
>>     >     >>     terminate the underlying Docker containers, leading to
>>     > unreleased
>>     >     >> resources.
>>     >     >>
>>     >     >>     As an immediate solution, I will restart all slaves to
>> ensure
>>     > the CI
>>     >     >> is
>>     >     >>     running again. After that, I will try to find a solution
>> to
>>     > detect and
>>     >     >>     release these containers.
>>     >     >>
>>     >     >>     Best regards,
>>     >     >>     Marco
>>     >     >>
>>     >     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <
>> hjjn@amazon.com>
>>     > wrote:
>>     >     >>
>>     >     >>     > I’ve encountered 2 failed GPU builds due to
>> “initialization
>>     > error:
>>     >     >> driver
>>     >     >>     > error: failed to process request”, the links to the
>> failed
>>     > builds
>>     >     >> are:
>>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>>     > organizations/jenkins/
>>     >     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
>>     >     >>     >
>>     >     >>     >
>>     >     >>
>>     >     >>
>>     >     >>
>>     >     >
>>     >
>>     >
>>     >
>>
>>
>>
>

Re: Problem with Jenkins GPU instances?

Posted by Marco de Abreu <ma...@googlemail.com>.

Great,
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10533/22/
seems to be passing without problems.

On Fri, May 4, 2018 at 6:07 AM, Jin, Hao <hj...@amazon.com> wrote:

> The builds are running now, thanks!
>
> On 5/3/18, 8:16 PM, "Marco de Abreu" <ma...@googlemail.com>
> wrote:
>
>     You're right, it seems like the Docker builds are hanging. I'm testing
> the
>     new auto scaling feature on the test environment [1] and I noticed
> that all
>     jobs hung at the exact same spot until 2:40AM German time. It seems
> like
>     some APT servers were having problems and since apt does not have a
> timeout
>     included, it hung the build instead of failing gracefully. It's
> 05:13AM now
>     and it seems like my test builds recovered. I'll check the production
>     environment and see if it's working fine over there as well. I'll give
> you
>     an update in here as soon a I know more details.
>
>     -Marco
>
>     [1]:
>     http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-
> mxnet/job/ci-master/
>
>     On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <hj...@amazon.com> wrote:
>
>     > Thanks for fixing the servers! However I found that some of the
> builds are
>     > taking extremely long time (not even starting after ~2 hrs):
>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>     > incubator-mxnet/detail/PR-10645/18/pipeline/59
>     > Seems like they are stuck during the setup phase?
>     > Hao
>     >
>     > On 5/3/18, 2:44 PM, "Marco de Abreu" <ma...@googlemail.com>
>     > wrote:
>     >
>     >     Alright, we're back up.
>     >
>     >     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
>     >     marco.g.abreu@googlemail.com> wrote:
>     >
>     >     > Seems like the CI will be down until some other people turn
> off their
>     >     > instances...
>     >     >
>     >     > Error
>     >     > We currently do not have sufficient g3.8xlarge capacity in
> zones with
>     >     > support for 'gp2' volumes. Our system will be working on
> provisioning
>     >     > additional capacity.
>     >     >
>     >     > -Marco
>     >     >
>     >     >
>     >     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hj...@amazon.com>
> wrote:
>     >     >
>     >     >> Thanks a lot Marco!
>     >     >> Hao
>     >     >>
>     >     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <
> marco.g.abreu@googlemail.com
>     > >
>     >     >> wrote:
>     >     >>
>     >     >>     Hello,
>     >     >>
>     >     >>     I'm already investigating the issue and it seems to be
> related
>     > to the
>     >     >>     recently introduced KVStore tests. They tend to hang,
> leading
>     > to job
>     >     >> be
>     >     >>     forcefully terminated by Jenkins. The problem here is
> that this
>     > does
>     >     >> not
>     >     >>     terminate the underlying Docker containers, leading to
>     > unreleased
>     >     >> resources.
>     >     >>
>     >     >>     As an immediate solution, I will restart all slaves to
> ensure
>     > the CI
>     >     >> is
>     >     >>     running again. After that, I will try to find a solution
> to
>     > detect and
>     >     >>     release these containers.
>     >     >>
>     >     >>     Best regards,
>     >     >>     Marco
>     >     >>
>     >     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hjjn@amazon.com
> >
>     > wrote:
>     >     >>
>     >     >>     > I’ve encountered 2 failed GPU builds due to
> “initialization
>     > error:
>     >     >> driver
>     >     >>     > error: failed to process request”, the links to the
> failed
>     > builds
>     >     >> are:
>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>     >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
>     > organizations/jenkins/
>     >     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
>     >     >>     >
>     >     >>     >
>     >     >>
>     >     >>
>     >     >>
>     >     >
>     >
>     >
>     >
>
>
>

Re: Problem with Jenkins GPU instances?

Posted by "Jin, Hao" <hj...@amazon.com>.

The builds are running now, thanks!

On 5/3/18, 8:16 PM, "Marco de Abreu" <ma...@googlemail.com> wrote:

    You're right, it seems like the Docker builds are hanging. I'm testing the
    new auto scaling feature on the test environment [1] and I noticed that all
    jobs hung at the exact same spot until 2:40AM German time. It seems like
    some APT servers were having problems and since apt does not have a timeout
    included, it hung the build instead of failing gracefully. It's 05:13AM now
    and it seems like my test builds recovered. I'll check the production
    environment and see if it's working fine over there as well. I'll give you
    an update in here as soon a I know more details.
    
    -Marco
    
    [1]:
    http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxnet/job/ci-master/
    
    On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <hj...@amazon.com> wrote:
    
    > Thanks for fixing the servers! However I found that some of the builds are
    > taking extremely long time (not even starting after ~2 hrs):
    > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > incubator-mxnet/detail/PR-10645/18/pipeline/59
    > Seems like they are stuck during the setup phase?
    > Hao
    >
    > On 5/3/18, 2:44 PM, "Marco de Abreu" <ma...@googlemail.com>
    > wrote:
    >
    >     Alright, we're back up.
    >
    >     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
    >     marco.g.abreu@googlemail.com> wrote:
    >
    >     > Seems like the CI will be down until some other people turn off their
    >     > instances...
    >     >
    >     > Error
    >     > We currently do not have sufficient g3.8xlarge capacity in zones with
    >     > support for 'gp2' volumes. Our system will be working on provisioning
    >     > additional capacity.
    >     >
    >     > -Marco
    >     >
    >     >
    >     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hj...@amazon.com> wrote:
    >     >
    >     >> Thanks a lot Marco!
    >     >> Hao
    >     >>
    >     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com
    > >
    >     >> wrote:
    >     >>
    >     >>     Hello,
    >     >>
    >     >>     I'm already investigating the issue and it seems to be related
    > to the
    >     >>     recently introduced KVStore tests. They tend to hang, leading
    > to job
    >     >> be
    >     >>     forcefully terminated by Jenkins. The problem here is that this
    > does
    >     >> not
    >     >>     terminate the underlying Docker containers, leading to
    > unreleased
    >     >> resources.
    >     >>
    >     >>     As an immediate solution, I will restart all slaves to ensure
    > the CI
    >     >> is
    >     >>     running again. After that, I will try to find a solution to
    > detect and
    >     >>     release these containers.
    >     >>
    >     >>     Best regards,
    >     >>     Marco
    >     >>
    >     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hj...@amazon.com>
    > wrote:
    >     >>
    >     >>     > I’ve encountered 2 failed GPU builds due to “initialization
    > error:
    >     >> driver
    >     >>     > error: failed to process request”, the links to the failed
    > builds
    >     >> are:
    >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
    >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
    >     >>     >
    >     >>     >
    >     >>
    >     >>
    >     >>
    >     >
    >
    >
    >

Re: Problem with Jenkins GPU instances?

Posted by Marco de Abreu <ma...@googlemail.com>.

You're right, it seems like the Docker builds are hanging. I'm testing the
new auto scaling feature on the test environment [1] and I noticed that all
jobs hung at the exact same spot until 2:40AM German time. It seems like
some APT servers were having problems and since apt does not have a timeout
included, it hung the build instead of failing gracefully. It's 05:13AM now
and it seems like my test builds recovered. I'll check the production
environment and see if it's working fine over there as well. I'll give you
an update in here as soon a I know more details.

-Marco

[1]:
http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxnet/job/ci-master/

On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <hj...@amazon.com> wrote:

> Thanks for fixing the servers! However I found that some of the builds are
> taking extremely long time (not even starting after ~2 hrs):
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10645/18/pipeline/59
> Seems like they are stuck during the setup phase?
> Hao
>
> On 5/3/18, 2:44 PM, "Marco de Abreu" <ma...@googlemail.com>
> wrote:
>
>     Alright, we're back up.
>
>     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
>     marco.g.abreu@googlemail.com> wrote:
>
>     > Seems like the CI will be down until some other people turn off their
>     > instances...
>     >
>     > Error
>     > We currently do not have sufficient g3.8xlarge capacity in zones with
>     > support for 'gp2' volumes. Our system will be working on provisioning
>     > additional capacity.
>     >
>     > -Marco
>     >
>     >
>     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hj...@amazon.com> wrote:
>     >
>     >> Thanks a lot Marco!
>     >> Hao
>     >>
>     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <marco.g.abreu@googlemail.com
> >
>     >> wrote:
>     >>
>     >>     Hello,
>     >>
>     >>     I'm already investigating the issue and it seems to be related
> to the
>     >>     recently introduced KVStore tests. They tend to hang, leading
> to job
>     >> be
>     >>     forcefully terminated by Jenkins. The problem here is that this
> does
>     >> not
>     >>     terminate the underlying Docker containers, leading to
> unreleased
>     >> resources.
>     >>
>     >>     As an immediate solution, I will restart all slaves to ensure
> the CI
>     >> is
>     >>     running again. After that, I will try to find a solution to
> detect and
>     >>     release these containers.
>     >>
>     >>     Best regards,
>     >>     Marco
>     >>
>     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hj...@amazon.com>
> wrote:
>     >>
>     >>     > I’ve encountered 2 failed GPU builds due to “initialization
> error:
>     >> driver
>     >>     > error: failed to process request”, the links to the failed
> builds
>     >> are:
>     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
>     >>     >
>     >>     >
>     >>
>     >>
>     >>
>     >
>
>
>

Re: Problem with Jenkins GPU instances?

Posted by "Jin, Hao" <hj...@amazon.com>.

Thanks for fixing the servers! However I found that some of the builds are taking extremely long time (not even starting after ~2 hrs):
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10645/18/pipeline/59
Seems like they are stuck during the setup phase?
Hao

On 5/3/18, 2:44 PM, "Marco de Abreu" <ma...@googlemail.com> wrote:

    Alright, we're back up.
    
    On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
    marco.g.abreu@googlemail.com> wrote:
    
    > Seems like the CI will be down until some other people turn off their
    > instances...
    >
    > Error
    > We currently do not have sufficient g3.8xlarge capacity in zones with
    > support for 'gp2' volumes. Our system will be working on provisioning
    > additional capacity.
    >
    > -Marco
    >
    >
    > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hj...@amazon.com> wrote:
    >
    >> Thanks a lot Marco!
    >> Hao
    >>
    >> On 5/3/18, 12:02 PM, "Marco de Abreu" <ma...@googlemail.com>
    >> wrote:
    >>
    >>     Hello,
    >>
    >>     I'm already investigating the issue and it seems to be related to the
    >>     recently introduced KVStore tests. They tend to hang, leading to job
    >> be
    >>     forcefully terminated by Jenkins. The problem here is that this does
    >> not
    >>     terminate the underlying Docker containers, leading to unreleased
    >> resources.
    >>
    >>     As an immediate solution, I will restart all slaves to ensure the CI
    >> is
    >>     running again. After that, I will try to find a solution to detect and
    >>     release these containers.
    >>
    >>     Best regards,
    >>     Marco
    >>
    >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hj...@amazon.com> wrote:
    >>
    >>     > I’ve encountered 2 failed GPU builds due to “initialization error:
    >> driver
    >>     > error: failed to process request”, the links to the failed builds
    >> are:
    >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
    >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    >>     > incubator-mxnet/detail/PR-10533/18/pipeline
    >>     >
    >>     >
    >>
    >>
    >>
    >

Re: Problem with Jenkins GPU instances?

Posted by Marco de Abreu <ma...@googlemail.com>.

Alright, we're back up.

On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
marco.g.abreu@googlemail.com> wrote:

> Seems like the CI will be down until some other people turn off their
> instances...
>
> Error
> We currently do not have sufficient g3.8xlarge capacity in zones with
> support for 'gp2' volumes. Our system will be working on provisioning
> additional capacity.
>
> -Marco
>
>
> On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hj...@amazon.com> wrote:
>
>> Thanks a lot Marco!
>> Hao
>>
>> On 5/3/18, 12:02 PM, "Marco de Abreu" <ma...@googlemail.com>
>> wrote:
>>
>>     Hello,
>>
>>     I'm already investigating the issue and it seems to be related to the
>>     recently introduced KVStore tests. They tend to hang, leading to job
>> be
>>     forcefully terminated by Jenkins. The problem here is that this does
>> not
>>     terminate the underlying Docker containers, leading to unreleased
>> resources.
>>
>>     As an immediate solution, I will restart all slaves to ensure the CI
>> is
>>     running again. After that, I will try to find a solution to detect and
>>     release these containers.
>>
>>     Best regards,
>>     Marco
>>
>>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hj...@amazon.com> wrote:
>>
>>     > I’ve encountered 2 failed GPU builds due to “initialization error:
>> driver
>>     > error: failed to process request”, the links to the failed builds
>> are:
>>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>>     > incubator-mxnet/detail/PR-10533/18/pipeline
>>     >
>>     >
>>
>>
>>
>

Re: Problem with Jenkins GPU instances?

Posted by Marco de Abreu <ma...@googlemail.com>.

Seems like the CI will be down until some other people turn off their
instances...

Error
We currently do not have sufficient g3.8xlarge capacity in zones with
support for 'gp2' volumes. Our system will be working on provisioning
additional capacity.

-Marco


On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <hj...@amazon.com> wrote:

> Thanks a lot Marco!
> Hao
>
> On 5/3/18, 12:02 PM, "Marco de Abreu" <ma...@googlemail.com>
> wrote:
>
>     Hello,
>
>     I'm already investigating the issue and it seems to be related to the
>     recently introduced KVStore tests. They tend to hang, leading to job be
>     forcefully terminated by Jenkins. The problem here is that this does
> not
>     terminate the underlying Docker containers, leading to unreleased
> resources.
>
>     As an immediate solution, I will restart all slaves to ensure the CI is
>     running again. After that, I will try to find a solution to detect and
>     release these containers.
>
>     Best regards,
>     Marco
>
>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hj...@amazon.com> wrote:
>
>     > I’ve encountered 2 failed GPU builds due to “initialization error:
> driver
>     > error: failed to process request”, the links to the failed builds
> are:
>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
>     > incubator-mxnet/detail/PR-10533/18/pipeline
>     >
>     >
>
>
>

Re: Problem with Jenkins GPU instances?

Posted by "Jin, Hao" <hj...@amazon.com>.

Thanks a lot Marco!
Hao

On 5/3/18, 12:02 PM, "Marco de Abreu" <ma...@googlemail.com> wrote:

    Hello,
    
    I'm already investigating the issue and it seems to be related to the
    recently introduced KVStore tests. They tend to hang, leading to job be
    forcefully terminated by Jenkins. The problem here is that this does not
    terminate the underlying Docker containers, leading to unreleased resources.
    
    As an immediate solution, I will restart all slaves to ensure the CI is
    running again. After that, I will try to find a solution to detect and
    release these containers.
    
    Best regards,
    Marco
    
    On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hj...@amazon.com> wrote:
    
    > I’ve encountered 2 failed GPU builds due to “initialization error: driver
    > error: failed to process request”, the links to the failed builds are:
    > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > incubator-mxnet/detail/PR-10645/17/pipeline/674
    > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > incubator-mxnet/detail/PR-10533/18/pipeline
    >
    >

Re: Problem with Jenkins GPU instances?

Posted by Marco de Abreu <ma...@googlemail.com>.

Hello,

I'm already investigating the issue and it seems to be related to the
recently introduced KVStore tests. They tend to hang, leading to job be
forcefully terminated by Jenkins. The problem here is that this does not
terminate the underlying Docker containers, leading to unreleased resources.

As an immediate solution, I will restart all slaves to ensure the CI is
running again. After that, I will try to find a solution to detect and
release these containers.

Best regards,
Marco

On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <hj...@amazon.com> wrote:

> I’ve encountered 2 failed GPU builds due to “initialization error: driver
> error: failed to process request”, the links to the failed builds are:
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10645/17/pipeline/674
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10533/18/pipeline
>
>