You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Hussein Elgridly <hu...@broadinstitute.org> on 2015/04/06 19:30:21 UTC

Nuked sandbox directory and failing to run finalizers on disk exceeded

Hi folks,

I've just had my first task fail due to exceeding disk capacity, and I've
run into some strange behaviour.

It's a Java process that's running inside a Docker container specified in
the task config. The Java process is failing with java.io.IOException: No
space left on device when attempting to write a file.

Three things are (or aren't) then happening which I think are just plain
wrong:

1. The task is being marked as failed (good!) but isn't reporting that it
exceeded disk limits (bad). I was expecting to see the "Disk limit
exceeded.  Reserved X bytes vs used Y bytes." message, but neither the
Mesos nor Aurora web interfaces are telling me this.
2. The task's sandbox directory is being nuked. All of it, immediately.
There while the job is running, vanished as soon as it fails (I happened to
be watching it live). This makes debugging difficult, and the
Aurora/Thermos web UI clearly has trouble because it reports the resource
requests as all zero when they most definitely weren't.
3. Finalizers aren't running. No finalizers = no error log = no debugging =
sadface. :(

I think what's actually happening here is that the process is running out
of disk on the machine itself and that IOException is propagating up from
the kernel, rather than Mesos killing the process from its disk usage
monitoring.

As such, we're going to try configuring the Mesos slaves with
--resources='disk:some_smaller_value' to leave a little overhead in the
hope that the Mesos disk monitor catches the overusage before the process
attempts to claim the last free block on disk.

I don't know why it'd be nuking the sandbox, though. And is the GC executor
more aggressive about cleaning out old sandbox directories if the disk is
low on free space?

If it helps, we're on Aurora commit
2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.

Thanks,

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard

Re: Nuked sandbox directory and failing to run finalizers on disk exceeded

Posted by Bill Farner <wf...@apache.org>.

Would you mind filing a mesos ticket for this?  Seems like a pretty major
wart in the integration.

-=Bill

On Thu, Apr 9, 2015 at 10:45 AM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> I believe the root of the Docker filesystem for any given container goes in
> /var/lib/docker/something/container_id/... on the host filesystem.
>
> The gc executor cleans out the sandbox directory but anything written
> anywhere else will stick around on the host filesystem until docker rm is
> called. This is handled by Mesos after DOCKER_REMOVE_DELAY = 6h later. [1]
>
> [1]
>
> https://github.com/apache/mesos/blob/2985ae05634038b70f974bbfed6b52fe47231418/src/slave/constants.cpp#L52
>
> I think the takeaway here is to use /mnt/mesos/sandbox as your scratch
> space, since that's all that Thermos watches. Mesos is not merciful if you
> anger it.
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 9 April 2015 at 12:25, Bill Farner <wf...@apache.org> wrote:
>
> > I'm fairly ignorant to some of the practicalities here - if you don't
> write
> > to /mnt/mesos/sandbox, where do files land?  Some other ephemeral
> directory
> > that dies with the container?
> >
> > -=Bill
> >
> > On Thu, Apr 9, 2015 at 7:11 AM, Hussein Elgridly <
> > hussein@broadinstitute.org
> > > wrote:
> >
> > > Thanks, that's helpful. I've also just discovered that Thermos only
> > > monitors disk usage in the sandbox location, so if we launch a Docker
> job
> > > and write to anywhere that's not /mnt/mesos/sandbox, we can exceed our
> > disk
> > > quota. I can work around this by turning our scratch space directories
> > into
> > > symlinks located under the sandbox, though.
> > >
> > > Hussein Elgridly
> > > Senior Software Engineer, DSDE
> > > The Broad Institute of MIT and Harvard
> > >
> > >
> > > On 8 April 2015 at 19:43, Zameer Manji <zm...@apache.org> wrote:
> > >
> > > > Hey,
> > > >
> > > > The deletion of sandbox directories is done by the Mesos slave not
> the
> > GC
> > > > executor. You will have to ask Mesos devs on the relationship between
> > low
> > > > disk and sandbox deletion.
> > > >
> > > > The executor enforces disk usage by running `du` in the background
> > > > periodically. I suspect in your case your process fails before the
> > > executor
> > > > notices the disk usage has been exceeded and marks the task as
> failed.
> > > This
> > > > explains why the disk usage message is not there.
> > > >
> > > > I'm not sure why the finalizers are not running, but you should note
> > that
> > > > they are best effort by the executor. The executor won't be able to
> run
> > > > them if Mesos tears down the container from underneath it for
> example.
> > > >
> > > > On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly <
> > > > hussein@broadinstitute.org> wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I've just had my first task fail due to exceeding disk capacity,
> and
> > > I've
> > > > > run into some strange behaviour.
> > > > >
> > > > > It's a Java process that's running inside a Docker container
> > specified
> > > in
> > > > > the task config. The Java process is failing with
> > java.io.IOException:
> > > No
> > > > > space left on device when attempting to write a file.
> > > > >
> > > > > Three things are (or aren't) then happening which I think are just
> > > plain
> > > > > wrong:
> > > > >
> > > > > 1. The task is being marked as failed (good!) but isn't reporting
> > that
> > > it
> > > > > exceeded disk limits (bad). I was expecting to see the "Disk limit
> > > > > exceeded.  Reserved X bytes vs used Y bytes." message, but neither
> > the
> > > > > Mesos nor Aurora web interfaces are telling me this.
> > > > > 2. The task's sandbox directory is being nuked. All of it,
> > immediately.
> > > > > There while the job is running, vanished as soon as it fails (I
> > > happened
> > > > to
> > > > > be watching it live). This makes debugging difficult, and the
> > > > > Aurora/Thermos web UI clearly has trouble because it reports the
> > > resource
> > > > > requests as all zero when they most definitely weren't.
> > > > > 3. Finalizers aren't running. No finalizers = no error log = no
> > > > debugging =
> > > > > sadface. :(
> > > > >
> > > > > I think what's actually happening here is that the process is
> running
> > > out
> > > > > of disk on the machine itself and that IOException is propagating
> up
> > > from
> > > > > the kernel, rather than Mesos killing the process from its disk
> usage
> > > > > monitoring.
> > > > >
> > > > > As such, we're going to try configuring the Mesos slaves with
> > > > > --resources='disk:some_smaller_value' to leave a little overhead in
> > the
> > > > > hope that the Mesos disk monitor catches the overusage before the
> > > process
> > > > > attempts to claim the last free block on disk.
> > > > >
> > > > > I don't know why it'd be nuking the sandbox, though. And is the GC
> > > > executor
> > > > > more aggressive about cleaning out old sandbox directories if the
> > disk
> > > is
> > > > > low on free space?
> > > > >
> > > > > If it helps, we're on Aurora commit
> > > > > 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Hussein Elgridly
> > > > > Senior Software Engineer, DSDE
> > > > > The Broad Institute of MIT and Harvard
> > > > >
> > > > > --
> > > > > Zameer Manji
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Nuked sandbox directory and failing to run finalizers on disk exceeded

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

I believe the root of the Docker filesystem for any given container goes in
/var/lib/docker/something/container_id/... on the host filesystem.

The gc executor cleans out the sandbox directory but anything written
anywhere else will stick around on the host filesystem until docker rm is
called. This is handled by Mesos after DOCKER_REMOVE_DELAY = 6h later. [1]

[1]
https://github.com/apache/mesos/blob/2985ae05634038b70f974bbfed6b52fe47231418/src/slave/constants.cpp#L52

I think the takeaway here is to use /mnt/mesos/sandbox as your scratch
space, since that's all that Thermos watches. Mesos is not merciful if you
anger it.

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 9 April 2015 at 12:25, Bill Farner <wf...@apache.org> wrote:

> I'm fairly ignorant to some of the practicalities here - if you don't write
> to /mnt/mesos/sandbox, where do files land?  Some other ephemeral directory
> that dies with the container?
>
> -=Bill
>
> On Thu, Apr 9, 2015 at 7:11 AM, Hussein Elgridly <
> hussein@broadinstitute.org
> > wrote:
>
> > Thanks, that's helpful. I've also just discovered that Thermos only
> > monitors disk usage in the sandbox location, so if we launch a Docker job
> > and write to anywhere that's not /mnt/mesos/sandbox, we can exceed our
> disk
> > quota. I can work around this by turning our scratch space directories
> into
> > symlinks located under the sandbox, though.
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 8 April 2015 at 19:43, Zameer Manji <zm...@apache.org> wrote:
> >
> > > Hey,
> > >
> > > The deletion of sandbox directories is done by the Mesos slave not the
> GC
> > > executor. You will have to ask Mesos devs on the relationship between
> low
> > > disk and sandbox deletion.
> > >
> > > The executor enforces disk usage by running `du` in the background
> > > periodically. I suspect in your case your process fails before the
> > executor
> > > notices the disk usage has been exceeded and marks the task as failed.
> > This
> > > explains why the disk usage message is not there.
> > >
> > > I'm not sure why the finalizers are not running, but you should note
> that
> > > they are best effort by the executor. The executor won't be able to run
> > > them if Mesos tears down the container from underneath it for example.
> > >
> > > On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly <
> > > hussein@broadinstitute.org> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I've just had my first task fail due to exceeding disk capacity, and
> > I've
> > > > run into some strange behaviour.
> > > >
> > > > It's a Java process that's running inside a Docker container
> specified
> > in
> > > > the task config. The Java process is failing with
> java.io.IOException:
> > No
> > > > space left on device when attempting to write a file.
> > > >
> > > > Three things are (or aren't) then happening which I think are just
> > plain
> > > > wrong:
> > > >
> > > > 1. The task is being marked as failed (good!) but isn't reporting
> that
> > it
> > > > exceeded disk limits (bad). I was expecting to see the "Disk limit
> > > > exceeded.  Reserved X bytes vs used Y bytes." message, but neither
> the
> > > > Mesos nor Aurora web interfaces are telling me this.
> > > > 2. The task's sandbox directory is being nuked. All of it,
> immediately.
> > > > There while the job is running, vanished as soon as it fails (I
> > happened
> > > to
> > > > be watching it live). This makes debugging difficult, and the
> > > > Aurora/Thermos web UI clearly has trouble because it reports the
> > resource
> > > > requests as all zero when they most definitely weren't.
> > > > 3. Finalizers aren't running. No finalizers = no error log = no
> > > debugging =
> > > > sadface. :(
> > > >
> > > > I think what's actually happening here is that the process is running
> > out
> > > > of disk on the machine itself and that IOException is propagating up
> > from
> > > > the kernel, rather than Mesos killing the process from its disk usage
> > > > monitoring.
> > > >
> > > > As such, we're going to try configuring the Mesos slaves with
> > > > --resources='disk:some_smaller_value' to leave a little overhead in
> the
> > > > hope that the Mesos disk monitor catches the overusage before the
> > process
> > > > attempts to claim the last free block on disk.
> > > >
> > > > I don't know why it'd be nuking the sandbox, though. And is the GC
> > > executor
> > > > more aggressive about cleaning out old sandbox directories if the
> disk
> > is
> > > > low on free space?
> > > >
> > > > If it helps, we're on Aurora commit
> > > > 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.
> > > >
> > > > Thanks,
> > > >
> > > > Hussein Elgridly
> > > > Senior Software Engineer, DSDE
> > > > The Broad Institute of MIT and Harvard
> > > >
> > > > --
> > > > Zameer Manji
> > > >
> > > >
> > >
> >
>

Re: Nuked sandbox directory and failing to run finalizers on disk exceeded

Posted by Bill Farner <wf...@apache.org>.

I'm fairly ignorant to some of the practicalities here - if you don't write
to /mnt/mesos/sandbox, where do files land?  Some other ephemeral directory
that dies with the container?

-=Bill

On Thu, Apr 9, 2015 at 7:11 AM, Hussein Elgridly <hussein@broadinstitute.org
> wrote:

> Thanks, that's helpful. I've also just discovered that Thermos only
> monitors disk usage in the sandbox location, so if we launch a Docker job
> and write to anywhere that's not /mnt/mesos/sandbox, we can exceed our disk
> quota. I can work around this by turning our scratch space directories into
> symlinks located under the sandbox, though.
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 8 April 2015 at 19:43, Zameer Manji <zm...@apache.org> wrote:
>
> > Hey,
> >
> > The deletion of sandbox directories is done by the Mesos slave not the GC
> > executor. You will have to ask Mesos devs on the relationship between low
> > disk and sandbox deletion.
> >
> > The executor enforces disk usage by running `du` in the background
> > periodically. I suspect in your case your process fails before the
> executor
> > notices the disk usage has been exceeded and marks the task as failed.
> This
> > explains why the disk usage message is not there.
> >
> > I'm not sure why the finalizers are not running, but you should note that
> > they are best effort by the executor. The executor won't be able to run
> > them if Mesos tears down the container from underneath it for example.
> >
> > On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly <
> > hussein@broadinstitute.org> wrote:
> >
> > > Hi folks,
> > >
> > > I've just had my first task fail due to exceeding disk capacity, and
> I've
> > > run into some strange behaviour.
> > >
> > > It's a Java process that's running inside a Docker container specified
> in
> > > the task config. The Java process is failing with java.io.IOException:
> No
> > > space left on device when attempting to write a file.
> > >
> > > Three things are (or aren't) then happening which I think are just
> plain
> > > wrong:
> > >
> > > 1. The task is being marked as failed (good!) but isn't reporting that
> it
> > > exceeded disk limits (bad). I was expecting to see the "Disk limit
> > > exceeded.  Reserved X bytes vs used Y bytes." message, but neither the
> > > Mesos nor Aurora web interfaces are telling me this.
> > > 2. The task's sandbox directory is being nuked. All of it, immediately.
> > > There while the job is running, vanished as soon as it fails (I
> happened
> > to
> > > be watching it live). This makes debugging difficult, and the
> > > Aurora/Thermos web UI clearly has trouble because it reports the
> resource
> > > requests as all zero when they most definitely weren't.
> > > 3. Finalizers aren't running. No finalizers = no error log = no
> > debugging =
> > > sadface. :(
> > >
> > > I think what's actually happening here is that the process is running
> out
> > > of disk on the machine itself and that IOException is propagating up
> from
> > > the kernel, rather than Mesos killing the process from its disk usage
> > > monitoring.
> > >
> > > As such, we're going to try configuring the Mesos slaves with
> > > --resources='disk:some_smaller_value' to leave a little overhead in the
> > > hope that the Mesos disk monitor catches the overusage before the
> process
> > > attempts to claim the last free block on disk.
> > >
> > > I don't know why it'd be nuking the sandbox, though. And is the GC
> > executor
> > > more aggressive about cleaning out old sandbox directories if the disk
> is
> > > low on free space?
> > >
> > > If it helps, we're on Aurora commit
> > > 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.
> > >
> > > Thanks,
> > >
> > > Hussein Elgridly
> > > Senior Software Engineer, DSDE
> > > The Broad Institute of MIT and Harvard
> > >
> > > --
> > > Zameer Manji
> > >
> > >
> >
>

Re: Nuked sandbox directory and failing to run finalizers on disk exceeded

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

Thanks, that's helpful. I've also just discovered that Thermos only
monitors disk usage in the sandbox location, so if we launch a Docker job
and write to anywhere that's not /mnt/mesos/sandbox, we can exceed our disk
quota. I can work around this by turning our scratch space directories into
symlinks located under the sandbox, though.

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 8 April 2015 at 19:43, Zameer Manji <zm...@apache.org> wrote:

> Hey,
>
> The deletion of sandbox directories is done by the Mesos slave not the GC
> executor. You will have to ask Mesos devs on the relationship between low
> disk and sandbox deletion.
>
> The executor enforces disk usage by running `du` in the background
> periodically. I suspect in your case your process fails before the executor
> notices the disk usage has been exceeded and marks the task as failed. This
> explains why the disk usage message is not there.
>
> I'm not sure why the finalizers are not running, but you should note that
> they are best effort by the executor. The executor won't be able to run
> them if Mesos tears down the container from underneath it for example.
>
> On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > Hi folks,
> >
> > I've just had my first task fail due to exceeding disk capacity, and I've
> > run into some strange behaviour.
> >
> > It's a Java process that's running inside a Docker container specified in
> > the task config. The Java process is failing with java.io.IOException: No
> > space left on device when attempting to write a file.
> >
> > Three things are (or aren't) then happening which I think are just plain
> > wrong:
> >
> > 1. The task is being marked as failed (good!) but isn't reporting that it
> > exceeded disk limits (bad). I was expecting to see the "Disk limit
> > exceeded.  Reserved X bytes vs used Y bytes." message, but neither the
> > Mesos nor Aurora web interfaces are telling me this.
> > 2. The task's sandbox directory is being nuked. All of it, immediately.
> > There while the job is running, vanished as soon as it fails (I happened
> to
> > be watching it live). This makes debugging difficult, and the
> > Aurora/Thermos web UI clearly has trouble because it reports the resource
> > requests as all zero when they most definitely weren't.
> > 3. Finalizers aren't running. No finalizers = no error log = no
> debugging =
> > sadface. :(
> >
> > I think what's actually happening here is that the process is running out
> > of disk on the machine itself and that IOException is propagating up from
> > the kernel, rather than Mesos killing the process from its disk usage
> > monitoring.
> >
> > As such, we're going to try configuring the Mesos slaves with
> > --resources='disk:some_smaller_value' to leave a little overhead in the
> > hope that the Mesos disk monitor catches the overusage before the process
> > attempts to claim the last free block on disk.
> >
> > I don't know why it'd be nuking the sandbox, though. And is the GC
> executor
> > more aggressive about cleaning out old sandbox directories if the disk is
> > low on free space?
> >
> > If it helps, we're on Aurora commit
> > 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.
> >
> > Thanks,
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> > --
> > Zameer Manji
> >
> >
>

Re: Nuked sandbox directory and failing to run finalizers on disk exceeded

Posted by Zameer Manji <zm...@apache.org>.

Hey,

The deletion of sandbox directories is done by the Mesos slave not the GC
executor. You will have to ask Mesos devs on the relationship between low
disk and sandbox deletion.

The executor enforces disk usage by running `du` in the background
periodically. I suspect in your case your process fails before the executor
notices the disk usage has been exceeded and marks the task as failed. This
explains why the disk usage message is not there.

I'm not sure why the finalizers are not running, but you should note that
they are best effort by the executor. The executor won't be able to run
them if Mesos tears down the container from underneath it for example.

On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> Hi folks,
>
> I've just had my first task fail due to exceeding disk capacity, and I've
> run into some strange behaviour.
>
> It's a Java process that's running inside a Docker container specified in
> the task config. The Java process is failing with java.io.IOException: No
> space left on device when attempting to write a file.
>
> Three things are (or aren't) then happening which I think are just plain
> wrong:
>
> 1. The task is being marked as failed (good!) but isn't reporting that it
> exceeded disk limits (bad). I was expecting to see the "Disk limit
> exceeded.  Reserved X bytes vs used Y bytes." message, but neither the
> Mesos nor Aurora web interfaces are telling me this.
> 2. The task's sandbox directory is being nuked. All of it, immediately.
> There while the job is running, vanished as soon as it fails (I happened to
> be watching it live). This makes debugging difficult, and the
> Aurora/Thermos web UI clearly has trouble because it reports the resource
> requests as all zero when they most definitely weren't.
> 3. Finalizers aren't running. No finalizers = no error log = no debugging =
> sadface. :(
>
> I think what's actually happening here is that the process is running out
> of disk on the machine itself and that IOException is propagating up from
> the kernel, rather than Mesos killing the process from its disk usage
> monitoring.
>
> As such, we're going to try configuring the Mesos slaves with
> --resources='disk:some_smaller_value' to leave a little overhead in the
> hope that the Mesos disk monitor catches the overusage before the process
> attempts to claim the last free block on disk.
>
> I don't know why it'd be nuking the sandbox, though. And is the GC executor
> more aggressive about cleaning out old sandbox directories if the disk is
> low on free space?
>
> If it helps, we're on Aurora commit
> 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.
>
> Thanks,
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
> --
> Zameer Manji
>
>