You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Bikas Saha <bi...@hortonworks.com> on 2014/08/01 19:41:28 UTC

RE: Reusing Containers Of Failed Tasks

Warning. Master is tracking the 0.5 API stability release. Hence
transferring to master would mean work. But your code would be a lot
cleaner. Master is expected to be unstable until next week or so.



Bikas



*From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
*Sent:* Wednesday, July 30, 2014 9:27 PM
*To:* user@tez.apache.org
*Subject:* Re: Reusing Containers Of Failed Tasks



Nevermind, I was not on master.  I'll investigate that.



Thanks!



On Thu, Jul 31, 2014 at 12:14 AM, Thaddeus Diamond <
thaddeus.diamond@gmail.com> wrote:

I don't see that setting in TezConfiguration.java.  Do you happen to know
it offhand?



On Thu, Jul 31, 2014 at 12:10 AM, Bikas Saha <bi...@hortonworks.com> wrote:

There is no workaround without code change in Tez.



The simplest code change would be to make this behavior configurable and
have the current behavior as default.



Btw, you can also try the session min held containers configuration that
was recently added. This ensures that your session will retain some minimum
resources. You can use the session min/max timeouts to decay excess
containers.



Bikas



*From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
*Sent:* Wednesday, July 30, 2014 8:51 PM
*To:* user@tez.apache.org
*Subject:* Re: Reusing Containers Of Failed Tasks



I see.  Is there a manual workaround you suggest for this?



The motivation is this: I have an application with low latency and max
concurrency SLAs.  The way we are trying to solve this with Tez is to keep
an application-level pool of Tez sessions and configure each to have
long-lived containers.  When users submit DAGs the application grabs an
idle Tez session from the pool and submits to that one. After the DAG
completes (successful or not) it is returned to the pool in an idle state.



If a session gets returned to the pool but no containers are spun up in it
because the DAG failed, I will fail to meet my SLAs on the next DAG
submission.



On Wed, Jul 30, 2014 at 8:05 PM, Bikas Saha <bi...@hortonworks.com> wrote:

Currently, failed tasks make the JVM exit. There is no work around for
that. Before we can change that we would need to be able to check the task
execution is isolated such that a task failure does not end up “corrupting”
the host.



Bikas



*From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
*Sent:* Wednesday, July 30, 2014 3:15 PM
*To:* user@tez.apache.org
*Subject:* Reusing Containers Of Failed Tasks



Hi,



I turned on container reuse and upped the time that containers linger after
task vertex completion (tez.am.container.session.delay-allocation-millis),
but I'm still having an issue.  Sometimes, the Processor I created will
fail due to application logic in one DAG but not the next. The trivial
example is:



class MyProcessor implements LogicalIOProcessor {

  // Other non-application logic code

  public void run(...) {

    if (new Random().nextBoolean()) {

      throw new FooBarBazException();

    }

  }

}



In this case I don't want the task JVM to be deallocated because it was
application logic that caused the failure and next time I start a DAG I
will have the long JVM task startup delay.



I see the following code in the source (TaskScheduler#deallocateTask(...))
that I think is the cause of this:



       if (!taskSucceeded || !shouldReuseContainers) {

          if (LOG.isDebugEnabled()) {

            LOG.debug("Releasing container, containerId=" +
container.getId()

                + ", taskSucceeded=" + taskSucceeded

                + ", reuseContainersFlag=" + shouldReuseContainers);

          }

          releaseContainer(container.getId());

        }



Is this something that can be fixed in master? Or is there a
workaround/conf I can set to get this working?



Thanks,

Thad


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.




CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Reusing Containers Of Failed Tasks

Posted by Siddharth Seth <ss...@apache.org>.

Tez does not close Inputs/Outputs/Processors in case there's an error
during task execution. We haven't really spent too much time defining
semantics in such cases - since the expectation is for the container not to
be re-used. Looks like this needs to be figured out - for such cases, as
well as LocalMode.


On Sun, Aug 3, 2014 at 5:51 PM, Thaddeus Diamond <thaddeus.diamond@gmail.com
> wrote:

> Thanks.  Created https://issues.apache.org/jira/browse/TEZ-1369 and
> uploaded a patch.
>
>
> On Sat, Aug 2, 2014 at 3:33 PM, Bikas Saha <bi...@hortonworks.com> wrote:
>
>> Session min held containers was orthogonal to your main issue about
>> failed task causing containers to get lost.
>>
>>
>>
>> It was more of a suggestion to your use case of maintaining an allocated
>> session pool for low latency. Min held containers will maintain that
>> minimum pool of containers (best effort) that is distributed evenly across
>> your cluster (best effort) such that subsequent DAGs are assured of some
>> min capacity.
>>
>>
>>
>> For your failed task to not fail the container, that would still need
>> minor code change in Tez to add a config to change that behavior. Please
>> feel free to create a jira and if possible provide a patch.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
>> *Sent:* Friday, August 01, 2014 8:54 PM
>>
>> *To:* user@tez.apache.org
>> *Subject:* Re: Reusing Containers Of Failed Tasks
>>
>>
>>
>> Okay, so I built the source and used the target JARs to compile my
>> project, but I'm not seeing any improvement in the behavior.  What is the
>> expected behavior if I set the session min held containers property?  It
>> still doesn't start up the containers on session start and the failed
>> containers still get shut down.  Thoughts?
>>
>>
>>
>> On Fri, Aug 1, 2014 at 3:43 PM, Thaddeus Diamond <
>> thaddeus.diamond@gmail.com> wrote:
>>
>> Okay.  Is there a place I can get the latest JARs to compile my code
>> against?  I need this and other configurations for development but the
>> latest maven central artifacts are 0.4.1-incubating.  Don't worry about
>> being unstable, I'm still in development with this project.
>>
>>
>>
>> On Fri, Aug 1, 2014 at 1:41 PM, Bikas Saha <bi...@hortonworks.com> wrote:
>>
>> Warning. Master is tracking the 0.5 API stability release. Hence
>> transferring to master would mean work. But your code would be a lot
>> cleaner. Master is expected to be unstable until next week or so.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
>> *Sent:* Wednesday, July 30, 2014 9:27 PM
>>
>>
>> *To:* user@tez.apache.org
>> *Subject:* Re: Reusing Containers Of Failed Tasks
>>
>>
>>
>> Nevermind, I was not on master.  I'll investigate that.
>>
>>
>>
>> Thanks!
>>
>>
>>
>> On Thu, Jul 31, 2014 at 12:14 AM, Thaddeus Diamond <
>> thaddeus.diamond@gmail.com> wrote:
>>
>> I don't see that setting in TezConfiguration.java.  Do you happen to know
>> it offhand?
>>
>>
>>
>> On Thu, Jul 31, 2014 at 12:10 AM, Bikas Saha <bi...@hortonworks.com>
>> wrote:
>>
>> There is no workaround without code change in Tez.
>>
>>
>>
>> The simplest code change would be to make this behavior configurable and
>> have the current behavior as default.
>>
>>
>>
>> Btw, you can also try the session min held containers configuration that
>> was recently added. This ensures that your session will retain some minimum
>> resources. You can use the session min/max timeouts to decay excess
>> containers.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
>> *Sent:* Wednesday, July 30, 2014 8:51 PM
>> *To:* user@tez.apache.org
>> *Subject:* Re: Reusing Containers Of Failed Tasks
>>
>>
>>
>> I see.  Is there a manual workaround you suggest for this?
>>
>>
>>
>> The motivation is this: I have an application with low latency and max
>> concurrency SLAs.  The way we are trying to solve this with Tez is to keep
>> an application-level pool of Tez sessions and configure each to have
>> long-lived containers.  When users submit DAGs the application grabs an
>> idle Tez session from the pool and submits to that one. After the DAG
>> completes (successful or not) it is returned to the pool in an idle state.
>>
>>
>>
>> If a session gets returned to the pool but no containers are spun up in
>> it because the DAG failed, I will fail to meet my SLAs on the next DAG
>> submission.
>>
>>
>>
>> On Wed, Jul 30, 2014 at 8:05 PM, Bikas Saha <bi...@hortonworks.com>
>> wrote:
>>
>> Currently, failed tasks make the JVM exit. There is no work around for
>> that. Before we can change that we would need to be able to check the task
>> execution is isolated such that a task failure does not end up “corrupting”
>> the host.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
>> *Sent:* Wednesday, July 30, 2014 3:15 PM
>> *To:* user@tez.apache.org
>> *Subject:* Reusing Containers Of Failed Tasks
>>
>>
>>
>> Hi,
>>
>>
>>
>> I turned on container reuse and upped the time that containers linger
>> after task vertex completion
>> (tez.am.container.session.delay-allocation-millis), but I'm still having an
>> issue.  Sometimes, the Processor I created will fail due to application
>> logic in one DAG but not the next. The trivial example is:
>>
>>
>>
>> class MyProcessor implements LogicalIOProcessor {
>>
>>   // Other non-application logic code
>>
>>   public void run(...) {
>>
>>     if (new Random().nextBoolean()) {
>>
>>       throw new FooBarBazException();
>>
>>     }
>>
>>   }
>>
>> }
>>
>>
>>
>> In this case I don't want the task JVM to be deallocated because it was
>> application logic that caused the failure and next time I start a DAG I
>> will have the long JVM task startup delay.
>>
>>
>>
>> I see the following code in the source
>> (TaskScheduler#deallocateTask(...)) that I think is the cause of this:
>>
>>
>>
>>        if (!taskSucceeded || !shouldReuseContainers) {
>>
>>           if (LOG.isDebugEnabled()) {
>>
>>             LOG.debug("Releasing container, containerId=" +
>> container.getId()
>>
>>                 + ", taskSucceeded=" + taskSucceeded
>>
>>                 + ", reuseContainersFlag=" + shouldReuseContainers);
>>
>>           }
>>
>>           releaseContainer(container.getId());
>>
>>         }
>>
>>
>>
>> Is this something that can be fixed in master? Or is there a
>> workaround/conf I can set to get this working?
>>
>>
>>
>> Thanks,
>>
>> Thad
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>>
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>
>

Re: Reusing Containers Of Failed Tasks

Posted by Thaddeus Diamond <th...@gmail.com>.

Thanks.  Created https://issues.apache.org/jira/browse/TEZ-1369 and
uploaded a patch.


On Sat, Aug 2, 2014 at 3:33 PM, Bikas Saha <bi...@hortonworks.com> wrote:

> Session min held containers was orthogonal to your main issue about failed
> task causing containers to get lost.
>
>
>
> It was more of a suggestion to your use case of maintaining an allocated
> session pool for low latency. Min held containers will maintain that
> minimum pool of containers (best effort) that is distributed evenly across
> your cluster (best effort) such that subsequent DAGs are assured of some
> min capacity.
>
>
>
> For your failed task to not fail the container, that would still need
> minor code change in Tez to add a config to change that behavior. Please
> feel free to create a jira and if possible provide a patch.
>
>
>
> Bikas
>
>
>
> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
> *Sent:* Friday, August 01, 2014 8:54 PM
>
> *To:* user@tez.apache.org
> *Subject:* Re: Reusing Containers Of Failed Tasks
>
>
>
> Okay, so I built the source and used the target JARs to compile my
> project, but I'm not seeing any improvement in the behavior.  What is the
> expected behavior if I set the session min held containers property?  It
> still doesn't start up the containers on session start and the failed
> containers still get shut down.  Thoughts?
>
>
>
> On Fri, Aug 1, 2014 at 3:43 PM, Thaddeus Diamond <
> thaddeus.diamond@gmail.com> wrote:
>
> Okay.  Is there a place I can get the latest JARs to compile my code
> against?  I need this and other configurations for development but the
> latest maven central artifacts are 0.4.1-incubating.  Don't worry about
> being unstable, I'm still in development with this project.
>
>
>
> On Fri, Aug 1, 2014 at 1:41 PM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> Warning. Master is tracking the 0.5 API stability release. Hence
> transferring to master would mean work. But your code would be a lot
> cleaner. Master is expected to be unstable until next week or so.
>
>
>
> Bikas
>
>
>
> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
> *Sent:* Wednesday, July 30, 2014 9:27 PM
>
>
> *To:* user@tez.apache.org
> *Subject:* Re: Reusing Containers Of Failed Tasks
>
>
>
> Nevermind, I was not on master.  I'll investigate that.
>
>
>
> Thanks!
>
>
>
> On Thu, Jul 31, 2014 at 12:14 AM, Thaddeus Diamond <
> thaddeus.diamond@gmail.com> wrote:
>
> I don't see that setting in TezConfiguration.java.  Do you happen to know
> it offhand?
>
>
>
> On Thu, Jul 31, 2014 at 12:10 AM, Bikas Saha <bi...@hortonworks.com>
> wrote:
>
> There is no workaround without code change in Tez.
>
>
>
> The simplest code change would be to make this behavior configurable and
> have the current behavior as default.
>
>
>
> Btw, you can also try the session min held containers configuration that
> was recently added. This ensures that your session will retain some minimum
> resources. You can use the session min/max timeouts to decay excess
> containers.
>
>
>
> Bikas
>
>
>
> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
> *Sent:* Wednesday, July 30, 2014 8:51 PM
> *To:* user@tez.apache.org
> *Subject:* Re: Reusing Containers Of Failed Tasks
>
>
>
> I see.  Is there a manual workaround you suggest for this?
>
>
>
> The motivation is this: I have an application with low latency and max
> concurrency SLAs.  The way we are trying to solve this with Tez is to keep
> an application-level pool of Tez sessions and configure each to have
> long-lived containers.  When users submit DAGs the application grabs an
> idle Tez session from the pool and submits to that one. After the DAG
> completes (successful or not) it is returned to the pool in an idle state.
>
>
>
> If a session gets returned to the pool but no containers are spun up in it
> because the DAG failed, I will fail to meet my SLAs on the next DAG
> submission.
>
>
>
> On Wed, Jul 30, 2014 at 8:05 PM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> Currently, failed tasks make the JVM exit. There is no work around for
> that. Before we can change that we would need to be able to check the task
> execution is isolated such that a task failure does not end up “corrupting”
> the host.
>
>
>
> Bikas
>
>
>
> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
> *Sent:* Wednesday, July 30, 2014 3:15 PM
> *To:* user@tez.apache.org
> *Subject:* Reusing Containers Of Failed Tasks
>
>
>
> Hi,
>
>
>
> I turned on container reuse and upped the time that containers linger
> after task vertex completion
> (tez.am.container.session.delay-allocation-millis), but I'm still having an
> issue.  Sometimes, the Processor I created will fail due to application
> logic in one DAG but not the next. The trivial example is:
>
>
>
> class MyProcessor implements LogicalIOProcessor {
>
>   // Other non-application logic code
>
>   public void run(...) {
>
>     if (new Random().nextBoolean()) {
>
>       throw new FooBarBazException();
>
>     }
>
>   }
>
> }
>
>
>
> In this case I don't want the task JVM to be deallocated because it was
> application logic that caused the failure and next time I start a DAG I
> will have the long JVM task startup delay.
>
>
>
> I see the following code in the source (TaskScheduler#deallocateTask(...))
> that I think is the cause of this:
>
>
>
>        if (!taskSucceeded || !shouldReuseContainers) {
>
>           if (LOG.isDebugEnabled()) {
>
>             LOG.debug("Releasing container, containerId=" +
> container.getId()
>
>                 + ", taskSucceeded=" + taskSucceeded
>
>                 + ", reuseContainersFlag=" + shouldReuseContainers);
>
>           }
>
>           releaseContainer(container.getId());
>
>         }
>
>
>
> Is this something that can be fixed in master? Or is there a
> workaround/conf I can set to get this working?
>
>
>
> Thanks,
>
> Thad
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

RE: Reusing Containers Of Failed Tasks

Posted by Bikas Saha <bi...@hortonworks.com>.

Session min held containers was orthogonal to your main issue about failed
task causing containers to get lost.



It was more of a suggestion to your use case of maintaining an allocated
session pool for low latency. Min held containers will maintain that
minimum pool of containers (best effort) that is distributed evenly across
your cluster (best effort) such that subsequent DAGs are assured of some
min capacity.



For your failed task to not fail the container, that would still need minor
code change in Tez to add a config to change that behavior. Please feel
free to create a jira and if possible provide a patch.



Bikas



*From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
*Sent:* Friday, August 01, 2014 8:54 PM
*To:* user@tez.apache.org
*Subject:* Re: Reusing Containers Of Failed Tasks



Okay, so I built the source and used the target JARs to compile my project,
but I'm not seeing any improvement in the behavior.  What is the expected
behavior if I set the session min held containers property?  It still
doesn't start up the containers on session start and the failed containers
still get shut down.  Thoughts?



On Fri, Aug 1, 2014 at 3:43 PM, Thaddeus Diamond <th...@gmail.com>
wrote:

Okay.  Is there a place I can get the latest JARs to compile my code
against?  I need this and other configurations for development but the
latest maven central artifacts are 0.4.1-incubating.  Don't worry about
being unstable, I'm still in development with this project.



On Fri, Aug 1, 2014 at 1:41 PM, Bikas Saha <bi...@hortonworks.com> wrote:

Warning. Master is tracking the 0.5 API stability release. Hence
transferring to master would mean work. But your code would be a lot
cleaner. Master is expected to be unstable until next week or so.



Bikas



*From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
*Sent:* Wednesday, July 30, 2014 9:27 PM


*To:* user@tez.apache.org
*Subject:* Re: Reusing Containers Of Failed Tasks



Nevermind, I was not on master.  I'll investigate that.



Thanks!



On Thu, Jul 31, 2014 at 12:14 AM, Thaddeus Diamond <
thaddeus.diamond@gmail.com> wrote:

I don't see that setting in TezConfiguration.java.  Do you happen to know
it offhand?



On Thu, Jul 31, 2014 at 12:10 AM, Bikas Saha <bi...@hortonworks.com> wrote:

There is no workaround without code change in Tez.



The simplest code change would be to make this behavior configurable and
have the current behavior as default.



Btw, you can also try the session min held containers configuration that
was recently added. This ensures that your session will retain some minimum
resources. You can use the session min/max timeouts to decay excess
containers.



Bikas



*From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
*Sent:* Wednesday, July 30, 2014 8:51 PM
*To:* user@tez.apache.org
*Subject:* Re: Reusing Containers Of Failed Tasks



I see.  Is there a manual workaround you suggest for this?



The motivation is this: I have an application with low latency and max
concurrency SLAs.  The way we are trying to solve this with Tez is to keep
an application-level pool of Tez sessions and configure each to have
long-lived containers.  When users submit DAGs the application grabs an
idle Tez session from the pool and submits to that one. After the DAG
completes (successful or not) it is returned to the pool in an idle state.



If a session gets returned to the pool but no containers are spun up in it
because the DAG failed, I will fail to meet my SLAs on the next DAG
submission.



On Wed, Jul 30, 2014 at 8:05 PM, Bikas Saha <bi...@hortonworks.com> wrote:

Currently, failed tasks make the JVM exit. There is no work around for
that. Before we can change that we would need to be able to check the task
execution is isolated such that a task failure does not end up “corrupting”
the host.



Bikas



*From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
*Sent:* Wednesday, July 30, 2014 3:15 PM
*To:* user@tez.apache.org
*Subject:* Reusing Containers Of Failed Tasks



Hi,



I turned on container reuse and upped the time that containers linger after
task vertex completion (tez.am.container.session.delay-allocation-millis),
but I'm still having an issue.  Sometimes, the Processor I created will
fail due to application logic in one DAG but not the next. The trivial
example is:



class MyProcessor implements LogicalIOProcessor {

  // Other non-application logic code

  public void run(...) {

    if (new Random().nextBoolean()) {

      throw new FooBarBazException();

    }

  }

}



In this case I don't want the task JVM to be deallocated because it was
application logic that caused the failure and next time I start a DAG I
will have the long JVM task startup delay.



I see the following code in the source (TaskScheduler#deallocateTask(...))
that I think is the cause of this:



       if (!taskSucceeded || !shouldReuseContainers) {

          if (LOG.isDebugEnabled()) {

            LOG.debug("Releasing container, containerId=" +
container.getId()

                + ", taskSucceeded=" + taskSucceeded

                + ", reuseContainersFlag=" + shouldReuseContainers);

          }

          releaseContainer(container.getId());

        }



Is this something that can be fixed in master? Or is there a
workaround/conf I can set to get this working?



Thanks,

Thad


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.




CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.






CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Reusing Containers Of Failed Tasks

Posted by Thaddeus Diamond <th...@gmail.com>.

Okay, so I built the source and used the target JARs to compile my project,
but I'm not seeing any improvement in the behavior.  What is the expected
behavior if I set the session min held containers property?  It still
doesn't start up the containers on session start and the failed containers
still get shut down.  Thoughts?


On Fri, Aug 1, 2014 at 3:43 PM, Thaddeus Diamond <thaddeus.diamond@gmail.com
> wrote:

> Okay.  Is there a place I can get the latest JARs to compile my code
> against?  I need this and other configurations for development but the
> latest maven central artifacts are 0.4.1-incubating.  Don't worry about
> being unstable, I'm still in development with this project.
>
>
> On Fri, Aug 1, 2014 at 1:41 PM, Bikas Saha <bi...@hortonworks.com> wrote:
>
>> Warning. Master is tracking the 0.5 API stability release. Hence
>> transferring to master would mean work. But your code would be a lot
>> cleaner. Master is expected to be unstable until next week or so.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
>> *Sent:* Wednesday, July 30, 2014 9:27 PM
>>
>> *To:* user@tez.apache.org
>> *Subject:* Re: Reusing Containers Of Failed Tasks
>>
>>
>>
>> Nevermind, I was not on master.  I'll investigate that.
>>
>>
>>
>> Thanks!
>>
>>
>>
>> On Thu, Jul 31, 2014 at 12:14 AM, Thaddeus Diamond <
>> thaddeus.diamond@gmail.com> wrote:
>>
>> I don't see that setting in TezConfiguration.java.  Do you happen to know
>> it offhand?
>>
>>
>>
>> On Thu, Jul 31, 2014 at 12:10 AM, Bikas Saha <bi...@hortonworks.com>
>> wrote:
>>
>> There is no workaround without code change in Tez.
>>
>>
>>
>> The simplest code change would be to make this behavior configurable and
>> have the current behavior as default.
>>
>>
>>
>> Btw, you can also try the session min held containers configuration that
>> was recently added. This ensures that your session will retain some minimum
>> resources. You can use the session min/max timeouts to decay excess
>> containers.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
>> *Sent:* Wednesday, July 30, 2014 8:51 PM
>> *To:* user@tez.apache.org
>> *Subject:* Re: Reusing Containers Of Failed Tasks
>>
>>
>>
>> I see.  Is there a manual workaround you suggest for this?
>>
>>
>>
>> The motivation is this: I have an application with low latency and max
>> concurrency SLAs.  The way we are trying to solve this with Tez is to keep
>> an application-level pool of Tez sessions and configure each to have
>> long-lived containers.  When users submit DAGs the application grabs an
>> idle Tez session from the pool and submits to that one. After the DAG
>> completes (successful or not) it is returned to the pool in an idle state.
>>
>>
>>
>> If a session gets returned to the pool but no containers are spun up in
>> it because the DAG failed, I will fail to meet my SLAs on the next DAG
>> submission.
>>
>>
>>
>> On Wed, Jul 30, 2014 at 8:05 PM, Bikas Saha <bi...@hortonworks.com>
>> wrote:
>>
>> Currently, failed tasks make the JVM exit. There is no work around for
>> that. Before we can change that we would need to be able to check the task
>> execution is isolated such that a task failure does not end up “corrupting”
>> the host.
>>
>>
>>
>> Bikas
>>
>>
>>
>> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
>> *Sent:* Wednesday, July 30, 2014 3:15 PM
>> *To:* user@tez.apache.org
>> *Subject:* Reusing Containers Of Failed Tasks
>>
>>
>>
>> Hi,
>>
>>
>>
>> I turned on container reuse and upped the time that containers linger
>> after task vertex completion
>> (tez.am.container.session.delay-allocation-millis), but I'm still having an
>> issue.  Sometimes, the Processor I created will fail due to application
>> logic in one DAG but not the next. The trivial example is:
>>
>>
>>
>> class MyProcessor implements LogicalIOProcessor {
>>
>>   // Other non-application logic code
>>
>>   public void run(...) {
>>
>>     if (new Random().nextBoolean()) {
>>
>>       throw new FooBarBazException();
>>
>>     }
>>
>>   }
>>
>> }
>>
>>
>>
>> In this case I don't want the task JVM to be deallocated because it was
>> application logic that caused the failure and next time I start a DAG I
>> will have the long JVM task startup delay.
>>
>>
>>
>> I see the following code in the source
>> (TaskScheduler#deallocateTask(...)) that I think is the cause of this:
>>
>>
>>
>>        if (!taskSucceeded || !shouldReuseContainers) {
>>
>>           if (LOG.isDebugEnabled()) {
>>
>>             LOG.debug("Releasing container, containerId=" +
>> container.getId()
>>
>>                 + ", taskSucceeded=" + taskSucceeded
>>
>>                 + ", reuseContainersFlag=" + shouldReuseContainers);
>>
>>           }
>>
>>           releaseContainer(container.getId());
>>
>>         }
>>
>>
>>
>> Is this something that can be fixed in master? Or is there a
>> workaround/conf I can set to get this working?
>>
>>
>>
>> Thanks,
>>
>> Thad
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>
>
>

Re: Reusing Containers Of Failed Tasks

Posted by Thaddeus Diamond <th...@gmail.com>.

Okay.  Is there a place I can get the latest JARs to compile my code
against?  I need this and other configurations for development but the
latest maven central artifacts are 0.4.1-incubating.  Don't worry about
being unstable, I'm still in development with this project.


On Fri, Aug 1, 2014 at 1:41 PM, Bikas Saha <bi...@hortonworks.com> wrote:

> Warning. Master is tracking the 0.5 API stability release. Hence
> transferring to master would mean work. But your code would be a lot
> cleaner. Master is expected to be unstable until next week or so.
>
>
>
> Bikas
>
>
>
> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
> *Sent:* Wednesday, July 30, 2014 9:27 PM
>
> *To:* user@tez.apache.org
> *Subject:* Re: Reusing Containers Of Failed Tasks
>
>
>
> Nevermind, I was not on master.  I'll investigate that.
>
>
>
> Thanks!
>
>
>
> On Thu, Jul 31, 2014 at 12:14 AM, Thaddeus Diamond <
> thaddeus.diamond@gmail.com> wrote:
>
> I don't see that setting in TezConfiguration.java.  Do you happen to know
> it offhand?
>
>
>
> On Thu, Jul 31, 2014 at 12:10 AM, Bikas Saha <bi...@hortonworks.com>
> wrote:
>
> There is no workaround without code change in Tez.
>
>
>
> The simplest code change would be to make this behavior configurable and
> have the current behavior as default.
>
>
>
> Btw, you can also try the session min held containers configuration that
> was recently added. This ensures that your session will retain some minimum
> resources. You can use the session min/max timeouts to decay excess
> containers.
>
>
>
> Bikas
>
>
>
> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
> *Sent:* Wednesday, July 30, 2014 8:51 PM
> *To:* user@tez.apache.org
> *Subject:* Re: Reusing Containers Of Failed Tasks
>
>
>
> I see.  Is there a manual workaround you suggest for this?
>
>
>
> The motivation is this: I have an application with low latency and max
> concurrency SLAs.  The way we are trying to solve this with Tez is to keep
> an application-level pool of Tez sessions and configure each to have
> long-lived containers.  When users submit DAGs the application grabs an
> idle Tez session from the pool and submits to that one. After the DAG
> completes (successful or not) it is returned to the pool in an idle state.
>
>
>
> If a session gets returned to the pool but no containers are spun up in it
> because the DAG failed, I will fail to meet my SLAs on the next DAG
> submission.
>
>
>
> On Wed, Jul 30, 2014 at 8:05 PM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> Currently, failed tasks make the JVM exit. There is no work around for
> that. Before we can change that we would need to be able to check the task
> execution is isolated such that a task failure does not end up “corrupting”
> the host.
>
>
>
> Bikas
>
>
>
> *From:* Thaddeus Diamond [mailto:thaddeus.diamond@gmail.com]
> *Sent:* Wednesday, July 30, 2014 3:15 PM
> *To:* user@tez.apache.org
> *Subject:* Reusing Containers Of Failed Tasks
>
>
>
> Hi,
>
>
>
> I turned on container reuse and upped the time that containers linger
> after task vertex completion
> (tez.am.container.session.delay-allocation-millis), but I'm still having an
> issue.  Sometimes, the Processor I created will fail due to application
> logic in one DAG but not the next. The trivial example is:
>
>
>
> class MyProcessor implements LogicalIOProcessor {
>
>   // Other non-application logic code
>
>   public void run(...) {
>
>     if (new Random().nextBoolean()) {
>
>       throw new FooBarBazException();
>
>     }
>
>   }
>
> }
>
>
>
> In this case I don't want the task JVM to be deallocated because it was
> application logic that caused the failure and next time I start a DAG I
> will have the long JVM task startup delay.
>
>
>
> I see the following code in the source (TaskScheduler#deallocateTask(...))
> that I think is the cause of this:
>
>
>
>        if (!taskSucceeded || !shouldReuseContainers) {
>
>           if (LOG.isDebugEnabled()) {
>
>             LOG.debug("Releasing container, containerId=" +
> container.getId()
>
>                 + ", taskSucceeded=" + taskSucceeded
>
>                 + ", reuseContainersFlag=" + shouldReuseContainers);
>
>           }
>
>           releaseContainer(container.getId());
>
>         }
>
>
>
> Is this something that can be fixed in master? Or is there a
> workaround/conf I can set to get this working?
>
>
>
> Thanks,
>
> Thad
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>