You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by Jonathan Eagles <je...@gmail.com> on 2015/09/11 18:38:31 UTC

Tez session max attempts of one when recovery is disabled

Running pig on tez (0.7.1 pre-release) with recovery disabled and noticed
that when the am fails there is no other attempts. What is it about
sessions versus non-sessions (what bad thing are we preventing) that keeps
us from retrying when recovery is disabled?

(background) Pig only runs sessions even when only executing a single DAG
and recovery is fragile in 0.7.1 where hangs are likely, fixed only in 0.8.
I want pig on tez to be a stable as pig on mr, where AM failures and going
to dissuade users from migrating to pig on tez.

Jon

Re: Tez session max attempts of one when recovery is disabled

Posted by Hitesh Shah <hi...@apache.org>.
Hello Jon,

If recovery is disabled, there is no clear way to know whether the previous attempt was in process of doing a commit and was aborted at that point. Given that there is no clear way to safely re-start/re-process the work, I believe the tez client sets to max attempts to 1 if recovery is disabled. Furthermore, with sessions, DAGs are submitted over RPC and not via the ApplicationSubmissionContext so therefore there will be no record of the DAG being submitted if recovery is disabled. The second attempt in this case will launch but will not do anything unless the client re-submits the DAG.

I think we should look to back porting all relevant recovery fixes to branch 0.7 if you would like to stabilize on that branch. Are there any known fixes on master that we should backport?

Jeff has been driving a lot of changes for recovery with a lot of fixes being tracked off https://issues.apache.org/jira/browse/TEZ-2581. It would be good if you could help review and help test these patches in this regard. I believe Jeff was planning to do a full rebase after TEZ-2003 got merged in but may not have done that yet. 

thanks
— Hitesh 

On Sep 11, 2015, at 9:38 AM, Jonathan Eagles <je...@gmail.com> wrote:

> Running pig on tez (0.7.1 pre-release) with recovery disabled and noticed
> that when the am fails there is no other attempts. What is it about
> sessions versus non-sessions (what bad thing are we preventing) that keeps
> us from retrying when recovery is disabled?
> 
> (background) Pig only runs sessions even when only executing a single DAG
> and recovery is fragile in 0.7.1 where hangs are likely, fixed only in 0.8.
> I want pig on tez to be a stable as pig on mr, where AM failures and going
> to dissuade users from migrating to pig on tez.
> 
> Jon