You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Cheolsoo Park <pi...@gmail.com> on 2014/02/19 20:11:01 UTC

Vertex re-running?

Hello Tez,

After upgrading Tez to the HEAD [1], I am seeing this problem. Basically,
some vertices rerun after succeeded, and DAG never makes progress anymore-

2014-02-19 18:54:46,473 [JobControl] INFO
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 524 Succeeded: 105 Running: 140
Failed: 0 Killed: 0, diagnostics=, counters=null

2014-02-19 18:54:47,475 [JobControl] INFO
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 524 Succeeded: 0 Running: 139 Failed:
0 Killed: 0, diagnostics=Vertex re-running, vertexName=scope-800,
vertexId=vertex_1392684249837_0017_1_01

Vertex re-running, vertexName=scope-799,
vertexId=vertex_1392684249837_0017_1_00
Vertex re-running, vertexName=scope-812,
vertexId=vertex_1392684249837_0017_1_07
Vertex re-running, vertexName=scope-811,
vertexId=vertex_1392684249837_0017_1_06, counters=null

As can be seen, succeeded tasks went down from 105 to 0. Can you please
provide any insights on what's going on? It used to work before [2]. Can
you suggest how far I should roll back if this is a regression in master?

Thanks,
Cheolsoo

*[1] Now*
commit 87f3ea351db18cde8c50385dde37e69f534515ad
Author: Siddharth Seth <ss...@apache.org>
Date:   Tue Feb 18 14:47:24 2014 -0800

    TEZ-787. Revert Guava dependency to 11.0.2. (sseth)

*[2] Before*
commit f55dbfb81fff6b6612890d378824c2404b118272
Author: Siddharth Seth <ss...@apache.org>
Date:   Wed Jan 29 22:16:45 2014 -0800

    TEZ-646. Introduce a CompositeDataMovementEvent to avoid multiple copies
    of the same payload in the AM. (sseth)

Re: Vertex re-running?

Posted by Cheolsoo Park <pi...@gmail.com>.

Thanks Bikas for the answer. I'll file a jira.

I can't say whether this is a problem for all jobs. But I was debugging a
Pig job (fairly complex one), and it was failing at the very last vertex
before.

Last night I pulled the master of Tez and let it run over night, and my job
is stuck at 0% progress while re-running tasks infinitely.




On Wed, Feb 19, 2014 at 11:36 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> This will happen when consumers tasks report read errors against producer
> tasks. Can you please share the AM logs or open a jira and attach AM logs
> there.
>
>
>
> Is this happening for all jobs? Multiple times in a single job? Does the
> job eventually pass?
>
>
>
> Thanks
>
> Bikas
>
>
>
>
>
> *From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> *Sent:* Wednesday, February 19, 2014 11:11 AM
> *To:* user@tez.incubator.apache.org
> *Subject:* Vertex re-running?
>
>
>
> Hello Tez,
>
>
>
> After upgrading Tez to the HEAD [1], I am seeing this problem. Basically,
> some vertices rerun after succeeded, and DAG never makes progress anymore-
>
>
>
> 2014-02-19 18:54:46,473 [JobControl] INFO
>  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 524 Succeeded: 105 Running: 140
> Failed: 0 Killed: 0, diagnostics=, counters=null
>
>
>
> 2014-02-19 18:54:47,475 [JobControl] INFO
>  org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
> status=RUNNING, progress=TotalTasks: 524 Succeeded: 0 Running: 139
> Failed: 0 Killed: 0, diagnostics=Vertex re-running, vertexName=scope-800,
> vertexId=vertex_1392684249837_0017_1_01
>
>
>
> Vertex re-running, vertexName=scope-799,
> vertexId=vertex_1392684249837_0017_1_00
>
> Vertex re-running, vertexName=scope-812,
> vertexId=vertex_1392684249837_0017_1_07
>
> Vertex re-running, vertexName=scope-811,
> vertexId=vertex_1392684249837_0017_1_06, counters=null
>
>
>
> As can be seen, succeeded tasks went down from 105 to 0. Can you please
> provide any insights on what's going on? It used to work before [2]. Can
> you suggest how far I should roll back if this is a regression in master?
>
>
>
> Thanks,
>
> Cheolsoo
>
>
>
> *[1] Now*
>
> commit 87f3ea351db18cde8c50385dde37e69f534515ad
>
> Author: Siddharth Seth <ss...@apache.org>
>
> Date:   Tue Feb 18 14:47:24 2014 -0800
>
>
>
>     TEZ-787. Revert Guava dependency to 11.0.2. (sseth)
>
>
>
> *[2] Before*
>
> commit f55dbfb81fff6b6612890d378824c2404b118272
>
> Author: Siddharth Seth <ss...@apache.org>
>
> Date:   Wed Jan 29 22:16:45 2014 -0800
>
>
>
>     TEZ-646. Introduce a CompositeDataMovementEvent to avoid multiple
> copies
>
>     of the same payload in the AM. (sseth)
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

RE: Vertex re-running?

Posted by Bikas Saha <bi...@hortonworks.com>.

This will happen when consumers tasks report read errors against producer
tasks. Can you please share the AM logs or open a jira and attach AM logs
there.



Is this happening for all jobs? Multiple times in a single job? Does the
job eventually pass?



Thanks

Bikas





*From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
*Sent:* Wednesday, February 19, 2014 11:11 AM
*To:* user@tez.incubator.apache.org
*Subject:* Vertex re-running?



Hello Tez,



After upgrading Tez to the HEAD [1], I am seeing this problem. Basically,
some vertices rerun after succeeded, and DAG never makes progress anymore-



2014-02-19 18:54:46,473 [JobControl] INFO
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 524 Succeeded: 105 Running: 140
Failed: 0 Killed: 0, diagnostics=, counters=null



2014-02-19 18:54:47,475 [JobControl] INFO
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status:
status=RUNNING, progress=TotalTasks: 524 Succeeded: 0 Running: 139 Failed:
0 Killed: 0, diagnostics=Vertex re-running, vertexName=scope-800,
vertexId=vertex_1392684249837_0017_1_01



Vertex re-running, vertexName=scope-799,
vertexId=vertex_1392684249837_0017_1_00

Vertex re-running, vertexName=scope-812,
vertexId=vertex_1392684249837_0017_1_07

Vertex re-running, vertexName=scope-811,
vertexId=vertex_1392684249837_0017_1_06, counters=null



As can be seen, succeeded tasks went down from 105 to 0. Can you please
provide any insights on what's going on? It used to work before [2]. Can
you suggest how far I should roll back if this is a regression in master?



Thanks,

Cheolsoo



*[1] Now*

commit 87f3ea351db18cde8c50385dde37e69f534515ad

Author: Siddharth Seth <ss...@apache.org>

Date:   Tue Feb 18 14:47:24 2014 -0800



    TEZ-787. Revert Guava dependency to 11.0.2. (sseth)



*[2] Before*

commit f55dbfb81fff6b6612890d378824c2404b118272

Author: Siddharth Seth <ss...@apache.org>

Date:   Wed Jan 29 22:16:45 2014 -0800



    TEZ-646. Introduce a CompositeDataMovementEvent to avoid multiple copies

    of the same payload in the AM. (sseth)

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.