You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2015/04/02 02:29:52 UTC

[jira] [Comment Edited] (TEZ-2251) Enabling auto reduce parallelism in certain jobs causes DAG to hang

    [ https://issues.apache.org/jira/browse/TEZ-2251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391855#comment-14391855 ] 

Bikas Saha edited comment on TEZ-2251 at 4/2/15 12:29 AM:
----------------------------------------------------------

[~rajesh.balamohan] I think we need this additional check for all get*SpecList()
{code}
public synchronized List<InputSpec> getInputSpecList(int taskIndex) throws AMUserCodeException {
//change to
public synchronized List<InputSpec> getInputSpecList(int taskIndex) throws AMUserCodeException {
  readLock.lock();
{code}

We should open a separate jira to change all VertexImpl methods that are currently synchronized -> to use either readlock or writelock. Currently the synchronization is mixed between "synchronized" and read-write lock which are independent of each other.


was (Author: bikassaha):
[~rajesh.balamohan] I think we need this additional check for all get*SpecList()
{code}
public synchronized List<InputSpec> getInputSpecList(int taskIndex) throws AMUserCodeException {
//change to
public synchronized List<InputSpec> getInputSpecList(int taskIndex) throws AMUserCodeException {
  readLock.lock();
{code}

We should open a separate jira to change all VertexImpl methods that are currently synchronized -> to use either readlock or writelock.

> Enabling auto reduce parallelism in certain jobs causes DAG to hang
> -------------------------------------------------------------------
>
>                 Key: TEZ-2251
>                 URL: https://issues.apache.org/jira/browse/TEZ-2251
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-2251.VertexImpl.patch, TEZ-2251.fix_but_slows_down.patch, hive_console.png, tez_2251_dag.png
>
>
> Scenario:
> - Run TPCH query20 (https://github.com/cartershanklin/hive-testbench/blob/master/sample-queries-tpch/tpch_query20.sql) at 1 TB scale (tez-master branch, hive trunk)
> - Enable auto reduce parallelism
> - DAG didn't complete and got stuck in "Reducer 6"
> Vertex parallelism of "Reducer 5 & 6" happens within a span of 3 milliseconds, and tasks of "reducer 5" ends up producing wrong partition details as it sees the updated task numbers of reducer 6 when scheduled.  This causes, job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)