You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tez.apache.org by Saikat Roychowdhury <sa...@yahoo-inc.com.INVALID> on 2015/08/13 20:54:09 UTC

Code pointers to Understand Auto Reduce parallelism

Hi, AllCan someone give some pointers to the code base to understand how auto reduce parallelism works in Tez?
-Saikat

Re: Code pointers to Understand Auto Reduce parallelism

Posted by Saikat Roychowdhury <sa...@yahoo-inc.com.INVALID>.

Thanks Hitesh.

Saikat 

     On Thursday, August 13, 2015 3:10 PM, Hitesh Shah <hi...@apache.org> wrote:

 When a vertex is ready to start launching tasks, it will look at available information of output generated by upstream vertices and based on configured optimal data size per task, it will re-configure the vertex’s parallelism. The basic tez or mapreduce shuffle implementation is that the map stage generates say 1000 partitions if there are 1000 reducers. The default mode would be assign partition 1 to reducer 1 and so on. If the new parallelism is different, a new routing table is needed to map which partitions are handled by which reducers. This is done in Tez by overriding the default Shuffle Edge Manager with a custom Edge Manager ( i.e a new routing table ).

You can look at ShuffleVertexManager.java and start from schedulePendingTasks() to see how parallelism is determined at runtime.

— Hitesh

On Aug 13, 2015, at 11:54 AM, Saikat Roychowdhury <sa...@yahoo-inc.com.INVALID> wrote:

> Hi, AllCan someone give some pointers to the code base to understand how auto reduce parallelism works in Tez?
> -Saikat

Re: Code pointers to Understand Auto Reduce parallelism

Posted by Hitesh Shah <hi...@apache.org>.

When a vertex is ready to start launching tasks, it will look at available information of output generated by upstream vertices and based on configured optimal data size per task, it will re-configure the vertex’s parallelism. The basic tez or mapreduce shuffle implementation is that the map stage generates say 1000 partitions if there are 1000 reducers. The default mode would be assign partition 1 to reducer 1 and so on. If the new parallelism is different, a new routing table is needed to map which partitions are handled by which reducers. This is done in Tez by overriding the default Shuffle Edge Manager with a custom Edge Manager ( i.e a new routing table ).

You can look at ShuffleVertexManager.java and start from schedulePendingTasks() to see how parallelism is determined at runtime.

— Hitesh

On Aug 13, 2015, at 11:54 AM, Saikat Roychowdhury <sa...@yahoo-inc.com.INVALID> wrote:

> Hi, AllCan someone give some pointers to the code base to understand how auto reduce parallelism works in Tez?
> -Saikat