You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2017/08/14 17:46:00 UTC

[jira] [Created] (TEZ-3818) Support a new data routing policy for small partitions

Ming Ma created TEZ-3818:
----------------------------

             Summary: Support a new data routing policy for small partitions 
                 Key: TEZ-3818
                 URL: https://issues.apache.org/jira/browse/TEZ-3818
             Project: Apache Tez
          Issue Type: Sub-task
            Reporter: Ming Ma


Under the existing fair shuffle manager data routing policies of fair_parallelism and increase_parallelism, small partitions (total size up to the max desirable limit) are processed together by a single destination task.

We have the following use case that will prefer having one destination task process one small partition while still having multiple destination tasks process one large partition. When destination vertex is connected to MultiMROutput and the output format is parquet output format, each instance of parquet output stream consumes extra memory. So if a destination task ends up processing lots of small partitions, it ends up exceeding the task memory limit.

With the new data routing policy, here is the summary of what each data routing policy does.

* reduce_parallelism. The parallelism is decreased to a desired level by having one destination task process multiple consecutive partitions.
* fair_parallelism. The parallelism is adjusted to a desired level by having one destination task process multiple consecutive small partitions and multiple destination tasks process one large partition.
* The new increase_parallelism. The parallelism is increased to a desired level by having one destination task process each small partition and multiple destination tasks process one large partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)