You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2015/05/28 14:46:28 UTC

[jira] [Updated] (TEZ-2496) Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source

     [ https://issues.apache.org/jira/browse/TEZ-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Balamohan updated TEZ-2496:
----------------------------------
    Attachment: TEZ-2496.1.patch


Storing absolute values of partition sizes would cost lots of memory.  Instead, patch tries to bucketize the partition sizes into (0,1,10,100,1000 MB buckets) and packs all details into a single BitSet per task. This information is stored in IOStatisticsImpl. This gets aggregated at vertex level (only when needed). ShuffleVertexManager schedules the tasks based on this information available in VertexImpl.

Adding some info on the testing

||Without Patch||Total runtime(sec)||AM_CPU||TaskCounter_CPU||
|application_1431919257083_2741|317.63|38450|8901320|
|application_1431919257083_2742|307.41|36970|8945970|
|application_1431919257083_2743|307.22|37020|8926730|

\\

||With Patch||Total runtime(sec)||AM_CPU||TaskCounter_CPU||
|application_1431919257083_2745|247.92|34250|8881370|
|application_1431919257083_2746|260.75|35570|8867250|
|application_1431919257083_2747|260.29|34990|8935350|

That is around *15-21%* improvement in overall runtime.  As mentioned earlier, this improvment would be visible based on job pattern.

Hive command used for testing
{noformat}
$HIVE_HOME/bin/hive --hiveconf tez.queue.name=hive1 --hiveconf tez.runtime.pipelined-shuffle.enabled=true --hiveconf hive.vectorized.groupby.maxentries=1024 --hiveconf hive.tez.auto.reducer.parallelism=true --hiveconf tez.runtime.io.sort.factor=200  --hiveconf tez.runtime.io.sort.mb=1800 --hiveconf hive.tez.container.size=4096 --hiveconf hive.mapjoin.hybridgrace.hashtable=true -f test.sql
{noformat}


> Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source
> ----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2496
>                 URL: https://issues.apache.org/jira/browse/TEZ-2496
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-2496.1.patch
>
>
> Consider scheduling tasks in ShuffleVertexManager based on the partition sizes from the source.  This would be helpful in scenarios, where there is limited resources (or concurrent jobs running or multiple waves) with dataskew and the task which gets large amount of data gets sceheduled much later.
> e.g Consider the following hive query running in a queue with limited capacity (42 slots in total) @ 200 GB scale
> {noformat}
> CREATE TEMPORARY TABLE sampleData AS
>   SELECT CASE
>            WHEN ss_sold_time_sk IS NULL THEN 70429
>            ELSE ss_sold_time_sk
>        END AS ss_sold_time_sk,
>        ss_item_sk,
>        ss_customer_sk,
>        ss_cdemo_sk,
>        ss_hdemo_sk,
>        ss_addr_sk,
>        ss_store_sk,
>        ss_promo_sk,
>        ss_ticket_number,
>        ss_quantity,
>        ss_wholesale_cost,
>        ss_list_price,
>        ss_sales_price,
>        ss_ext_discount_amt,
>        ss_ext_sales_price,
>        ss_ext_wholesale_cost,
>        ss_ext_list_price,
>        ss_ext_tax,
>        ss_coupon_amt,
>        ss_net_paid,
>        ss_net_paid_inc_tax,
>        ss_net_profit,
>        ss_sold_date_sk
>   FROM store_sales distribute by ss_sold_time_sk;
> {noformat}
> This generated 39 maps and 134 reduce slots (3 reduce waves). When lots of nulls are there for ss_sold_time_sk, it would tend to have data skew towards 70429.  If the reducer which gets this data gets scheduled much earlier (i.e in first wave itself), entire job would finish fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)