You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Tsuyoshi OZAWA (JIRA)" <ji...@apache.org> on 2014/09/01 13:18:20 UTC

[jira] [Commented] (TEZ-1528) Native support for multi cluster aggregations

    [ https://issues.apache.org/jira/browse/TEZ-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117315#comment-14117315 ] 

Tsuyoshi OZAWA commented on TEZ-1528:
-------------------------------------

Thanks for an interesting proposal, Arun. It's very near as I mentioned on TEZ-145 - a new APIs to preserve locality at DAG-API level.

Recently, we benchmarked from 1GB to 100GB scale data compared Tez with MapReduce. As a result, we confirmed that the combiner of MapReduce reduces the network and disks IOs drastically for aggregation tasks. I'll share the result soon. I believe this feature works well for ETL workload.

> Native support for multi cluster aggregations
> ---------------------------------------------
>
>                 Key: TEZ-1528
>                 URL: https://issues.apache.org/jira/browse/TEZ-1528
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>
> Increasingly, data-sets are partitioned across clusters due to legal or operational considerations. An e.g. is a 'customer-activity' table with partitions for the same 'date', but sub-partitions located in the clusters across which raw data cannot be moved/copied due to legal considerations.
> It would be nice to have Tez support aggregations across these clusters by providing native support for cross-cluster 'sub-dags' (think auto transform of mapper-reducer to mapper-combiner-reducer split across clusters),  'edge' with *strict* limits on data-transfer across clusters etc. Providing such a primitive would make it relatively easier for Hive, Pig etc. to provide SQL queries, ETL applications etc. across clusters. Limits on data-transfer are very important - we should support only transfer of aggregates, joins across clusters is an anti-goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)