You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2016/02/11 17:16:18 UTC
[jira] [Commented] (TEZ-3113) massive increase of run time using
PipelinedSorter rather than DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142954#comment-15142954 ]
Rajesh Balamohan commented on TEZ-3113:
---------------------------------------
Is it possible to share the logs and the sort.mb settings for this job?
> massive increase of run time using PipelinedSorter rather than DefaultSorter
> ----------------------------------------------------------------------------
>
> Key: TEZ-3113
> URL: https://issues.apache.org/jira/browse/TEZ-3113
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.8.2
> Environment: scalding 0.15-SNAPSHOT per https://github.com/twitter/scalding/pull/1446
> cascading 3.1.0-wip-54
> tez-0.8.2
> OpenJDK 8 on AMD64
> Hadoop 2.6.0 (YARN, HDFS); Apache distribution
> Debian Linux 8
> 8 * Intel Core i7-3770K
> Reporter: Cyrille Chépélov
>
> While running a (fairly complex) scalding DAG that was working fine using tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.
> Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.
> Difficulties can be traced to this shape of code:
> {code:scala}
> val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset
> x
> .group
> .mapValues(x => 1L)
> .sum
> .write(TypedTsvHeader("foo.tsv", ('key, 'count)))
> {code}
> where the incoming data contains many, many different keys. Observed behaviour of PipelinedSorter is that several hundred thousand different files are put flat in the same per-TezChild local temporary directories, and thing become very slow (not alleging any causality).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)