You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2016/08/24 17:15:21 UTC

[jira] [Commented] (TEZ-3113) massive increase of run time using PipelinedSorter rather than DefaultSorter

    [ https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435318#comment-15435318 ] 

Hitesh Shah commented on TEZ-3113:
----------------------------------

[~mingma] [~cchepelov] any chance of updating the logs to help [~rajesh.balamohan] debug further? 

> massive increase of run time using PipelinedSorter rather than DefaultSorter
> ----------------------------------------------------------------------------
>
>                 Key: TEZ-3113
>                 URL: https://issues.apache.org/jira/browse/TEZ-3113
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2
>         Environment: scalding 0.15-SNAPSHOT per https://github.com/twitter/scalding/pull/1446
> cascading 3.1.0-wip-54
> tez-0.8.2
> OpenJDK 8 on AMD64
> Hadoop 2.6.0 (YARN, HDFS); Apache distribution
> Debian Linux 8
> 8 * Intel Core i7-3770K 
>            Reporter: Cyrille Chépélov
>
> While running a (fairly complex) scalding DAG that was working fine using tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.
> Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.
> Difficulties can be traced to this shape of code:
> {code:scala}
> val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset 
> x
>   .group
>   .mapValues(x => 1L)
>   .sum
>   .write(TypedTsvHeader("foo.tsv", ('key, 'count)))
> {code}
> where the incoming data contains many, many different keys. Observed behaviour of PipelinedSorter is that several hundred thousand different files are put flat in the same per-TezChild local temporary directories, and thing become very slow (not alleging any causality).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)