You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/05/14 11:01:59 UTC

[jira] [Resolved] (SPARK-7630) Going with Terasort,when test data's partitions is larger than 200,the sorted data cann't be passed by TeraValidate.

     [ https://issues.apache.org/jira/browse/SPARK-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-7630.
------------------------------
    Resolution: Invalid

I don't believe this is a question about Spark itself, or at least, has not been narrowed down enough to show the problem.

> Going with Terasort,when test data's partitions is larger than 200,the sorted data cann't be passed by TeraValidate.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-7630
>                 URL: https://issues.apache.org/jira/browse/SPARK-7630
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.3.1
>            Reporter: Yun Zhao
>
> I take source of Terasort from 'https://github.com/ehiggs/spark-terasort/tree/master/src/main/scala/com/github/ehiggs/spark/terasort'.The main code is below.
> 	"val dataset = sc.newAPIHadoopFile[Array[Byte], Array[Byte], TeraInputFormat](inputFile)
>     val sorted = dataset.partitionBy(new TeraSortPartitioner(dataset.partitions.size)).sortByKey()
>     sorted.saveAsNewAPIHadoopFile[TeraOutputFormat](outputFile)"
> Every time I validate the result if it's sorted correctly by TeraValidate when terasort is finished. I find sometimes it's not sorted correctly. Indeed I find if the number of dataset's partitions is a larger than 200,the result is not sorted correctly, and if less than or equal to 200,the result is sorted correctly.
> I think for a long time, I set 'spark.shuffle.manager' to 'hash', which default value is 'sort'. Then no matter what the number of dataset's partitions is, the result is sorted correctly.
> I study on source code, and I find '200' is related  to this parameter 'spark.shuffle.sort.bypassMergeThreshold' which default value is 200. I then set 'spark.shuffle.sort.bypassMergeThreshold' to 400, and  it's expected that when the number of dataset's partitions is between 200 and 400, the result is sorted correctly.
> Howerer,I still don't know how to solve this bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org