You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2014/08/18 09:44:18 UTC

[jira] [Commented] (SPARK-3098) In some cases, operation groupByKey get a wrong results

    [ https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100402#comment-14100402 ] 

Sean Owen commented on SPARK-3098:
----------------------------------

What are your key types? are you sure they are suitable as key, implementing hashCode and so on?
Are you certain the key-values in the result do not *also* appear in your input? It's not clear what is in the RDD before and after the groupByKey operation. It seems better to rule out some small error in this analysis first.

>  In some cases, operation groupByKey get a wrong results
> --------------------------------------------------------
>
>                 Key: SPARK-3098
>                 URL: https://issues.apache.org/jira/browse/SPARK-3098
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.1
>            Reporter: Guoqiang Li
>            Priority: Blocker
>
> I do not know how to reproduce the bug.
> This is the case. When I was in operating 10 billion data by groupByKey. the results error:
> {noformat}
> (4696501, 370568)
> (4696501, 376672)
> (4696501, 374880)
> .....
> (4696502, 350264)
> (4696502, 358458)
> (4696502, 398502)
> ......
> {noformat} 
> => 
> {noformat}
> (4696501,ArrayBuffer(350264, 358458, 398502 ........)), (4696502,ArrayBuffer(376621, ......))
> {noformat}
> code :
> {code}
>     val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
>     val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
>     val favorites = favoritePreferences(sc, favoritePath, periodTime)
>     val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
>     val peferences= allBehaviors.groupByKey().map { ... } 
> {code}
> spark-defaults.conf:
> {code}
> spark.default.parallelism    280
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org