You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@crunch.apache.org by "Gabriel Reid (JIRA)" <ji...@apache.org> on 2014/08/07 20:58:12 UTC

[jira] [Commented] (CRUNCH-458) Eliminate potentially random MR split-point decisions

    [ https://issues.apache.org/jira/browse/CRUNCH-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089608#comment-14089608 ] 

Gabriel Reid commented on CRUNCH-458:
-------------------------------------

Definitely sounds like a good plan to make the planning stuff deterministic. I was thinking that it might be a bit better to use a TreeSet, etc instead of LinkedHashSet so that the behavior will be the same regardless of the order in which node paths are added, so it would protect against calling code using a HashSet somewhere. On the other hand, that might be over-thinking it, and it means we would need to have Comparators for PCollections and NodePaths. Anyhow, just something to consider.

I'm curious, what was the NPE that you were getting when an alternate plan was being created? Was that something in your own code, or in Crunch?

> Eliminate potentially random MR split-point decisions
> -----------------------------------------------------
>
>                 Key: CRUNCH-458
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-458
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Josh Wills
>         Attachments: CRUNCH-458.patch
>
>
> I'm running into a pipeline in which the decision of where to split two dependent jobs seems to be random from run-to-run (I only noticed it b/c one of the runs causes the pipeline to throw an NPE, and the other does not.) I'd like to investigate this and try to eliminate any potential sources of randomness in the way that two dependent GBK operations are split.



--
This message was sent by Atlassian JIRA
(v6.2#6252)