You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2014/06/06 19:36:02 UTC

[jira] [Created] (MAHOUT-1573) More explicit parallelism adjustments in math-scala DRM apis; elements of automatic re-adjustments

Dmitriy Lyubimov created MAHOUT-1573:
----------------------------------------

             Summary: More explicit parallelism adjustments in math-scala DRM apis; elements of automatic re-adjustments
                 Key: MAHOUT-1573
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1573
             Project: Mahout
          Issue Type: Task
    Affects Versions: 0.9
            Reporter: Dmitriy Lyubimov
            Assignee: Dmitriy Lyubimov
             Fix For: 1.0


(1) add minSplit parameter pass-thru to drmFromHDFS to be able to explicitly increase parallelism. 

(2) add parrallelism readjustment parameter to a checkpoint() call. This implies shuffle-less coalesce() translation to the data set before it is requested to be cached (if specified).

Going forward, we probably should try and figure how we can automate it,  at least a little bit. For example, the simplest automatic adjustment might include re-adjust parallelims on load to simply fit cluster size (95% or 180% of cluster size, for example), with some rule-of-thumb safeguards here, e.g. we cannot exceed a factor of say 8 (or whatever we configure) in splitting each original hdfs split. We should be able to get a reasonable parallelism performance out of the box on simple heuristics like that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)