You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2014/06/18 23:36:25 UTC
[jira] [Commented] (MAHOUT-1573) More explicit parallelism
adjustments in math-scala DRM apis; elements of automatic parallelism
management
[ https://issues.apache.org/jira/browse/MAHOUT-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036429#comment-14036429 ]
ASF GitHub Bot commented on MAHOUT-1573:
----------------------------------------
Github user asfgit closed the pull request at:
https://github.com/apache/mahout/pull/13
> More explicit parallelism adjustments in math-scala DRM apis; elements of automatic parallelism management
> ----------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1573
> URL: https://issues.apache.org/jira/browse/MAHOUT-1573
> Project: Mahout
> Issue Type: Task
> Affects Versions: 0.9
> Reporter: Dmitriy Lyubimov
> Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> (1) add minSplit parameter pass-thru to drmFromHDFS to be able to explicitly increase parallelism.
> (2) add parrallelism readjustment parameter to a checkpoint() call. This implies shuffle-less coalesce() translation to the data set before it is requested to be cached (if specified).
> Going forward, we probably should try and figure how we can automate it, at least a little bit. For example, the simplest automatic adjustment might include re-adjust parallelims on load to simply fit cluster size (95% or 180% of cluster size, for example), with some rule-of-thumb safeguards here, e.g. we cannot exceed a factor of say 8 (or whatever we configure) in splitting each original hdfs split. We should be able to get a reasonable parallelism performance out of the box on simple heuristics like that.
--
This message was sent by Atlassian JIRA
(v6.2#6252)