You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "LI Guobao (JIRA)" <ji...@apache.org> on 2018/06/18 09:50:00 UTC
[jira] [Closed] (SYSTEMML-2398) Paramserv ASP aggregation overhead
on update per epoch
[ https://issues.apache.org/jira/browse/SYSTEMML-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LI Guobao closed SYSTEMML-2398.
-------------------------------
It is resolved by avoiding invoking the synchronised _updateModel_ method by multiple worker threads which leads to the intense serializationbetween ps and workers.
> Paramserv ASP aggregation overhead on update per epoch
> ------------------------------------------------------
>
> Key: SYSTEMML-2398
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2398
> Project: SystemML
> Issue Type: Bug
> Reporter: Matthias Boehm
> Assignee: LI Guobao
> Priority: Major
> Fix For: SystemML 1.2
>
>
> Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP
> {code}
> SystemML Statistics:
> Total elapsed time: 449.548 sec.
> Total compilation time: 1.995 sec.
> Total execution time: 447.553 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 970241/0/0/2.
> Cache writes (WB, FS, HDFS): 55191/0/0.
> Cache times (ACQr/m, RLS, EXP): 1.048/0.120/1.087/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/13582.
> HOP DAGs recompile time: 24.473 sec.
> Functions recompiled: 1.
> Functions recompile time: 0.013 sec.
> Paramserv func number of workers: 79.
> Paramserv func total gradients compute time: 1714.962 secs.
> Paramserv func total aggregation time: 428.499 secs.
> Paramserv func model broadcasting time: 2.080 secs.
> Paramserv func total batch slicing time: 0.0190000000 secs.
> Total JIT compile time: 37.461 sec.
> Total JVM GC count: 66.
> Total JVM GC time: 7.098 sec.
> Heavy hitter instructions:
> # Instruction Time(s) Count
> 1 conv2d_bias_add 719.111 13768
> 2 paramserv 437.051 1
> 3 relu_backward 210.414 20370
> 4 ba+* 180.001 40928
> 5 conv2d_backward_filter 175.104 13580
> 6 +* 156.714 81480
> 7 conv2d_backward_data 140.779 6790
> 8 * 123.502 95173
> 9 -* 104.058 54320
> 10 - 94.502 74985
> {code}
> As we see the aggregation is a major bottleneck. This is unexpected due to the coarse-grained update per epoch. [~Guobao] could you please have a look and profile where this is coming from?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)