You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "LI Guobao (JIRA)" <ji...@apache.org> on 2018/06/18 09:50:00 UTC
[jira] [Closed] (SYSTEMML-2398) Paramserv ASP aggregation overhead on update per epoch

     [ https://issues.apache.org/jira/browse/SYSTEMML-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

LI Guobao closed SYSTEMML-2398.
-------------------------------

It is resolved by avoiding invoking the synchronised _updateModel_ method by multiple worker threads which leads to the intense serializationbetween ps and workers.

> Paramserv ASP aggregation overhead on update per epoch
> ------------------------------------------------------
>
>                 Key: SYSTEMML-2398
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2398
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>             Fix For: SystemML 1.2
>
>
> Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP
> {code}
> SystemML Statistics:
> Total elapsed time:		449.548 sec.
> Total compilation time:		1.995 sec.
> Total execution time:		447.553 sec.
> Number of compiled MR Jobs:	0.
> Number of executed MR Jobs:	0.
> Cache hits (Mem, WB, FS, HDFS):	970241/0/0/2.
> Cache writes (WB, FS, HDFS):	55191/0/0.
> Cache times (ACQr/m, RLS, EXP):	1.048/0.120/1.087/0.000 sec.
> HOP DAGs recompiled (PRED, SB):	0/13582.
> HOP DAGs recompile time:	24.473 sec.
> Functions recompiled:		1.
> Functions recompile time:	0.013 sec.
> Paramserv func number of workers:	79.
> Paramserv func total gradients compute time:	1714.962 secs.
> Paramserv func total aggregation time:	428.499 secs.
> Paramserv func model broadcasting time:	2.080 secs.
> Paramserv func total batch slicing time:	0.0190000000 secs.
> Total JIT compile time:		37.461 sec.
> Total JVM GC count:		66.
> Total JVM GC time:		7.098 sec.
> Heavy hitter instructions:
>   #  Instruction             Time(s)  Count
>   1  conv2d_bias_add         719.111  13768
>   2  paramserv               437.051      1
>   3  relu_backward           210.414  20370
>   4  ba+*                    180.001  40928
>   5  conv2d_backward_filter  175.104  13580
>   6  +*                      156.714  81480
>   7  conv2d_backward_data    140.779   6790
>   8  *                       123.502  95173
>   9  -*                      104.058  54320
>  10  -                        94.502  74985
> {code}
> As we see the aggregation is a major bottleneck. This is unexpected due to the coarse-grained update per epoch. [~Guobao] could you please have a look and profile where this is coming from?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)