You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2016/10/05 04:06:20 UTC
[jira] [Commented] (SYSTEMML-1011) Slow sparse append cbind (sparse row re-allocations)

    [ https://issues.apache.org/jira/browse/SYSTEMML-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15547539#comment-15547539 ] 

Matthias Boehm commented on SYSTEMML-1011:
------------------------------------------

with an allocate-once policy we see small end-to-end improvements (as shown below), but especially for sparse, we should execute these append operations multi-threaded (SYSTEMML-1012), which has the potential of mitigating allocation inefficiencies.

{code}
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
-- Running runLinearRegCG on 10M_1k_sparse (all configs)
LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 8
LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 14
LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 14
{code}  

> Slow sparse append cbind (sparse row re-allocations)
> ----------------------------------------------------
>
>                 Key: SYSTEMML-1011
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1011
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>
> All algorithms that support the 'intercept' option (e.g., LinregCG, LinregDS, L2SVM, MSVM, Mlogreg, and GLM) append a column of 1s in the beginning of the script. On large sparse data, this append sometimes dominates end-to-end performance. For example, here are the LinregCG results for a 10Mx1K scenario with sparsity 0.01.
> {code}
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 6
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 15
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 16
> -- Running runLinearRegCG on 10M_1k_sparse (all configs)
> LinRegCG train ict=0 on mbperftest/binomial/X10M_1k_sparse: 7
> LinRegCG train ict=1 on mbperftest/binomial/X10M_1k_sparse: 16
> LinRegCG train ict=2 on mbperftest/binomial/X10M_1k_sparse: 15
> {code}
> and here is the related -stats output for ict=1.
> {code}
> Total elapsed time:		16.893 sec.
> Total compilation time:		2.412 sec.
> Total execution time:		14.480 sec.
> Number of compiled Spark inst:	0.
> Number of executed Spark inst:	0.
> Cache hits (Mem, WB, FS, HDFS):	172/0/0/2.
> Cache writes (WB, FS, HDFS):	77/0/1.
> Cache times (ACQr/m, RLS, EXP):	1.734/0.003/2.143/0.209 sec.
> HOP DAGs recompiled (PRED, SB):	0/0.
> HOP DAGs recompile time:	0.000 sec.
> Spark ctx create time (lazy):	0.000 sec.
> Spark trans counts (par,bc,col):0/0/0.
> Spark trans times (par,bc,col):	0.000/0.000/0.000 secs.
> Total JIT compile time:		5.357 sec.
> Total JVM GC count:		2.
> Total JVM GC time:		5.628 sec.
> Heavy hitter instructions (name, time, count):
> -- 1) 	append 	8.595 sec 	26
> -- 2) 	mmchain 	4.443 sec 	8
> -- 3) 	ba+* 	0.537 sec 	10
> -- 4) 	r' 	0.411 sec 	10
> -- 5) 	write 	0.210 sec 	1
> -- 6) 	- 	0.087 sec 	20
> -- 7) 	uak+ 	0.059 sec 	2
> -- 8) 	tsmm 	0.049 sec 	11
> -- 9) 	rand 	0.043 sec 	5
> -- 10) 	+* 	0.007 sec 	24
> {code}
> The large GC time indicates that sparse row re-allocations are a major issue here. We should compute the joint nnz per output row, and allocate the output sparse row just once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)