You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by GitBox <gi...@apache.org> on 2020/05/31 16:28:10 UTC

[GitHub] [singa] chrishkchris opened a new pull request #716: [WIP] SINGA-510 Distributed Training Time Profiling

chrishkchris opened a new pull request #716:
URL: https://github.com/apache/singa/pull/716


   I am working on the distributed training time profiling
   Done:
   1. Moved the cuda streams from `Communicator` to `Context`
   2. Time Profiling for Commmunicator buffered operations
   To DO:
   1. I need to rewrite the backward time (verbosity=1) to adapt for distributed training
   2. Double Checking
   
   Current Result using tested in panda 7 (two GPUs) is:
   ```
   Time Profiling:
   OP_ID0. SetValue : 1.1173e-05 sec
   OP_ID1. cudnnConvForward : 0.000604602 sec
   OP_ID2. GpuBatchNormForwardTraining : 0.000561895 sec
   OP_ID3. ReLU : 0.000372956 sec
   OP_ID4. GpuPoolingForward : 0.000242873 sec
   OP_ID5. SetValue : 6.48188e-06 sec
   OP_ID6. cudnnConvForward : 0.000125797 sec
   OP_ID7. GpuBatchNormForwardTraining : 0.000892871 sec
   OP_ID8. ReLU : 9.91812e-05 sec
   OP_ID9. SetValue : 6.59137e-06 sec
   OP_ID10. cudnnConvForward : 0.000641241 sec
   OP_ID11. GpuBatchNormForwardTraining : 0.000152563 sec
   OP_ID12. ReLU : 9.87247e-05 sec
   OP_ID13. SetValue : 6.35137e-06 sec
   OP_ID14. cudnnConvForward : 0.000403831 sec
   OP_ID15. GpuBatchNormForwardTraining : 0.00137244 sec
   OP_ID16. SetValue : 6.39059e-06 sec
   OP_ID17. cudnnConvForward : 0.000403334 sec
   OP_ID18. GpuBatchNormForwardTraining : 0.000570597 sec
   OP_ID19. Add : 0.000542958 sec
   OP_ID20. ReLU : 0.000917093 sec
   OP_ID21. SetValue : 8.19075e-06 sec
   OP_ID22. cudnnConvForward : 0.000331162 sec
   OP_ID23. GpuBatchNormForwardTraining : 0.000149028 sec
   OP_ID24. ReLU : 9.89948e-05 sec
   OP_ID25. SetValue : 6.336e-06 sec
   OP_ID26. cudnnConvForward : 0.000638993 sec
   OP_ID27. GpuBatchNormForwardTraining : 0.000152312 sec
   OP_ID28. ReLU : 9.89239e-05 sec
   OP_ID29. SetValue : 6.12643e-06 sec
   OP_ID30. cudnnConvForward : 0.000408918 sec
   OP_ID31. GpuBatchNormForwardTraining : 0.00109342 sec
   OP_ID32. Add : 0.000542411 sec
   OP_ID33. ReLU : 0.000372632 sec
   OP_ID34. SetValue : 6.22839e-06 sec
   OP_ID35. cudnnConvForward : 0.000263843 sec
   OP_ID36. GpuBatchNormForwardTraining : 0.000149717 sec
   OP_ID37. ReLU : 0.000245991 sec
   OP_ID38. SetValue : 9.47551e-05 sec
   OP_ID39. cudnnConvForward : 0.00126357 sec
   OP_ID40. GpuBatchNormForwardTraining : 0.000149358 sec
   OP_ID41. ReLU : 9.88433e-05 sec
   OP_ID42. SetValue : 6.02039e-06 sec
   OP_ID43. cudnnConvForward : 0.000403975 sec
   OP_ID44. GpuBatchNormForwardTraining : 0.000618823 sec
   OP_ID45. Add : 0.00155586 sec
   OP_ID46. ReLU : 0.000372686 sec
   OP_ID47. SetValue : 6.06933e-06 sec
   OP_ID48. cudnnConvForward : 0.000499369 sec
   OP_ID49. GpuBatchNormForwardTraining : 0.000293853 sec
   OP_ID50. ReLU : 0.000213278 sec
   OP_ID51. SetValue : 6.02823e-06 sec
   OP_ID52. cudnnConvForward : 0.00238728 sec
   OP_ID53. GpuBatchNormForwardTraining : 8.67325e-05 sec
   OP_ID54. ReLU : 5.25023e-05 sec
   OP_ID55. SetValue : 5.95733e-06 sec
   OP_ID56. cudnnConvForward : 0.000295534 sec
   OP_ID57. GpuBatchNormForwardTraining : 0.000824675 sec
   OP_ID58. SetValue : 5.95921e-06 sec
   OP_ID59. cudnnConvForward : 0.000591925 sec
   OP_ID60. GpuBatchNormForwardTraining : 0.000293985 sec
   OP_ID61. Add : 0.000275258 sec
   OP_ID62. ReLU : 0.000189845 sec
   OP_ID63. SetValue : 5.65113e-06 sec
   OP_ID64. cudnnConvForward : 0.000368874 sec
   OP_ID65. GpuBatchNormForwardTraining : 0.000226366 sec
   OP_ID66. ReLU : 0.000212913 sec
   OP_ID67. SetValue : 0.000102895 sec
   OP_ID68. cudnnConvForward : 0.000735088 sec
   OP_ID69. GpuBatchNormForwardTraining : 8.56194e-05 sec
   OP_ID70. ReLU : 5.24138e-05 sec
   OP_ID71. SetValue : 5.93223e-06 sec
   OP_ID72. cudnnConvForward : 0.000293387 sec
   OP_ID73. GpuBatchNormForwardTraining : 0.000295833 sec
   OP_ID74. Add : 0.000275058 sec
   OP_ID75. ReLU : 0.000192881 sec
   OP_ID76. SetValue : 5.63608e-06 sec
   OP_ID77. cudnnConvForward : 0.00131212 sec
   OP_ID78. GpuBatchNormForwardTraining : 9.09983e-05 sec
   OP_ID79. ReLU : 5.23912e-05 sec
   OP_ID80. SetValue : 5.70196e-06 sec
   OP_ID81. cudnnConvForward : 0.00055266 sec
   OP_ID82. GpuBatchNormForwardTraining : 8.59699e-05 sec
   OP_ID83. ReLU : 5.26049e-05 sec
   OP_ID84. SetValue : 5.91655e-06 sec
   OP_ID85. cudnnConvForward : 0.000294112 sec
   OP_ID86. GpuBatchNormForwardTraining : 0.000296516 sec
   OP_ID87. Add : 0.000466364 sec
   OP_ID88. ReLU : 0.000881861 sec
   OP_ID89. SetValue : 5.73867e-06 sec
   OP_ID90. cudnnConvForward : 0.000401748 sec
   OP_ID91. GpuBatchNormForwardTraining : 8.48119e-05 sec
   OP_ID92. ReLU : 5.25264e-05 sec
   OP_ID93. SetValue : 5.60063e-06 sec
   OP_ID94. cudnnConvForward : 0.000539294 sec
   OP_ID95. GpuBatchNormForwardTraining : 8.27498e-05 sec
   OP_ID96. ReLU : 4.85327e-05 sec
   OP_ID97. SetValue : 3.61223e-06 sec
   OP_ID98. cudnnConvForward : 0.000288547 sec
   OP_ID99. GpuBatchNormForwardTraining : 0.000460903 sec
   OP_ID100. Add : 0.000961548 sec
   OP_ID101. ReLU : 0.000184901 sec
   OP_ID102. SetValue : 3.41835e-06 sec
   OP_ID103. cudnnConvForward : 0.000465582 sec
   OP_ID104. GpuBatchNormForwardTraining : 0.00015375 sec
   OP_ID105. ReLU : 9.39084e-05 sec
   OP_ID106. SetValue : 3.34839e-06 sec
   OP_ID107. cudnnConvForward : 0.000739092 sec
   OP_ID108. GpuBatchNormForwardTraining : 2.95197e-05 sec
   OP_ID109. ReLU : 4.71257e-05 sec
   OP_ID110. SetValue : 3.70886e-06 sec
   OP_ID111. cudnnConvForward : 0.000729521 sec
   OP_ID112. GpuBatchNormForwardTraining : 0.000162648 sec
   OP_ID113. SetValue : 3.66588e-06 sec
   OP_ID114. cudnnConvForward : 0.000552125 sec
   OP_ID115. GpuBatchNormForwardTraining : 0.000110431 sec
   OP_ID116. Add : 0.000136142 sec
   OP_ID117. ReLU : 9.43153e-05 sec
   OP_ID118. SetValue : 3.45192e-06 sec
   OP_ID119. cudnnConvForward : 0.0002493 sec
   OP_ID120. GpuBatchNormForwardTraining : 2.82673e-05 sec
   OP_ID121. ReLU : 4.85173e-05 sec
   OP_ID122. SetValue : 3.32392e-06 sec
   OP_ID123. cudnnConvForward : 0.000632525 sec
   OP_ID124. GpuBatchNormForwardTraining : 5.23012e-05 sec
   OP_ID125. ReLU : 7.41651e-05 sec
   OP_ID126. SetValue : 5.48329e-06 sec
   OP_ID127. cudnnConvForward : 0.000743826 sec
   OP_ID128. GpuBatchNormForwardTraining : 0.000105363 sec
   OP_ID129. Add : 0.000136166 sec
   OP_ID130. ReLU : 9.44919e-05 sec
   OP_ID131. SetValue : 3.40298e-06 sec
   OP_ID132. cudnnConvForward : 0.000246384 sec
   OP_ID133. GpuBatchNormForwardTraining : 2.92612e-05 sec
   OP_ID134. ReLU : 2.42613e-05 sec
   OP_ID135. SetValue : 3.424e-06 sec
   OP_ID136. cudnnConvForward : 0.000479882 sec
   OP_ID137. GpuBatchNormForwardTraining : 2.7339e-05 sec
   OP_ID138. ReLU : 2.45948e-05 sec
   OP_ID139. SetValue : 3.69192e-06 sec
   OP_ID140. cudnnConvForward : 0.0002578 sec
   OP_ID141. GpuBatchNormForwardTraining : 0.000152226 sec
   OP_ID142. Add : 0.000349039 sec
   OP_ID143. ReLU : 0.000539722 sec
   OP_ID144. SetValue : 2.51548e-05 sec
   OP_ID145. cudnnConvForward : 0.000463997 sec
   OP_ID146. GpuBatchNormForwardTraining : 2.93804e-05 sec
   OP_ID147. ReLU : 2.43536e-05 sec
   OP_ID148. SetValue : 3.42055e-06 sec
   OP_ID149. cudnnConvForward : 0.000458101 sec
   OP_ID150. GpuBatchNormForwardTraining : 2.76612e-05 sec
   OP_ID151. ReLU : 2.42648e-05 sec
   OP_ID152. SetValue : 3.73365e-06 sec
   OP_ID153. cudnnConvForward : 0.00025759 sec
   OP_ID154. GpuBatchNormForwardTraining : 0.000107765 sec
   OP_ID155. Add : 0.000136457 sec
   OP_ID156. ReLU : 9.44404e-05 sec
   OP_ID157. SetValue : 3.3898e-06 sec
   OP_ID158. cudnnConvForward : 0.000294264 sec
   OP_ID159. GpuBatchNormForwardTraining : 5.24452e-05 sec
   OP_ID160. ReLU : 6.85327e-05 sec
   OP_ID161. SetValue : 3.36659e-06 sec
   OP_ID162. cudnnConvForward : 0.00115037 sec
   OP_ID163. GpuBatchNormForwardTraining : 2.7813e-05 sec
   OP_ID164. ReLU : 2.43812e-05 sec
   OP_ID165. SetValue : 3.75749e-06 sec
   OP_ID166. cudnnConvForward : 0.000251883 sec
   OP_ID167. GpuBatchNormForwardTraining : 0.000106659 sec
   OP_ID168. Add : 0.000136356 sec
   OP_ID169. ReLU : 9.38396e-05 sec
   OP_ID170. SetValue : 3.40141e-06 sec
   OP_ID171. cudnnConvForward : 0.000274543 sec
   OP_ID172. GpuBatchNormForwardTraining : 2.90845e-05 sec
   OP_ID173. ReLU : 2.43893e-05 sec
   OP_ID174. SetValue : 3.75749e-06 sec
   OP_ID175. cudnnConvForward : 0.000489212 sec
   OP_ID176. GpuBatchNormForwardTraining : 2.77186e-05 sec
   OP_ID177. ReLU : 7.05308e-05 sec
   OP_ID178. SetValue : 3.70164e-06 sec
   OP_ID179. cudnnConvForward : 0.000874302 sec
   OP_ID180. GpuBatchNormForwardTraining : 0.000200876 sec
   OP_ID181. Add : 0.00015361 sec
   OP_ID182. ReLU : 9.39011e-05 sec
   OP_ID183. SetValue : 3.61255e-06 sec
   OP_ID184. cudnnConvForward : 0.000500288 sec
   OP_ID185. GpuBatchNormForwardTraining : 5.63156e-05 sec
   OP_ID186. ReLU : 7.19357e-05 sec
   OP_ID187. SetValue : 3.56361e-06 sec
   OP_ID188. cudnnConvForward : 0.000834152 sec
   OP_ID189. GpuBatchNormForwardTraining : 1.92238e-05 sec
   OP_ID190. ReLU : 1.08248e-05 sec
   OP_ID191. SetValue : 2.61512e-05 sec
   OP_ID192. cudnnConvForward : 0.000762623 sec
   OP_ID193. GpuBatchNormForwardTraining : 0.000107159 sec
   OP_ID194. SetValue : 5.97678e-06 sec
   OP_ID195. cudnnConvForward : 0.000745796 sec
   OP_ID196. GpuBatchNormForwardTraining : 6.61951e-05 sec
   OP_ID197. Add : 6.96229e-05 sec
   OP_ID198. ReLU : 4.85587e-05 sec
   OP_ID199. SetValue : 3.71263e-06 sec
   OP_ID200. cudnnConvForward : 0.000355318 sec
   OP_ID201. GpuBatchNormForwardTraining : 1.65788e-05 sec
   OP_ID202. ReLU : 1.09208e-05 sec
   OP_ID203. SetValue : 3.59027e-06 sec
   OP_ID204. cudnnConvForward : 0.00119127 sec
   OP_ID205. GpuBatchNormForwardTraining : 3.92022e-05 sec
   OP_ID206. ReLU : 1.11642e-05 sec
   OP_ID207. SetValue : 1.82005e-05 sec
   OP_ID208. cudnnConvForward : 0.000356234 sec
   OP_ID209. GpuBatchNormForwardTraining : 6.75047e-05 sec
   OP_ID210. Add : 7.42124e-05 sec
   OP_ID211. ReLU : 4.8645e-05 sec
   OP_ID212. SetValue : 3.59372e-06 sec
   OP_ID213. cudnnConvForward : 0.000332925 sec
   OP_ID214. GpuBatchNormForwardTraining : 1.66958e-05 sec
   OP_ID215. ReLU : 1.10679e-05 sec
   OP_ID216. SetValue : 3.56204e-06 sec
   OP_ID217. cudnnConvForward : 0.000758647 sec
   OP_ID218. GpuBatchNormForwardTraining : 4.04969e-05 sec
   OP_ID219. ReLU : 1.07827e-05 sec
   OP_ID220. SetValue : 4.21364e-06 sec
   OP_ID221. cudnnConvForward : 0.000565189 sec
   OP_ID222. GpuBatchNormForwardTraining : 0.000248861 sec
   OP_ID223. Add : 0.000280032 sec
   OP_ID224. ReLU : 9.83251e-05 sec
   OP_ID225. GpuPoolingForward : 7.88555e-05 sec
   OP_ID226. GEMM : 3.08941e-05 sec
   OP_ID227. SetValue : 3.54478e-06 sec
   OP_ID228. GEMM : 1.05518e-05 sec
   OP_ID230. SoftMax : 1.22017e-05 sec
   OP_ID231. ComputeCrossEntropy : 6.26635e-06 sec
   OP_ID232. SetValue : 5.88768e-06 sec
   OP_ID233. SumAll : 1.73609e-05 sec
   OP_ID234. Div : 5.07168e-06 sec
   OP_ID236. CopyDataToFrom : 1.16828e-05 sec
   OP_ID237. SoftmaxCrossEntropyBackward : 6.73725e-06 sec
   OP_ID238. Div : 5.25459e-06 sec
   OP_ID239. SetValue : 7.44753e-06 sec
   OP_ID240. GEMV : 8.38526e-06 sec
   OP_ID241. Dist_c1c1:fusedSynch_Filling : 2.0699e-05 sec
   OP_ID242. GEMM : 4.15291e-05 sec
   OP_ID243. GEMM : 2.6011e-05 sec
   OP_ID244. Dist_c1c1:fusedSynch_Filling : 0.000417212 sec
   OP_ID245. GpuPoolingBackward : 0.000276932 sec
   OP_ID246. ReLUBackward : 8.68697e-05 sec
   OP_ID247. GpuBatchNormBackward : 8.28863e-05 sec
   OP_ID248. Dist_c1c1:fusedSynch_Filling : 9.19595e-05 sec
   OP_ID249. Dist_c1c1:fusedSynch_Filling : 6.97129e-06 sec
   OP_ID250. cudnnConvolutionBackwardData : 0.000336988 sec
   OP_ID251. cudnnConvolutionBackwardFilter : 0.000513113 sec
   OP_ID252. Dist_c1c1:fusedSynch_Filling : 0.000901595 sec
   OP_ID253. Dist_sc1:fusedSynch_Transfer : 0.00144644 sec
   OP_ID254. ReLUBackward : 2.46146e-05 sec
   OP_ID255. GpuBatchNormBackward : 3.61901e-05 sec
   OP_ID256. Dist_c1c1:fusedSynch_Filling : 6.07843e-06 sec
   OP_ID257. Dist_c1c1:fusedSynch_Filling : 5.96016e-06 sec
   OP_ID258. cudnnConvolutionBackwardData : 0.00139958 sec
   OP_ID259. cudnnConvolutionBackwardFilter : 0.000735867 sec
   OP_ID260. synch : 2.92894e-06 sec
   OP_ID261. ReLUBackward : 2.51454e-05 sec
   OP_ID262. GpuBatchNormBackward : 3.18425e-05 sec
   OP_ID263. Dist_c1c1:fusedSynch_Filling : 0.00208151 sec
   OP_ID264. Dist_c1c1:fusedSynch_Filling : 8.05051e-06 sec
   OP_ID265. cudnnConvolutionBackwardData : 0.000539205 sec
   OP_ID266. cudnnConvolutionBackwardFilter : 0.000713407 sec
   OP_ID267. Add : 0.000107023 sec
   OP_ID268. Dist_c1c1:fusedSynch_Filling : 0.00160548 sec
   OP_ID269. ReLUBackward : 0.000240142 sec
   OP_ID270. GpuBatchNormBackward : 0.000123936 sec
   OP_ID271. Dist_c1c1:fusedSynch_Filling : 0.000117176 sec
   OP_ID272. Dist_c1c1:fusedSynch_Filling : 8.46996e-06 sec
   OP_ID273. cudnnConvolutionBackwardData : 0.000339708 sec
   OP_ID274. cudnnConvolutionBackwardFilter : 0.000588619 sec
   OP_ID275. Dist_c1c1:fusedSynch_Filling : 0.000950344 sec
   OP_ID276. Dist_sc1:fusedSynch_Transfer : 0.00293659 sec
   OP_ID277. ReLUBackward : 2.50845e-05 sec
   OP_ID278. GpuBatchNormBackward : 6.21136e-05 sec
   OP_ID279. Dist_c1c1:fusedSynch_Filling : 6.53208e-06 sec
   OP_ID280. Dist_c1c1:fusedSynch_Filling : 5.91843e-06 sec
   OP_ID281. cudnnConvolutionBackwardData : 0.00109241 sec
   OP_ID282. cudnnConvolutionBackwardFilter : 0.00104219 sec
   OP_ID283. synch : 2.94682e-06 sec
   OP_ID284. ReLUBackward : 2.53719e-05 sec
   OP_ID285. GpuBatchNormBackward : 5.95413e-05 sec
   OP_ID286. Dist_c1c1:fusedSynch_Filling : 0.00211412 sec
   OP_ID287. Dist_c1c1:fusedSynch_Filling : 7.99059e-06 sec
   OP_ID288. cudnnConvolutionBackwardData : 0.00038087 sec
   OP_ID289. cudnnConvolutionBackwardFilter : 0.00107717 sec
   OP_ID290. Add : 0.000145052 sec
   OP_ID291. Dist_c1c1:fusedSynch_Filling : 0.00167427 sec
   OP_ID292. ReLUBackward : 7.44816e-05 sec
   OP_ID293. GpuBatchNormBackward : 9.21766e-05 sec
   OP_ID294. Dist_c1c1:fusedSynch_Filling : 0.000136159 sec
   OP_ID295. Dist_c1c1:fusedSynch_Filling : 0.000850771 sec
   OP_ID296. cudnnConvolutionBackwardData : 0.00156798 sec
   OP_ID297. cudnnConvolutionBackwardFilter : 0.00193751 sec
   OP_ID298. Dist_c1c1:fusedSynch_Filling : 0.00281473 sec
   OP_ID299. Dist_sc1:fusedSynch_Transfer : 0.00593443 sec
   OP_ID300. GpuBatchNormBackward : 0.000179956 sec
   OP_ID301. Dist_c1c1:fusedSynch_Filling : 6.74886e-06 sec
   OP_ID302. Dist_c1c1:fusedSynch_Filling : 6.54902e-06 sec
   OP_ID303. cudnnConvolutionBackwardData : 0.000548626 sec
   OP_ID304. cudnnConvolutionBackwardFilter : 0.000554376 sec
   OP_ID305. Dist_c1c1:fusedSynch_Filling : 0.00075892 sec
   OP_ID306. ReLUBackward : 2.62823e-05 sec
   OP_ID307. GpuBatchNormBackward : 3.38306e-05 sec
   OP_ID308. Dist_c1c1:fusedSynch_Filling : 5.50158e-05 sec
   OP_ID309. Dist_c1c1:fusedSynch_Filling : 0.00138753 sec
   OP_ID310. cudnnConvolutionBackwardData : 0.00163799 sec
   OP_ID311. cudnnConvolutionBackwardFilter : 0.000739837 sec
   OP_ID312. synch : 2.83106e-06 sec
   OP_ID313. ReLUBackward : 7.14124e-05 sec
   OP_ID314. GpuBatchNormBackward : 0.000117993 sec
   OP_ID315. Dist_c1c1:fusedSynch_Filling : 0.00119169 sec
   OP_ID316. Dist_c1c1:fusedSynch_Filling : 8.44015e-06 sec
   OP_ID317. cudnnConvolutionBackwardData : 0.00083556 sec
   OP_ID318. cudnnConvolutionBackwardFilter : 0.00097282 sec
   OP_ID319. Add : 0.000143578 sec
   OP_ID320. Dist_c1c1:fusedSynch_Filling : 0.00206324 sec
   OP_ID321. ReLUBackward : 0.00014268 sec
   OP_ID322. GpuBatchNormBackward : 0.000246442 sec
   OP_ID323. Dist_c1c1:fusedSynch_Filling : 0.000252399 sec
   OP_ID324. Dist_c1c1:fusedSynch_Filling : 8.23717e-06 sec
   OP_ID325. cudnnConvolutionBackwardData : 0.000380665 sec
   OP_ID326. cudnnConvolutionBackwardFilter : 0.000741035 sec
   OP_ID327. Dist_c1c1:fusedSynch_Filling : 0.00114966 sec
   OP_ID328. ReLUBackward : 4.04822e-05 sec
   OP_ID329. GpuBatchNormBackward : 6.46519e-05 sec
   OP_ID330. Dist_c1c1:fusedSynch_Filling : 7.90796e-05 sec
   OP_ID331. Dist_c1c1:fusedSynch_Filling : 0.000134841 sec
   OP_ID332. cudnnConvolutionBackwardData : 0.000631252 sec
   OP_ID333. cudnnConvolutionBackwardFilter : 0.000588826 sec
   OP_ID334. Dist_c1c1:fusedSynch_Filling : 0.00112109 sec
   OP_ID335. Dist_sc1:fusedSynch_Transfer : 0.00624962 sec
   OP_ID336. ReLUBackward : 4.24386e-05 sec
   OP_ID337. GpuBatchNormBackward : 6.823e-05 sec
   OP_ID338. Dist_c1c1:fusedSynch_Filling : 4.74673e-05 sec
   OP_ID339. Dist_c1c1:fusedSynch_Filling : 6.89747e-05 sec
   OP_ID340. cudnnConvolutionBackwardData : 0.000413136 sec
   OP_ID341. cudnnConvolutionBackwardFilter : 0.00072062 sec
   OP_ID342. Add : 0.000256353 sec
   OP_ID343. Dist_c1c1:fusedSynch_Filling : 0.000370799 sec
   OP_ID344. ReLUBackward : 0.000199168 sec
   OP_ID345. GpuBatchNormBackward : 0.000398816 sec
   OP_ID346. Dist_c1c1:fusedSynch_Filling : 0.00041531 sec
   OP_ID347. Dist_c1c1:fusedSynch_Filling : 3.14074e-05 sec
   OP_ID348. cudnnConvolutionBackwardData : 0.00047725 sec
   OP_ID349. cudnnConvolutionBackwardFilter : 0.000463326 sec
   OP_ID350. Dist_c1c1:fusedSynch_Filling : 0.000921659 sec
   OP_ID351. ReLUBackward : 3.92091e-05 sec
   OP_ID352. GpuBatchNormBackward : 0.000134884 sec
   OP_ID353. Dist_c1c1:fusedSynch_Filling : 0.000149363 sec
   OP_ID354. Dist_c1c1:fusedSynch_Filling : 0.00042736 sec
   OP_ID355. cudnnConvolutionBackwardData : 0.000975802 sec
   OP_ID356. cudnnConvolutionBackwardFilter : 0.00101968 sec
   OP_ID357. Dist_c1c1:fusedSynch_Filling : 0.00165023 sec
   OP_ID358. ReLUBackward : 9.00837e-05 sec
   OP_ID359. GpuBatchNormBackward : 9.20185e-05 sec
   OP_ID360. Dist_c1c1:fusedSynch_Filling : 9.6102e-05 sec
   OP_ID361. Dist_c1c1:fusedSynch_Filling : 8.47843e-06 sec
   OP_ID362. cudnnConvolutionBackwardData : 0.00041172 sec
   OP_ID363. cudnnConvolutionBackwardFilter : 0.000460769 sec
   OP_ID364. Add : 0.000189111 sec
   OP_ID365. Dist_c1c1:fusedSynch_Filling : 0.00121564 sec
   OP_ID366. ReLUBackward : 0.000142642 sec
   OP_ID367. GpuBatchNormBackward : 0.000357559 sec
   OP_ID368. Dist_c1c1:fusedSynch_Filling : 0.000386966 sec
   OP_ID369. Dist_c1c1:fusedSynch_Filling : 9.84188e-06 sec
   OP_ID370. cudnnConvolutionBackwardData : 0.000566859 sec
   OP_ID371. cudnnConvolutionBackwardFilter : 0.000575604 sec
   OP_ID372. Dist_c1c1:fusedSynch_Filling : 0.00114785 sec
   OP_ID373. ReLUBackward : 4.21666e-05 sec
   OP_ID374. GpuBatchNormBackward : 6.36976e-05 sec
   OP_ID375. Dist_c1c1:fusedSynch_Filling : 7.89186e-05 sec
   OP_ID376. Dist_c1c1:fusedSynch_Filling : 0.000211579 sec
   OP_ID377. cudnnConvolutionBackwardData : 0.000529962 sec
   OP_ID378. cudnnConvolutionBackwardFilter : 0.000613635 sec
   OP_ID379. Dist_c1c1:fusedSynch_Filling : 0.000975731 sec
   OP_ID380. Dist_sc1:fusedSynch_Transfer : 0.0105351 sec
   OP_ID381. ReLUBackward : 5.40119e-05 sec
   OP_ID382. GpuBatchNormBackward : 0.000123302 sec
   OP_ID383. Dist_c1c1:fusedSynch_Filling : 0.000156116 sec
   OP_ID384. Dist_c1c1:fusedSynch_Filling : 0.000108925 sec
   OP_ID385. cudnnConvolutionBackwardData : 0.000547811 sec
   OP_ID386. cudnnConvolutionBackwardFilter : 0.000481152 sec
   OP_ID387. Add : 0.000167894 sec
   OP_ID388. Dist_c1c1:fusedSynch_Filling : 0.000176278 sec
   OP_ID389. ReLUBackward : 0.000189876 sec
   OP_ID390. GpuBatchNormBackward : 0.000227941 sec
   OP_ID391. Dist_c1c1:fusedSynch_Filling : 0.000128344 sec
   OP_ID392. Dist_c1c1:fusedSynch_Filling : 8.59953e-06 sec
   OP_ID393. cudnnConvolutionBackwardData : 0.000345485 sec
   OP_ID394. cudnnConvolutionBackwardFilter : 0.000486787 sec
   OP_ID395. Dist_c1c1:fusedSynch_Filling : 0.000965114 sec
   OP_ID396. ReLUBackward : 0.000145278 sec
   OP_ID397. GpuBatchNormBackward : 0.000112226 sec
   OP_ID398. Dist_c1c1:fusedSynch_Filling : 0.000148055 sec
   OP_ID399. Dist_c1c1:fusedSynch_Filling : 0.000210897 sec
   OP_ID400. cudnnConvolutionBackwardData : 0.000686638 sec
   OP_ID401. cudnnConvolutionBackwardFilter : 0.000620595 sec
   OP_ID402. Dist_c1c1:fusedSynch_Filling : 0.00112662 sec
   OP_ID403. ReLUBackward : 4.06287e-05 sec
   OP_ID404. GpuBatchNormBackward : 6.28643e-05 sec
   OP_ID405. Dist_c1c1:fusedSynch_Filling : 6.75056e-05 sec
   OP_ID406. Dist_c1c1:fusedSynch_Filling : 8.43514e-06 sec
   OP_ID407. cudnnConvolutionBackwardData : 0.000343444 sec
   OP_ID408. cudnnConvolutionBackwardFilter : 0.000480312 sec
   OP_ID409. Add : 0.000311576 sec
   OP_ID410. Dist_c1c1:fusedSynch_Filling : 0.0013929 sec
   OP_ID411. ReLUBackward : 0.000268452 sec
   OP_ID412. GpuBatchNormBackward : 0.000360346 sec
   OP_ID413. Dist_c1c1:fusedSynch_Filling : 0.00036786 sec
   OP_ID414. Dist_c1c1:fusedSynch_Filling : 8.19388e-06 sec
   OP_ID415. cudnnConvolutionBackwardData : 0.000544672 sec
   OP_ID416. cudnnConvolutionBackwardFilter : 0.000524391 sec
   OP_ID417. Dist_c1c1:fusedSynch_Filling : 0.00109717 sec
   OP_ID418. ReLUBackward : 4.04932e-05 sec
   OP_ID419. GpuBatchNormBackward : 6.24439e-05 sec
   OP_ID420. Dist_c1c1:fusedSynch_Filling : 7.83178e-05 sec
   OP_ID421. Dist_c1c1:fusedSynch_Filling : 0.000375623 sec
   OP_ID422. cudnnConvolutionBackwardData : 0.000601101 sec
   OP_ID423. cudnnConvolutionBackwardFilter : 0.00100906 sec
   OP_ID424. Dist_c1c1:fusedSynch_Filling : 0.00126519 sec
   OP_ID425. Dist_sc1:fusedSynch_Transfer : 0.0101198 sec
   OP_ID426. ReLUBackward : 4.10024e-05 sec
   OP_ID427. GpuBatchNormBackward : 8.5408e-05 sec
   OP_ID428. Dist_c1c1:fusedSynch_Filling : 0.000180643 sec
   OP_ID429. Dist_c1c1:fusedSynch_Filling : 0.000117019 sec
   OP_ID430. cudnnConvolutionBackwardData : 0.000543283 sec
   OP_ID431. cudnnConvolutionBackwardFilter : 0.000633979 sec
   OP_ID432. Add : 0.000142057 sec
   OP_ID433. Dist_c1c1:fusedSynch_Filling : 0.000228257 sec
   OP_ID434. ReLUBackward : 0.000191387 sec
   OP_ID435. GpuBatchNormBackward : 0.000325436 sec
   OP_ID436. Dist_c1c1:fusedSynch_Filling : 0.000301465 sec
   OP_ID437. Dist_c1c1:fusedSynch_Filling : 0.000582661 sec
   OP_ID438. cudnnConvolutionBackwardData : 0.00156517 sec
   OP_ID439. cudnnConvolutionBackwardFilter : 0.00091854 sec
   OP_ID440. Dist_c1c1:fusedSynch_Filling : 0.00194469 sec
   OP_ID441. GpuBatchNormBackward : 0.000287492 sec
   OP_ID442. Dist_c1c1:fusedSynch_Filling : 1.22773e-05 sec
   OP_ID443. Dist_c1c1:fusedSynch_Filling : 1.05744e-05 sec
   OP_ID444. cudnnConvolutionBackwardData : 0.000509951 sec
   OP_ID445. cudnnConvolutionBackwardFilter : 0.000659549 sec
   OP_ID446. Dist_c1c1:fusedSynch_Filling : 0.00121603 sec
   OP_ID447. ReLUBackward : 6.40806e-05 sec
   OP_ID448. GpuBatchNormBackward : 6.27128e-05 sec
   OP_ID449. Dist_c1c1:fusedSynch_Filling : 0.000114209 sec
   OP_ID450. Dist_c1c1:fusedSynch_Filling : 0.000922965 sec
   OP_ID451. cudnnConvolutionBackwardData : 0.00126743 sec
   OP_ID452. cudnnConvolutionBackwardFilter : 0.00117937 sec
   OP_ID453. Dist_c1c1:fusedSynch_Filling : 0.00167664 sec
   OP_ID454. ReLUBackward : 0.000197352 sec
   OP_ID455. GpuBatchNormBackward : 0.000287667 sec
   OP_ID456. Dist_c1c1:fusedSynch_Filling : 0.000291127 sec
   OP_ID457. Dist_c1c1:fusedSynch_Filling : 8.72188e-06 sec
   OP_ID458. cudnnConvolutionBackwardData : 0.000609728 sec
   OP_ID459. cudnnConvolutionBackwardFilter : 0.00103231 sec
   OP_ID460. Add : 0.000375005 sec
   OP_ID461. Dist_c1c1:fusedSynch_Filling : 0.00245611 sec
   OP_ID462. ReLUBackward : 0.000449388 sec
   OP_ID463. GpuBatchNormBackward : 0.000682592 sec
   OP_ID464. Dist_c1c1:fusedSynch_Filling : 0.000688698 sec
   OP_ID465. Dist_c1c1:fusedSynch_Filling : 8.71153e-06 sec
   OP_ID466. cudnnConvolutionBackwardData : 0.00032988 sec
   OP_ID467. cudnnConvolutionBackwardFilter : 0.000582202 sec
   OP_ID468. Dist_c1c1:fusedSynch_Filling : 0.00104164 sec
   OP_ID469. ReLUBackward : 0.0001444 sec
   OP_ID470. GpuBatchNormBackward : 0.000172296 sec
   OP_ID471. Dist_c1c1:fusedSynch_Filling : 0.000496346 sec
   OP_ID472. Dist_c1c1:fusedSynch_Filling : 9.12891e-05 sec
   OP_ID473. cudnnConvolutionBackwardData : 0.000920184 sec
   OP_ID474. cudnnConvolutionBackwardFilter : 0.000940664 sec
   OP_ID475. Dist_c1c1:fusedSynch_Filling : 0.00157309 sec
   OP_ID476. ReLUBackward : 0.000126831 sec
   OP_ID477. GpuBatchNormBackward : 0.000148184 sec
   OP_ID478. Dist_c1c1:fusedSynch_Filling : 0.000156464 sec
   OP_ID479. Dist_c1c1:fusedSynch_Filling : 8.65412e-06 sec
   OP_ID480. cudnnConvolutionBackwardData : 0.000427521 sec
   OP_ID481. cudnnConvolutionBackwardFilter : 0.000855174 sec
   OP_ID482. Add : 0.000373829 sec
   OP_ID483. Dist_c1c1:fusedSynch_Filling : 0.00206805 sec
   OP_ID484. ReLUBackward : 0.000423871 sec
   OP_ID485. GpuBatchNormBackward : 0.000560101 sec
   OP_ID486. Dist_c1c1:fusedSynch_Filling : 0.000568155 sec
   OP_ID487. Dist_c1c1:fusedSynch_Filling : 3.11429e-05 sec
   OP_ID488. cudnnConvolutionBackwardData : 0.000387184 sec
   OP_ID489. cudnnConvolutionBackwardFilter : 0.000718611 sec
   OP_ID490. Dist_c1c1:fusedSynch_Filling : 0.00125309 sec
   OP_ID491. Dist_sc1:fusedSynch_Transfer : 0.0217816 sec
   OP_ID492. ReLUBackward : 0.000161743 sec
   OP_ID493. GpuBatchNormBackward : 0.000250255 sec
   OP_ID494. Dist_c1c1:fusedSynch_Filling : 3.50748e-05 sec
   OP_ID495. Dist_c1c1:fusedSynch_Filling : 1.02638e-05 sec
   OP_ID496. cudnnConvolutionBackwardData : 0.000735119 sec
   OP_ID497. cudnnConvolutionBackwardFilter : 0.000801258 sec
   OP_ID498. Dist_c1c1:fusedSynch_Filling : 3.4352e-05 sec
   OP_ID499. ReLUBackward : 7.55345e-05 sec
   OP_ID500. GpuBatchNormBackward : 0.000190799 sec
   OP_ID501. Dist_c1c1:fusedSynch_Filling : 0.000112515 sec
   OP_ID502. Dist_c1c1:fusedSynch_Filling : 0.00021745 sec
   OP_ID503. cudnnConvolutionBackwardData : 0.000423726 sec
   OP_ID504. cudnnConvolutionBackwardFilter : 0.000856039 sec
   OP_ID505. Add : 0.000333065 sec
   OP_ID506. Dist_c1c1:fusedSynch_Filling : 0.00103454 sec
   OP_ID507. ReLUBackward : 0.000281486 sec
   OP_ID508. GpuBatchNormBackward : 0.000580162 sec
   OP_ID509. Dist_c1c1:fusedSynch_Filling : 0.000589751 sec
   OP_ID510. Dist_c1c1:fusedSynch_Filling : 8.68267e-06 sec
   OP_ID511. cudnnConvolutionBackwardData : 0.000450668 sec
   OP_ID512. cudnnConvolutionBackwardFilter : 0.000676487 sec
   OP_ID513. Dist_c1c1:fusedSynch_Filling : 0.00121196 sec
   OP_ID514. ReLUBackward : 9.93365e-05 sec
   OP_ID515. GpuBatchNormBackward : 0.000120631 sec
   OP_ID516. Dist_c1c1:fusedSynch_Filling : 0.000407734 sec
   OP_ID517. Dist_c1c1:fusedSynch_Filling : 8.75369e-05 sec
   OP_ID518. cudnnConvolutionBackwardData : 0.000719416 sec
   OP_ID519. cudnnConvolutionBackwardFilter : 0.000893315 sec
   OP_ID520. Dist_c1c1:fusedSynch_Filling : 0.00137058 sec
   OP_ID521. ReLUBackward : 8.66111e-05 sec
   OP_ID522. GpuBatchNormBackward : 0.000226898 sec
   OP_ID523. Dist_c1c1:fusedSynch_Filling : 0.000191338 sec
   OP_ID524. Dist_c1c1:fusedSynch_Filling : 5.27084e-05 sec
   OP_ID525. cudnnConvolutionBackwardData : 0.000543402 sec
   OP_ID526. cudnnConvolutionBackwardFilter : 0.000831682 sec
   OP_ID527. Add : 0.00032223 sec
   OP_ID528. Dist_c1c1:fusedSynch_Filling : 0.00199479 sec
   OP_ID529. ReLUBackward : 0.000356539 sec
   OP_ID530. GpuBatchNormBackward : 0.000734911 sec
   OP_ID531. Dist_c1c1:fusedSynch_Filling : 0.00104239 sec
   OP_ID532. Dist_c1c1:fusedSynch_Filling : 0.00089491 sec
   OP_ID533. cudnnConvolutionBackwardData : 0.00155672 sec
   OP_ID534. cudnnConvolutionBackwardFilter : 0.00120545 sec
   OP_ID535. Dist_c1c1:fusedSynch_Filling : 0.00229868 sec
   OP_ID536. GpuBatchNormBackward : 0.000855846 sec
   OP_ID537. Dist_c1c1:fusedSynch_Filling : 0.000105602 sec
   OP_ID538. Dist_c1c1:fusedSynch_Filling : 5.36753e-05 sec
   OP_ID539. cudnnConvolutionBackwardData : 0.000392553 sec
   OP_ID540. cudnnConvolutionBackwardFilter : 0.000599381 sec
   OP_ID541. Dist_c1c1:fusedSynch_Filling : 0.0010769 sec
   OP_ID542. ReLUBackward : 0.000144799 sec
   OP_ID543. GpuBatchNormBackward : 0.000147809 sec
   OP_ID544. Dist_c1c1:fusedSynch_Filling : 0.000379803 sec
   OP_ID545. Dist_c1c1:fusedSynch_Filling : 0.000168204 sec
   OP_ID546. cudnnConvolutionBackwardData : 0.00214344 sec
   OP_ID547. cudnnConvolutionBackwardFilter : 0.000917329 sec
   OP_ID548. Dist_c1c1:fusedSynch_Filling : 0.00303214 sec
   OP_ID549. ReLUBackward : 0.00037046 sec
   OP_ID550. GpuBatchNormBackward : 0.00065897 sec
   OP_ID551. Dist_c1c1:fusedSynch_Filling : 0.000666668 sec
   OP_ID552. Dist_c1c1:fusedSynch_Filling : 8.272e-06 sec
   OP_ID553. cudnnConvolutionBackwardData : 0.000680572 sec
   OP_ID554. cudnnConvolutionBackwardFilter : 0.0017049 sec
   OP_ID555. Add : 0.000755777 sec
   OP_ID556. Dist_c1c1:fusedSynch_Filling : 0.00379687 sec
   OP_ID557. ReLUBackward : 0.000669526 sec
   OP_ID558. GpuBatchNormBackward : 0.00141841 sec
   OP_ID559. Dist_c1c1:fusedSynch_Filling : 0.00142914 sec
   OP_ID560. Dist_c1c1:fusedSynch_Filling : 8.55906e-06 sec
   OP_ID561. cudnnConvolutionBackwardData : 0.000473758 sec
   OP_ID562. cudnnConvolutionBackwardFilter : 0.00116066 sec
   OP_ID563. Dist_c1c1:fusedSynch_Filling : 0.00182434 sec
   OP_ID564. ReLUBackward : 0.000206452 sec
   OP_ID565. GpuBatchNormBackward : 0.000302718 sec
   OP_ID566. Dist_c1c1:fusedSynch_Filling : 0.000369426 sec
   OP_ID567. Dist_c1c1:fusedSynch_Filling : 9.34522e-05 sec
   OP_ID568. cudnnConvolutionBackwardData : 0.00108618 sec
   OP_ID569. cudnnConvolutionBackwardFilter : 0.00133329 sec
   OP_ID570. Dist_c1c1:fusedSynch_Filling : 0.00244867 sec
   OP_ID571. ReLUBackward : 0.0001878 sec
   OP_ID572. GpuBatchNormBackward : 0.000285802 sec
   OP_ID573. Dist_c1c1:fusedSynch_Filling : 0.000297231 sec
   OP_ID574. Dist_c1c1:fusedSynch_Filling : 3.09352e-05 sec
   OP_ID575. cudnnConvolutionBackwardData : 0.000588568 sec
   OP_ID576. cudnnConvolutionBackwardFilter : 0.00143151 sec
   OP_ID577. Add : 0.00063398 sec
   OP_ID578. Dist_c1c1:fusedSynch_Filling : 0.00354693 sec
   OP_ID579. ReLUBackward : 0.000931225 sec
   OP_ID580. GpuBatchNormBackward : 0.00121922 sec
   OP_ID581. Dist_c1c1:fusedSynch_Filling : 0.00123103 sec
   OP_ID582. Dist_c1c1:fusedSynch_Filling : 8.27294e-06 sec
   OP_ID583. cudnnConvolutionBackwardData : 0.000389641 sec
   OP_ID584. cudnnConvolutionBackwardFilter : 0.00126543 sec
   OP_ID585. Dist_c1c1:fusedSynch_Filling : 0.00183042 sec
   OP_ID586. ReLUBackward : 0.000191214 sec
   OP_ID587. GpuBatchNormBackward : 0.000390581 sec
   OP_ID588. Dist_c1c1:fusedSynch_Filling : 0.000452165 sec
   OP_ID589. Dist_c1c1:fusedSynch_Filling : 9.33726e-05 sec
   OP_ID590. cudnnConvolutionBackwardData : 0.000740172 sec
   OP_ID591. cudnnConvolutionBackwardFilter : 0.00129798 sec
   OP_ID592. Dist_c1c1:fusedSynch_Filling : 0.00206863 sec
   OP_ID593. ReLUBackward : 0.000183125 sec
   OP_ID594. GpuBatchNormBackward : 0.00032698 sec
   OP_ID595. Dist_c1c1:fusedSynch_Filling : 0.000339722 sec
   OP_ID596. Dist_c1c1:fusedSynch_Filling : 8.25851e-06 sec
   OP_ID597. cudnnConvolutionBackwardData : 0.000475595 sec
   OP_ID598. cudnnConvolutionBackwardFilter : 0.00140305 sec
   OP_ID599. Add : 0.00076655 sec
   OP_ID600. Dist_c1c1:fusedSynch_Filling : 0.00336851 sec
   OP_ID601. ReLUBackward : 0.000739154 sec
   OP_ID602. GpuBatchNormBackward : 0.00148457 sec
   OP_ID603. Dist_c1c1:fusedSynch_Filling : 0.00151882 sec
   OP_ID604. Dist_c1c1:fusedSynch_Filling : 8.51294e-06 sec
   OP_ID605. cudnnConvolutionBackwardData : 0.000480648 sec
   OP_ID606. cudnnConvolutionBackwardFilter : 0.00113185 sec
   OP_ID607. Dist_c1c1:fusedSynch_Filling : 0.00235924 sec
   OP_ID608. GpuBatchNormBackward : 0.00144066 sec
   OP_ID609. Dist_c1c1:fusedSynch_Filling : 0.00066504 sec
   OP_ID610. Dist_c1c1:fusedSynch_Filling : 8.43451e-06 sec
   OP_ID611. cudnnConvolutionBackwardData : 0.000470185 sec
   OP_ID612. cudnnConvolutionBackwardFilter : 0.00111511 sec
   OP_ID613. Dist_c1c1:fusedSynch_Filling : 0.00174279 sec
   OP_ID614. ReLUBackward : 0.000175645 sec
   OP_ID615. GpuBatchNormBackward : 0.000366169 sec
   OP_ID616. Dist_c1c1:fusedSynch_Filling : 0.00044987 sec
   OP_ID617. Dist_c1c1:fusedSynch_Filling : 7.79555e-05 sec
   OP_ID618. cudnnConvolutionBackwardData : 0.000944252 sec
   OP_ID619. cudnnConvolutionBackwardFilter : 0.00122981 sec
   OP_ID620. Dist_c1c1:fusedSynch_Filling : 0.00217873 sec
   OP_ID621. ReLUBackward : 0.000164283 sec
   OP_ID622. GpuBatchNormBackward : 0.000311484 sec
   OP_ID623. Dist_c1c1:fusedSynch_Filling : 0.00032374 sec
   OP_ID624. Dist_c1c1:fusedSynch_Filling : 3.05233e-05 sec
   OP_ID625. cudnnConvolutionBackwardData : 0.000174195 sec
   OP_ID626. cudnnConvolutionBackwardFilter : 0.000599772 sec
   OP_ID627. Add : 0.000137017 sec
   OP_ID628. Dist_c1c1:fusedSynch_Filling : 0.00119403 sec
   OP_ID629. GpuPoolingBackward : 0.00109889 sec
   OP_ID630. ReLUBackward : 0.000826437 sec
   OP_ID631. GpuBatchNormBackward : 0.00145682 sec
   OP_ID632. Dist_c1c1:fusedSynch_Filling : 0.00310224 sec
   OP_ID633. Dist_c1c1:fusedSynch_Filling : 0.00153533 sec
   OP_ID634. cudnnConvolutionBackwardData : 0.00196348 sec
   OP_ID635. cudnnConvolutionBackwardFilter : 0.00217321 sec
   OP_ID636. Dist_c1c1:fusedSynch_Filling : 0.00256668 sec
   OP_ID637. Dist_sc1:fusedSynch_Transfer : 0.0590358 sec
   OP_ID639. Div : 3.62823e-06 sec
   OP_ID640. Axpy : 3.49176e-06 sec
   OP_ID641. EltwiseMult : 3.45318e-06 sec
   OP_ID642. Axpy : 3.32643e-06 sec
   OP_ID643. Axpy : 3.3142e-06 sec
   OP_ID644. Div : 3.465e-05 sec
   OP_ID645. Axpy : 7.04894e-05 sec
   OP_ID646. EltwiseMult : 6.0229e-05 sec
   OP_ID647. Axpy : 0.000115972 sec
   OP_ID648. Axpy : 4.57283e-05 sec
   OP_ID649. Div : 3.66902e-06 sec
   OP_ID650. Axpy : 3.5131e-06 sec
   OP_ID651. EltwiseMult : 3.29349e-06 sec
   OP_ID652. Axpy : 2.57365e-05 sec
   OP_ID653. Axpy : 3.13788e-06 sec
   OP_ID654. Div : 3.47514e-06 sec
   OP_ID655. Axpy : 5.19404e-06 sec
   OP_ID656. EltwiseMult : 3.39733e-06 sec
   OP_ID657. Axpy : 5.0971e-06 sec
   OP_ID658. Axpy : 3.14635e-06 sec
   OP_ID659. Div : 1.81245e-05 sec
   OP_ID660. Axpy : 6.94965e-05 sec
   OP_ID661. EltwiseMult : 1.81415e-05 sec
   OP_ID662. Axpy : 4.67937e-05 sec
   OP_ID663. Axpy : 2.53082e-05 sec
   OP_ID664. Div : 3.55984e-06 sec
   OP_ID665. Axpy : 3.4491e-06 sec
   OP_ID666. EltwiseMult : 3.38918e-06 sec
   OP_ID667. Axpy : 3.25616e-06 sec
   OP_ID668. Axpy : 3.22133e-06 sec
   OP_ID669. Div : 3.38259e-06 sec
   OP_ID670. Axpy : 3.31827e-06 sec
   OP_ID671. EltwiseMult : 3.39294e-06 sec
   OP_ID672. Axpy : 2.62337e-05 sec
   OP_ID673. Axpy : 3.22133e-06 sec
   OP_ID674. Div : 3.63721e-05 sec
   OP_ID675. Axpy : 7.61387e-05 sec
   OP_ID676. EltwiseMult : 4.68973e-05 sec
   OP_ID677. Axpy : 5.23548e-05 sec
   OP_ID678. Axpy : 7.98108e-05 sec
   OP_ID679. Div : 3.46918e-06 sec
   OP_ID680. Axpy : 3.53882e-06 sec
   OP_ID681. EltwiseMult : 3.36721e-06 sec
   OP_ID682. Axpy : 3.24925e-06 sec
   OP_ID683. Axpy : 3.21443e-06 sec
   OP_ID684. Div : 3.28972e-06 sec
   OP_ID685. Axpy : 3.41082e-06 sec
   OP_ID686. EltwiseMult : 3.35749e-06 sec
   OP_ID687. Axpy : 3.17584e-06 sec
   OP_ID688. Axpy : 3.23702e-06 sec
   OP_ID689. Div : 1.80298e-05 sec
   OP_ID690. Axpy : 2.42011e-05 sec
   OP_ID691. EltwiseMult : 1.81885e-05 sec
   OP_ID692. Axpy : 2.25854e-05 sec
   OP_ID693. Axpy : 2.54792e-05 sec
   OP_ID694. Div : 3.49898e-06 sec
   OP_ID695. Axpy : 3.55451e-06 sec
   OP_ID696. EltwiseMult : 3.44031e-06 sec
   OP_ID697. Axpy : 3.26965e-06 sec
   OP_ID698. Axpy : 3.3327e-06 sec
   OP_ID699. Div : 3.42118e-06 sec
   OP_ID700. Axpy : 3.50651e-06 sec
   OP_ID701. EltwiseMult : 3.49929e-06 sec
   OP_ID702. Axpy : 3.26243e-06 sec
   OP_ID703. Axpy : 5.05098e-06 sec
   OP_ID704. Div : 1.78911e-05 sec
   OP_ID705. Axpy : 2.26187e-05 sec
   OP_ID706. EltwiseMult : 4.65274e-05 sec
   OP_ID707. Axpy : 2.56038e-05 sec
   OP_ID708. Axpy : 2.51153e-05 sec
   OP_ID709. Div : 3.55074e-06 sec
   OP_ID710. Axpy : 3.29757e-06 sec
   OP_ID711. EltwiseMult : 2.64354e-05 sec
   OP_ID712. Axpy : 3.21098e-06 sec
   OP_ID713. Axpy : 3.23357e-06 sec
   OP_ID714. Div : 3.36376e-06 sec
   OP_ID715. Axpy : 3.28753e-06 sec
   OP_ID716. EltwiseMult : 3.56329e-06 sec
   OP_ID717. Axpy : 3.06667e-06 sec
   OP_ID718. Axpy : 3.31953e-06 sec
   OP_ID719. Div : 5.81923e-05 sec
   OP_ID720. Axpy : 7.56471e-05 sec
   OP_ID721. EltwiseMult : 5.84599e-05 sec
   OP_ID722. Axpy : 5.25606e-05 sec
   OP_ID723. Axpy : 7.61509e-05 sec
   OP_ID724. Div : 3.52314e-06 sec
   OP_ID725. Axpy : 2.6422e-05 sec
   OP_ID726. EltwiseMult : 3.37004e-06 sec
   OP_ID727. Axpy : 3.27937e-06 sec
   OP_ID728. Axpy : 3.15325e-06 sec
   OP_ID729. Div : 3.36408e-06 sec
   OP_ID730. Axpy : 3.40988e-06 sec
   OP_ID731. EltwiseMult : 3.46102e-06 sec
   OP_ID732. Axpy : 3.22792e-06 sec
   OP_ID733. Axpy : 3.17365e-06 sec
   OP_ID734. Div : 1.80624e-05 sec
   OP_ID735. Axpy : 2.22911e-05 sec
   OP_ID736. EltwiseMult : 1.82507e-05 sec
   OP_ID737. Axpy : 4.6369e-05 sec
   OP_ID738. Axpy : 2.51947e-05 sec
   OP_ID739. Div : 3.59059e-06 sec
   OP_ID740. Axpy : 3.49145e-06 sec
   OP_ID741. EltwiseMult : 3.46039e-06 sec
   OP_ID742. Axpy : 2.56822e-05 sec
   OP_ID743. Axpy : 3.22259e-06 sec
   OP_ID744. Div : 3.47294e-06 sec
   OP_ID745. Axpy : 3.44596e-06 sec
   OP_ID746. EltwiseMult : 3.40768e-06 sec
   OP_ID747. Axpy : 3.26118e-06 sec
   OP_ID748. Axpy : 3.26149e-06 sec
   OP_ID749. Div : 3.28643e-05 sec
   OP_ID750. Axpy : 9.76957e-05 sec
   OP_ID751. EltwiseMult : 3.73387e-05 sec
   OP_ID752. Axpy : 5.30525e-05 sec
   OP_ID753. Axpy : 5.17904e-05 sec
   OP_ID754. Div : 3.50651e-06 sec
   OP_ID755. Axpy : 3.536e-06 sec
   OP_ID756. EltwiseMult : 3.50431e-06 sec
   OP_ID757. Axpy : 5.03121e-06 sec
   OP_ID758. Axpy : 3.29192e-06 sec
   OP_ID759. Div : 3.3509e-06 sec
   OP_ID760. Axpy : 5.15231e-06 sec
   OP_ID761. EltwiseMult : 3.4171e-06 sec
   OP_ID762. Axpy : 3.24486e-06 sec
   OP_ID763. Axpy : 3.2869e-06 sec
   OP_ID764. Div : 4.09716e-05 sec
   OP_ID765. Axpy : 2.23947e-05 sec
   OP_ID766. EltwiseMult : 1.81945e-05 sec
   OP_ID767. Axpy : 6.77139e-05 sec
   OP_ID768. Axpy : 4.70965e-05 sec
   OP_ID769. Div : 5.3531e-06 sec
   OP_ID770. Axpy : 3.46823e-06 sec
   OP_ID771. EltwiseMult : 3.4378e-06 sec
   OP_ID772. Axpy : 3.36659e-06 sec
   OP_ID773. Axpy : 3.22949e-06 sec
   OP_ID774. Div : 5.05568e-06 sec
   OP_ID775. Axpy : 3.392e-06 sec
   OP_ID776. EltwiseMult : 2.57948e-05 sec
   OP_ID777. Axpy : 3.26933e-06 sec
   OP_ID778. Axpy : 2.53082e-05 sec
   OP_ID779. Div : 6.37427e-05 sec
   OP_ID780. Axpy : 8.01678e-05 sec
   OP_ID781. EltwiseMult : 3.64875e-05 sec
   OP_ID782. Axpy : 9.89463e-05 sec
   OP_ID783. Axpy : 5.54673e-05 sec
   OP_ID784. Div : 3.54133e-06 sec
   OP_ID785. Axpy : 3.44533e-06 sec
   OP_ID786. EltwiseMult : 3.41961e-06 sec
   OP_ID787. Axpy : 3.28659e-06 sec
   OP_ID788. Axpy : 3.19435e-06 sec
   OP_ID789. Div : 3.46604e-06 sec
   OP_ID790. Axpy : 3.51592e-06 sec
   OP_ID791. EltwiseMult : 5.55796e-06 sec
   OP_ID792. Axpy : 3.24235e-06 sec
   OP_ID793. Axpy : 3.21443e-06 sec
   OP_ID794. Div : 5.33757e-05 sec
   OP_ID795. Axpy : 3.20125e-05 sec
   OP_ID796. EltwiseMult : 3.28411e-05 sec
   OP_ID797. Axpy : 2.85092e-05 sec
   OP_ID798. Axpy : 9.91216e-06 sec
   OP_ID799. Div : 3.52627e-06 sec
   OP_ID800. Axpy : 3.46604e-06 sec
   OP_ID801. EltwiseMult : 5.15764e-06 sec
   OP_ID802. Axpy : 3.28031e-06 sec
   OP_ID803. Axpy : 3.27121e-06 sec
   OP_ID804. Div : 3.44376e-06 sec
   OP_ID805. Axpy : 3.47325e-06 sec
   OP_ID806. EltwiseMult : 3.40078e-06 sec
   OP_ID807. Axpy : 3.2367e-06 sec
   OP_ID808. Axpy : 3.32925e-06 sec
   OP_ID809. Div : 6.8389e-06 sec
   OP_ID810. Axpy : 6.14525e-06 sec
   OP_ID811. EltwiseMult : 6.47655e-06 sec
   OP_ID812. Axpy : 5.00172e-06 sec
   OP_ID813. Axpy : 4.89129e-06 sec
   OP_ID814. Div : 3.36031e-06 sec
   OP_ID815. Axpy : 3.29443e-06 sec
   OP_ID816. EltwiseMult : 3.35843e-06 sec
   OP_ID817. Axpy : 3.13004e-06 sec
   OP_ID818. Axpy : 3.28314e-06 sec
   OP_ID819. Div : 3.32392e-06 sec
   OP_ID820. Axpy : 3.29694e-06 sec
   OP_ID821. EltwiseMult : 3.36094e-06 sec
   OP_ID822. Axpy : 3.17333e-06 sec
   OP_ID823. Axpy : 2.4939e-05 sec
   OP_ID824. Div : 3.44373e-05 sec
   OP_ID825. Axpy : 8.76329e-06 sec
   OP_ID826. EltwiseMult : 1.10742e-05 sec
   OP_ID827. Axpy : 8.04172e-06 sec
   OP_ID828. Axpy : 3.41039e-05 sec
   OP_ID829. Div : 2.61584e-05 sec
   OP_ID830. Axpy : 3.32957e-06 sec
   OP_ID831. EltwiseMult : 3.30855e-06 sec
   OP_ID832. Axpy : 2.59492e-05 sec
   OP_ID833. Axpy : 3.20847e-06 sec
   OP_ID834. Div : 3.29819e-06 sec
   OP_ID835. Axpy : 3.30886e-06 sec
   OP_ID836. EltwiseMult : 3.38604e-06 sec
   OP_ID837. Axpy : 3.13631e-06 sec
   OP_ID838. Axpy : 3.26808e-06 sec
   OP_ID839. Div : 6.68925e-06 sec
   OP_ID840. Axpy : 6.15435e-06 sec
   OP_ID841. EltwiseMult : 6.55529e-06 sec
   OP_ID842. Axpy : 5.00392e-06 sec
   OP_ID843. Axpy : 2.7691e-05 sec
   OP_ID844. Div : 2.52417e-05 sec
   OP_ID845. Axpy : 3.43812e-06 sec
   OP_ID846. EltwiseMult : 3.4927e-06 sec
   OP_ID847. Axpy : 3.26776e-06 sec
   OP_ID848. Axpy : 3.27592e-06 sec
   OP_ID849. Div : 3.36502e-06 sec
   OP_ID850. Axpy : 3.37318e-06 sec
   OP_ID851. EltwiseMult : 2.6e-05 sec
   OP_ID852. Axpy : 3.21035e-06 sec
   OP_ID853. Axpy : 3.26777e-06 sec
   OP_ID854. Div : 8.33223e-06 sec
   OP_ID855. Axpy : 6.24345e-06 sec
   OP_ID856. EltwiseMult : 6.35921e-06 sec
   OP_ID857. Axpy : 4.96314e-06 sec
   OP_ID858. Axpy : 4.90321e-06 sec
   OP_ID859. Div : 3.3302e-06 sec
   OP_ID860. Axpy : 3.34745e-06 sec
   OP_ID861. EltwiseMult : 3.42023e-06 sec
   OP_ID862. Axpy : 3.14729e-06 sec
   OP_ID863. Axpy : 3.28314e-06 sec
   OP_ID864. Div : 3.53474e-06 sec
   OP_ID865. Axpy : 3.4949e-06 sec
   OP_ID866. EltwiseMult : 3.2229e-06 sec
   OP_ID867. Axpy : 3.30353e-06 sec
   OP_ID868. Axpy : 3.24141e-06 sec
   OP_ID869. Div : 1.14331e-05 sec
   OP_ID870. Axpy : 9.26777e-06 sec
   OP_ID871. EltwiseMult : 1.10061e-05 sec
   OP_ID872. Axpy : 7.98149e-06 sec
   OP_ID873. Axpy : 1.13133e-05 sec
   OP_ID874. Div : 3.3669e-06 sec
   OP_ID875. Axpy : 2.60514e-05 sec
   OP_ID876. EltwiseMult : 3.45851e-06 sec
   OP_ID877. Axpy : 3.29976e-06 sec
   OP_ID878. Axpy : 3.26996e-06 sec
   OP_ID879. Div : 3.3851e-06 sec
   OP_ID880. Axpy : 3.41333e-06 sec
   OP_ID881. EltwiseMult : 3.33584e-06 sec
   OP_ID882. Axpy : 3.25114e-06 sec
   OP_ID883. Axpy : 4.93521e-06 sec
   OP_ID884. Div : 6.8251e-06 sec
   OP_ID885. Axpy : 6.12894e-06 sec
   OP_ID886. EltwiseMult : 6.34792e-06 sec
   OP_ID887. Axpy : 4.85176e-06 sec
   OP_ID888. Axpy : 5.07326e-06 sec
   OP_ID889. Div : 3.39953e-06 sec
   OP_ID890. Axpy : 3.47357e-06 sec
   OP_ID891. EltwiseMult : 3.49302e-06 sec
   OP_ID892. Axpy : 3.17145e-06 sec
   OP_ID893. Axpy : 2.56483e-05 sec
   OP_ID894. Div : 3.38541e-06 sec
   OP_ID895. Axpy : 3.45192e-06 sec
   OP_ID896. EltwiseMult : 2.65424e-05 sec
   OP_ID897. Axpy : 3.17929e-06 sec
   OP_ID898. Axpy : 3.30384e-06 sec
   OP_ID899. Div : 6.69898e-06 sec
   OP_ID900. Axpy : 6.09004e-06 sec
   OP_ID901. EltwiseMult : 6.44455e-06 sec
   OP_ID902. Axpy : 4.98165e-06 sec
   OP_ID903. Axpy : 4.80251e-06 sec
   OP_ID904. Div : 3.45443e-06 sec
   OP_ID905. Axpy : 3.38855e-06 sec
   OP_ID906. EltwiseMult : 3.31106e-06 sec
   OP_ID907. Axpy : 3.19467e-06 sec
   OP_ID908. Axpy : 3.16172e-06 sec
   OP_ID909. Div : 3.24894e-06 sec
   OP_ID910. Axpy : 3.27184e-06 sec
   OP_ID911. EltwiseMult : 3.4698e-06 sec
   OP_ID912. Axpy : 3.27749e-06 sec
   OP_ID913. Axpy : 3.26431e-06 sec
   OP_ID914. Div : 1.14754e-05 sec
   OP_ID915. Axpy : 8.69772e-06 sec
   OP_ID916. EltwiseMult : 3.37205e-05 sec
   OP_ID917. Axpy : 8.22243e-06 sec
   OP_ID918. Axpy : 1.13565e-05 sec
   OP_ID919. Div : 3.25867e-06 sec
   OP_ID920. Axpy : 3.55012e-06 sec
   OP_ID921. EltwiseMult : 3.42714e-06 sec
   OP_ID922. Axpy : 3.33616e-06 sec
   OP_ID923. Axpy : 3.17302e-06 sec
   OP_ID924. Div : 3.34682e-06 sec
   OP_ID925. Axpy : 3.32863e-06 sec
   OP_ID926. EltwiseMult : 3.31074e-06 sec
   OP_ID927. Axpy : 3.19843e-06 sec
   OP_ID928. Axpy : 3.26086e-06 sec
   OP_ID929. Div : 6.73035e-06 sec
   OP_ID930. Axpy : 6.13176e-06 sec
   OP_ID931. EltwiseMult : 6.37553e-06 sec
   OP_ID932. Axpy : 4.90039e-06 sec
   OP_ID933. Axpy : 4.97129e-06 sec
   OP_ID934. Div : 3.47639e-06 sec
   OP_ID935. Axpy : 3.39922e-06 sec
   OP_ID936. EltwiseMult : 3.4651e-06 sec
   OP_ID937. Axpy : 3.296e-06 sec
   OP_ID938. Axpy : 7.20376e-06 sec
   OP_ID939. Div : 3.52847e-06 sec
   OP_ID940. Axpy : 3.46165e-06 sec
   OP_ID941. EltwiseMult : 3.39765e-06 sec
   OP_ID942. Axpy : 3.19404e-06 sec
   OP_ID943. Axpy : 3.29443e-06 sec
   OP_ID944. Div : 6.85302e-06 sec
   OP_ID945. Axpy : 6.21961e-06 sec
   OP_ID946. EltwiseMult : 6.56565e-06 sec
   OP_ID947. Axpy : 4.92173e-06 sec
   OP_ID948. Axpy : 5.02776e-06 sec
   OP_ID949. Div : 3.33239e-06 sec
   OP_ID950. Axpy : 3.3302e-06 sec
   OP_ID951. EltwiseMult : 3.42243e-06 sec
   OP_ID952. Axpy : 3.28533e-06 sec
   OP_ID953. Axpy : 3.24894e-06 sec
   OP_ID954. Div : 3.24455e-06 sec
   OP_ID955. Axpy : 3.39608e-06 sec
   OP_ID956. EltwiseMult : 3.35153e-06 sec
   OP_ID957. Axpy : 3.25584e-06 sec
   OP_ID958. Axpy : 3.19906e-06 sec
   OP_ID959. Div : 1.40141e-05 sec
   OP_ID960. Axpy : 3.1157e-05 sec
   OP_ID961. EltwiseMult : 1.11031e-05 sec
   OP_ID962. Axpy : 7.86572e-06 sec
   OP_ID963. Axpy : 1.11551e-05 sec
   OP_ID964. Div : 3.3258e-06 sec
   OP_ID965. Axpy : 5.02776e-06 sec
   OP_ID966. EltwiseMult : 2.52565e-05 sec
   OP_ID967. Axpy : 3.25867e-06 sec
   OP_ID968. Axpy : 3.29725e-06 sec
   OP_ID969. Div : 3.25396e-06 sec
   OP_ID970. Axpy : 3.3829e-06 sec
   OP_ID971. EltwiseMult : 3.32612e-06 sec
   OP_ID972. Axpy : 3.16769e-06 sec
   OP_ID973. Axpy : 3.14416e-06 sec
   OP_ID974. Div : 6.77929e-06 sec
   OP_ID975. Axpy : 6.10855e-06 sec
   OP_ID976. EltwiseMult : 6.54431e-06 sec
   OP_ID977. Axpy : 4.86431e-06 sec
   OP_ID978. Axpy : 5.02086e-06 sec
   OP_ID979. Div : 3.4949e-06 sec
   OP_ID980. Axpy : 3.55765e-06 sec
   OP_ID981. EltwiseMult : 3.40831e-06 sec
   OP_ID982. Axpy : 3.36721e-06 sec
   OP_ID983. Axpy : 3.14604e-06 sec
   OP_ID984. Div : 3.43467e-06 sec
   OP_ID985. Axpy : 3.50337e-06 sec
   OP_ID986. EltwiseMult : 2.50453e-05 sec
   OP_ID987. Axpy : 3.3622e-06 sec
   OP_ID988. Axpy : 3.09553e-06 sec
   OP_ID989. Div : 6.92204e-06 sec
   OP_ID990. Axpy : 6.11796e-06 sec
   OP_ID991. EltwiseMult : 2.01885e-05 sec
   OP_ID992. Axpy : 4.9371e-06 sec
   OP_ID993. Axpy : 4.97977e-06 sec
   OP_ID994. Div : 3.30478e-06 sec
   OP_ID995. Axpy : 3.40361e-06 sec
   OP_ID996. EltwiseMult : 3.3851e-06 sec
   OP_ID997. Axpy : 3.20596e-06 sec
   OP_ID998. Axpy : 3.20031e-06 sec
   OP_ID999. Div : 3.28188e-06 sec
   OP_ID1000. Axpy : 1.71683e-05 sec
   OP_ID1001. EltwiseMult : 3.37788e-06 sec
   OP_ID1002. Axpy : 4.94525e-06 sec
   OP_ID1003. Axpy : 3.29349e-06 sec
   OP_ID1004. Div : 5.67357e-05 sec
   OP_ID1005. Axpy : 2.27216e-05 sec
   OP_ID1006. EltwiseMult : 2.57177e-05 sec
   OP_ID1007. Axpy : 8.06211e-06 sec
   OP_ID1008. Axpy : 1.13669e-05 sec
   OP_ID1009. Div : 3.31294e-06 sec
   OP_ID1010. Axpy : 3.47984e-06 sec
   OP_ID1011. EltwiseMult : 1.72998e-05 sec
   OP_ID1012. Axpy : 3.2502e-06 sec
   OP_ID1013. Axpy : 3.22259e-06 sec
   OP_ID1014. Div : 3.31796e-06 sec
   OP_ID1015. Axpy : 3.32925e-06 sec
   OP_ID1016. EltwiseMult : 3.31984e-06 sec
   OP_ID1017. Axpy : 4.96722e-06 sec
   OP_ID1018. Axpy : 3.19184e-06 sec
   OP_ID1019. Div : 6.7178e-06 sec
   OP_ID1020. Axpy : 6.11796e-06 sec
   OP_ID1021. EltwiseMult : 6.49788e-06 sec
   OP_ID1022. Axpy : 4.93929e-06 sec
   OP_ID1023. Axpy : 4.8411e-06 sec
   OP_ID1024. Div : 3.43027e-06 sec
   OP_ID1025. Axpy : 1.72012e-05 sec
   OP_ID1026. EltwiseMult : 3.52721e-06 sec
   OP_ID1027. Axpy : 3.36565e-06 sec
   OP_ID1028. Axpy : 4.99263e-06 sec
   OP_ID1029. Div : 1.73509e-05 sec
   OP_ID1030. Axpy : 3.42431e-06 sec
   OP_ID1031. EltwiseMult : 3.38447e-06 sec
   OP_ID1032. Axpy : 3.31576e-06 sec
   OP_ID1033. Axpy : 3.26494e-06 sec
   OP_ID1034. Div : 3.32326e-05 sec
   OP_ID1035. Axpy : 3.16985e-05 sec
   OP_ID1036. EltwiseMult : 1.02372e-05 sec
   OP_ID1037. Axpy : 7.04972e-06 sec
   OP_ID1038. Axpy : 9.92596e-06 sec
   OP_ID1039. Div : 3.53976e-06 sec
   OP_ID1040. Axpy : 3.46071e-06 sec
   OP_ID1041. EltwiseMult : 3.45537e-06 sec
   OP_ID1042. Axpy : 3.30886e-06 sec
   OP_ID1043. Axpy : 3.25741e-06 sec
   OP_ID1044. Div : 3.4149e-06 sec
   OP_ID1045. Axpy : 5.15294e-06 sec
   OP_ID1046. EltwiseMult : 3.51216e-06 sec
   OP_ID1047. Axpy : 3.21192e-06 sec
   OP_ID1048. Axpy : 5.86541e-06 sec
   OP_ID1049. Div : 6.74259e-06 sec
   OP_ID1050. Axpy : 6.1829e-06 sec
   OP_ID1051. EltwiseMult : 4.27253e-05 sec
   OP_ID1052. Axpy : 4.98886e-06 sec
   OP_ID1053. Axpy : 4.99106e-06 sec
   OP_ID1054. Div : 3.34777e-06 sec
   OP_ID1055. Axpy : 1.90485e-05 sec
   OP_ID1056. EltwiseMult : 3.2342e-06 sec
   OP_ID1057. Axpy : 5.05129e-06 sec
   OP_ID1058. Axpy : 4.89976e-06 sec
   OP_ID1059. Div : 2.7952e-05 sec
   OP_ID1060. Axpy : 3.38478e-06 sec
   OP_ID1061. EltwiseMult : 3.38604e-06 sec
   OP_ID1062. Axpy : 3.32643e-06 sec
   OP_ID1063. Axpy : 3.18463e-06 sec
   OP_ID1064. Div : 1.41478e-05 sec
   OP_ID1065. Axpy : 9.04659e-06 sec
   OP_ID1066. EltwiseMult : 3.25487e-05 sec
   OP_ID1067. Axpy : 3.26808e-05 sec
   OP_ID1068. Axpy : 1.11809e-05 sec
   OP_ID1069. Div : 3.42525e-06 sec
   OP_ID1070. Axpy : 1.72442e-05 sec
   OP_ID1071. EltwiseMult : 3.40706e-06 sec
   OP_ID1072. Axpy : 3.34337e-06 sec
   OP_ID1073. Axpy : 3.20847e-06 sec
   OP_ID1074. Div : 3.32486e-06 sec
   OP_ID1075. Axpy : 3.29757e-06 sec
   OP_ID1076. EltwiseMult : 3.30541e-06 sec
   OP_ID1077. Axpy : 3.19059e-06 sec
   OP_ID1078. Axpy : 3.1931e-06 sec
   OP_ID1079. Div : 4.9669e-06 sec
   OP_ID1080. Axpy : 5.15169e-06 sec
   OP_ID1081. EltwiseMult : 2.76552e-05 sec
   OP_ID1082. Axpy : 4.17631e-06 sec
   OP_ID1083. Axpy : 4.06431e-06 sec
   OP_ID1084. Div : 3.41992e-06 sec
   OP_ID1085. Axpy : 3.26369e-06 sec
   OP_ID1086. EltwiseMult : 3.45694e-06 sec
   OP_ID1087. Axpy : 3.18243e-06 sec
   OP_ID1088. Axpy : 6.93365e-06 sec
   OP_ID1089. Div : 2.6357e-05 sec
   OP_ID1090. Axpy : 3.29945e-06 sec
   OP_ID1091. EltwiseMult : 3.41082e-06 sec
   OP_ID1092. Axpy : 3.16172e-06 sec
   OP_ID1093. Axpy : 3.24486e-06 sec
   OP_ID1094. Div : 3.96172e-06 sec
   OP_ID1095. Axpy : 4.21678e-06 sec
   OP_ID1096. EltwiseMult : 5.68408e-06 sec
   OP_ID1097. Axpy : 3.59153e-06 sec
   OP_ID1098. Axpy : 3.62917e-06 sec
   OP_ID1099. Div : 3.34745e-06 sec
   OP_ID1100. Axpy : 3.29569e-06 sec
   OP_ID1101. EltwiseMult : 3.30322e-06 sec
   OP_ID1102. Axpy : 3.08392e-06 sec
   OP_ID1103. Axpy : 3.24455e-06 sec
   OP_ID1104. Div : 3.1818e-06 sec
   OP_ID1105. Axpy : 2.48329e-05 sec
   OP_ID1106. EltwiseMult : 3.24424e-06 sec
   OP_ID1107. Axpy : 3.27247e-06 sec
   OP_ID1108. Axpy : 3.23263e-06 sec
   OP_ID1109. Div : 6.9578e-06 sec
   OP_ID1110. Axpy : 7.00392e-06 sec
   OP_ID1111. EltwiseMult : 4.97318e-06 sec
   OP_ID1112. Axpy : 4.16063e-06 sec
   OP_ID1113. Axpy : 4.16188e-06 sec
   OP_ID1114. Div : 2.4922e-05 sec
   OP_ID1115. Axpy : 3.24518e-06 sec
   OP_ID1116. EltwiseMult : 3.23671e-06 sec
   OP_ID1117. Axpy : 3.1109e-06 sec
   OP_ID1118. Axpy : 3.17741e-06 sec
   OP_ID1119. Div : 3.22196e-06 sec
   OP_ID1120. Axpy : 3.31608e-06 sec
   OP_ID1121. EltwiseMult : 3.27122e-06 sec
   OP_ID1122. Axpy : 1.69553e-05 sec
   OP_ID1123. Axpy : 3.22165e-06 sec
   OP_ID1124. Div : 4.016e-06 sec
   OP_ID1125. Axpy : 4.31467e-06 sec
   OP_ID1126. EltwiseMult : 3.82588e-06 sec
   OP_ID1127. Axpy : 3.63702e-06 sec
   OP_ID1128. Axpy : 3.64204e-06 sec
   OP_ID1129. Div : 3.37945e-06 sec
   OP_ID1130. Axpy : 3.3647e-06 sec
   OP_ID1131. EltwiseMult : 3.23231e-06 sec
   OP_ID1132. Axpy : 3.18306e-06 sec
   OP_ID1133. Axpy : 3.19435e-06 sec
   OP_ID1134. Div : 3.38384e-06 sec
   OP_ID1135. Axpy : 3.4218e-06 sec
   OP_ID1136. EltwiseMult : 3.488e-06 sec
   OP_ID1137. Axpy : 3.19153e-06 sec
   OP_ID1138. Axpy : 3.25616e-06 sec
   OP_ID1139. Div : 4.02823e-06 sec
   OP_ID1140. Axpy : 6.13929e-06 sec
   OP_ID1141. EltwiseMult : 3.88863e-06 sec
   OP_ID1142. Axpy : 3.56863e-06 sec
   OP_ID1143. Axpy : 5.36188e-06 sec
   OP_ID1144. Div : 3.46353e-06 sec
   OP_ID1145. Axpy : 3.24706e-06 sec
   OP_ID1146. EltwiseMult : 3.42306e-06 sec
   OP_ID1147. Axpy : 3.16235e-06 sec
   OP_ID1148. Axpy : 3.23639e-06 sec
   OP_ID1149. Div : 3.30133e-06 sec
   OP_ID1150. Axpy : 3.35749e-06 sec
   OP_ID1151. EltwiseMult : 3.31106e-06 sec
   OP_ID1152. Axpy : 4.93302e-06 sec
   OP_ID1153. Axpy : 3.18682e-06 sec
   OP_ID1154. Div : 4.95404e-06 sec
   OP_ID1155. Axpy : 5.30227e-06 sec
   OP_ID1156. EltwiseMult : 4.89318e-06 sec
   OP_ID1157. Axpy : 4.10792e-06 sec
   OP_ID1158. Axpy : 2.55981e-05 sec
   OP_ID1159. Div : 5.1109e-06 sec
   OP_ID1160. Axpy : 2.4784e-05 sec
   OP_ID1161. EltwiseMult : 2.44681e-05 sec
   OP_ID1162. Axpy : 3.15639e-06 sec
   OP_ID1163. Axpy : 3.15702e-06 sec
   OP_ID1164. Div : 3.34682e-06 sec
   OP_ID1165. Axpy : 4.97945e-06 sec
   OP_ID1166. EltwiseMult : 3.32768e-06 sec
   OP_ID1167. Axpy : 3.2891e-06 sec
   OP_ID1168. Axpy : 3.24706e-06 sec
   OP_ID1169. Div : 4.08125e-06 sec
   OP_ID1170. Axpy : 4.31184e-06 sec
   OP_ID1171. EltwiseMult : 3.76345e-06 sec
   OP_ID1172. Axpy : 3.58902e-06 sec
   OP_ID1173. Axpy : 3.60847e-06 sec
   OP_ID1174. Div : 3.40643e-06 sec
   OP_ID1175. Axpy : 3.33929e-06 sec
   OP_ID1176. EltwiseMult : 2.47774e-05 sec
   OP_ID1177. Axpy : 3.13569e-06 sec
   OP_ID1178. Axpy : 3.23671e-06 sec
   OP_ID1179. Div : 3.3967e-06 sec
   OP_ID1180. Axpy : 3.34463e-06 sec
   OP_ID1181. EltwiseMult : 3.26369e-06 sec
   OP_ID1182. Axpy : 3.16267e-06 sec
   OP_ID1183. Axpy : 3.17961e-06 sec
   OP_ID1184. Div : 4.13553e-06 sec
   OP_ID1185. Axpy : 4.38651e-06 sec
   OP_ID1186. EltwiseMult : 2.47297e-05 sec
   OP_ID1187. Axpy : 3.54416e-06 sec
   OP_ID1188. Axpy : 2.66312e-05 sec
   OP_ID1189. Div : 3.38165e-06 sec
   OP_ID1190. Axpy : 4.96565e-06 sec
   OP_ID1191. EltwiseMult : 3.33333e-06 sec
   OP_ID1192. Axpy : 3.07106e-06 sec
   OP_ID1193. Axpy : 3.19561e-06 sec
   OP_ID1194. Div : 3.25616e-06 sec
   OP_ID1195. Axpy : 3.31859e-06 sec
   OP_ID1196. EltwiseMult : 3.23263e-06 sec
   OP_ID1197. Axpy : 3.18086e-06 sec
   OP_ID1198. Axpy : 4.95749e-06 sec
   OP_ID1199. Div : 5.22008e-06 sec
   OP_ID1200. Axpy : 5.18369e-06 sec
   OP_ID1201. EltwiseMult : 4.93804e-06 sec
   OP_ID1202. Axpy : 4.22651e-06 sec
   OP_ID1203. Axpy : 6.03671e-06 sec
   OP_ID1204. Div : 3.31294e-06 sec
   OP_ID1205. Axpy : 3.34839e-06 sec
   OP_ID1206. EltwiseMult : 3.25145e-06 sec
   OP_ID1207. Axpy : 3.22635e-06 sec
   OP_ID1208. Axpy : 3.08706e-06 sec
   OP_ID1209. Div : 3.32863e-06 sec
   OP_ID1210. Axpy : 3.25365e-06 sec
   OP_ID1211. EltwiseMult : 3.3691e-06 sec
   OP_ID1212. Axpy : 3.1749e-06 sec
   OP_ID1213. Axpy : 3.17114e-06 sec
   OP_ID1214. Div : 2.54033e-05 sec
   OP_ID1215. Axpy : 4.18761e-06 sec
   OP_ID1216. EltwiseMult : 2.68091e-05 sec
   OP_ID1217. Axpy : 3.60941e-06 sec
   OP_ID1218. Axpy : 3.56988e-06 sec
   OP_ID1219. Div : 3.4949e-06 sec
   OP_ID1220. Axpy : 3.28596e-06 sec
   OP_ID1221. EltwiseMult : 3.28157e-06 sec
   OP_ID1222. Axpy : 3.18996e-06 sec
   OP_ID1223. Axpy : 3.24235e-06 sec
   OP_ID1224. Div : 3.33616e-06 sec
   OP_ID1225. Axpy : 3.25898e-06 sec
   OP_ID1226. EltwiseMult : 3.36784e-06 sec
   OP_ID1227. Axpy : 3.12251e-06 sec
   OP_ID1228. Axpy : 3.27247e-06 sec
   OP_ID1229. Div : 4.92643e-06 sec
   OP_ID1230. Axpy : 5.0767e-06 sec
   OP_ID1231. EltwiseMult : 2.61148e-05 sec
   OP_ID1232. Axpy : 4.17663e-06 sec
   OP_ID1233. Axpy : 4.15874e-06 sec
   OP_ID1234. Div : 3.32204e-06 sec
   OP_ID1235. Axpy : 2.52618e-05 sec
   OP_ID1236. EltwiseMult : 3.39482e-06 sec
   OP_ID1237. Axpy : 3.10306e-06 sec
   OP_ID1238. Axpy : 3.1429e-06 sec
   OP_ID1239. Div : 3.23294e-06 sec
   OP_ID1240. Axpy : 3.43498e-06 sec
   OP_ID1241. EltwiseMult : 3.39388e-06 sec
   OP_ID1242. Axpy : 3.2138e-06 sec
   OP_ID1243. Axpy : 3.20282e-06 sec
   OP_ID1244. Div : 4.05302e-06 sec
   OP_ID1245. Axpy : 4.45427e-06 sec
   OP_ID1246. EltwiseMult : 3.88204e-06 sec
   OP_ID1247. Axpy : 3.56172e-06 sec
   OP_ID1248. Axpy : 3.59906e-06 sec
   OP_ID1249. Div : 3.39514e-06 sec
   OP_ID1250. Axpy : 3.29663e-06 sec
   OP_ID1251. EltwiseMult : 3.3098e-06 sec
   OP_ID1252. Axpy : 3.13286e-06 sec
   OP_ID1253. Axpy : 3.13192e-06 sec
   OP_ID1254. Div : 3.27561e-06 sec
   OP_ID1255. Axpy : 3.32894e-06 sec
   OP_ID1256. EltwiseMult : 3.25145e-06 sec
   OP_ID1257. Axpy : 3.17616e-06 sec
   OP_ID1258. Axpy : 3.11373e-06 sec
   OP_ID1259. Div : 5.24172e-06 sec
   OP_ID1260. Axpy : 5.17145e-06 sec
   OP_ID1261. EltwiseMult : 4.88251e-06 sec
   OP_ID1262. Axpy : 4.20235e-06 sec
   OP_ID1263. Axpy : 4.20392e-06 sec
   OP_ID1264. Div : 3.28973e-06 sec
   OP_ID1265. Axpy : 3.29286e-06 sec
   OP_ID1266. EltwiseMult : 3.28941e-06 sec
   OP_ID1267. Axpy : 3.15451e-06 sec
   OP_ID1268. Axpy : 3.12125e-06 sec
   OP_ID1269. Div : 3.37129e-06 sec
   OP_ID1270. Axpy : 3.33459e-06 sec
   OP_ID1271. EltwiseMult : 3.38667e-06 sec
   OP_ID1272. Axpy : 3.18682e-06 sec
   OP_ID1273. Axpy : 3.17961e-06 sec
   OP_ID1274. Div : 3.75749e-06 sec
   OP_ID1275. Axpy : 3.98525e-06 sec
   OP_ID1276. EltwiseMult : 3.65459e-06 sec
   OP_ID1277. Axpy : 3.43341e-06 sec
   OP_ID1278. Axpy : 3.42494e-06 sec
   OP_ID1279. Div : 3.28502e-06 sec
   OP_ID1280. Axpy : 3.31733e-06 sec
   OP_ID1281. EltwiseMult : 3.30227e-06 sec
   OP_ID1282. Axpy : 3.29129e-06 sec
   OP_ID1283. Axpy : 3.15325e-06 sec
   OP_ID1284. Div : 3.27027e-06 sec
   OP_ID1285. Axpy : 3.29976e-06 sec
   OP_ID1286. EltwiseMult : 3.44188e-06 sec
   OP_ID1287. Axpy : 3.16957e-06 sec
   OP_ID1288. Axpy : 3.21851e-06 sec
   OP_ID1289. Div : 3.52659e-06 sec
   OP_ID1290. Axpy : 3.7349e-06 sec
   OP_ID1291. EltwiseMult : 3.61004e-06 sec
   OP_ID1292. Axpy : 3.31043e-06 sec
   OP_ID1293. Axpy : 3.36e-06 sec
   OP_ID1294. Div : 5.05098e-06 sec
   OP_ID1295. Axpy : 3.27184e-06 sec
   OP_ID1296. EltwiseMult : 3.27529e-06 sec
   OP_ID1297. Axpy : 3.11278e-06 sec
   OP_ID1298. Axpy : 3.13882e-06 sec
   OP_ID1299. Div : 3.27498e-06 sec
   OP_ID1300. Axpy : 3.27655e-06 sec
   OP_ID1301. EltwiseMult : 2.78761e-05 sec
   OP_ID1302. Axpy : 3.20314e-06 sec
   OP_ID1303. Axpy : 3.1178e-06 sec
   OP_ID1304. Div : 3.70227e-06 sec
   OP_ID1305. Axpy : 3.88768e-06 sec
   OP_ID1306. EltwiseMult : 3.67498e-06 sec
   OP_ID1307. Axpy : 3.44157e-06 sec
   OP_ID1308. Axpy : 3.44439e-06 sec
   OP_ID1309. Div : 3.28565e-06 sec
   OP_ID1310. Axpy : 3.27843e-06 sec
   OP_ID1311. EltwiseMult : 3.328e-06 sec
   OP_ID1312. Axpy : 3.11655e-06 sec
   OP_ID1313. Axpy : 3.14259e-06 sec
   OP_ID1314. Div : 3.27341e-06 sec
   OP_ID1315. Axpy : 3.31451e-06 sec
   OP_ID1316. EltwiseMult : 3.31294e-06 sec
   OP_ID1317. Axpy : 3.1702e-06 sec
   OP_ID1318. Axpy : 3.15576e-06 sec
   OP_ID1319. Div : 3.64267e-06 sec
   OP_ID1320. Axpy : 3.60282e-06 sec
   OP_ID1321. EltwiseMult : 3.76533e-06 sec
   OP_ID1322. Axpy : 3.34274e-06 sec
   OP_ID1323. Axpy : 2.64031e-05 sec
   OP_ID1324. Div : 3.32235e-06 sec
   OP_ID1325. Axpy : 3.28816e-06 sec
   OP_ID1326. EltwiseMult : 3.37694e-06 sec
   OP_ID1327. Axpy : 3.07294e-06 sec
   OP_ID1328. Axpy : 3.30322e-06 sec
   OP_ID1329. Div : 5.01051e-06 sec
   OP_ID1330. Axpy : 3.37663e-06 sec
   OP_ID1331. EltwiseMult : 3.15137e-06 sec
   OP_ID1332. Axpy : 3.26243e-06 sec
   OP_ID1333. Axpy : 3.05098e-06 sec
   OP_ID1334. Div : 3.56549e-06 sec
   OP_ID1335. Axpy : 3.64204e-06 sec
   OP_ID1336. EltwiseMult : 3.43843e-06 sec
   OP_ID1337. Axpy : 3.38792e-06 sec
   OP_ID1338. Axpy : 3.29474e-06 sec
   OP_ID1339. Div : 3.41992e-06 sec
   OP_ID1340. Axpy : 3.20125e-06 sec
   OP_ID1341. EltwiseMult : 3.31765e-06 sec
   OP_ID1342. Axpy : 3.11404e-06 sec
   OP_ID1343. Axpy : 3.17835e-06 sec
   OP_ID1344. Div : 3.27121e-06 sec
   OP_ID1345. Axpy : 3.3622e-06 sec
   OP_ID1346. EltwiseMult : 3.25553e-06 sec
   OP_ID1347. Axpy : 3.16455e-06 sec
   OP_ID1348. Axpy : 3.11498e-06 sec
   OP_ID1349. Div : 3.75435e-06 sec
   OP_ID1350. Axpy : 3.9087e-06 sec
   OP_ID1351. EltwiseMult : 3.60157e-06 sec
   OP_ID1352. Axpy : 3.45569e-06 sec
   OP_ID1353. Axpy : 3.47168e-06 sec
   OP_ID1354. Div : 3.26274e-06 sec
   OP_ID1355. Axpy : 3.29631e-06 sec
   OP_ID1356. EltwiseMult : 3.31608e-06 sec
   OP_ID1357. Axpy : 3.11969e-06 sec
   OP_ID1358. Axpy : 3.20816e-06 sec
   OP_ID1359. Div : 3.30667e-06 sec
   OP_ID1360. Axpy : 3.31482e-06 sec
   OP_ID1361. EltwiseMult : 3.33835e-06 sec
   OP_ID1362. Axpy : 3.15263e-06 sec
   OP_ID1363. Axpy : 3.19184e-06 sec
   OP_ID1364. Div : 3.55263e-06 sec
   OP_ID1365. Axpy : 3.88511e-05 sec
   OP_ID1366. EltwiseMult : 3.62321e-06 sec
   OP_ID1367. Axpy : 2.49904e-05 sec
   OP_ID1368. Axpy : 3.32423e-06 sec
   OP_ID1369. Div : 3.28565e-06 sec
   OP_ID1370. Axpy : 3.23263e-06 sec
   OP_ID1371. EltwiseMult : 5.11372e-06 sec
   OP_ID1372. Axpy : 3.14667e-06 sec
   OP_ID1373. Axpy : 3.19843e-06 sec
   OP_ID1374. Div : 3.2891e-06 sec
   OP_ID1375. Axpy : 5.13568e-06 sec
   OP_ID1376. EltwiseMult : 3.32016e-06 sec
   OP_ID1377. Axpy : 3.25898e-06 sec
   OP_ID1378. Axpy : 3.18902e-06 sec
   OP_ID1379. Div : 3.5658e-06 sec
   OP_ID1380. Axpy : 3.6731e-06 sec
   OP_ID1381. EltwiseMult : 3.46102e-06 sec
   OP_ID1382. Axpy : 3.34306e-06 sec
   OP_ID1383. Axpy : 3.26714e-06 sec
   OP_ID1384. Div : 3.27718e-06 sec
   OP_ID1385. Axpy : 3.2891e-06 sec
   OP_ID1386. EltwiseMult : 3.21694e-06 sec
   OP_ID1387. Axpy : 2.6192e-05 sec
   OP_ID1388. Axpy : 3.14824e-06 sec
   OP_ID1389. Div : 3.29725e-06 sec
   OP_ID1390. Axpy : 3.33835e-06 sec
   OP_ID1391. EltwiseMult : 3.34808e-06 sec
   OP_ID1392. Axpy : 3.24298e-06 sec
   OP_ID1393. Axpy : 3.22071e-06 sec
   OP_ID1394. Div : 3.5749e-06 sec
   OP_ID1395. Axpy : 3.75498e-06 sec
   OP_ID1396. EltwiseMult : 3.53067e-06 sec
   OP_ID1397. Axpy : 3.36972e-06 sec
   OP_ID1398. Axpy : 3.26808e-06 sec
   OP_ID1399. Div : 3.35812e-06 sec
   OP_ID1400. Axpy : 3.30604e-06 sec
   OP_ID1401. EltwiseMult : 2.62231e-05 sec
   OP_ID1402. Axpy : 3.1291e-06 sec
   OP_ID1403. Axpy : 3.1338e-06 sec
   OP_ID1404. Div : 3.27843e-06 sec
   OP_ID1405. Axpy : 3.2709e-06 sec
   OP_ID1406. EltwiseMult : 5.13223e-06 sec
   OP_ID1407. Axpy : 2.46751e-05 sec
   OP_ID1408. Axpy : 3.13412e-06 sec
   OP_ID1409. Div : 1.93832e-05 sec
   OP_ID1410. Axpy : 3.88737e-06 sec
   OP_ID1411. EltwiseMult : 3.7578e-06 sec
   OP_ID1412. Axpy : 3.39921e-06 sec
   OP_ID1413. Axpy : 3.49459e-06 sec
   OP_ID1414. Div : 3.36439e-06 sec
   OP_ID1415. Axpy : 3.26776e-06 sec
   OP_ID1416. EltwiseMult : 3.30698e-06 sec
   OP_ID1417. Axpy : 3.09616e-06 sec
   OP_ID1418. Axpy : 3.15231e-06 sec
   OP_ID1419. Div : 3.28251e-06 sec
   OP_ID1420. Axpy : 3.32643e-06 sec
   OP_ID1421. EltwiseMult : 3.30635e-06 sec
   OP_ID1422. Axpy : 1.69939e-05 sec
   OP_ID1423. Axpy : 3.14886e-06 sec
   OP_ID1424. Div : 3.46071e-06 sec
   OP_ID1425. Axpy : 3.43968e-06 sec
   OP_ID1426. EltwiseMult : 3.56831e-06 sec
   OP_ID1427. Axpy : 3.29694e-06 sec
   OP_ID1428. Axpy : 3.28533e-06 sec
   OP_ID1429. Div : 2.49004e-05 sec
   OP_ID1430. Axpy : 1.71181e-05 sec
   OP_ID1431. EltwiseMult : 3.32267e-06 sec
   OP_ID1432. Axpy : 3.10149e-06 sec
   OP_ID1433. Axpy : 3.18274e-06 sec
   OP_ID1434. Div : 3.15012e-06 sec
   OP_ID1435. Axpy : 5.18275e-06 sec
   OP_ID1436. EltwiseMult : 3.32517e-06 sec
   OP_ID1437. Axpy : 3.23796e-06 sec
   OP_ID1438. Axpy : 2.57349e-05 sec
   OP_ID1439. Div : 4.00706e-05 sec
   OP_ID1440. Axpy : 1.73026e-05 sec
   OP_ID1441. EltwiseMult : 2.48612e-05 sec
   OP_ID1442. Axpy : 3.41584e-06 sec
   OP_ID1443. Axpy : 3.24298e-06 sec
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436278541



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       then may pass the cuda stream from the buffered op and save the cuda stream in the node?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640070499


   > @chrishkchris In the communicator.cc, we split the original operation into multiple. Is this to facilitate recording the time?
   
   Yes, in order to record the time, those operations that are used for waiting should not be included.  Otherwise the recorded time will include the waiting time.
   
   For example, if I do not split the operator, a filling buffer operation will have the time profiled as the sum of :
   1. waiting for the NULL stream before start filling buffer 2. filling buffer
   If there is no need to wait, then the time spent for 1 will be zero, but some it still need to wait. 
   
   The aims of the time profiling design is to figure out how much time actual time (excluding waiting time) we need for each operators, so in the future we may be able to do better scheduling.
   
   Therefore, all the "waiting" operation should be excluded from time profiling of operators.
   
   Also, since there is no need to record the waiting time, I renamed all the waiting operators as "waiting" so that we don't time profiling for them
   https://github.com/chrishkchris/singa/blob/SINGA-510_2/src/core/scheduler/scheduler.cc#L292


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436278378



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       I mean not to record the event based on the operator name.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436273939



##########
File path: src/io/communicator.cc
##########
@@ -179,14 +175,14 @@ void Communicator::wait() {
   device_->Exec(
       [this](Context *ctx) mutable {
         // synchronizing on all the CUDA streams used by communicator
-        CUDA_CHECK(cudaEventRecord(event, s));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c1));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c2));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->s));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c1));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c2));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
       },
-      blocks_, blocks_, "wait");
+      blocks_, blocks_, "Sync");

Review comment:
       Got it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436277347



##########
File path: include/singa/core/common.h
##########
@@ -100,13 +100,21 @@ typedef struct _Context {
   std::mt19937 random_generator;
 #ifdef USE_CUDA
   cublasHandle_t cublas_handle;
-  cudaStream_t stream;
-  curandGenerator_t curand_generator;
+  cudaStream_t stream; 
+  curandGenerator_t curand_generator; 
+
 #ifdef USE_CUDNN
   cudnnHandle_t cudnn_handle;
 #endif
 #endif  // USE_CUDA
 
+#ifdef USE_DIST

Review comment:
       All of them in Travis CI build are only CPU version
   https://travis-ci.org/github/apache/singa/builds/693742421




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436274471



##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       Yes, now the stream is not in the communicator.
   Even in the orginal version, I was using different streams for different operations in cusparse.
   The API let me select the cuda stream for cusparse right before calling the cusparse operators, it let the users to use different streams for different operations 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640073996


   You are right. The waiting time cannot be included in the execution time of the operation. But for some operators that use two cuda streams, we determine which stream to record events based on the name of the operator. I think it's not an elegant scheme, any ideas about this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640083464


   Summary:
   * fix typos
   * remove some outdated code mentioned above.
   Future plan:
   * maybe we can add a struct named OpDescriptor to record ever-increasing context regarding the operator
   * the time of the same type of operator can be accumulated and sorted, which is convenient for viewing the execution time of a certain type of operator. Then we can know where the bottleneck is.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436273917



##########
File path: src/io/communicator.cc
##########
@@ -179,14 +175,14 @@ void Communicator::wait() {
   device_->Exec(
       [this](Context *ctx) mutable {
         // synchronizing on all the CUDA streams used by communicator
-        CUDA_CHECK(cudaEventRecord(event, s));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c1));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c2));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->s));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c1));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c2));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
       },
-      blocks_, blocks_, "wait");
+      blocks_, blocks_, "Sync");

Review comment:
       > I would like to rename all the waiting operators to be "waiting" so I can use a single condition in the if statement:
   > https://github.com/chrishkchris/singa/blob/SINGA-510_2/src/core/scheduler/scheduler.cc#L292
   > 
   > For the typo, I need to change it back to waiting instead of wait
   
   Got it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436276471



##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       In the original kernal I also did in this way of many operators




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436274471



##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       Yes, now the stream is not in the communicator.
   Even in the orginal version, I was using different streams for different operations in cusparse.
   The API let me stream for cusparse right before calling the cusparse operators, it let the users to use different streams for different operations 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] nudles merged pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
nudles merged pull request #716:
URL: https://github.com/apache/singa/pull/716


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436279395



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       Yeah, we need big changes to solve this problem(e.g. passing a struct named OpDescriptor to the device). I think we just keep it like this for now.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436278084



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       which one do you prefer? I can use USE_CUDA in common.h instead




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436263014



##########
File path: include/singa/core/common.h
##########
@@ -100,13 +100,21 @@ typedef struct _Context {
   std::mt19937 random_generator;
 #ifdef USE_CUDA
   cublasHandle_t cublas_handle;
-  cudaStream_t stream;
-  curandGenerator_t curand_generator;
+  cudaStream_t stream; 
+  curandGenerator_t curand_generator; 
+
 #ifdef USE_CUDNN
   cudnnHandle_t cudnn_handle;
 #endif
 #endif  // USE_CUDA
 
+#ifdef USE_DIST

Review comment:
       I think this part should be wrapped by USE_CUDA. Otherwise we may create these streams when not using cuda.
   

##########
File path: src/io/communicator.cc
##########
@@ -179,14 +175,14 @@ void Communicator::wait() {
   device_->Exec(
       [this](Context *ctx) mutable {
         // synchronizing on all the CUDA streams used by communicator
-        CUDA_CHECK(cudaEventRecord(event, s));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c1));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c2));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->s));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c1));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c2));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
       },
-      blocks_, blocks_, "wait");
+      blocks_, blocks_, "Sync");

Review comment:
       The operator name is a bit confusing.

##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       Do we have a better scheme? 

##########
File path: include/singa/io/communicator.h
##########
@@ -98,16 +98,16 @@ class Communicator {
   void generateBlocks(Tensor &t);
   void generateBlocks(std::vector<Tensor> &t);
   void allReduce(int size, void *sendbuff, void *recvbuff,
-                 ncclDataType_t ncclType);
+                 ncclDataType_t ncclType, Context *ctx);

Review comment:
       Remember to delete the streams created in the communicator

##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       Should this line be deleted?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-637350685


   Ready for Review/merge


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436277347



##########
File path: include/singa/core/common.h
##########
@@ -100,13 +100,21 @@ typedef struct _Context {
   std::mt19937 random_generator;
 #ifdef USE_CUDA
   cublasHandle_t cublas_handle;
-  cudaStream_t stream;
-  curandGenerator_t curand_generator;
+  cudaStream_t stream; 
+  curandGenerator_t curand_generator; 
+
 #ifdef USE_CUDNN
   cudnnHandle_t cudnn_handle;
 #endif
 #endif  // USE_CUDA
 
+#ifdef USE_DIST

Review comment:
       All of them in Travis CI build are only CPU version, our Travis CI don't have GPU or GPU include header and library
   https://travis-ci.org/github/apache/singa/builds/693742421




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640075943


   > You are right. The waiting time cannot be included in the execution time of the operation. But for some operators that use two cuda streams, we determine which stream to record events based on the name of the operator. I think it's not an elegant scheme, any ideas about this?
   
   For time profiling, the idea situation is: All the buffered communicator operators should use only one cuda stream, two streams is not good because one stream should wait for another. So I broke down most of the operations.
   
   The only one kernal I did not yet break it down yet is the sparse communication kernal, which is too long so I do not inlcude breaking the kernal down in this PR.
   https://github.com/chrishkchris/singa/blob/SINGA-510_2/src/io/communicator.cc#L444
   
   My original plan of this PR is record the fp32/fp16 communication time seamlessly. If it prodives better time profiling for the sparse communication (breaking the large kernal down), it can be included in the future PR 
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436276062



##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       I see, you moved this line to the position before each call to sparse. https://github.com/apache/singa/blob/6021e705106677d6f4209e3c0f8386d01602ca1b/src/io/communicator.cc#L626




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436276062



##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       I see, you moved this line to the position before each call to sparse.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436276471



##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       In the original version I also did in this way of many operators




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436277347



##########
File path: include/singa/core/common.h
##########
@@ -100,13 +100,21 @@ typedef struct _Context {
   std::mt19937 random_generator;
 #ifdef USE_CUDA
   cublasHandle_t cublas_handle;
-  cudaStream_t stream;
-  curandGenerator_t curand_generator;
+  cudaStream_t stream; 
+  curandGenerator_t curand_generator; 
+
 #ifdef USE_CUDNN
   cudnnHandle_t cudnn_handle;
 #endif
 #endif  // USE_CUDA
 
+#ifdef USE_DIST

Review comment:
       All of them in Travis CI build are only CPU version, our Travis CI don't have GPU or GPU include header/library
   https://travis-ci.org/github/apache/singa/builds/693742421




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436268516



##########
File path: src/io/communicator.cc
##########
@@ -179,14 +175,14 @@ void Communicator::wait() {
   device_->Exec(
       [this](Context *ctx) mutable {
         // synchronizing on all the CUDA streams used by communicator
-        CUDA_CHECK(cudaEventRecord(event, s));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c1));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c2));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->s));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c1));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c2));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
       },
-      blocks_, blocks_, "wait");
+      blocks_, blocks_, "Sync");

Review comment:
       The operator name and the function name are a bit confusing




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436273681



##########
File path: src/io/communicator.cc
##########
@@ -179,14 +175,14 @@ void Communicator::wait() {
   device_->Exec(
       [this](Context *ctx) mutable {
         // synchronizing on all the CUDA streams used by communicator
-        CUDA_CHECK(cudaEventRecord(event, s));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c1));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
-        CUDA_CHECK(cudaEventRecord(event, c2));
-        CUDA_CHECK(cudaStreamWaitEvent(NULL, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->s));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c1));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
+        CUDA_CHECK(cudaEventRecord(event, ctx->c2));
+        CUDA_CHECK(cudaStreamWaitEvent(ctx->stream, event, 0));
       },
-      blocks_, blocks_, "wait");
+      blocks_, blocks_, "Sync");

Review comment:
       I would like to rename all the waiting operators to be "waiting" so I can use a single condition in the if statement:
   https://github.com/chrishkchris/singa/blob/SINGA-510_2/src/core/scheduler/scheduler.cc#L292
   
   For the typo, I need to change it back to waiting instead of wait




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC edited a comment on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC edited a comment on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640083464


   Summary:
   * fix typos
   * remove some outdated code mentioned above.
   
   Future plan:
   * maybe we can add a struct named OpDescriptor to record ever-increasing context regarding the operator
   * the time of the same type of operator can be accumulated and sorted, which is convenient for viewing the execution time of a certain type of operator. Then we can know where the bottleneck is.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] XJDKC commented on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
XJDKC commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640060635


   @chrishkchris In the communicator.cc, we split the original operation into multiple. Is this to facilitate recording the time?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436275077



##########
File path: include/singa/io/communicator.h
##########
@@ -98,16 +98,16 @@ class Communicator {
   void generateBlocks(Tensor &t);
   void generateBlocks(std::vector<Tensor> &t);
   void allReduce(int size, void *sendbuff, void *recvbuff,
-                 ncclDataType_t ncclType);
+                 ncclDataType_t ncclType, Context *ctx);

Review comment:
       oh I will go to delete the three lines in the communicator.h header:
     cudaStream_t s;
     cudaStream_t c1;
     cudaStream_t c2;




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436274471



##########
File path: src/io/communicator.cc
##########
@@ -134,7 +131,6 @@ void Communicator::sparsInit() {
   CUDA_CHECK(cudaMalloc(&xInd, (int)(sizeof(int) * maxSize)));
   CUDA_CHECK(cudaMalloc(&xVal, (int)(sizeof(float) * maxSize)));
   CUSPARSE_CHECK(cusparseCreate(&cusparse_handle));
-  CUSPARSE_CHECK(cusparseSetStream(cusparse_handle, c2));

Review comment:
       Yes, now the stream is not in the communicator.
   Even in the orginal version, I was using different streams for different operations in cusparse.
   The API let me select the cuda stream for cusparse right before calling the cusparse operators, it let the users to use different streams for different cusparse operations 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436277284



##########
File path: include/singa/core/common.h
##########
@@ -100,13 +100,21 @@ typedef struct _Context {
   std::mt19937 random_generator;
 #ifdef USE_CUDA
   cublasHandle_t cublas_handle;
-  cudaStream_t stream;
-  curandGenerator_t curand_generator;
+  cudaStream_t stream; 
+  curandGenerator_t curand_generator; 
+
 #ifdef USE_CUDNN
   cudnnHandle_t cudnn_handle;
 #endif
 #endif  // USE_CUDA
 
+#ifdef USE_DIST

Review comment:
       In the CPU only version, there is no cuda library. It will not have those cuda related headers.
   In the building process, the default USE_DIST is 0.
   So the Travis CI build of CPU version didn't include this part, you may check the CI CPU version build result
   My concern not to use USE_CUDA instead is that it could add three more useless cuda stream into the single GPU version.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436278007



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       If I use USE_CUDA instead if USE_DIST in common.h, this @ifdef can be deleted




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436278541



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       then may pass the cuda stream from the buffered op and save the cuda stream in the node?
   So I can use: cudaeventrecord(node->stream)
   Oh, this cannot be done in CPU verspion, should think of other approach




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436278541



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       then may pass the cuda stream from the buffered op and save the cuda stream in the node?
   So I can use: cudaeventrecord(node->stream)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436277284



##########
File path: include/singa/core/common.h
##########
@@ -100,13 +100,21 @@ typedef struct _Context {
   std::mt19937 random_generator;
 #ifdef USE_CUDA
   cublasHandle_t cublas_handle;
-  cudaStream_t stream;
-  curandGenerator_t curand_generator;
+  cudaStream_t stream; 
+  curandGenerator_t curand_generator; 
+
 #ifdef USE_CUDNN
   cudnnHandle_t cudnn_handle;
 #endif
 #endif  // USE_CUDA
 
+#ifdef USE_DIST

Review comment:
       In the CPU only version, there is no cuda library. It will not have those cuda related headers.
   In the building process, the default USE_DIST is 0. Note that the USE_DIST and USE_CUDA is determined at build, not at runtime.
   So the Travis CI build of CPU version didn't include this part, you may check the CI CPU version build result
   My concern not to use USE_CUDA instead is that it could add three more useless cuda stream into the single GPU version.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on pull request #716:
URL: https://github.com/apache/singa/pull/716#issuecomment-640084210


   > Summary:
   > 
   > * fix typos
   > * remove some outdated code mentioned above.
   > 
   > Future plan:
   > 
   > * maybe we can add a struct named OpDescriptor to record ever-increasing context regarding the operator
   > * the time of the same type of operator can be accumulated and sorted, which is convenient for viewing the execution time of a certain type of operator. Then we can know where the bottleneck is.
   
   thanks for the summary!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [singa] chrishkchris commented on a change in pull request #716: SINGA-510 Distributed Training Time Profiling

Posted by GitBox <gi...@apache.org>.
chrishkchris commented on a change in pull request #716:
URL: https://github.com/apache/singa/pull/716#discussion_r436279603



##########
File path: src/core/device/cuda_gpu.cc
##########
@@ -120,9 +132,41 @@ void CudaGPU::TimeProfilingDoExec(function<void(Context*)>&& fn, int executor,
   cudaEventCreate(&(node->start_));
   cudaEventCreate(&(node->end_));
 
+#ifdef USE_DIST

Review comment:
       > Yeah, we need big changes to solve this problem(e.g. passing a struct named OpDescriptor to the device). I think we just keep it like this for now.
   
   Yes, I guess OpDescriptor would be useful to record many context regarding the operator. I think it is good to include it in the future




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org