You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2022/08/03 14:07:21 UTC

[GitHub] [systemds] phaniarnab opened a new pull request, #1676: [SYSTEMDS-3390] Improve performance of countDistinctApprox()

phaniarnab opened a new pull request, #1676:
URL: https://github.com/apache/systemds/pull/1676

   This patch improves the performance of countDistinctApprox() row/col
   aggregation by replacing matrix slicing with direct ops on the input
   matrix. This has the most impact in local CP execution mode, as
   some simple experiments show:
   
   (numbers represent average over 3 runs)
   1. row aggregation
       (A) dense: 10000x1000 with sparsity=0.9
       1.198s with slicing, 0.874s without slicing - a 27% improvement
   
       (B) sparse: 10000x1000 with sparsity=0.1
       0.528s with slicing, 0.512s without slicing - a 3% improvement
   
   As expected, the larger and the more dense the input matrix,
   the larger the performance improvement.
   
   2. col aggregation
       (A) dense: 1000x10000 with sparsity=0.9
       1.186s with slicing, 1.036s without slicing - a 13% improvement
   
       (B) sparse: 1000x10000 with sparsity=0.1
       1.272s with slicing, 0.647s without slicing - a 49% improvement
   
   In this case, the sparser the input matrix, the larger the performance
   improvement. This phenomenon is a result of employing a hash map M
   in the implementation: as the RxC input matrix becomes denser, M's
   keyset size approaches C, and the performance approaches the baseline,
   which uses slicing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] phaniarnab commented on pull request #1676: [SYSTEMDS-3390] Improve performance of countDistinctApprox()

Posted by GitBox <gi...@apache.org>.
phaniarnab commented on PR #1676:
URL: https://github.com/apache/systemds/pull/1676#issuecomment-1204000051

   To run tests after rebasing PR #1650 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [systemds] phaniarnab closed pull request #1676: [SYSTEMDS-3390] Improve performance of countDistinctApprox()

Posted by GitBox <gi...@apache.org>.
phaniarnab closed pull request #1676: [SYSTEMDS-3390] Improve performance of countDistinctApprox()
URL: https://github.com/apache/systemds/pull/1676


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org