You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemds.apache.org by GitBox <gi...@apache.org> on 2022/07/03 19:16:16 UTC

[GitHub] [systemds] BACtaki opened a new pull request, #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

BACtaki opened a new pull request, #1650:
URL: https://github.com/apache/systemds/pull/1650

   JIRA: https://issues.apache.org/jira/browse/SYSTEMDS-3390
   
   This patch improves the performance of countDistinctApprox() row/col aggregation by replacing matrix slicing with direct
   ops on the input matrix. This has the most impact in CP execution mode given the smaller input size (max 1000x1000); some
   simple experiments demonstrate this:
   
   (numbers represent average over 3 runs)
   1. row aggregation
       (A) dense: 10000x1000 with sparsity=0.9
       1.198s with slicing, 0.874s without slicing - a 27% improvement
   
       (B) sparse: 10000x1000 with sparsity=0.1
       0.528s with slicing, 0.512s without slicing - a 3% improvement
   
   As expected, the larger and the more dense the input matrix, the larger the performance improvement.
   
   2. col aggregation
       (A) dense: 10000x1000 with sparsity=0.9
       1.186s with slicing, 1.036s without slicing - a 13% improvement
   
       (B) sparse: 10000x1000 with sparsity=0.1
       1.272s with slicing, 0.647s without slicing - a 49% improvement
   
   In this case, the sparser the input matrix, the larger the performance improvement. This phenomenon is a result of
   employing a hash map M in the implementation: as the RxC input matrix becomes denser, M's keyset size approaches C,
   and the performance approaches the baseline, which uses slicing.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] BACtaki commented on pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

BACtaki commented on PR #1650:
URL: https://github.com/apache/systemds/pull/1650#issuecomment-1201664485

   Not sure why [**.functions.builtin.part2.**](https://github.com/apache/systemds/runs/7579235369?check_suite_focus=true#logs) test is failing- maybe it's a transient issue?
   
   ```
   script file: ./src/test/scripts/functions/builtin/multipleBuiltins.R
   Error:    Run 2: MultipleBuiltinsTest.testMultipleBuiltinsDefaultSP:54->runMultipleBuiltinsTest:76->AutomatedTestBase.runRScript:1258 ERROR: R has ended irregularly
   R Standard output :
   
   R Standard Error  :
   Error in library("DescTools") : there is no package called ‘DescTools’
   Execution halted
   
   
   script file: ./src/test/scripts/functions/builtin/multipleBuiltins.R
   Error:    Run 3: MultipleBuiltinsTest.testMultipleBuiltinsDefaultSP:54->runMultipleBuiltinsTest:76->AutomatedTestBase.runRScript:1258 ERROR: R has ended irregularly
   R Standard output :
   
   R Standard Error  :
   Error in library("DescTools") : there is no package called ‘DescTools’
   Execution halted
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] phaniarnab commented on a diff in pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

phaniarnab commented on code in PR #1650:
URL: https://github.com/apache/systemds/pull/1650#discussion_r932944938


##########
src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixCountDistinct.java:
##########
@@ -102,66 +114,255 @@ else if(in.getNonZeros() < minimumSize) {
 	 * 
 	 * Benefit: precise, but uses memory, on the scale of inputs number of distinct values.
 	 * 
-	 * @param in The input matrix to count number distinct values in
-	 * @return The absolute distinct count
+	 * @param blkIn The input matrix to count number distinct values in
+	 * @return A matrix block containing the absolute distinct count for the entire input or along given row/col axis
 	 */
-	private static int countDistinctValuesNaive(MatrixBlock in) {
+	private static MatrixBlock countDistinctValuesNaive(MatrixBlock blkIn, CountDistinctOperator op) {
+
+		if (blkIn.isEmpty()) {
+			return new MatrixBlock(1);
+		}
+		else if(blkIn instanceof CompressedMatrixBlock) {
+			throw new NotImplementedException("countDistinct() does not support CompressedMatrixBlock");
+		}
+
 		Set<Double> distinct = new HashSet<>();
+		MatrixBlock blkOut;
 		double[] data;
-		if(in.isEmpty())
-			return 1;
-		else if(in instanceof CompressedMatrixBlock)
-			throw new NotImplementedException();
 
-		long nonZeros = in.getNonZeros();
+		if (op.getDirection().isRowCol()) {
+			blkOut = new MatrixBlock(1, 1, false);
 
-		if(nonZeros != -1 && nonZeros < in.getNumColumns() * in.getNumRows()) {
-			distinct.add(0d);
-		}
+			long distinctCount = 0;
+			long nonZeros = blkIn.getNonZeros();
 
-		if(in.sparseBlock != null) {
-			SparseBlock sb = in.sparseBlock;
+			// Check if input matrix contains any 0 values for RowCol case.
+			// This does not apply to row/col case, where we count nnz per row or col during iteration.
+			if(nonZeros != -1 && nonZeros < blkIn.getNumColumns() * blkIn.getNumRows()) {
+				distinct.add(0d);
+			}
 
-			if(in.sparseBlock.isContiguous()) {
-				data = sb.values(0);
-				countDistinctValuesNaive(data, distinct);
+			if(blkIn.getSparseBlock() != null) {
+				SparseBlock sb = blkIn.getSparseBlock();
+				if(blkIn.getSparseBlock().isContiguous()) {
+					// COO, CSR
+					data = sb.values(0);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				} else {
+					// MCSR
+					for(int i = 0; i < blkIn.getNumRows(); i++) {
+						if(!sb.isEmpty(i)) {
+							data = blkIn.getSparseBlock().values(i);
+							distinctCount = countDistinctValuesNaive(data, distinct);
+						}
+					}
+				}
+			} else if(blkIn.getDenseBlock() != null) {
+				DenseBlock db = blkIn.getDenseBlock();
+				for (int i = 0; i <= db.numBlocks(); i++) {
+					data = db.valuesAt(i);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				}
 			}
-			else {
-				for(int i = 0; i < in.getNumRows(); i++) {
-					if(!sb.isEmpty(i)) {
-						data = in.sparseBlock.values(i);
+
+			blkOut.setValue(0, 0, distinctCount);
+		} else if (op.getDirection().isRow()) {
+			blkOut = blkIn.slice(0, blkIn.getNumRows() - 1, 0, 0);

Review Comment:
   Why deep copying (slice) the first column is required? You are anyway going to set individual values. Won't just allocating a dense column vector work? Something like below.
     `blkOut = new MatrixBlock(blkIn.getNumRows(), 1, false, blkIn.getNumRows());
      blkOut.allocateBlock();`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] BACtaki commented on a diff in pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

BACtaki commented on code in PR #1650:
URL: https://github.com/apache/systemds/pull/1650#discussion_r934874887


##########
src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixCountDistinct.java:
##########
@@ -102,66 +114,255 @@ else if(in.getNonZeros() < minimumSize) {
 	 * 
 	 * Benefit: precise, but uses memory, on the scale of inputs number of distinct values.
 	 * 
-	 * @param in The input matrix to count number distinct values in
-	 * @return The absolute distinct count
+	 * @param blkIn The input matrix to count number distinct values in
+	 * @return A matrix block containing the absolute distinct count for the entire input or along given row/col axis
 	 */
-	private static int countDistinctValuesNaive(MatrixBlock in) {
+	private static MatrixBlock countDistinctValuesNaive(MatrixBlock blkIn, CountDistinctOperator op) {
+
+		if (blkIn.isEmpty()) {
+			return new MatrixBlock(1);
+		}
+		else if(blkIn instanceof CompressedMatrixBlock) {
+			throw new NotImplementedException("countDistinct() does not support CompressedMatrixBlock");
+		}
+
 		Set<Double> distinct = new HashSet<>();
+		MatrixBlock blkOut;
 		double[] data;
-		if(in.isEmpty())
-			return 1;
-		else if(in instanceof CompressedMatrixBlock)
-			throw new NotImplementedException();
 
-		long nonZeros = in.getNonZeros();
+		if (op.getDirection().isRowCol()) {
+			blkOut = new MatrixBlock(1, 1, false);
 
-		if(nonZeros != -1 && nonZeros < in.getNumColumns() * in.getNumRows()) {
-			distinct.add(0d);
-		}
+			long distinctCount = 0;
+			long nonZeros = blkIn.getNonZeros();
 
-		if(in.sparseBlock != null) {
-			SparseBlock sb = in.sparseBlock;
+			// Check if input matrix contains any 0 values for RowCol case.
+			// This does not apply to row/col case, where we count nnz per row or col during iteration.
+			if(nonZeros != -1 && nonZeros < blkIn.getNumColumns() * blkIn.getNumRows()) {
+				distinct.add(0d);
+			}
 
-			if(in.sparseBlock.isContiguous()) {
-				data = sb.values(0);
-				countDistinctValuesNaive(data, distinct);
+			if(blkIn.getSparseBlock() != null) {
+				SparseBlock sb = blkIn.getSparseBlock();
+				if(blkIn.getSparseBlock().isContiguous()) {
+					// COO, CSR
+					data = sb.values(0);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				} else {
+					// MCSR
+					for(int i = 0; i < blkIn.getNumRows(); i++) {
+						if(!sb.isEmpty(i)) {
+							data = blkIn.getSparseBlock().values(i);
+							distinctCount = countDistinctValuesNaive(data, distinct);
+						}
+					}
+				}
+			} else if(blkIn.getDenseBlock() != null) {
+				DenseBlock db = blkIn.getDenseBlock();
+				for (int i = 0; i <= db.numBlocks(); i++) {
+					data = db.valuesAt(i);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				}
 			}
-			else {
-				for(int i = 0; i < in.getNumRows(); i++) {
-					if(!sb.isEmpty(i)) {
-						data = in.sparseBlock.values(i);
+
+			blkOut.setValue(0, 0, distinctCount);
+		} else if (op.getDirection().isRow()) {
+			blkOut = blkIn.slice(0, blkIn.getNumRows() - 1, 0, 0);

Review Comment:
   >
   			// Result is a Mx1 matrix block.
   			// TODO We are reusing memory allocated for input matrix block here - explore whether it would be better to
   			//  allocate new
   			MatrixBlock resultMatrix = in.slice(0, in.getNumRows() - 1, 0, 0);
   
   I was trying to save on new memory allocations by reusing previously allocated, but you are right: it is probably unnecessary, not to mention that it is likely more expensive performance-wise to compute slice than to allocate new- fixed in next iteration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] phaniarnab commented on a diff in pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

phaniarnab commented on code in PR #1650:
URL: https://github.com/apache/systemds/pull/1650#discussion_r932961300


##########
src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixCountDistinct.java:
##########
@@ -102,66 +114,255 @@ else if(in.getNonZeros() < minimumSize) {
 	 * 
 	 * Benefit: precise, but uses memory, on the scale of inputs number of distinct values.
 	 * 
-	 * @param in The input matrix to count number distinct values in
-	 * @return The absolute distinct count
+	 * @param blkIn The input matrix to count number distinct values in
+	 * @return A matrix block containing the absolute distinct count for the entire input or along given row/col axis
 	 */
-	private static int countDistinctValuesNaive(MatrixBlock in) {
+	private static MatrixBlock countDistinctValuesNaive(MatrixBlock blkIn, CountDistinctOperator op) {
+
+		if (blkIn.isEmpty()) {
+			return new MatrixBlock(1);
+		}
+		else if(blkIn instanceof CompressedMatrixBlock) {
+			throw new NotImplementedException("countDistinct() does not support CompressedMatrixBlock");
+		}
+
 		Set<Double> distinct = new HashSet<>();
+		MatrixBlock blkOut;
 		double[] data;
-		if(in.isEmpty())
-			return 1;
-		else if(in instanceof CompressedMatrixBlock)
-			throw new NotImplementedException();
 
-		long nonZeros = in.getNonZeros();
+		if (op.getDirection().isRowCol()) {
+			blkOut = new MatrixBlock(1, 1, false);
 
-		if(nonZeros != -1 && nonZeros < in.getNumColumns() * in.getNumRows()) {
-			distinct.add(0d);
-		}
+			long distinctCount = 0;
+			long nonZeros = blkIn.getNonZeros();
 
-		if(in.sparseBlock != null) {
-			SparseBlock sb = in.sparseBlock;
+			// Check if input matrix contains any 0 values for RowCol case.
+			// This does not apply to row/col case, where we count nnz per row or col during iteration.
+			if(nonZeros != -1 && nonZeros < blkIn.getNumColumns() * blkIn.getNumRows()) {
+				distinct.add(0d);
+			}
 
-			if(in.sparseBlock.isContiguous()) {
-				data = sb.values(0);
-				countDistinctValuesNaive(data, distinct);
+			if(blkIn.getSparseBlock() != null) {
+				SparseBlock sb = blkIn.getSparseBlock();
+				if(blkIn.getSparseBlock().isContiguous()) {
+					// COO, CSR
+					data = sb.values(0);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				} else {
+					// MCSR
+					for(int i = 0; i < blkIn.getNumRows(); i++) {
+						if(!sb.isEmpty(i)) {
+							data = blkIn.getSparseBlock().values(i);
+							distinctCount = countDistinctValuesNaive(data, distinct);
+						}
+					}
+				}
+			} else if(blkIn.getDenseBlock() != null) {
+				DenseBlock db = blkIn.getDenseBlock();
+				for (int i = 0; i <= db.numBlocks(); i++) {
+					data = db.valuesAt(i);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				}
 			}
-			else {
-				for(int i = 0; i < in.getNumRows(); i++) {
-					if(!sb.isEmpty(i)) {
-						data = in.sparseBlock.values(i);
+
+			blkOut.setValue(0, 0, distinctCount);
+		} else if (op.getDirection().isRow()) {
+			blkOut = blkIn.slice(0, blkIn.getNumRows() - 1, 0, 0);
+
+			if (blkIn.getDenseBlock() != null) {
+				// The naive approach would be to iterate through every (i, j) in the input. However, can do better
+				// by exploiting the physical layout of dense blocks - contiguous blocks in row-major order - in memory.
+				DenseBlock db = blkIn.getDenseBlock();
+				for (int bix=0; bix<db.numBlocks(); ++bix) {
+					data = db.valuesAt(bix);
+					for (int rix=bix * db.blockSize(); rix<blkIn.getNumRows(); rix++) {
+						distinct.clear();
+						for (int cix=0; cix<blkIn.getNumColumns(); ++cix) {
+							distinct.add(data[db.pos(rix, cix)]);
+						}
+						blkOut.setValue(rix, 0, distinct.size());
+					}
+				}
+			} else if (blkIn.getSparseBlock() != null) {
+				// Each sparse block type - COO, CSR, MCSR - has a different data representation, which we will exploit
+				// separately.
+				SparseBlock sb = blkIn.getSparseBlock();
+				if (SparseBlockFactory.isSparseBlockType(sb, SparseBlock.Type.MCSR)) {
+					// Currently, SparseBlockIterator only provides an interface for cell-wise iteration.
+					// TODO Explore row-wise and column-wise methods for SparseBlockIterator
+
+					// MCSR enables O(1) access to column values per row
+					for (int rix=0; rix<blkIn.getNumRows(); ++rix) {
+						if (sb.isEmpty(rix)) {
+							continue;
+						}
+						distinct.clear();
+						data = sb.values(rix);
 						countDistinctValuesNaive(data, distinct);
+						blkOut.setValue(rix, 0, distinct.size());
+					}
+				} else if (SparseBlockFactory.isSparseBlockType(sb, SparseBlock.Type.CSR)) {
+					// Casting is safe given if-condition
+					SparseBlockCSR csrBlock = (SparseBlockCSR) sb;
+
+					// Data lies in one contiguous block in CSR format. We will iterate in row-major using O(1) op
+					// size(row) to determine the number of columns per row.
+					data = csrBlock.values();
+					// We want to iterate through all rows to keep track of the row index for constructing the output
+					for (int rix=0; rix<blkIn.getNumRows(); ++rix) {
+						if (csrBlock.isEmpty(rix)) {
+							continue;
+						}
+						distinct.clear();
+						int rpos = csrBlock.pos(rix);
+						int clen = csrBlock.size(rix);
+						for (int colOffset=0; colOffset<clen; ++colOffset) {
+							distinct.add(data[rpos + colOffset]);
+						}
+						blkOut.setValue(rix, 0, distinct.size());
+					}
+				} else { // COO
+					if (!(sb instanceof SparseBlockCOO)) {
+						throw new IllegalArgumentException("Input matrix is of unrecognized type: "
+								+ sb.getClass().getSimpleName());
+					}
+					SparseBlockCOO cooBlock = (SparseBlockCOO) sb;
+
+					// For COO, we want to avoid using pos(row) and size(row) as they use binary search, which is a
+					// O(log N) op. Also, isEmpty(row) uses pos(row) internally.
+					int[] rixs = cooBlock.rowIndexes();
+					data = cooBlock.values();
+					int i = 0;  // data iterator
+					int rix = 0;  // row index
+					while (rix < cooBlock.numRows() && i < rixs.length) {
+						distinct.clear();
+						while (i + 1 < rixs.length && rixs[i] == rixs[i + 1]) {
+							distinct.add(data[i]);
+							i++;
+						}
+						if (i + 1 < rixs.length) {  // rixs[i] != rixs[i + 1]
+							distinct.add(data[i]);
+						}
+						blkOut.setValue(rix, 0, distinct.size());
+						rix = (i + 1 < rixs.length)? rixs[i + 1] : rix;
+						i++;
 					}
 				}
 			}
-		}
-		else if(in.denseBlock != null) {
-			DenseBlock db = in.denseBlock;
-			for(int i = 0; i <= db.numBlocks(); i++) {
-				data = db.valuesAt(i);
-				countDistinctValuesNaive(data, distinct);
+		} else {  // Col aggregation
+			blkOut = blkIn.slice(0, 0, 0, blkIn.getNumColumns() - 1);

Review Comment:
   Similar to above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] phaniarnab commented on pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

phaniarnab commented on PR #1650:
URL: https://github.com/apache/systemds/pull/1650#issuecomment-1197059845

   Thanks for the PR @BACtaki. I had an initial look at the code changes today. Before I get into detailed comments, I'd like to clarify a few things.
   - Are the optimizations only improve the naive cases where the input dimension is less than 1024 for the given direction (row/col)? 
   - I see that you are now iterating the dense and sparse inputs in a more cache-conscious manner (reducing CPU cache misses). Are there any other optimizations you are employing (e.g. reducing the number of intermediates)?
   
   Sorry for the delay. I will add my comments and suggestions tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] BACtaki commented on pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

BACtaki commented on PR #1650:
URL: https://github.com/apache/systemds/pull/1650#issuecomment-1201640792

Thanks for your review @phaniarnab! Comments inline:

> Are the optimizations only improve the naive cases where the input dimension is less than 1024 for the given direction (row/col)?

The cache-conscious iteration optimization only applies to the naive counting case where `nnz < minimumSize
= 1024`. Aside from this, I also refactored `MatrixSketch` and `KMVSketch` to replace the `getScalarValue() -> Integer` method with the `getValue() -> MatrixBlock` method to unify the handling of RowCol, Row, and Col cases (as per @Baunsgaard 's suggestion from an earlier PR).

> I see that you are now iterating the dense and sparse inputs in a more cache-conscious manner (reducing CPU cache misses). Are there any other optimizations you are employing (e.g. reducing the number of intermediates)?

Really good point re intermediates. We are actually creating an intermediate `SmallestPriorityQueue` of max size `k = 64` per row/col. The way we are constructing the intermediates places an upper bound on additional memory:
1. RowCol: max 64 double values (64 bits each)
2. Row: `SmallestPriorityQueue spq` is reset between each iteration, so the intermediate is at most 64 double values
3. Col: identical to Row, but the iteration occurs in column-major order and `spq` is reset between iterations of consecutive columns

There are no further optimizations beyond cache-conscious iteration in CP case.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] phaniarnab commented on pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

phaniarnab commented on PR #1650:
URL: https://github.com/apache/systemds/pull/1650#issuecomment-1204058823

   Thanks again @BACtaki. While merging I fixed a few minor warnings (redundant imports, missing cast, etc.).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] phaniarnab commented on pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

phaniarnab commented on PR #1650:
URL: https://github.com/apache/systemds/pull/1650#issuecomment-1202410005

   Thanks, @BACtaki. The changes now look good to me. I will merge the PR in some time.
   Don't worry about the test failure. That is due to some docker + R issues. @Shafaq-Siddiqi is working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] BACtaki commented on pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

BACtaki commented on PR #1650:
URL: https://github.com/apache/systemds/pull/1650#issuecomment-1204225801

   > Thanks again @BACtaki. While merging I fixed a few minor warnings (redundant imports, missing cast, etc.).
   
   Thanks for your review @phaniarnab ! I will definitely be more careful with warnings going forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] Baunsgaard commented on pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

Baunsgaard commented on PR #1650:
URL: https://github.com/apache/systemds/pull/1650#issuecomment-1194745175

   Sorry, missed this PR, will take a look tomorrow!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] phaniarnab closed pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

phaniarnab closed pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops
URL: https://github.com/apache/systemds/pull/1650


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [systemds] phaniarnab commented on a diff in pull request #1650: [SYSTEMDS-3390] Replace slicing for countDistinctApprox() row/col aggregation with direct ops

Posted by GitBox <gi...@apache.org>.

phaniarnab commented on code in PR #1650:
URL: https://github.com/apache/systemds/pull/1650#discussion_r932944938


##########
src/main/java/org/apache/sysds/runtime/matrix/data/LibMatrixCountDistinct.java:
##########
@@ -102,66 +114,255 @@ else if(in.getNonZeros() < minimumSize) {
 	 * 
 	 * Benefit: precise, but uses memory, on the scale of inputs number of distinct values.
 	 * 
-	 * @param in The input matrix to count number distinct values in
-	 * @return The absolute distinct count
+	 * @param blkIn The input matrix to count number distinct values in
+	 * @return A matrix block containing the absolute distinct count for the entire input or along given row/col axis
 	 */
-	private static int countDistinctValuesNaive(MatrixBlock in) {
+	private static MatrixBlock countDistinctValuesNaive(MatrixBlock blkIn, CountDistinctOperator op) {
+
+		if (blkIn.isEmpty()) {
+			return new MatrixBlock(1);
+		}
+		else if(blkIn instanceof CompressedMatrixBlock) {
+			throw new NotImplementedException("countDistinct() does not support CompressedMatrixBlock");
+		}
+
 		Set<Double> distinct = new HashSet<>();
+		MatrixBlock blkOut;
 		double[] data;
-		if(in.isEmpty())
-			return 1;
-		else if(in instanceof CompressedMatrixBlock)
-			throw new NotImplementedException();
 
-		long nonZeros = in.getNonZeros();
+		if (op.getDirection().isRowCol()) {
+			blkOut = new MatrixBlock(1, 1, false);
 
-		if(nonZeros != -1 && nonZeros < in.getNumColumns() * in.getNumRows()) {
-			distinct.add(0d);
-		}
+			long distinctCount = 0;
+			long nonZeros = blkIn.getNonZeros();
 
-		if(in.sparseBlock != null) {
-			SparseBlock sb = in.sparseBlock;
+			// Check if input matrix contains any 0 values for RowCol case.
+			// This does not apply to row/col case, where we count nnz per row or col during iteration.
+			if(nonZeros != -1 && nonZeros < blkIn.getNumColumns() * blkIn.getNumRows()) {
+				distinct.add(0d);
+			}
 
-			if(in.sparseBlock.isContiguous()) {
-				data = sb.values(0);
-				countDistinctValuesNaive(data, distinct);
+			if(blkIn.getSparseBlock() != null) {
+				SparseBlock sb = blkIn.getSparseBlock();
+				if(blkIn.getSparseBlock().isContiguous()) {
+					// COO, CSR
+					data = sb.values(0);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				} else {
+					// MCSR
+					for(int i = 0; i < blkIn.getNumRows(); i++) {
+						if(!sb.isEmpty(i)) {
+							data = blkIn.getSparseBlock().values(i);
+							distinctCount = countDistinctValuesNaive(data, distinct);
+						}
+					}
+				}
+			} else if(blkIn.getDenseBlock() != null) {
+				DenseBlock db = blkIn.getDenseBlock();
+				for (int i = 0; i <= db.numBlocks(); i++) {
+					data = db.valuesAt(i);
+					distinctCount = countDistinctValuesNaive(data, distinct);
+				}
 			}
-			else {
-				for(int i = 0; i < in.getNumRows(); i++) {
-					if(!sb.isEmpty(i)) {
-						data = in.sparseBlock.values(i);
+
+			blkOut.setValue(0, 0, distinctCount);
+		} else if (op.getDirection().isRow()) {
+			blkOut = blkIn.slice(0, blkIn.getNumRows() - 1, 0, 0);

Review Comment:
   Why deep copying (slice) the first column is required? You are anyway going to set individual values. Won't just allocating a dense column vector work? Something like below.
   `blkOut = new MatrixBlock(blkIn.getNumRows(), 1, false, blkIn.getNumRows());
    blkOut.allocateBlock();`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@systemds.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org