You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by Chris Lucas <cl...@inf.ed.ac.uk> on 2016/04/05 16:43:04 UTC

[math] matrix multiplication surprisingly slow?

I recently ran a benchmark comparing the performance math commons 3.6.1's
linear algebra library to the that of scala Breeze (
https://github.com/scalanlp/breeze).

I looked at det, inverse, Cholesky factorization, addition, and
multiplication, including matrices with 10, 100, 500, and 1000 elements,
with symmetric, non-symmetric, and non-square cases where applicable.

In general, I was pleasantly surprised: math commons performed about as
well as Breeze, despite the latter relying on native libraries. There was
one exception, however:

    m0.multiply(m1)

where m0 and m1 are both Array2DRowRealMatrix instances. It scaled very
poorly in math commons, being much slower than nominally more expensive
operations like inv and the Breeze implementation. Does anyone have a
thought as to what's going on? In case it's useful, one representative test
involves multiplying two instances of

    new Array2DRowRealMatrix(matVals)

where matVals is 1000x1000 entries of math.random and the second instance
is created as part of the loop. This part of the benchmark is not specific
to the expensive multiplication step, and takes very little time relative
to the multiplication itself. I'm using System.nanotime for the timing, and
take the average time over several consecutive iterations, on a 3.5 GHz
Intel Core i7, Oracle JRE (build 1.8.0_05-b13).

Thanks,

Chris

Re: [math] matrix multiplication surprisinglyslow?

Posted by Gilles <gi...@harfang.homelinux.org>.
On Wed, 6 Apr 2016 12:55:06 +0100, Chris Lucas wrote:
> Thanks for the quick reply!
>
> I've pasted a small self-contained example at the bottom. It creates 
> the
> matrices in advance, but nothing meaningful changes if they're 
> created on a
> per-operation basis.
>
> Results for 50 multiplications of [size]x[size] matrices:
>    Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
>    Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
>    Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
>    Size: 600, total time: 17.532 seconds, time per: 0.351 seconds
>
> For comparison:
>
> Results for 50 additions of [size]x[size] matrices (which should be 
> faster,
> be the extent of the difference is nonetheless striking to me):
>    Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
>    Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
>    Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
>    Size: 600, total time: 0.050 seconds, time per: 0.001 seconds
>
> Results for 50 inversions of a [size]x[size] matrix, which I'd expect 
> to be
> slower than multiplication for larger matrices:
>    Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
>    Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
>    Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
>    Size: 600, total time: 9.297 seconds, time per: 0.186 seconds
>
> I hope this is useful, and if I'm doing something wrong that's 
> leading to
> this performance gap, I'd love to know.

Here is the result of a JMH run (only benchmarking matrix 
multiplication):
---CUT---
[...]

Benchmark                           Mode  Cnt        Score      Error  
Units
MyBenchmarkMatrix.size100x100_a2d  thrpt  200      738.787 ±    8.501  
ops/s
MyBenchmarkMatrix.size100x100_b    thrpt  200     1527.522 ±   16.158  
ops/s
MyBenchmarkMatrix.size10x10_a2d    thrpt  200   697662.468 ± 4807.916  
ops/s
MyBenchmarkMatrix.size10x10_b      thrpt  200  1003800.610 ± 6612.299  
ops/s
MyBenchmarkMatrix.size600x600_a2d  thrpt  200        0.669 ±    0.007  
ops/s
MyBenchmarkMatrix.size600x600_b    thrpt  200        7.794 ±    0.071  
ops/s
---CUT---

where the suffixindicates the layout type:
   a2d -> Array2DRowRealMatrix
     b -> BlockRealMatrix
[So, even for small matrices, "BlockRealMatrix" is a clear win.]

As for the multiplication getting slower for larger sizes, I'd assume 
that it's
related to the O(n^2) nature of the algorithm (where "n" is the number 
of elements
of the matrix).
For the two layouts provided in CM and for the other library, you could 
perhaps
confirm the exact dependency by fitting the parameters "a", "b", "c" of 
the
function
   t(n) = a + b n + c n^2
using e.g. 
http://commons.apache.org/proper/commons-math/javadocs/api-3.6.1/org/apache/commons/math3/fitting/PolynomialCurveFitter.html
(and more data points).


Regards,
Gilles

>
> ----
> import org.apache.commons.math3.linear.LUDecomposition;
> import org.apache.commons.math3.linear.Array2DRowRealMatrix;
> import org.apache.commons.math3.linear.RealMatrix;
>
>
> public class AMCMatrices {
>
>   public static void main(String[] args) {
>     miniTest(0);
>   }
>
>   public static void miniTest(int tType) {
>     int samples = 50;
>
>     int sizes[] = { 10, 100, 300, 500, 600 };
>
>     for (int sI = 0; sI < sizes.length; sI++) {
>       int mSize = sizes[sI];
>
>       org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize, 
> mSize);
>       RealMatrix m1 = buildM(mSize, mSize);
>
>       long start = System.nanoTime();
>       for (int n = 0; n < samples; n++) {
>         switch (tType) {
>         case 0:
>           m0.multiply(m1);
>           break;
>         case 1:
>           m0.add(m1);
>           break;
>         case 2:
>           new LUDecomposition(m0).getSolver().getInverse();
>           break;
>         }
>
>       }
>       long end = System.nanoTime();
>
>       double dt = ((double) (end - start)) / 1E9;
>       System.out.println(String.format(
>           "Size: %d, total time: %3.3f seconds, time per: %3.3f 
> seconds",
>           mSize, dt, dt / samples));
>     }
>   }
>
>   public static Array2DRowRealMatrix buildM(int r, int c) {
>     double[][] matVals = new double[r][c];
>     for (int i = 0; i < r; i++) {
>       for (int j = 0; j < c; j++) {
>         matVals[i][j] = Math.random();
>       }
>     }
>     return new Array2DRowRealMatrix(matVals);
>   }
> }
>
> ----
>
> On 5 April 2016 at 19:36, Gilles <gi...@harfang.homelinux.org> 
> wrote:
>
>> Hi.
>>
>> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>>
>>> I recently ran a benchmark comparing the performance math commons 
>>> 3.6.1's
>>> linear algebra library to the that of scala Breeze (
>>> https://github.com/scalanlp/breeze).
>>>
>>> I looked at det, inverse, Cholesky factorization, addition, and
>>> multiplication, including matrices with 10, 100, 500, and 1000 
>>> elements,
>>> with symmetric, non-symmetric, and non-square cases where 
>>> applicable.
>>>
>>
>> It would be interesting to add this to the CM documentation:
>>   https://issues.apache.org/jira/browse/MATH-1194
>>
>> In general, I was pleasantly surprised: math commons performed about 
>> as
>>> well as Breeze, despite the latter relying on native libraries. 
>>> There was
>>> one exception, however:
>>>
>>>     m0.multiply(m1)
>>>
>>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled 
>>> very
>>> poorly in math commons, being much slower than nominally more 
>>> expensive
>>> operations like inv and the Breeze implementation. Does anyone have 
>>> a
>>> thought as to what's going on?
>>>
>>
>> Could your provide more information such as a plot of performance vs 
>> size?
>> A self-contained and minimal code to run would be nice too.
>> See the CM micro-benchmarking tool here:
>>
>> 
>> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
>> And an example of how to use it:
>>
>> 
>> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>>
>> In case it's useful, one representative test
>>> involves multiplying two instances of
>>>
>>>     new Array2DRowRealMatrix(matVals)
>>>
>>> where matVals is 1000x1000 entries of math.random and the second 
>>> instance
>>> is created as part of the loop. This part of the benchmark is not 
>>> specific
>>> to the expensive multiplication step, and takes very little time 
>>> relative
>>> to the multiplication itself. I'm using System.nanotime for the 
>>> timing,
>>> and
>>> take the average time over several consecutive iterations, on a 3.5 
>>> GHz
>>> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
>>>
>>
>> You might want to confirm the behaviour using JMH (becoming the Java
>> standard
>> for benchmarking):
>>   http://openjdk.java.net/projects/code-tools/jmh/
>>
>>
>> Best regards,
>> Gilles
>>
>>
>>> Thanks,
>>>
>>> Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: [math] matrix multiplication surprisinglyslow?

Posted by Gilles <gi...@harfang.homelinux.org>.
On Thu, 7 Apr 2016 16:17:52 +0100, Chris Lucas wrote:
> Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
> comparing the two, along with some other libraries.

The email is not really suited for tables (see your message below).
What are the points which you want to make?

As said before, "reports" on CM features (a.o. benchmarks) are welcome 
but
should be formatted for inclusion in the userguide (for now, until we 
might
create a separate document for data that is subject to change more or 
less
rapidly).

If you suspect a problem with the code, then I suggest that you file a 
JIRA
report, where files can be attached (or tables can be viewed without 
being
deformed thanks to the "{noformat}" tag).

Regards,
Gilles

> I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 
> 10 test
> iterations.
>
> The suffixes denote what's being tested:
>   MC: AMC 3.6.1 using Array2DRowRealMatrix
>   SB: Scala Breeze 0.12 with native libraries
>   OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an 
> especially
> speedy pure-java library.
>   BMC: MC using a BlockRealMatrix.
>
> I'm using the same matrix creation/multiplication/inverse code as I
> mentioned in my previous note. When testing BlockReadMatrices, I'm 
> using
> fixed random matrices and annotating my class with 
> @State(Scope.Benchmark).
> For the curious, my rationale for building matrices on the spot is 
> that I'm
> already caching results pretty heavily and rarely perform the same
> operation on the same matrix twice -- I'm most curious about 
> performance in
> the absence of caching benefits.
>
> Test 1a: Creating and multiplying two random/uniform (i.e., all 
> entries
> drawn from math.random()) 100x100 matrices:
> [info] MatrixBenchmarks.buildMultTestMC100  thrpt   10         
> 836.579
> ±       7.120  ops/s
> [info] MatrixBenchmarks.buildMultTestSB100  thrpt   10        
> 1649.089
> ±     170.649  ops/s
> [info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ± 
> 44.158
> ops/s
> [info] MatrixBenchmarks.multBMC100    thrpt   10  1051.055 ±  2.290  
> ops/s
>
> Test 1b: Creating and multiplying two random/uniform 500x500 
> matrices:
> [info] MatrixBenchmarks.buildMultTestMC500  thrpt   10           
> 8.997
> ±       0.064  ops/s
> [info] MatrixBenchmarks.buildMultTestSB500  thrpt   10          
> 80.558
> ±       0.627  ops/s
> [info] MatrixBenchmarks.buildMultTestOJ500  thrpt   10          
> 34.626
> ±       2.505  ops/s
> [info] MatrixBenchmarks.multBMC500    thrpt   10     9.132 ±  0.059  
> ops/s
> [info] MatrixBenchmarks.multCacheBMC500  thrpt   10  13.630 ± 0.107  
> ops/s
> [no matrix creation]
>
> ---
>
> Test 2a: Creating a single 500x500 matrix (to get a sense of whether 
> the
> mult itself is driving differences:
> [info] MatrixBenchmarks.buildMC500          thrpt   10         
> 155.026
> ±       2.456  ops/s
> [info] MatrixBenchmarks.buildSB500          thrpt   10         
> 197.257
> ±       4.619  ops/s
> [info] MatrixBenchmarks.buildOJ500          thrpt   10         
> 176.036
> ±       2.121  ops/s
>
> Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N):
> [info] MatrixBenchmarks.buildMC1000  thrpt   10    36.112 ±  2.775  
> ops/s
> [info] MatrixBenchmarks.buildSB1000  thrpt   10    45.557 ±  2.893  
> ops/s
> [info] MatrixBenchmarks.buildOJ1000  thrpt   10    47.438 ±  1.370  
> ops/s
> [info] MatrixBenchmarks.buildBMC1000  thrpt   10    37.779 ±  0.871  
> ops/s
>
> ---
>
> Test 3a: Inverting a random 100x100 matrix:
> [info] MatrixBenchmarks.invMC100   thrpt   10  1033.011 ±  9.928  
> ops/s
> [info] MatrixBenchmarks.invSB100   thrpt   10  1664.833 ± 15.170  
> ops/s
> [info] MatrixBenchmarks.invOJ100   thrpt   10  1174.044 ± 52.083  
> ops/s
> [info] MatrixBenchmarks.invBMC100     thrpt   10  1043.858 ± 22.144  
> ops/s
>
> Test 3b: Inverting a random 500x500 matrix:
> [info] MatrixBenchmarks.invMC500        thrpt   10           9.089 ±
> 0.284  ops/s
> [info] MatrixBenchmarks.invSB500        thrpt   10          43.857 ±
> 1.083  ops/s
> [info] MatrixBenchmarks.invOJ500        thrpt   10          23.444 ±
> 1.484  ops/s
> [info] MatrixBenchmarks.invBMC500     thrpt   10     9.156 ±  0.052  
> ops/s
> [info] MatrixBenchmarks.invCacheBMC500   thrpt   10   9.627 ± 0.084  
> ops/s
> [no matrix creation]
>
> Test 3c:
> [info] MatrixBenchmarks.invMC1000  thrpt   10     0.704 ±  0.040  
> ops/s
> [info] MatrixBenchmarks.invSB1000           thrpt   10     8.611 ±  
> 0.557
> ops/s
> [info] MatrixBenchmarks.invOJ1000  thrpt   10     3.539 ±  0.229  
> ops/s
> [info] MatrixBenchmarks.invBMC1000    thrpt   10     0.691 ±  0.095  
> ops/s
>
> Also, isn't matrix multiplication at least O(N^2.37), rather than 
> O(N^2)?
>
> On 6 April 2016 at 12:55, Chris Lucas <cl...@inf.ed.ac.uk> wrote:
>
>> Thanks for the quick reply!
>>
>> I've pasted a small self-contained example at the bottom. It creates 
>> the
>> matrices in advance, but nothing meaningful changes if they're 
>> created on a
>> per-operation basis.
>>
>> Results for 50 multiplications of [size]x[size] matrices:
>>    Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
>>    Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
>>    Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
>>    Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
>>    Size: 600, total time: 17.532 seconds, time per: 0.351 seconds
>>
>> For comparison:
>>
>> Results for 50 additions of [size]x[size] matrices (which should be
>> faster, be the extent of the difference is nonetheless striking to 
>> me):
>>    Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
>>    Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
>>    Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
>>    Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
>>    Size: 600, total time: 0.050 seconds, time per: 0.001 seconds
>>
>> Results for 50 inversions of a [size]x[size] matrix, which I'd 
>> expect to
>> be slower than multiplication for larger matrices:
>>    Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
>>    Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
>>    Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
>>    Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
>>    Size: 600, total time: 9.297 seconds, time per: 0.186 seconds
>>
>> I hope this is useful, and if I'm doing something wrong that's 
>> leading to
>> this performance gap, I'd love to know.
>>
>> ----
>> import org.apache.commons.math3.linear.LUDecomposition;
>> import org.apache.commons.math3.linear.Array2DRowRealMatrix;
>> import org.apache.commons.math3.linear.RealMatrix;
>>
>>
>> public class AMCMatrices {
>>
>>   public static void main(String[] args) {
>>     miniTest(0);
>>   }
>>
>>   public static void miniTest(int tType) {
>>     int samples = 50;
>>
>>     int sizes[] = { 10, 100, 300, 500, 600 };
>>
>>     for (int sI = 0; sI < sizes.length; sI++) {
>>       int mSize = sizes[sI];
>>
>>       org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize, 
>> mSize);
>>       RealMatrix m1 = buildM(mSize, mSize);
>>
>>       long start = System.nanoTime();
>>       for (int n = 0; n < samples; n++) {
>>         switch (tType) {
>>         case 0:
>>           m0.multiply(m1);
>>           break;
>>         case 1:
>>           m0.add(m1);
>>           break;
>>         case 2:
>>           new LUDecomposition(m0).getSolver().getInverse();
>>           break;
>>         }
>>
>>       }
>>       long end = System.nanoTime();
>>
>>       double dt = ((double) (end - start)) / 1E9;
>>       System.out.println(String.format(
>>           "Size: %d, total time: %3.3f seconds, time per: %3.3f 
>> seconds",
>>           mSize, dt, dt / samples));
>>     }
>>   }
>>
>>   public static Array2DRowRealMatrix buildM(int r, int c) {
>>     double[][] matVals = new double[r][c];
>>     for (int i = 0; i < r; i++) {
>>       for (int j = 0; j < c; j++) {
>>         matVals[i][j] = Math.random();
>>       }
>>     }
>>     return new Array2DRowRealMatrix(matVals);
>>   }
>> }
>>
>> ----
>>
>> On 5 April 2016 at 19:36, Gilles <gi...@harfang.homelinux.org> 
>> wrote:
>>
>>> Hi.
>>>
>>> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>>>
>>>> I recently ran a benchmark comparing the performance math commons 
>>>> 3.6.1's
>>>> linear algebra library to the that of scala Breeze (
>>>> https://github.com/scalanlp/breeze).
>>>>
>>>> I looked at det, inverse, Cholesky factorization, addition, and
>>>> multiplication, including matrices with 10, 100, 500, and 1000 
>>>> elements,
>>>> with symmetric, non-symmetric, and non-square cases where 
>>>> applicable.
>>>>
>>>
>>> It would be interesting to add this to the CM documentation:
>>>   https://issues.apache.org/jira/browse/MATH-1194
>>>
>>> In general, I was pleasantly surprised: math commons performed 
>>> about as
>>>> well as Breeze, despite the latter relying on native libraries. 
>>>> There was
>>>> one exception, however:
>>>>
>>>>     m0.multiply(m1)
>>>>
>>>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled 
>>>> very
>>>> poorly in math commons, being much slower than nominally more 
>>>> expensive
>>>> operations like inv and the Breeze implementation. Does anyone 
>>>> have a
>>>> thought as to what's going on?
>>>>
>>>
>>> Could your provide more information such as a plot of performance 
>>> vs size?
>>> A self-contained and minimal code to run would be nice too.
>>> See the CM micro-benchmarking tool here:
>>>
>>> 
>>> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
>>> And an example of how to use it:
>>>
>>> 
>>> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>>>
>>> In case it's useful, one representative test
>>>> involves multiplying two instances of
>>>>
>>>>     new Array2DRowRealMatrix(matVals)
>>>>
>>>> where matVals is 1000x1000 entries of math.random and the second 
>>>> instance
>>>> is created as part of the loop. This part of the benchmark is not
>>>> specific
>>>> to the expensive multiplication step, and takes very little time 
>>>> relative
>>>> to the multiplication itself. I'm using System.nanotime for the 
>>>> timing,
>>>> and
>>>> take the average time over several consecutive iterations, on a 
>>>> 3.5 GHz
>>>> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
>>>>
>>>
>>> You might want to confirm the behaviour using JMH (becoming the 
>>> Java
>>> standard
>>> for benchmarking):
>>>   http://openjdk.java.net/projects/code-tools/jmh/
>>>
>>>
>>> Best regards,
>>> Gilles
>>>
>>>
>>>> Thanks,
>>>>
>>>> Chris
>>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org


Re: [math] matrix multiplication surprisingly slow?

Posted by Chris Lucas <cl...@inf.ed.ac.uk>.
Thanks for suggesting the BlockRealMatrix. I've run a few benchmarks
comparing the two, along with some other libraries.

I'm using jmh 1.11.3, JDK 1.8.0_05, VM 25.5-b02, with 2 warmups and 10 test
iterations.

The suffixes denote what's being tested:
  MC: AMC 3.6.1 using Array2DRowRealMatrix
  SB: Scala Breeze 0.12 with native libraries
  OJ: OJAlgo 39.0. Some other benchmarks suggest OJAlgo is an especially
speedy pure-java library.
  BMC: MC using a BlockRealMatrix.

I'm using the same matrix creation/multiplication/inverse code as I
mentioned in my previous note. When testing BlockReadMatrices, I'm using
fixed random matrices and annotating my class with @State(Scope.Benchmark).
For the curious, my rationale for building matrices on the spot is that I'm
already caching results pretty heavily and rarely perform the same
operation on the same matrix twice -- I'm most curious about performance in
the absence of caching benefits.

Test 1a: Creating and multiplying two random/uniform (i.e., all entries
drawn from math.random()) 100x100 matrices:
[info] MatrixBenchmarks.buildMultTestMC100  thrpt   10         836.579
±       7.120  ops/s
[info] MatrixBenchmarks.buildMultTestSB100  thrpt   10        1649.089
±     170.649  ops/s
[info] MatrixBenchmarks.buildMultTestOJ100  thrpt   10  1485.014 ± 44.158
ops/s
[info] MatrixBenchmarks.multBMC100    thrpt   10  1051.055 ±  2.290  ops/s

Test 1b: Creating and multiplying two random/uniform 500x500 matrices:
[info] MatrixBenchmarks.buildMultTestMC500  thrpt   10           8.997
±       0.064  ops/s
[info] MatrixBenchmarks.buildMultTestSB500  thrpt   10          80.558
±       0.627  ops/s
[info] MatrixBenchmarks.buildMultTestOJ500  thrpt   10          34.626
±       2.505  ops/s
[info] MatrixBenchmarks.multBMC500    thrpt   10     9.132 ±  0.059  ops/s
[info] MatrixBenchmarks.multCacheBMC500  thrpt   10  13.630 ± 0.107  ops/s
[no matrix creation]

---

Test 2a: Creating a single 500x500 matrix (to get a sense of whether the
mult itself is driving differences:
[info] MatrixBenchmarks.buildMC500          thrpt   10         155.026
±       2.456  ops/s
[info] MatrixBenchmarks.buildSB500          thrpt   10         197.257
±       4.619  ops/s
[info] MatrixBenchmarks.buildOJ500          thrpt   10         176.036
±       2.121  ops/s

Test 2b: Creating a single 1000x1000 matrix (to show it scales w/N):
[info] MatrixBenchmarks.buildMC1000  thrpt   10    36.112 ±  2.775  ops/s
[info] MatrixBenchmarks.buildSB1000  thrpt   10    45.557 ±  2.893  ops/s
[info] MatrixBenchmarks.buildOJ1000  thrpt   10    47.438 ±  1.370  ops/s
[info] MatrixBenchmarks.buildBMC1000  thrpt   10    37.779 ±  0.871  ops/s

---

Test 3a: Inverting a random 100x100 matrix:
[info] MatrixBenchmarks.invMC100   thrpt   10  1033.011 ±  9.928  ops/s
[info] MatrixBenchmarks.invSB100   thrpt   10  1664.833 ± 15.170  ops/s
[info] MatrixBenchmarks.invOJ100   thrpt   10  1174.044 ± 52.083  ops/s
[info] MatrixBenchmarks.invBMC100     thrpt   10  1043.858 ± 22.144  ops/s

Test 3b: Inverting a random 500x500 matrix:
[info] MatrixBenchmarks.invMC500        thrpt   10           9.089 ±
0.284  ops/s
[info] MatrixBenchmarks.invSB500        thrpt   10          43.857 ±
1.083  ops/s
[info] MatrixBenchmarks.invOJ500        thrpt   10          23.444 ±
1.484  ops/s
[info] MatrixBenchmarks.invBMC500     thrpt   10     9.156 ±  0.052  ops/s
[info] MatrixBenchmarks.invCacheBMC500   thrpt   10   9.627 ± 0.084  ops/s
[no matrix creation]

Test 3c:
[info] MatrixBenchmarks.invMC1000  thrpt   10     0.704 ±  0.040  ops/s
[info] MatrixBenchmarks.invSB1000           thrpt   10     8.611 ±  0.557
ops/s
[info] MatrixBenchmarks.invOJ1000  thrpt   10     3.539 ±  0.229  ops/s
[info] MatrixBenchmarks.invBMC1000    thrpt   10     0.691 ±  0.095  ops/s

Also, isn't matrix multiplication at least O(N^2.37), rather than O(N^2)?

On 6 April 2016 at 12:55, Chris Lucas <cl...@inf.ed.ac.uk> wrote:

> Thanks for the quick reply!
>
> I've pasted a small self-contained example at the bottom. It creates the
> matrices in advance, but nothing meaningful changes if they're created on a
> per-operation basis.
>
> Results for 50 multiplications of [size]x[size] matrices:
>    Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
>    Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
>    Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
>    Size: 600, total time: 17.532 seconds, time per: 0.351 seconds
>
> For comparison:
>
> Results for 50 additions of [size]x[size] matrices (which should be
> faster, be the extent of the difference is nonetheless striking to me):
>    Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
>    Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
>    Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
>    Size: 600, total time: 0.050 seconds, time per: 0.001 seconds
>
> Results for 50 inversions of a [size]x[size] matrix, which I'd expect to
> be slower than multiplication for larger matrices:
>    Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
>    Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
>    Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
>    Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
>    Size: 600, total time: 9.297 seconds, time per: 0.186 seconds
>
> I hope this is useful, and if I'm doing something wrong that's leading to
> this performance gap, I'd love to know.
>
> ----
> import org.apache.commons.math3.linear.LUDecomposition;
> import org.apache.commons.math3.linear.Array2DRowRealMatrix;
> import org.apache.commons.math3.linear.RealMatrix;
>
>
> public class AMCMatrices {
>
>   public static void main(String[] args) {
>     miniTest(0);
>   }
>
>   public static void miniTest(int tType) {
>     int samples = 50;
>
>     int sizes[] = { 10, 100, 300, 500, 600 };
>
>     for (int sI = 0; sI < sizes.length; sI++) {
>       int mSize = sizes[sI];
>
>       org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize, mSize);
>       RealMatrix m1 = buildM(mSize, mSize);
>
>       long start = System.nanoTime();
>       for (int n = 0; n < samples; n++) {
>         switch (tType) {
>         case 0:
>           m0.multiply(m1);
>           break;
>         case 1:
>           m0.add(m1);
>           break;
>         case 2:
>           new LUDecomposition(m0).getSolver().getInverse();
>           break;
>         }
>
>       }
>       long end = System.nanoTime();
>
>       double dt = ((double) (end - start)) / 1E9;
>       System.out.println(String.format(
>           "Size: %d, total time: %3.3f seconds, time per: %3.3f seconds",
>           mSize, dt, dt / samples));
>     }
>   }
>
>   public static Array2DRowRealMatrix buildM(int r, int c) {
>     double[][] matVals = new double[r][c];
>     for (int i = 0; i < r; i++) {
>       for (int j = 0; j < c; j++) {
>         matVals[i][j] = Math.random();
>       }
>     }
>     return new Array2DRowRealMatrix(matVals);
>   }
> }
>
> ----
>
> On 5 April 2016 at 19:36, Gilles <gi...@harfang.homelinux.org> wrote:
>
>> Hi.
>>
>> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>>
>>> I recently ran a benchmark comparing the performance math commons 3.6.1's
>>> linear algebra library to the that of scala Breeze (
>>> https://github.com/scalanlp/breeze).
>>>
>>> I looked at det, inverse, Cholesky factorization, addition, and
>>> multiplication, including matrices with 10, 100, 500, and 1000 elements,
>>> with symmetric, non-symmetric, and non-square cases where applicable.
>>>
>>
>> It would be interesting to add this to the CM documentation:
>>   https://issues.apache.org/jira/browse/MATH-1194
>>
>> In general, I was pleasantly surprised: math commons performed about as
>>> well as Breeze, despite the latter relying on native libraries. There was
>>> one exception, however:
>>>
>>>     m0.multiply(m1)
>>>
>>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled very
>>> poorly in math commons, being much slower than nominally more expensive
>>> operations like inv and the Breeze implementation. Does anyone have a
>>> thought as to what's going on?
>>>
>>
>> Could your provide more information such as a plot of performance vs size?
>> A self-contained and minimal code to run would be nice too.
>> See the CM micro-benchmarking tool here:
>>
>> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
>> And an example of how to use it:
>>
>> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>>
>> In case it's useful, one representative test
>>> involves multiplying two instances of
>>>
>>>     new Array2DRowRealMatrix(matVals)
>>>
>>> where matVals is 1000x1000 entries of math.random and the second instance
>>> is created as part of the loop. This part of the benchmark is not
>>> specific
>>> to the expensive multiplication step, and takes very little time relative
>>> to the multiplication itself. I'm using System.nanotime for the timing,
>>> and
>>> take the average time over several consecutive iterations, on a 3.5 GHz
>>> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
>>>
>>
>> You might want to confirm the behaviour using JMH (becoming the Java
>> standard
>> for benchmarking):
>>   http://openjdk.java.net/projects/code-tools/jmh/
>>
>>
>> Best regards,
>> Gilles
>>
>>
>>> Thanks,
>>>
>>> Chris
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
>> For additional commands, e-mail: user-help@commons.apache.org
>>
>>
>

Re: [math] matrix multiplication surprisingly slow?

Posted by Chris Lucas <cl...@inf.ed.ac.uk>.
Thanks for the quick reply!

I've pasted a small self-contained example at the bottom. It creates the
matrices in advance, but nothing meaningful changes if they're created on a
per-operation basis.

Results for 50 multiplications of [size]x[size] matrices:
   Size: 10, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.062 seconds, time per: 0.001 seconds
   Size: 300, total time: 3.050 seconds, time per: 0.061 seconds
   Size: 500, total time: 15.186 seconds, time per: 0.304 seconds
   Size: 600, total time: 17.532 seconds, time per: 0.351 seconds

For comparison:

Results for 50 additions of [size]x[size] matrices (which should be faster,
be the extent of the difference is nonetheless striking to me):
   Size: 10, total time: 0.011 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.012 seconds, time per: 0.000 seconds
   Size: 300, total time: 0.020 seconds, time per: 0.000 seconds
   Size: 500, total time: 0.032 seconds, time per: 0.001 seconds
   Size: 600, total time: 0.050 seconds, time per: 0.001 seconds

Results for 50 inversions of a [size]x[size] matrix, which I'd expect to be
slower than multiplication for larger matrices:
   Size: 10, total time: 0.014 seconds, time per: 0.000 seconds
   Size: 100, total time: 0.074 seconds, time per: 0.001 seconds
   Size: 300, total time: 1.005 seconds, time per: 0.020 seconds
   Size: 500, total time: 5.024 seconds, time per: 0.100 seconds
   Size: 600, total time: 9.297 seconds, time per: 0.186 seconds

I hope this is useful, and if I'm doing something wrong that's leading to
this performance gap, I'd love to know.

----
import org.apache.commons.math3.linear.LUDecomposition;
import org.apache.commons.math3.linear.Array2DRowRealMatrix;
import org.apache.commons.math3.linear.RealMatrix;


public class AMCMatrices {

  public static void main(String[] args) {
    miniTest(0);
  }

  public static void miniTest(int tType) {
    int samples = 50;

    int sizes[] = { 10, 100, 300, 500, 600 };

    for (int sI = 0; sI < sizes.length; sI++) {
      int mSize = sizes[sI];

      org.apache.commons.math3.linear.RealMatrix m0 = buildM(mSize, mSize);
      RealMatrix m1 = buildM(mSize, mSize);

      long start = System.nanoTime();
      for (int n = 0; n < samples; n++) {
        switch (tType) {
        case 0:
          m0.multiply(m1);
          break;
        case 1:
          m0.add(m1);
          break;
        case 2:
          new LUDecomposition(m0).getSolver().getInverse();
          break;
        }

      }
      long end = System.nanoTime();

      double dt = ((double) (end - start)) / 1E9;
      System.out.println(String.format(
          "Size: %d, total time: %3.3f seconds, time per: %3.3f seconds",
          mSize, dt, dt / samples));
    }
  }

  public static Array2DRowRealMatrix buildM(int r, int c) {
    double[][] matVals = new double[r][c];
    for (int i = 0; i < r; i++) {
      for (int j = 0; j < c; j++) {
        matVals[i][j] = Math.random();
      }
    }
    return new Array2DRowRealMatrix(matVals);
  }
}

----

On 5 April 2016 at 19:36, Gilles <gi...@harfang.homelinux.org> wrote:

> Hi.
>
> On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
>
>> I recently ran a benchmark comparing the performance math commons 3.6.1's
>> linear algebra library to the that of scala Breeze (
>> https://github.com/scalanlp/breeze).
>>
>> I looked at det, inverse, Cholesky factorization, addition, and
>> multiplication, including matrices with 10, 100, 500, and 1000 elements,
>> with symmetric, non-symmetric, and non-square cases where applicable.
>>
>
> It would be interesting to add this to the CM documentation:
>   https://issues.apache.org/jira/browse/MATH-1194
>
> In general, I was pleasantly surprised: math commons performed about as
>> well as Breeze, despite the latter relying on native libraries. There was
>> one exception, however:
>>
>>     m0.multiply(m1)
>>
>> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled very
>> poorly in math commons, being much slower than nominally more expensive
>> operations like inv and the Breeze implementation. Does anyone have a
>> thought as to what's going on?
>>
>
> Could your provide more information such as a plot of performance vs size?
> A self-contained and minimal code to run would be nice too.
> See the CM micro-benchmarking tool here:
>
> https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
> And an example of how to use it:
>
> https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java
>
> In case it's useful, one representative test
>> involves multiplying two instances of
>>
>>     new Array2DRowRealMatrix(matVals)
>>
>> where matVals is 1000x1000 entries of math.random and the second instance
>> is created as part of the loop. This part of the benchmark is not specific
>> to the expensive multiplication step, and takes very little time relative
>> to the multiplication itself. I'm using System.nanotime for the timing,
>> and
>> take the average time over several consecutive iterations, on a 3.5 GHz
>> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).
>>
>
> You might want to confirm the behaviour using JMH (becoming the Java
> standard
> for benchmarking):
>   http://openjdk.java.net/projects/code-tools/jmh/
>
>
> Best regards,
> Gilles
>
>
>> Thanks,
>>
>> Chris
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>
>

Re: [math] matrix multiplication surprisinglyslow?

Posted by Gilles <gi...@harfang.homelinux.org>.
Hi.

On Tue, 5 Apr 2016 15:43:04 +0100, Chris Lucas wrote:
> I recently ran a benchmark comparing the performance math commons 
> 3.6.1's
> linear algebra library to the that of scala Breeze (
> https://github.com/scalanlp/breeze).
>
> I looked at det, inverse, Cholesky factorization, addition, and
> multiplication, including matrices with 10, 100, 500, and 1000 
> elements,
> with symmetric, non-symmetric, and non-square cases where applicable.

It would be interesting to add this to the CM documentation:
   https://issues.apache.org/jira/browse/MATH-1194

> In general, I was pleasantly surprised: math commons performed about 
> as
> well as Breeze, despite the latter relying on native libraries. There 
> was
> one exception, however:
>
>     m0.multiply(m1)
>
> where m0 and m1 are both Array2DRowRealMatrix instances. It scaled 
> very
> poorly in math commons, being much slower than nominally more 
> expensive
> operations like inv and the Breeze implementation. Does anyone have a
> thought as to what's going on?

Could your provide more information such as a plot of performance vs 
size?
A self-contained and minimal code to run would be nice too.
See the CM micro-benchmarking tool here:
   
https://github.com/apache/commons-math/blob/master/src/test/java/org/apache/commons/math4/PerfTestUtils.java
And an example of how to use it:
   
https://github.com/apache/commons-math/blob/master/src/userguide/java/org/apache/commons/math4/userguide/FastMathTestPerformance.java

> In case it's useful, one representative test
> involves multiplying two instances of
>
>     new Array2DRowRealMatrix(matVals)
>
> where matVals is 1000x1000 entries of math.random and the second 
> instance
> is created as part of the loop. This part of the benchmark is not 
> specific
> to the expensive multiplication step, and takes very little time 
> relative
> to the multiplication itself. I'm using System.nanotime for the 
> timing, and
> take the average time over several consecutive iterations, on a 3.5 
> GHz
> Intel Core i7, Oracle JRE (build 1.8.0_05-b13).

You might want to confirm the behaviour using JMH (becoming the Java 
standard
for benchmarking):
   http://openjdk.java.net/projects/code-tools/jmh/


Best regards,
Gilles

>
> Thanks,
>
> Chris


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org