You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "rmuir (via GitHub)" <gi...@apache.org> on 2023/10/28 03:50:01 UTC

[PR] Speedup float cosine vectors, use FMA where fast and available to reduce error [lucene]

rmuir opened a new pull request, #12731:
URL: https://github.com/apache/lucene/pull/12731

   The intel fma is nice, and its easier to reason about when looking at assembly. We basically reduce the error for free where its available. Along with another change (reducing the unrolling for cosine, since it has 3 fma ops already), we can speed up cosine from 6 -> 8 uops/us.
   
   On the arm the fma leads to slight slowdowns, so we don't use it. Its not much, just something like 10%, but seems like the wrong tradeoff.
   
   If you run the code with `-XX-UseFMA` there's no slowdown, but no speedup either. And obviously, no changes for ARM here.
   
   ```
   Skylake AVX-256
   
   Main:
   Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.624 ± 0.041  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   5.988 ± 0.111  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.959 ± 0.032  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  12.058 ± 0.920  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.422 ± 0.018  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5   9.837 ± 0.154  ops/us
   
   Patch:
   Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.638 ± 0.006  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   8.164 ± 0.084  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.997 ± 0.027  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  12.486 ± 0.163  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.445 ± 0.014  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  11.682 ± 0.129  ops/us
   
   Patch (with -jvmArgsAppend '-XX:-UseFMA'):
   Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
   VectorUtilBenchmark.floatCosineScalar        1024  thrpt    5   0.641 ± 0.005  ops/us
   VectorUtilBenchmark.floatCosineVector        1024  thrpt    5   6.102 ± 0.053  ops/us
   VectorUtilBenchmark.floatDotProductScalar    1024  thrpt    5   1.997 ± 0.007  ops/us
   VectorUtilBenchmark.floatDotProductVector    1024  thrpt    5  12.177 ± 0.170  ops/us
   VectorUtilBenchmark.floatSquareScalar        1024  thrpt    5   1.450 ± 0.027  ops/us
   VectorUtilBenchmark.floatSquareVector        1024  thrpt    5  10.464 ± 0.154  ops/us
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] Speedup float cosine vectors, use FMA where fast and available to reduce error [lucene]

Posted by "uschindler (via GitHub)" <gi...@apache.org>.

uschindler commented on code in PR #12731:
URL: https://github.com/apache/lucene/pull/12731#discussion_r1375252807


##########
lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##########
@@ -77,6 +77,47 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
         VectorizationProvider.TESTS_FORCE_INTEGER_VECTORS || (isAMD64withoutAVX2 == false);
   }
 
+  private static final String MANAGEMENT_FACTORY_CLASS = "java.lang.management.ManagementFactory";
+  private static final String HOTSPOT_BEAN_CLASS = "com.sun.management.HotSpotDiagnosticMXBean";
+
+  // best effort to see if FMA is fast (this is architecture-independent option)
+  private static boolean hasFastFMA() {
+    // on ARM cpus, FMA works fine but is a slight slowdown: don't use it.
+    if (Constants.OS_ARCH.equals("amd64") == false) {
+      return false;
+    }
+    try {
+      final Class<?> beanClazz = Class.forName(HOTSPOT_BEAN_CLASS);
+      // we use reflection for this, because the management factory is not part
+      // of Java 8's compact profile:
+      final Object hotSpotBean =

Review Comment:
   Ok, thats fine thank for confirming that it works. The "require static" allows access to the module (if available). So it looks like the code works fine.
   
   I think my only change I'd suggested is to add the FMA enablement to the logging message as stated above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org