You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ankur (Jira)" <ji...@apache.org> on 2021/03/16 02:11:00 UTC

[jira] [Comment Edited] (LUCENE-9838) simd version of VectorUtil.dotProduct

    [ https://issues.apache.org/jira/browse/LUCENE-9838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302157#comment-17302157 ] 

Ankur edited comment on LUCENE-9838 at 3/16/21, 2:10 AM:
---------------------------------------------------------

This is cool - [~rcmuir]. 

I played with this a little on my MacBook Pro (2019, *Memory*: 32 GB 2667 MHZ DDR4; *Processor*:  2.6 GHz 6-Core Intel Core i7) after downloading [OpenJDK build 16+36-2231|https://download.java.net/java/GA/jdk16/7863447f0ab643c585b9bdebf67c69db/36/GPL/openjdk-16_osx-x64_bin.tar.gz] and setting up a standalone [JMH benchmark|https://github.com/openjdk/jmh] project.

I copied over the old dotProduct implementation and the new one from your patch to _MyBenchmark.java_ in the JMH project space. Here are the results I got
{code:java}
Benchmark                  (size)   Mode  Cnt    Score   Error   Units
MyBenchmark.dotProductOld      16  thrpt    5   90.896 ± 5.302  ops/us
MyBenchmark.dotProductNew      16  thrpt    5  100.901 ± 5.105  ops/us

MyBenchmark.dotProductOld      32  thrpt    5   53.563 ± 2.378  ops/us
MyBenchmark.dotProductNew      32  thrpt    5   97.610 ± 5.393  ops/us

MyBenchmark.dotProductOld      64  thrpt    5   29.792 ± 1.246  ops/us
MyBenchmark.dotProductNew      64  thrpt    5   73.499 ± 3.640  ops/us

MyBenchmark.dotProductOld     128  thrpt    5   16.906 ± 0.751  ops/us
MyBenchmark.dotProductNew     128  thrpt    5   65.068 ± 3.986  ops/us

MyBenchmark.dotProductOld     256  thrpt    5    8.360 ± 0.125  ops/us
MyBenchmark.dotProductNew     256  thrpt    5   42.595 ± 2.958  ops/us

MyBenchmark.dotProductOld     512  thrpt    5    4.231 ± 0.158  ops/us
MyBenchmark.dotProductNew     512  thrpt    5   26.283 ± 0.640  ops/us

MyBenchmark.dotProductOld    1024  thrpt    5    2.104 ± 0.093  ops/us
MyBenchmark.dotProductNew    1024  thrpt    5   14.389 ± 0.720  ops/us

{code}
 

These benchmarks were run after adding annotations to disable TieredCompilation and vector bounds check. Looks like for small vector size (*16 elements*) we see *10%* improvement but for large vectors (*128 or more* elements) the improvement is *_4X or higher._*


was (Author: goankur):
This is cool - [~rcmuir]. 

I played with this a little on my MacBook Pro (2019, *Memory*: 32 GB 2667 MHZ DDR4; *Processor*:  2.6 GHz 6-Core Intel Core i7) after downloading [OpenJDK build 16+36-2231|https://download.java.net/java/GA/jdk16/7863447f0ab643c585b9bdebf67c69db/36/GPL/openjdk-16_osx-x64_bin.tar.gz] and setting up a standalone [JMH benchmark|https://github.com/openjdk/jmh] project.

I copied over the old dotProduct implementation and the new one from your patch to _MyBenchmark.java_ in the JMH project space. Here are the results I got
{code:java}
Benchmark                  (size)   Mode  Cnt    Score   Error   Units
MyBenchmark.dotProductOld      16  thrpt    5   90.896 ± 5.302  ops/us
MyBenchmark.dotProductNew      16  thrpt    5  100.901 ± 5.105  ops/us

MyBenchmark.dotProductOld      32  thrpt    5   53.563 ± 2.378  ops/us
MyBenchmark.dotProductNew      32  thrpt    5   97.610 ± 5.393  ops/us

MyBenchmark.dotProductOld      64  thrpt    5   29.792 ± 1.246  ops/us
MyBenchmark.dotProductNew      64  thrpt    5   73.499 ± 3.640  ops/us

MyBenchmark.dotProductOld     128  thrpt    5   16.906 ± 0.751  ops/us
MyBenchmark.dotProductNew     128  thrpt    5   65.068 ± 3.986  ops/us

MyBenchmark.dotProductOld     256  thrpt    5    8.360 ± 0.125  ops/us
MyBenchmark.dotProductNew     256  thrpt    5   42.595 ± 2.958  ops/us

MyBenchmark.dotProductOld     512  thrpt    5    4.231 ± 0.158  ops/us
MyBenchmark.dotProductNew     512  thrpt    5   26.283 ± 0.640  ops/us

MyBenchmark.dotProductOld    1024  thrpt    5    2.104 ± 0.093  ops/us
MyBenchmark.dotProductNew    1024  thrpt    5   14.389 ± 0.720  ops/us

{code}
 

These benchmarks were run after adding annotations to disable TieredCompilation and vector bounds check. Looks like for small vector size (*16 elements*) we see *10%* improvement but for large vectors (*128 or more* elements) the improvement is *_4X or higher._*

 

 

 

 

 

 

> simd version of VectorUtil.dotProduct
> -------------------------------------
>
>                 Key: LUCENE-9838
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9838
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-9838.patch
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Followup to LUCENE-9837
> Let's explore using JDK 16 vector API to speed this up more. It might be a hassle to try to MR-JAR/package up for users (adding commandline flags and stuff), but it gives good performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org