You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/25 08:21:33 UTC

[GitHub] [arrow] jianxind opened a new pull request #7267: ARROW-8772: [C++] Unrooled aggregate dense for better speculative execution

jianxind opened a new pull request #7267:
URL: https://github.com/apache/arrow/pull/7267


   1. Expand SumKernel benchmark to more types(Float, Double, Int8, Int16, Int32, Int64).
   2. Unlooped the aggregate kernel dense part to speculative add the result in parrel.
   
   Signed-off-by: Frank Du <fr...@intel.com>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7267: ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-633600133


   This shows a small benefit on ARM architecture (ThunderX1):
   
   before:
   
   ```
   -----------------------------------------------------------------------------
   Benchmark                   Time             CPU   Iterations UserCounters...
   -----------------------------------------------------------------------------
   SumKernel/32768/0        19.5 us         19.5 us        35883 bytes_per_second=1.56853G/s null_percent=0 size=32.768k
   SumKernel/32768/1        19.2 us         19.2 us        36172 bytes_per_second=1.58548G/s null_percent=1 size=32.768k
   SumKernel/32768/10       21.7 us         21.7 us        32230 bytes_per_second=1.40723G/s null_percent=10 size=32.768k
   SumKernel/32768/50       22.4 us         22.4 us        31172 bytes_per_second=1.36246G/s null_percent=50 size=32.768k
   ```
   
   after (see the Int64 benchmarks):
   
   ```
   -----------------------------------------------------------------------------------
   Benchmark                         Time             CPU   Iterations UserCounters...
   -----------------------------------------------------------------------------------
   SumKernelFloat/32768/0         25.4 us         25.4 us        27590 bytes_per_second=1.20019G/s null_percent=0 size=32.768k
   SumKernelFloat/32768/1         41.2 us         41.2 us        16973 bytes_per_second=758.715M/s null_percent=1 size=32.768k
   SumKernelFloat/32768/10        49.0 us         49.0 us        14285 bytes_per_second=638.254M/s null_percent=10 size=32.768k
   SumKernelFloat/32768/50        51.5 us         51.5 us        13625 bytes_per_second=606.657M/s null_percent=50 size=32.768k
   SumKernelDouble/32768/0        18.5 us         18.5 us        37953 bytes_per_second=1.65038G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/1        26.2 us         26.2 us        26852 bytes_per_second=1.16631G/s null_percent=1 size=32.768k
   SumKernelDouble/32768/10       27.3 us         27.3 us        25685 bytes_per_second=1.11778G/s null_percent=10 size=32.768k
   SumKernelDouble/32768/50       27.7 us         27.7 us        25298 bytes_per_second=1.10219G/s null_percent=50 size=32.768k
   SumKernelInt8/32768/0          64.4 us         64.4 us        10857 bytes_per_second=485.21M/s null_percent=0 size=32.768k
   SumKernelInt8/32768/1          69.8 us         69.8 us        10025 bytes_per_second=448.008M/s null_percent=1 size=32.768k
   SumKernelInt8/32768/10         92.7 us         92.7 us         7554 bytes_per_second=337.084M/s null_percent=10 size=32.768k
   SumKernelInt8/32768/50          100 us          100 us         6974 bytes_per_second=311.424M/s null_percent=50 size=32.768k
   SumKernelInt16/32768/0         41.0 us         41.0 us        17076 bytes_per_second=762.469M/s null_percent=0 size=32.768k
   SumKernelInt16/32768/1         42.5 us         42.5 us        16513 bytes_per_second=735.553M/s null_percent=1 size=32.768k
   SumKernelInt16/32768/10        53.7 us         53.7 us        12977 bytes_per_second=581.673M/s null_percent=10 size=32.768k
   SumKernelInt16/32768/50        57.7 us         57.7 us        12103 bytes_per_second=541.264M/s null_percent=50 size=32.768k
   SumKernelInt32/32768/0         26.0 us         26.0 us        26859 bytes_per_second=1.17154G/s null_percent=0 size=32.768k
   SumKernelInt32/32768/1         26.5 us         26.5 us        26555 bytes_per_second=1.15362G/s null_percent=1 size=32.768k
   SumKernelInt32/32768/10        32.3 us         32.3 us        21653 bytes_per_second=967.29M/s null_percent=10 size=32.768k
   SumKernelInt32/32768/50        34.2 us         34.2 us        20513 bytes_per_second=914.616M/s null_percent=50 size=32.768k
   SumKernelInt64/32768/0         18.0 us         18.0 us        39059 bytes_per_second=1.69949G/s null_percent=0 size=32.768k
   SumKernelInt64/32768/1         19.3 us         19.3 us        36220 bytes_per_second=1.58067G/s null_percent=1 size=32.768k
   SumKernelInt64/32768/10        21.8 us         21.8 us        32171 bytes_per_second=1.40112G/s null_percent=10 size=32.768k
   SumKernelInt64/32768/50        22.5 us         22.5 us        31062 bytes_per_second=1.35471G/s null_percent=50 size=32.768k
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm closed pull request #7267: ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

wesm closed pull request #7267:
URL: https://github.com/apache/arrow/pull/7267


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7267: ARROW-8772: [C++] Unrooled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-633450931


   https://issues.apache.org/jira/browse/ARROW-8772


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind edited a comment on pull request #7267: ARROW-8772: [C++] Unrooled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

jianxind edited a comment on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-633449893


   We find the SumKernel benchmark dense(null percent 0) results is relatively low compared to sparse part for float and double type. Below is the result before unrolled the loop.
   
   - SumKernelFloat/32768/0         9.12 us         9.11 us        76809 bytes_per_second=3.35021G/s null_percent=0
   - SumKernelFloat/32768/1         4.38 us         4.38 us       159861 bytes_per_second=6.97134G/s null_percent=1
   - SumKernelDouble/32768/0        4.99 us         4.99 us       140190 bytes_per_second=6.11937G/s null_percent=0
   - SumKernelDouble/32768/1        2.14 us         2.14 us       327897 bytes_per_second=14.246G/s null_percent=1
   
   With the unroll change,  the dense sumkernel benchmark get 3.7x for float and 2.6x for double.
   
   - SumKernelFloat/32768/0         2.46 us         2.46 us       285185 bytes_per_second=12.4269G/s null_percent=0
   - SumKernelDouble/32768/0        1.90 us         1.90 us       370921 bytes_per_second=16.0846G/s null_percent=0
   
   Anyway, it can get more higher performance if using intrinsic, I'd like to work at later point.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind edited a comment on pull request #7267: ARROW-8772: [C++] Unrooled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

jianxind edited a comment on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-633449893


   We find the SumKernel benchmark dense(null percent 0) results is relatively low compared to sparse part for float and double type. Below is the result before unrolled the loop.
   
   - SumKernelFloat/32768/0         9.12 us         9.11 us        76809 bytes_per_second=3.35021G/s null_percent=0
   - SumKernelFloat/32768/1         4.38 us         4.38 us       159861 bytes_per_second=6.97134G/s null_percent=1
   - SumKernelDouble/32768/0        4.99 us         4.99 us       140190 bytes_per_second=6.11937G/s null_percent=0
   - SumKernelDouble/32768/1        2.14 us         2.14 us       327897 bytes_per_second=14.246G/s null_percent=1
   
   With the unroll change,  the dense sumkernel benchmark get 3.7x improvement for float and 2.6x speed for double.
   
   - SumKernelFloat/32768/0         2.46 us         2.46 us       285185 bytes_per_second=12.4269G/s null_percent=0
   - SumKernelDouble/32768/0        1.90 us         1.90 us       370921 bytes_per_second=16.0846G/s null_percent=0
   
   Anyway, it can get more higher performance if using intrinsic, I'd like to work at later point.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm edited a comment on pull request #7267: ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

wesm edited a comment on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-633600133


   This shows a small benefit on ARM64 architecture (ThunderX1):
   
   before:
   
   ```
   -----------------------------------------------------------------------------
   Benchmark                   Time             CPU   Iterations UserCounters...
   -----------------------------------------------------------------------------
   SumKernel/32768/0        19.5 us         19.5 us        35883 bytes_per_second=1.56853G/s null_percent=0 size=32.768k
   SumKernel/32768/1        19.2 us         19.2 us        36172 bytes_per_second=1.58548G/s null_percent=1 size=32.768k
   SumKernel/32768/10       21.7 us         21.7 us        32230 bytes_per_second=1.40723G/s null_percent=10 size=32.768k
   SumKernel/32768/50       22.4 us         22.4 us        31172 bytes_per_second=1.36246G/s null_percent=50 size=32.768k
   ```
   
   after (see the Int64 benchmarks):
   
   ```
   -----------------------------------------------------------------------------------
   Benchmark                         Time             CPU   Iterations UserCounters...
   -----------------------------------------------------------------------------------
   SumKernelFloat/32768/0         25.4 us         25.4 us        27590 bytes_per_second=1.20019G/s null_percent=0 size=32.768k
   SumKernelFloat/32768/1         41.2 us         41.2 us        16973 bytes_per_second=758.715M/s null_percent=1 size=32.768k
   SumKernelFloat/32768/10        49.0 us         49.0 us        14285 bytes_per_second=638.254M/s null_percent=10 size=32.768k
   SumKernelFloat/32768/50        51.5 us         51.5 us        13625 bytes_per_second=606.657M/s null_percent=50 size=32.768k
   SumKernelDouble/32768/0        18.5 us         18.5 us        37953 bytes_per_second=1.65038G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/1        26.2 us         26.2 us        26852 bytes_per_second=1.16631G/s null_percent=1 size=32.768k
   SumKernelDouble/32768/10       27.3 us         27.3 us        25685 bytes_per_second=1.11778G/s null_percent=10 size=32.768k
   SumKernelDouble/32768/50       27.7 us         27.7 us        25298 bytes_per_second=1.10219G/s null_percent=50 size=32.768k
   SumKernelInt8/32768/0          64.4 us         64.4 us        10857 bytes_per_second=485.21M/s null_percent=0 size=32.768k
   SumKernelInt8/32768/1          69.8 us         69.8 us        10025 bytes_per_second=448.008M/s null_percent=1 size=32.768k
   SumKernelInt8/32768/10         92.7 us         92.7 us         7554 bytes_per_second=337.084M/s null_percent=10 size=32.768k
   SumKernelInt8/32768/50          100 us          100 us         6974 bytes_per_second=311.424M/s null_percent=50 size=32.768k
   SumKernelInt16/32768/0         41.0 us         41.0 us        17076 bytes_per_second=762.469M/s null_percent=0 size=32.768k
   SumKernelInt16/32768/1         42.5 us         42.5 us        16513 bytes_per_second=735.553M/s null_percent=1 size=32.768k
   SumKernelInt16/32768/10        53.7 us         53.7 us        12977 bytes_per_second=581.673M/s null_percent=10 size=32.768k
   SumKernelInt16/32768/50        57.7 us         57.7 us        12103 bytes_per_second=541.264M/s null_percent=50 size=32.768k
   SumKernelInt32/32768/0         26.0 us         26.0 us        26859 bytes_per_second=1.17154G/s null_percent=0 size=32.768k
   SumKernelInt32/32768/1         26.5 us         26.5 us        26555 bytes_per_second=1.15362G/s null_percent=1 size=32.768k
   SumKernelInt32/32768/10        32.3 us         32.3 us        21653 bytes_per_second=967.29M/s null_percent=10 size=32.768k
   SumKernelInt32/32768/50        34.2 us         34.2 us        20513 bytes_per_second=914.616M/s null_percent=50 size=32.768k
   SumKernelInt64/32768/0         18.0 us         18.0 us        39059 bytes_per_second=1.69949G/s null_percent=0 size=32.768k
   SumKernelInt64/32768/1         19.3 us         19.3 us        36220 bytes_per_second=1.58067G/s null_percent=1 size=32.768k
   SumKernelInt64/32768/10        21.8 us         21.8 us        32171 bytes_per_second=1.40112G/s null_percent=10 size=32.768k
   SumKernelInt64/32768/50        22.5 us         22.5 us        31062 bytes_per_second=1.35471G/s null_percent=50 size=32.768k
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] fsaintjacques commented on pull request #7267: ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

fsaintjacques commented on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-634179607


   I just wanted to point that this vary greatly depending on the toolchain, e.g. LLVM is usually more aggressive at unrolling than GCC. Since most of our users (python, R) will have libarrow built with gcc, I think it's fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7267: ARROW-8772: [C++] Unrolled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-633756435


   Indeed the dense performance issue only happens for float and double type, compiler did a good job for all int types, that is why I add all data types to sumkernel benchmark.
   
   > @jianxind thank you for looking into this! I had suspected that there was room for improvement in this algorithm. You're certainly free to implement kernels requiring intrinsics and put them in aggregate_basic_$SIMD_VERSION.cc, and then we can utilize the CpuInfo from `ExecContext` during kernel dispatch to select a supported kernel.
   
   Sure I will work on this SIMD chance then, thanks.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jianxind commented on pull request #7267: ARROW-8772: [C++] Unrooled aggregate dense for better speculative execution

Posted by GitBox <gi...@apache.org>.

jianxind commented on pull request #7267:
URL: https://github.com/apache/arrow/pull/7267#issuecomment-633449893


   We find the SumKernel benchmark dense(null percent 0) results is relatively low compared to sparse part for float and double type. Below is the result before unrolled the loop.
   SumKernelFloat/32768/0         9.12 us         9.11 us        76809 bytes_per_second=3.35021G/s null_percent=0 size=32.768k
   SumKernelFloat/32768/1         4.38 us         4.38 us       159861 bytes_per_second=6.97134G/s null_percent=1 size=32.768k
   SumKernelDouble/32768/0        4.99 us         4.99 us       140190 bytes_per_second=6.11937G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/1        2.14 us         2.14 us       327897 bytes_per_second=14.246G/s null_percent=1 size=32.768k
   
   With the unroll change,  the dense sumkernel benchmark get 3.7x for float and 2.6x for double.
   SumKernelFloat/32768/0         2.46 us         2.46 us       285185 bytes_per_second=12.4269G/s null_percent=0 size=32.768k
   SumKernelDouble/32768/0        1.90 us         1.90 us       370921 bytes_per_second=16.0846G/s null_percent=0 size=32.768k
   
   Anyway, it can get more higher performance if using intrinsic, I'd like to work at later point.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org