You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2021/04/25 04:27:00 UTC

[jira] [Created] (ARROW-12533) Some benchmarks are slow on Arm64 Linux when built with clang

Yibo Cai created ARROW-12533:
--------------------------------

             Summary: Some benchmarks are slow on Arm64 Linux when built with clang
                 Key: ARROW-12533
                 URL: https://issues.apache.org/jira/browse/ARROW-12533
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Yibo Cai


Many benchmarks run very slow on Arm64 Linux when built with clang.
 Most time is spent in preparing test data, not the test itself.

Per my investigation, it boils down to poor performance of `std::uniform_real_distribution`, which uses software emulated `long double` arithmetic on Arm64 [1].

Apple M1 doesn't have this issue. Clang aarch64 sets `long double` size to 64 bits on MacOS, but 128 on Linux [2].

Gcc aarch64 doesn't have this issue. It doesn't use `long double` to generate random reals [1]. Guess clang uses algorithms with better randomness.

clang `-ffast-math` option removes the `long double` arithmetic (and adds other simplifications to floating point arithmetic), it improves speed 100x on Arm64 in generating random reals.

It may deserve some effort to study if `long double` is really necessary, and if `-ffast-math` is acceptable for generating test bits.

[1] [https://godbolt.org/z/Y3Tc6MTME]
 [2] [https://en.wikipedia.org/wiki/Long_double]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)