You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2023/06/15 12:04:46 UTC

[GitHub] [arrow] pitrou opened a new issue, #36090: [C++] Excessive increase in debug build times

pitrou opened a new issue, #36090:
URL: https://github.com/apache/arrow/issues/36090

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Here is a full debug build with Acero disabled (572 build steps), as of changeset 95972cca1f65179b63a9fe5262d1092547e3d137:
   ```
   real	2m22,463s
   user	38m39,322s
   sys	2m26,218s
   ```
   
   Here is the same build with Acero enabled (642 build steps):
   ```
   real	2m39,304s
   user	45m11,410s
   sys	2m51,336s
   ```
   
   So compiling Acero + tests is worth 7 minutes of CPU time, which seems a bit large.
   
   More generally, Arrow C++ compile times (in debug mode) seem to have increased by 25% - or 7 additional minutes of CPU time - between 11.0.0 and 12.0.0, which is not sustainable:
   
   * debug build on 11.0.0 (552 build steps):
   ```
   real	2m17,473s
   user	37m37,749s
   sys	2m24,311s
   ```
   * same debug build on 12.0.0 (634 build steps):
   ```
   real	2m35,782s
   user	44m12,064s
   sys	2m48,723s
   ```
   
   ### Measurements notes
   
   * `ccache` was disabled
   * CUDA, Dataset, Flight, Gandiva were disabled
   * Compute, GCS, Parquet, Orc, S3 were enabled
   * shared libraries only
   * using `ld.gold`
   * target build directory resides on a [zram](https://en.wikipedia.org/wiki/Zram) filesystem
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36090: [C++] Excessive increase in debug build times

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1596795853

   > Consider enabling precompiled headers on builds that don't seem to be getting much benefit from ccache (I can look into this but I don't know much about how well ccache is or isn't working on these CI jobs so I'd appreciate any input)
   
   ccache is supposed to be beneficial as long as the cache is primed and not stale. I don't think we have any builds where ccache wouldn't be very beneficial (apart from the build(s) that do enable precompiled headers unfortunately).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36090: [C++] Excessive increase in debug build times

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1592917416

   cc @westonpace @felipecrv , in case one of you wants to investigate and find ways to reduce the compile tax.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #36090: [C++] Excessive increase in debug build times

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1595869905

   Sure! I can help. I think that we can use "object library" for the propose: https://cmake.org/cmake/help/latest/command/add_library.html#object-libraries
   
   Could you open a new issue for a task what should I do?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36090: [C++] Excessive increase in debug build times

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1596816359

   In my experience, CMake unity builds are an easier and more hassle-free way to reduce header inclusion overhead.
   
   However, we could perhaps think about selectively enabling precompiled headers for unit tests (as CMake unity builds cannot encompass several targets, and GTest is a heavy inclusion unfortunately).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36090: [C++] Excessive increase in debug build times

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1595586144

   I investigated this today and I think I figured out the problem.  When we split out acero into it's own module we ended up introducing `test_util_internal.cc`  I believe this had originally been a part of the arrow testing library but I didn't dig too much into where it came from.
   
   To make it available to all the tests we created an `add_arrow_acero_test` function which had...
   
   ```
     add_arrow_test(${REL_TEST_NAME}
                    EXTRA_LINK_LIBS
                    ${ARROW_ACERO_TEST_LINK_LIBS}
                    PREFIX
                    ${PREFIX}
                    SOURCES
                    test_util_internal.cc
                    LABELS
                    ${LABELS}
                    ${ARG_UNPARSED_ARGUMENTS})
   ```
   
   We also added `test_util_internal.cc` to all of the dataset tests in a similar fashion.
   
   Unfortunately, adding this common utility in this way actually means that the file gets compiled again for every single test.  Since this file is fairly complex it ended up adding quite a bit to the overall compilation time.
   
   The fix is straightforward but I may need some cmake help to figure out the best way to do it in our existing infrastructure.  I'd like to create a small test library for acero that contains `src/arrow/acero/dummy_nodes.cc` and `src/arrow/acero/test_util_internal.cc`.  Also, I'd like to create a small test library for datasets that contains `src/arrow/acero/test_util_internal.cc` and `src/arrow/dataset/test_util_internal.cc`.
   
   I don't believe we want to export these libraries (like we do with arrow_testing).  I think they can just be small static archives that get added as dependencies to all of the tests.
   
   As a hack to test this out I added `test_util_internal.cc` to the list of acero sources.  That appears to restore the build time to what I got in 11.0.0.
   
   I did attempt to use ClangBuildAnalyzer to further reduce compile times.  I didn't find any clear wins.  However, I was able to tune precompiled header files to drop build times by an additional 35%.  In other words, on my system:
   
   Current Build Time: 2100
   After fixing duplicate test files: 1900
   With tuned precompiled headers: 1300
   
   The files I needed to add to the precompiled headers were:
   
    * datum.h - Since Datum is generally passed around by value this is an inevitable include for anything remotely involved with compute.
    * scalar.h - This is another header that was fairly ubiquitous.  However, we may be able to reduce this one with better forward declarations since scalars are typically passed around by reference/pointer.
    * compute/exec.h - This is needed for ExecBatch.  Similar to Datum I think this one is inevitable because ExecBatch is usually passed by value
    * compute/expression.h - Same as the Datum and ExecBatch as Expression is typically passed by value
    * compute/function.h
    * compute/kernel.h
   
   Adding datum/exec/expression to the pch for acero and datasets may be sufficient.  I don't think we need it in the top-level pch.
   
   These last two are not actually included excessively.  However, they are needed by all of the compute kernel files and they are very expensive includes (kernel.h in particular).  We could probably leave them out or, if we really want them, find a way to make a dedicated pch file for just compute kernels.
   
   For the testing pch I needed to add the gmock headers.  gtest/gtest.h isn't actually too bad but the gmock includes are very expensive.
   
   Action items:
   
    - [ ] Figure out how to make internal "common testing" libraries for acero & dataset (maybe @kou can help?)
    - [ ] See if we can reduce the number of times we include scalar.h (@westonpace)
    - [ ] Get `kernel.h` out of the compute api_xyz.h files (these only need FunctionOptions which we can put in its own file) @westonpace 
    - [ ] Consider enabling precompiled headers on builds that don't seem to be getting much benefit from ccache (I can look into this but I don't know much about how well ccache is or isn't working on these CI jobs so I'd appreciate any input) 
   
   I've prototyped some of these tasks in https://github.com/westonpace/arrow/tree/experiment/build-times 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36090: [C++] Excessive increase in debug build times

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1592934393

   https://github.com/aras-p/ClangBuildAnalyzer/ can help to analyze compilation slowness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on issue #36090: [C++] Excessive increase in debug build times

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1596793353

   > [ ] Figure out how to make internal "common testing" libraries for acero & dataset (maybe @kou can help?)
   
   Besides what @kou suggested, you can create an additional "testing" shared library as already exists for e.g. Flight.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou closed issue #36090: [C++] Excessive increase in debug build times

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou closed issue #36090: [C++] Excessive increase in debug build times
URL: https://github.com/apache/arrow/issues/36090


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org