You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "alexhudspith (via GitHub)" <gi...@apache.org> on 2023/05/08 22:24:36 UTC

[GitHub] [arrow] alexhudspith opened a new issue, #35498: [C++][Parquet] Parquet write_todataset performance regression

alexhudspith opened a new issue, #35498:
URL: https://github.com/apache/arrow/issues/35498

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   On Linux, `pyarrow.parquet.write_to_dataset` shows a large performance regression in Arrow 12.0 versus 11.0.
   
   The following results were collected using Ubuntu 22.04.2 LTS, 5.15.0-71-generic, Intel Haswell 4-core @ 3.6GHz, 16 GB RAM, Samsung 840 Pro SSD. They are elapsed times in seconds to write a single int64 column of integers [0,..., _length_-1] with no compression and no multi-threading:
   
   | Array length | Arrow 11 (s) | Arrow 12 (s) |
   |-----------------:|--------:|--------:|
   |1,000,000 | 0.1 | 0.1|
   | 2,000,000 | 0.2 | 0.4 |
   | 4,000,000 | 0.3 | 1.6 |
   | 8,000,000 | 0.8 | 6.2 |
   | 16,000,000 | 2.3 | 24.4 |
   | 32,000,000 | 6.5 | 94.1 |
   | 64,000,000 | 13.5 | 371.7 |
   
   The output directory was deleted before each run.
   ```
   """check.py"""
   import sys
   import time
   import gc
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   def main():
       path = '/tmp/test.parquet'
       length = 10_000_000 if len(sys.argv) < 2 else int(sys.argv[1])
       table = pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])
       t0 = time.perf_counter()
       pq.write_to_dataset(
           table, path, schema=table.schema, use_legacy_dataset=False, use_threads=False, compression=None
       )
       duration = time.perf_counter() - t0
       print(f'{duration:.2f}s')
   
   if __name__ == '__main__':
       main()
   ```
   
   Running `git bisect` on local builds leads me to this commit: 660d259f525d301f7ff5b90416622698fa8a5e9c: [C++] Add ordered/segmented aggregation Substrait extension (#34627).
   
   Following that change, Flamegraphs show a lot of additional time spent in `arrow::util::EnsureAlignment` calling glibc `memcpy`:
   
   Before (ddd0a337174e57cdc80b1ee30dc7e787acfc09f6)
   ![good-ddd0a33 perf](https://user-images.githubusercontent.com/13152260/236944113-e7b6abb3-9449-4ca6-8a4c-ab88c0f9ace9.svg)
   
   After (660d259f525d301f7ff5b90416622698fa8a5e9c)
   ![bad-660d259 perf](https://user-images.githubusercontent.com/13152260/236944165-fbad2d51-716d-4985-ac1e-a51b18bf76a8.svg)
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541498603

   > If possible, better yet is to get numpy to support memory-alignment configuration, so that Arrow-user-code would not need to change.
   
   Looks like numpy already supports this via [configurable memory routines](https://numpy.org/devdocs/reference/c-api/data_memory.html#configurable-memory-routines-in-numpy-nep-49) (see [NEP-49](https://numpy.org/neps/nep-0049.html) and [the corresponding PR](https://github.com/numpy/numpy/pull/17582)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1539786060

   @jorisvandenbossche So is this due to alignment change ( `EnsureAlignment` ) between 11.0 and 12.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541450260

   @icexelloss, what are your thoughts about this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541480968

   My understanding is that alignment is "recommended but not required _for in memory data_", it's only when serializing (IPC) that the requirement is enforced (https://arrow.apache.org/docs/dev/format/Columnar.html#buffer-alignment-and-padding)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1540487190

   > I am not directly sure for the reason to do this here, cc @rtpsw @westonpace
   
   The reason for adding `EnsureAlignment` is that Flight generated misaligned buffer (#32276), and more generally because the combination of a zero-copy policy with a non-aligned format leads to misaligned buffers. Arrow requires [aligned buffers](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding), and I'm not sure there is a good way around this requirement.
   
   Note that so far I didn't look into the specifics of the current issue - it may be due to Flight or some other misaligned format used as input, or some other reason.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1542647235

   > I don't know if no alignment enforcement would be OK
   
   Correct.  As @mapleFU mentions this could lead to undefined behavior in type punning (this is the error we were getting from flight).
   
   > I don't know if no alignment enforcement would be OK, but it sounds like a smaller alignment would do. Perhaps a good quick fix is to change ipc::kArrowAlignment in https://github.com/apache/arrow/issues/35498#issuecomment-1539766664 to the byte size of the value's type. My understanding is that [numpy ensures this alignment condition](https://numpy.org/devdocs/dev/alignment.html#).
   
   This seems like a the best solution.
   
   > My understanding is that alignment is "recommended but not required for in memory data", it's only when serializing (IPC) that the requirement is enforced
   
   Correct.  Most of Arrow-C++ aims to tolerate unaligned buffers.  However, Acero doesn't.  These sorts of type punning assumptions are subtle and there are a lot of compute functions.  If someone wanted to support this then a first step would be to create unit tests for all the compute functions on unaligned buffers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1539766664

   @alexhudspith Thanks a lot for the report and the detailed analysis.
   
   It's unfortunate that this got into the release, as it seems we actually also could see this in our own benchmarks (https://conbench.ursa.dev/benchmark-results/2b587cc1079f4e3a97f542e6f11e883e/, we need some better process to check regressions before a release)
   
   A call to `EnsureAlignment` indeed got added in the PR you reference (the full for-loop got added):
   
   https://github.com/apache/arrow/blob/de6c3cd2b6048641251fac0c64b70c2cd166e0c9/cpp/src/arrow/acero/source_node.cc#L106-L113
   
   I am not directly sure for the reason to do this here, cc @rtpsw @westonpace 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1540492306

   But so that specific change wasn't really related to the rest of the PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541449693

Looking at [the code](https://github.com/apache/arrow/issues/35498#issue-1701030317), I suspect the reason for degraded performance is because the source table has misaligned numpy arrays and each batch of each of these arrays get realigned by `EnsureAlignment`, since the aligned default batch size leads the batch-slicing to preserve misalignment. This can explain why the performance degradation gets worse with larger arrays that get sliced to more batches. One way to verify this theory is to increase the batch size in line with the array sizes - the performance degradation is expected to be reduced.

As for a cause of the problem, it looks like `pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])` results in per-Arrow misaligned arrays, due to zero-copy-wrapping of [misaligned numpy arrays](https://numpy.org/devdocs/dev/alignment.html#), which the Arrow spec forbids. However, since this code is natural and has likely been accepted since the beginning, the realignment should probably be done within Arrow (maybe with a warning), or be possible via Arrow configuration. The full arrays have a realignment performance cost, of course, but it should be much lower than many batches of each array have. Looking out further, I'd suggest considering adding facilities for getting per-Arrow aligned numpy arrays and documenting accordingly. If possible, better yet is to get numpy to support memory-alignment configuration, so that Arrow-user-code would not need to change.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541516532

   I guess IPC will generate non-aligned, like 1byte in `Int32Array`, which would cause undefined behavior and type punning. But here it doesn't need to force subslice alignment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace closed issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression
URL: https://github.com/apache/arrow/issues/35498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541528033

   > I guess IPC will generate non-aligned, like 1byte in Int32Array, which would cause undefined behavior and type punning.
   
   Right, this is basically what happened.
   
   > But here it doesn't need to force subslice alignment?
   
   I don't know if no alignment enforcement would be OK, but it sounds like a smaller alignment would do. Perhaps a good quick fix is to change `ipc::kArrowAlignment` in [the code](https://github.com/apache/arrow/issues/35498#issuecomment-1539766664) to the byte size of the value's type. My understanding is that [numpy ensures this alignment condition](https://numpy.org/devdocs/dev/alignment.html#).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541508585

   > My understanding is that alignment is "recommended but not required _for in memory data_", it's only when serializing (IPC) that the requirement is enforced (https://arrow.apache.org/docs/dev/format/Columnar.html#buffer-alignment-and-padding)?
   
   That sounds right per the spec, though avoiding the recommendation may lead to other kinds of performance degradation. In practice, the existing Arrow code checks alignment of in-memory buffers (at least in [the place I pointed to](https://github.com/apache/arrow/issues/35498#issuecomment-1541477582)), despite what the spec says, and it would be a much wider scope of an issue to change that. Given all this, I think we need to consider what makes sense for a resolution in the short-term vs the longer-term.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1543534278

   > > I don't know if no alignment enforcement would be OK, but it sounds like a smaller alignment would do. Perhaps a good quick fix is to change ipc::kArrowAlignment in [#35498 (comment)](https://github.com/apache/arrow/issues/35498#issuecomment-1539766664) to the byte size of the value's type. My understanding is that [numpy ensures this alignment condition](https://numpy.org/devdocs/dev/alignment.html#).
   > 
   > This seems like a the best solution.
   
   I created #35541 as a proposed fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1540431108

   I assume so yes, based on the flamegraphs shown above. But I am not familiar with why this was added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1540530046

   > But so (for my understanding) that specific change wasn't really related to the rest of the PR?
   
   IIRC, the PR failed internal integration testing without the `EnsureAlignment` change. I'm not sure whether this means the change is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.

rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541477582

   > IIRC, the PR failed internal integration testing without the `EnsureAlignment` change. I'm not sure whether this means the change is unrelated.
   
   Digging this up for reference, the failure was raised by [an invocation of `util::CheckAlignment` from `CompareBinaryColumnToRow`](https://github.com/apache/arrow/blob/97821aa5af650e6478116cae7c0128fe37dad067/cpp/src/arrow/compute/row/compare_internal.cc#L182).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1542498779

   Related issue about buffer alignment in Acero:
   
   * https://github.com/apache/arrow/issues/33313
   
   (which _proposes_ to enforce alignment to 64 bytes in the Acero source node, but so in that issue that was something to discuss)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org