You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/10/03 14:50:00 UTC
[jira] [Comment Edited] (ARROW-17913) feather.read_table 150x slower when reading columns in newer versions

    [ https://issues.apache.org/jira/browse/ARROW-17913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612323#comment-17612323 ] 

Joris Van den Bossche edited comment on ARROW-17913 at 10/3/22 2:49 PM:
------------------------------------------------------------------------

I am not directly sure what <=6.0 did differently, but looking at the current implementation this is somewhat expected (it might still be that it can be implemented in a better way, of course): when specifying columns, it will read each column separately from the MemoryMappedFile (instead doing a single ReadAt call), and copying each read chunk in a single output buffer, and thus because of this copy the memory-mapping basically has no effect in this case (https://github.com/apache/arrow/blob/ec579df631deaa8f6186208ed2a4ebec00581dfa/cpp/src/arrow/io/file.h#L182-L185)

This can also be seen when you compare timings with and without memory mapping (with {{memory_map=False}}, there is no difference anymore between manually selecting all columns or not):

{code}
In [5]: %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
29.4 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=False)
35.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit feather.read_table('test.feather', memory_map=True)
239 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit feather.read_table('test.feather', memory_map=False)
35 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

Now, I would have assumed that it is not needed that all buffers of all columns live in a single body, so I am not 100% sure why it is needed to copy each field to a single output.


was (Author: jorisvandenbossche):
I am not directly sure what <=6.0 did differently, but looking at the current implementation this is somewhat expected (it might still be that it can be implemented in a better way, of course): when specifying columns, it will read each column separately from the MemoryMappedFile (instead doing a single ReadAt call), and copying each read chunk in a single output buffer, and thus because of this copy the memory-mapping basically has no effect in this case (https://github.com/apache/arrow/blob/ec579df631deaa8f6186208ed2a4ebec00581dfa/cpp/src/arrow/io/file.h#L182-L185)

This can also be seen when you compare timings with and without memory mapping (with {{memory_map=False}}, there is no difference anymore between manually selecting all columns or not):

{code}
In [5]: %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
29.4 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=False)
35.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit feather.read_table('test.feather', memory_map=True)
239 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %timeit feather.read_table('test.feather', memory_map=False)
35 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

Now, I would have assumed that it is not needed that all buffers of all columns live in a single memory chunk, so I am not 100% sure why it is needed to copy each field to a single output.

> feather.read_table 150x slower when reading columns in newer versions
> ---------------------------------------------------------------------
>
>                 Key: ARROW-17913
>                 URL: https://issues.apache.org/jira/browse/ARROW-17913
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0, 8.0.0, 9.0.0
>         Environment: python 3.9, ubuntu 20.04
>            Reporter: Håkon Magne Holmen
>            Priority: Major
>              Labels: feather, performance
>
> h3. Description
> Performance when reading columns using {{feather.read_table}} on Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
> Profiling the code below shows that the bottleneck is somewhere in the {{read_names}} function of {{pyarrow._feather.FeatherReader}}.
> h5. Example
> Setup code:
> {code}
> import pandas as pd
> from pyarrow import feather
> rows, cols = (1_000_000, 10)
> data = {f'c{c}': range(rows) for c in range(cols)}
> df = pd.DataFrame(data=data)
> feather.write_feather(df, 'test.feather', compression="uncompressed"){code} 
> Benchmarks Arrow 9.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> > 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> Benchmarks Arrow 6.0.0:
> {code}
> %timeit feather.read_table('test.feather', memory_map=True)
> %timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> > 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> 224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)