You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/07/26 22:18:00 UTC

[jira] [Commented] (ARROW-6051) [C++][Python] Parquet float column of NaN writing performance regression from 0.13.0 to 0.14.1

    [ https://issues.apache.org/jira/browse/ARROW-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894167#comment-16894167 ] 

Wes McKinney commented on ARROW-6051:
-------------------------------------

I made before and after flamegraphs using

{code}
export FLAMEGRAPH_PATH=/home/wesm/code/FlameGraph

function flamegraph {
    perf record -F 999 -g --call-graph=dwarf -- $@
    perf script | c++filt | $FLAMEGRAPH_PATH/stackcollapse-perf.pl > out.perf-folded
    $FLAMEGRAPH_PATH/flamegraph.pl out.perf-folded > perf.svg
}
{code}

and 

{code}
flamegraph python bench.py
{code}

with bench.py as 

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

arr = pa.array([np.nan] * 10000000)
t = pa.Table.from_arrays([arr], names=['f0'])

for i in range(50):
    pq.write_table(t, '/home/wesm/tmp/nans.parquet')
{code}

The evidence suggests that dictionary encoding is accounting for the performance difference -- it doesn't seem to be running at all before, or maybe it was bailing out very quickly. Either way it would be interesting to understand what changed

> [C++][Python] Parquet float column of NaN writing performance regression from 0.13.0 to 0.14.1
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6051
>                 URL: https://issues.apache.org/jira/browse/ARROW-6051
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: parquet
>         Attachments: perf.svg, perf_before.svg
>
>
> I'm not sure the origin of the regression but I have with
> pyarrow 0.13.0 from conda-forge
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> import pandas as pd
> arr = pa.array([np.nan] * 10000000)
> t = pa.Table.from_arrays([arr], names=['f0'])
> %timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet')
> 28.7 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> but in pyarrow 0.14.1 from conda-forge
> {code}
> %timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet')
> 88.1 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> I'm sorry to say, but this is what happens when benchmark data is not tracked and monitored



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)