You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/08/20 17:20:00 UTC
[jira] [Closed] (ARROW-6051) [C++][Python] Parquet float column of
NaN writing performance regression from 0.13.0 to 0.14.1
[ https://issues.apache.org/jira/browse/ARROW-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney closed ARROW-6051.
-------------------------------
Resolution: Not A Problem
So the plot thickens here a little bit
Arrow 0.13.0
{code}
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
arr = pa.array([np.nan] * 10000000)
t = pa.Table.from_arrays([arr], names=['f0'])
In [7]: arr
Out[7]:
<pyarrow.lib.NullArray object at 0x7f4dd5b23138>
10000000 nulls
{code}
but in 0.14.1/master
{code}
In [2]: paste
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
arr = pa.array([np.nan] * 10000000)
## -- End pasted text --
In [3]: arr
Out[3]:
<pyarrow.lib.DoubleArray object at 0x7f2312461e08>
[
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
...
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan,
nan
]
{code}
So that explains the perf difference I saw. I used this code instead to make the table and didn't see a meaningful perf difference
{code}
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
size = 10_000_000
values = np.random.randn(size)
values[::] = np.nan
arr = pa.array(values)
t = pa.Table.from_arrays([arr], names=['f0'])
{code}
Perhaps some slightly worse performance with NaN values in the hash table:
https://gist.github.com/wesm/436e37e398e61ca29031e43674084957
> [C++][Python] Parquet float column of NaN writing performance regression from 0.13.0 to 0.14.1
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-6051
> URL: https://issues.apache.org/jira/browse/ARROW-6051
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Labels: parquet
> Attachments: perf.svg, perf_before.svg
>
>
> I'm not sure the origin of the regression but I have with
> pyarrow 0.13.0 from conda-forge
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> import pandas as pd
> arr = pa.array([np.nan] * 10000000)
> t = pa.Table.from_arrays([arr], names=['f0'])
> %timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet')
> 28.7 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> but in pyarrow 0.14.1 from conda-forge
> {code}
> %timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet')
> 88.1 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> I'm sorry to say, but this is what happens when benchmark data is not tracked and monitored
--
This message was sent by Atlassian Jira
(v8.3.2#803003)