You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/09/11 13:38:00 UTC

[jira] [Updated] (ARROW-6529) [C++] Feather: slow writing of NullArray

     [ https://issues.apache.org/jira/browse/ARROW-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-6529:
-----------------------------------------
    Description: 
From https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none

Smaller example with just using pyarrow, it seems that writing an array of nulls takes much longer than an array of for example ints, which seems a bit strange:

{code}
In [93]: arr = pa.array([None]*1000, type='int64')

In [94]: %%timeit 
    ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
    ...: w.writer.write_array('x', arr) 
    ...: w.writer.close() 

31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [95]: arr = pa.array([None]*1000)  

In [96]: arr    
Out[96]: 
<pyarrow.lib.NullArray object at 0x7fa47a23ca40>
1000 nulls

In [97]: %%timeit 
    ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
    ...: w.writer.write_array('x', arr) 
    ...: w.writer.close() 

3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}

So writing the same length NullArray takes ca 100x more time compared to an array of nulls but with Integer type.

  was:
From https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none

Smaller example with just using pyarrow, it seems that writing an array of nulls takes much longer than an array of for example ints, which seems a bit strange:

{code}
In [93]: arr = pa.array([1]*1000)  

In [94]: %%timeit 
    ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
    ...: w.writer.write_array('x', arr) 
    ...: w.writer.close() 

31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [95]: arr = pa.array([None]*1000)  

In [96]: arr    
Out[96]: 
<pyarrow.lib.NullArray object at 0x7fa47a23ca40>
1000 nulls

In [97]: %%timeit 
    ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
    ...: w.writer.write_array('x', arr) 
    ...: w.writer.close() 

3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}

So writing the same length NullArray takes ca 100x more time.


> [C++] Feather: slow writing of NullArray
> ----------------------------------------
>
>                 Key: ARROW-6529
>                 URL: https://issues.apache.org/jira/browse/ARROW-6529
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: feather
>
> From https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none
> Smaller example with just using pyarrow, it seems that writing an array of nulls takes much longer than an array of for example ints, which seems a bit strange:
> {code}
> In [93]: arr = pa.array([None]*1000, type='int64')
> In [94]: %%timeit 
>     ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
>     ...: w.writer.write_array('x', arr) 
>     ...: w.writer.close() 
> 31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> In [95]: arr = pa.array([None]*1000)  
> In [96]: arr    
> Out[96]: 
> <pyarrow.lib.NullArray object at 0x7fa47a23ca40>
> 1000 nulls
> In [97]: %%timeit 
>     ...: w = pyarrow.feather.FeatherWriter('__test.feather') 
>     ...: w.writer.write_array('x', arr) 
>     ...: w.writer.close() 
> 3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> {code}
> So writing the same length NullArray takes ca 100x more time compared to an array of nulls but with Integer type.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)