You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Arun Joseph <aj...@gmail.com> on 2021/12/03 20:53:56 UTC

[Python] Why does reading an arrow file cause almost double the memory consumption?

Hi Apache Arrow Members,

My question is below but I've compiled a minimum reproducible example with
a public dataset:

import pandas as pd
from pyarrow import feather
import os
import psutil


def setup():
  df = pd.read_csv('
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
')
  df.to_feather('test.csv')

if __name__ == "__main__":
  # setup()
  process = psutil.Process(os.getpid())
  path = 'test.csv'

  mem_size = process.memory_info().rss / 1e9
  print(f'BEFORE mem_size: {mem_size}gb')

  df = feather.read_feather(path)

  df_size = df.memory_usage(deep=True).sum() / 1e9
  mem_size = process.memory_info().rss / 1e9
  print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')

I substituted my df with a sample csv. I had trouble finding a sample CSV
of adequate size however, my dataset is ~3GB, and I see memory usage of
close to 6GB.

Output with My Data:
BEFORE mem_size: 0.088891392gb
AFTER mem_size: 6.324678656gb df_size: 3.080121688gb

It seems strange that the overall memory usage of the process is approx
double of the size of the dataframe itself. Is there a reason for this, and
is there a way to mitigate this?

$ conda list pyarrow
#
# Name                    Version                   Build  Channel
pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge

Thank You,
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Wed, 8 Dec 2021 at 00:32, David Li <li...@apache.org> wrote:

> Just for edification (though I have limited understanding of the machinery
> here, someone more familiar with Pandas internals may have more
> insight/this may be wrong or very outdated!):
>
> zero_copy_only does not work for two reasons (well, one reason
> fundamentally): the representation in memory of a Pandas dataframe has been
> a dense, 2D NumPy array per column type. In other words, all data across
> all columns of the same type are contiguous in memory. (At least
> historically. My understanding is that this has changed/become more
> flexible relatively recently.) This is the representation that Arrow tries
> to generate by default. (See
> https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.)
>
> Just to confirm that your explanation is fully correct.
(I have been working on changing this in pandas (
https://github.com/pandas-dev/pandas/issues/39146/), but this is right now
only an experimental opt-in feature)


> However, the Arrow table you have is not contiguous: each column is
> allocated separately, and for a Table, each column is made up of a list of
> contiguous chunks. So there are very few cases where data can be
> zero-copied, it must instead be copied and "compacted".
>
> The split_blocks option *helps* work around this. It allows each column in
> the Pandas DataFrame to be its own allocation. However, each individual
> column must still be contiguous. If you try zero_copy_only with
> split_blocks, you'll get a different error message, this is because the
> columns of your Arrow Table have more than one chunk. If you create a small
> in-memory Table with only one column with one chunk, zero_copy_only +
> split_blocks will work!
>
> split_blocks with self_destruct works in this case still because
> self_destruct will still copy data, it will just also try to free the Arrow
> data as each column is converted. (Note that to minimize the memory usage,
> you should also pass use_threads=False. In that case, the maximum memory
> overhead should be one column's worth.)
>
> -David
>
> On Tue, Dec 7, 2021, at 18:09, Weston Pace wrote:
>
> Thank you for the new example.
>
> # Why is it 2x?
>
> This is essentially a "peak RAM" usage of the operation.  Given that
> split_blocks helped I think we can attribute this doubling to the
> pandas conversion.
>
> # Why doesn't the memory get returned?
>
> It does, it just doesn't do so immediately.  If I put a 5 second sleep
> before I print the memory I see that the RSS shrinks down.  This is
> how jemalloc is configured in Arrow (actually I think it is 1 second)
> for releasing RSS after reaching peak consumption.
>
> BEFORE mem_size: 0.082276352gb
> AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb
> AFTER-ARROW: 3.281625024gb
> ---five second sleep---
> AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb
> AFTER-SLEEP-ARROW: 3.281625024gb
>
> # Why didn't switching to the system allocator help?
>
> The problem isn't "the dynamic allocator is allocating more than it
> needs".  There is a point in this process where ~6GB are actually
> needed.  The system allocator either also holds on to that RSS for a
> little bit or the RSS numbers themselves take a little bit of time to
> update.  I'm not entirely sure.
>
> # Why isn't this a zero-copy conversion to pandas?
>
> That's a good question, I don't know the details.  If I try manually
> doing the conversion with zero_copy_only I get the error "Cannot do
> zero copy conversion into multi-column DataFrame block"
>
> # What is up with the numpy.ndarray objects in the heap?
>
> I'm pretty sure guppy3 is double-counting.  Note that the total size
> is ~20GB.  I've been able to reproduce this in cases where the heap is
> 3GB and guppy still shows the dataframe taking up 6GB.  In fact, I
> once even managed to generate this:
>
> AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb
> AFTER-SLEEP-ARROW: 0.0gb
> Partition of a set of 212560 objects. Total size = 13328742559 bytes.
> Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0     57   0 6563250864  49 6563250864  49 pandas.core.series.Series
>      1    133   0 3339213718  25 9902464582  74 numpy.ndarray
>      2      1   0 3339197360  25 13241661942  99
> pandas.core.frame.DataFrame
>
> The RSS is 3.44GB but guppy reports the dataframe as 13GB.
>
> I did see some strange behavior when working with the
> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you
> can work around this by deleting the reader.
>
> # Can I return the data immediately / I don't want to use 2x memory
> consumption
>
> I think split_blocks and self_destruct is the best answer at the
> moment.  self_destruct has remained in the code since at least 1.0.0
> so perhaps it is time we remove the "experimental" flag and maybe
> replace it with a "caution" or "danger" flag (as it causes the table
> to become unusable afterwards).
>
> Jemalloc has some manual facilities to purge dirty memory and we
> expose some of them with
> pyarrow.default_memory_pool().release_unused() but that doesn't seem
> to be helping in this situation.  Either the excess memory is in the
> non-jemalloc pool or the jemalloc command can't quite release this
> memory, or the RSS stats are just stale.  I'm not entirely sure.
>
> On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <aj...@gmail.com> wrote:
> >
> > Slightly related, I have some other code that opens up an arrow file
> using a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch
> to a pandas dataframe. After this conversion is done, and I inspect the
> heap, I always see the following:
> >
> > hpy().heap()
> > Partition of a set of 351136 objects. Total size = 20112096840 bytes.
> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
> >      0    121   0 9939601034  49 9939601034  49 numpy.ndarray
> >      1      1   0 9939585700  49 19879186734  99
> pandas.core.frame.DataFrame
> >      2      1   0 185786680   1 20064973414 100
> pandas.core.indexes.datetimes.DatetimeIndex
> >
> > Specifically the numpy.ndarray. It only shows up after the conversion
> and it does not seem to go away. It also seems to be roughly the same size
> as the dataframe itself.
> >
> > - Arun
> >
> > On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <aj...@gmail.com> wrote:
> >>
> >> Just to follow up on this, is there a way to manually force the arrow
> pool to de-allocate? My usecase is essentially having multiple processes in
> a Pool or via Slurm read from an arrow file, do some work, and then exit.
> Issue is that the 2x memory consumption reduces the bandwidth on the
> machine to effectively half.
> >>
> >> Thank You,
> >> Arun
> >>
> >> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <aj...@gmail.com> wrote:
> >>>
> >>> Additionally, I tested with my actual data, and did not see memory
> savings.
> >>>
> >>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com> wrote:
> >>>>
> >>>> Hi Joris,
> >>>>
> >>>> Thank you for the explanation. The 2x memory consumption on
> conversion makes sense if there is a copy, but it does seem like it
> persists longer than it should. Might be because of python's GC policies?
> >>>> I tried out your recommendations but they did not seem to work.
> However, I did notice an experimental option on `to_pandas`,
> `self_destruct`, which seems to address the issue I'm facing. Sadly, that
> itself did not work either... but, combined with `split_blocks=True`, I am
> seeing memory savings:
> >>>>
> >>>> import pandas as pd
> >>>> import numpy as np
> >>>> import pyarrow as pa
> >>>> from pyarrow import feather
> >>>> import os
> >>>> import psutil
> >>>> pa.set_memory_pool(pa.system_memory_pool())
> >>>> DATA_FILE = 'test.arrow'
> >>>>
> >>>> def setup():
> >>>>   np.random.seed(0)
> >>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
> columns=list([f'{i}' for i in range(57)]))
> >>>>   df.to_feather(DATA_FILE)
> >>>>   print(f'wrote {DATA_FILE}')
> >>>>   import sys
> >>>>   sys.exit()
> >>>>
> >>>> if __name__ == "__main__":
> >>>>   # setup()
> >>>>   process = psutil.Process(os.getpid())
> >>>>   path = DATA_FILE
> >>>>
> >>>>   mem_size = process.memory_info().rss / 1e9
> >>>>   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>
> >>>>   feather_table = feather.read_table(path)
> >>>>   # df = feather_table.to_pandas(split_blocks=True)
> >>>>   # df = feather_table.to_pandas()
> >>>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
> >>>>
> >>>>   mem_size = process.memory_info().rss / 1e9
> >>>>   df_size = df.memory_usage().sum() / 1e9
> >>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >>>>
> >>>>
> >>>> OUTPUT(to_pandas()):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
> >>>> ARROW: 3.281625024gb
> >>>>
> >>>> OUTPUT (to_pandas(split_blocks=True)):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
> >>>> ARROW: 3.281627712gb
> >>>>
> >>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
> >>>> ARROW: 3.281627712gb
> >>>>
> >>>> I'm guessing since this feature is experimental, it might either go
> away, or might have strange behaviors. Is there anything I should look out
> for, or is there some alternative to reproduce these results?
> >>>>
> >>>> Thank You,
> >>>> Arun
> >>>>
> >>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> >>>>>
> >>>>> Hi Aron, Weston,
> >>>>>
> >>>>> I didn't try running the script locally, but a quick note: the
> >>>>> `feather.read_feather` function reads the Feather file into an Arrow
> >>>>> table ànd directly converts it to a pandas DataFrame. A memory
> >>>>> consumption 2x the size of the dataframe sounds not that unexpected
> to
> >>>>> me: most of the time, when converting an arrow table to a pandas
> >>>>> DataFrame, the data will be copied to accommodate for pandas'
> specific
> >>>>> internal memory layout (at least numeric columns will be combined
> >>>>> together in 2D arrays).
> >>>>>
> >>>>> To verify if this is the cause, you might want to do either of:
> >>>>> - use `feather.read_table` instead of `feather.read_feather`, which
> >>>>> will read the file as an Arrow table instead (and don't do any
> >>>>> conversion to pandas)
> >>>>> - if you want to include the conversion to pandas, also use
> >>>>> `read_table` and do the conversion to pandas explicitly with a
> >>>>> `to_pandas()` call on the result. In that case, you can specify
> >>>>> `split_blocks=True` to use more zero-copy conversion in the
> >>>>> arrow->pandas conversion
> >>>>>
> >>>>> Joris
> >>>>>
> >>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
> >>>>> >
> >>>>> > Hi Wes,
> >>>>> >
> >>>>> > Sorry for the late reply on this, but I think I got a reproducible
> test case:
> >>>>> >
> >>>>> > import pandas as pd
> >>>>> > import numpy as np
> >>>>> > import pyarrow as pa
> >>>>> > from pyarrow import feather
> >>>>> > import os
> >>>>> > import psutil
> >>>>> > pa.set_memory_pool(pa.system_memory_pool())
> >>>>> > DATA_FILE = 'test.arrow'
> >>>>> >
> >>>>> > def setup():
> >>>>> >   np.random.seed(0)
> >>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
> columns=list([f'i_{i}' for i in range(57)]))
> >>>>> >   df.to_feather(DATA_FILE)
> >>>>> >   print(f'wrote {DATA_FILE}')
> >>>>> >   import sys
> >>>>> >   sys.exit()
> >>>>> >
> >>>>> > if __name__ == "__main__":
> >>>>> >   # setup()
> >>>>> >   process = psutil.Process(os.getpid())
> >>>>> >   path = DATA_FILE
> >>>>> >
> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >
> >>>>> >   df = feather.read_feather(path)
> >>>>> >
> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >   df_size = df.memory_usage().sum() / 1e9
> >>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >>>>> >
> >>>>> > OUTPUT:
> >>>>> > BEFORE mem_size: 0.091795456gb
> >>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
> >>>>> > ARROW: 3.281625024gb
> >>>>> >
> >>>>> > Let me know if you're able to see similar results.
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Arun
> >>>>> >
> >>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com>
> wrote:
> >>>>> >>
> >>>>> >> I get more or less the same results as you for the provided setup
> data
> >>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
> >>>>> >> which is to be expected).  The fact that the arrow size is much
> lower
> >>>>> >> than the dataframe size is not too surprising to me.  If a column
> >>>>> >> can't be zero copied then it's memory will disappear from the
> arrow
> >>>>> >> pool (I think).  Plus, object columns will have overhead in pandas
> >>>>> >> that they do not have in Arrow.
> >>>>> >>
> >>>>> >> The df_size issue for me seems to be tied to string columns.  I
> think
> >>>>> >> pandas is overestimating how much size is needed there (many of my
> >>>>> >> strings are similar and I wonder if some kind of object sharing is
> >>>>> >> happening).  But we can table this for another time.
> >>>>> >>
> >>>>> >> I tried writing my feather file with your parameters and it didn't
> >>>>> >> have much impact on any of the numbers.
> >>>>> >>
> >>>>> >> Since the arrow size for you is expected (nearly the same as the
> >>>>> >> df_size) I'm not sure what to investigate next.  The memory does
> not
> >>>>> >> seem to be retained by Arrow.  Is there any chance you could
> create a
> >>>>> >> reproducible test case using randomly generated numpy data (then
> you
> >>>>> >> could share that setup function)?
> >>>>> >>
> >>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com>
> wrote:
> >>>>> >> >
> >>>>> >> > Hi Wes,
> >>>>> >> >
> >>>>> >> > I'm not including the setup() call when I encounter the issue.
> I just kept it in there for ease of reproducibility. Memory usage is indeed
> higher when it is included, but that isn't surprising.
> >>>>> >> >
> >>>>> >> > I tried switching over to the system allocator but there is no
> change.
> >>>>> >> >
> >>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
> >>>>> >> >
> >>>>> >> > I updated my script to also include the Arrow bytes allocated
> and it gave me the following:
> >>>>> >> >
> >>>>> >> > MVE:
> >>>>> >> > import pandas as pd
> >>>>> >> > import pyarrow as pa
> >>>>> >> > from pyarrow import feather
> >>>>> >> > import os
> >>>>> >> > import psutil
> >>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
> >>>>> >> >
> >>>>> >> >
> >>>>> >> > def setup():
> >>>>> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >>>>> >> >   df.to_feather('test.csv')
> >>>>> >> >
> >>>>> >> > if __name__ == "__main__":
> >>>>> >> >   # setup()
> >>>>> >> >   process = psutil.Process(os.getpid())
> >>>>> >> >   path = 'test.csv'
> >>>>> >> >
> >>>>> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >> >
> >>>>> >> >   df = feather.read_feather(path)
> >>>>> >> >
> >>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >>>>> >> >   mem_size = process.memory_info().rss / 1e10
> >>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >>>>> >> >
> >>>>> >> > Output with my data:
> >>>>> >> > BEFORE mem_size: 0.08761344gb
> >>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> >>>>> >> > ARROW: 3.080121792gb
> >>>>> >> >
> >>>>> >> > Output with Provided Setup Data:
> >>>>> >> > BEFORE mem_size: 0.09179136gb
> >>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> >>>>> >> > ARROW: 0.00029664gb
> >>>>> >> >
> >>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes
> are distinct and non-overlapping, but it seems strange that the output with
> the provided data has the Arrow bytes allocated at ~0GB whereas the one
> with my data has the allocated data approximately equal to the dataframe
> size. I'm not sure if it affects anything but my file was written with the
> following:
> >>>>> >> >
> >>>>> >> > import pyarrow.lib as ext
> >>>>> >> > import pyarrow
> >>>>> >> > COMPRESSION_LEVEL = 19
> >>>>> >> > COMPRESSION_ALGO = 'zstd'
> >>>>> >> > KILOBYTE = 1 << 10
> >>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
> >>>>> >> > CHUNK_SIZE = MEGABYTE
> >>>>> >> >
> >>>>> >> > table = pyarrow.Table.from_pandas(df,
> preserve_index=preserve_index)
> >>>>> >> > ext.write_feather(table, dest, compression=compression,
> compression_level=compression_level,chunksize=chunk_size, version=2)
> >>>>> >> >
> >>>>> >> > As to the discrepancy around calculating dataframe size. I'm
> not sure why that would be so off for you. Going off the docs, it seems
> like it should be accurate. My Dataframe in question is [7196546 rows x 56
> columns] where each column is mostly a float or integer and datetime index.
> 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >>>>> >> >
> >>>>> >> > Thank You,
> >>>>> >> > Arun
> >>>>> >> >
> >>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <
> weston.pace@gmail.com> wrote:
> >>>>> >> >>
> >>>>> >> >> 2x overshoot of memory does seem a little high.  Are you
> including the
> >>>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV
> reader
> >>>>> >> >> will require 2-3x memory usage because it buffers the bytes in
> memory
> >>>>> >> >> in case it needs to re-convert them later (because it realizes
> the
> >>>>> >> >> data type for the column is different).  I'm not sure if
> Panda's CSV
> >>>>> >> >> reader is similar.
> >>>>> >> >>
> >>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to
> hold on
> >>>>> >> >> to a bit more memory and hold onto it (for a little while at
> least)
> >>>>> >> >> even after it is no longer used.  Even malloc will hold onto
> memory
> >>>>> >> >> sometimes due to fragmentation or other concerns.  You could
> try
> >>>>> >> >> changing to the system allocator
> >>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of
> your file)
> >>>>> >> >> to see if that makes a difference.
> >>>>> >> >>
> >>>>> >> >> I'm not sure your method of calculating the dataframe size is
> >>>>> >> >> reliable.  I don't actually know enough about pandas but when
> I tried
> >>>>> >> >> your experiment with my own 1.9G CSV file it ended up
> reporting:
> >>>>> >> >>
> >>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >>>>> >> >>
> >>>>> >> >> which seems suspicious.
> >>>>> >> >>
> >>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't
> seem
> >>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow
> itself was
> >>>>> >> >> only using ~1.9GB and I will naively assume the difference
> between the
> >>>>> >> >> two is bloat caused by object wrappers when converting to
> pandas.
> >>>>> >> >>
> >>>>> >> >> Another thing you might try and measure is
> >>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell
> you how
> >>>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB
> then it
> >>>>> >> >> is a pretty good guess that memory is being held somewhere
> else.
> >>>>> >> >>
> >>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com>
> wrote:
> >>>>> >> >> >
> >>>>> >> >> > Hi Apache Arrow Members,
> >>>>> >> >> >
> >>>>> >> >> > My question is below but I've compiled a minimum
> reproducible example with a public dataset:
> >>>>> >> >> >
> >>>>> >> >> > import pandas as pd
> >>>>> >> >> > from pyarrow import feather
> >>>>> >> >> > import os
> >>>>> >> >> > import psutil
> >>>>> >> >> >
> >>>>> >> >> >
> >>>>> >> >> > def setup():
> >>>>> >> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >>>>> >> >> >   df.to_feather('test.csv')
> >>>>> >> >> >
> >>>>> >> >> > if __name__ == "__main__":
> >>>>> >> >> >   # setup()
> >>>>> >> >> >   process = psutil.Process(os.getpid())
> >>>>> >> >> >   path = 'test.csv'
> >>>>> >> >> >
> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >> >> >
> >>>>> >> >> >   df = feather.read_feather(path)
> >>>>> >> >> >
> >>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >> >> >
> >>>>> >> >> > I substituted my df with a sample csv. I had trouble finding
> a sample CSV of adequate size however, my dataset is ~3GB, and I see memory
> usage of close to 6GB.
> >>>>> >> >> >
> >>>>> >> >> > Output with My Data:
> >>>>> >> >> > BEFORE mem_size: 0.088891392gb
> >>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >>>>> >> >> >
> >>>>> >> >> > It seems strange that the overall memory usage of the
> process is approx double of the size of the dataframe itself. Is there a
> reason for this, and is there a way to mitigate this?
> >>>>> >> >> >
> >>>>> >> >> > $ conda list pyarrow
> >>>>> >> >> > #
> >>>>> >> >> > # Name                    Version                   Build
> Channel
> >>>>> >> >> > pyarrow                   4.0.1
> py37h0f64622_13_cpu    conda-forge
> >>>>> >> >> >
> >>>>> >> >> > Thank You,
> >>>>> >> >> > Arun Joseph
> >>>>> >> >> >
> >>>>> >> >
> >>>>> >> >
> >>>>> >> >
> >>>>> >> > --
> >>>>> >> > Arun Joseph
> >>>>> >> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Arun Joseph
> >>>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Arun Joseph
> >>>>
> >>>
> >>>
> >>> --
> >>> Arun Joseph
> >>>
> >>
> >>
> >> --
> >> Arun Joseph
> >>
> >
> >
> > --
> > Arun Joseph
> >
>
>
>

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Whoops sorry Weston, that’s my bad! Thank you for the addition to the SO
post. I do see some improvements with deleting the reader now.

Arun

On Thu, Dec 9, 2021 at 20:21 Weston Pace <we...@gmail.com> wrote:

> > Thank you Wes and David for the in-depth responses.
>
> Just as an aside, I go by Weston, as there is already a Wes on this
> mailing list and it can get confusing ;)
>
> > I also created a stack overflow post...I hope that is ok and or useful.
> Otherwise I can remove it.
>
> I think that's fine, SO can have greater reach than the mailing list.
>
> > As for guppy double counting, that is really strange.
>
> I agree, maybe guppy has difficulty identifying when two python
> objects reference the same underlying chunk of C memory.  Here's a
> quick simple example of it getting confused (I'll add this to the SO
> post):
>
> import numpy as np
> import os
> import psutil
> import pyarrow as pa
> from guppy import hpy
>
> process = psutil.Process(os.getpid())
>
> x = np.random.rand(100000000)
> print(hpy().heap())
> print(process.memory_info().rss)
>
> # This is a zero-copy operation.  Note
> # that RSS remains consistent.  Both x
> # and arr reference the same underlying
> # array of doubles.
> arr = pa.array(x)
> print(hpy().heap())
> print(process.memory_info().rss)
>
> > By deleting the Reader, do you mean just doing a `del Reader` or `Reader
> = None`?
>
> I was thinking "del reader" but "reader = None" should achieve the
> same effect (I think?)
>
> On Tue, Dec 7, 2021 at 2:44 PM Arun Joseph <aj...@gmail.com> wrote:
> >
> > Thank you Wes and David for the in-depth responses. I also created a
> stack overflow post on this with an example/outputs (before I saw your
> responses, but updated after I saw it), I hope that is ok and or useful.
> Otherwise I can remove it.
> >
> > # I'm pretty sure guppy3 is double-counting.
> > As for guppy double counting, that is really strange. I think you're
> looking at the cumulative size re: the dataframe size. Although, I did
> observe what you are describing when I was generating the test data as
> well. In the stack overflow post, I have another example which prints out
> the RSS and the guppy heap output and it does not seem like there is double
> counting for the normal run. I also included a sleep at the end before
> recording the heap and RSS as well.
> >
> > # I think split_blocks and self_destruct is the best answer at the
> moment.  self_destruct has remained in the code since at least 1.0.0 so
> perhaps it is time we remove the "experimental" flag and maybe replace it
> with a "caution" or "danger" flag (as it causes the table to become
> unusable afterwards). In terms of the closest immediate fix, split_blocks
> and self_destruct do seem like the best choices.
> > Yes I agree. I'll be incorporating these changes in my actual codebase.
> While they don't always work, there should be some improvement.
> >
> > # I did see some strange behavior when working with the
> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you can
> work around this by deleting the reader.
> > By deleting the Reader, do you mean just doing a `del Reader` or `Reader
> = None`?
> >
> > # Note that to minimize the memory usage, you should also pass
> use_threads=False
> > I will also try this out, thank you
> >
> > On Tue, Dec 7, 2021 at 6:32 PM David Li <li...@apache.org> wrote:
> >>
> >> Just for edification (though I have limited understanding of the
> machinery here, someone more familiar with Pandas internals may have more
> insight/this may be wrong or very outdated!):
> >>
> >> zero_copy_only does not work for two reasons (well, one reason
> fundamentally): the representation in memory of a Pandas dataframe has been
> a dense, 2D NumPy array per column type. In other words, all data across
> all columns of the same type are contiguous in memory. (At least
> historically. My understanding is that this has changed/become more
> flexible relatively recently.) This is the representation that Arrow tries
> to generate by default. (See
> https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.)
> >>
> >> However, the Arrow table you have is not contiguous: each column is
> allocated separately, and for a Table, each column is made up of a list of
> contiguous chunks. So there are very few cases where data can be
> zero-copied, it must instead be copied and "compacted".
> >>
> >> The split_blocks option *helps* work around this. It allows each column
> in the Pandas DataFrame to be its own allocation. However, each individual
> column must still be contiguous. If you try zero_copy_only with
> split_blocks, you'll get a different error message, this is because the
> columns of your Arrow Table have more than one chunk. If you create a small
> in-memory Table with only one column with one chunk, zero_copy_only +
> split_blocks will work!
> >>
> >> split_blocks with self_destruct works in this case still because
> self_destruct will still copy data, it will just also try to free the Arrow
> data as each column is converted. (Note that to minimize the memory usage,
> you should also pass use_threads=False. In that case, the maximum memory
> overhead should be one column's worth.)
> >>
> >> -David
> >>
> >> On Tue, Dec 7, 2021, at 18:09, Weston Pace wrote:
> >>
> >> Thank you for the new example.
> >>
> >> # Why is it 2x?
> >>
> >> This is essentially a "peak RAM" usage of the operation.  Given that
> >> split_blocks helped I think we can attribute this doubling to the
> >> pandas conversion.
> >>
> >> # Why doesn't the memory get returned?
> >>
> >> It does, it just doesn't do so immediately.  If I put a 5 second sleep
> >> before I print the memory I see that the RSS shrinks down.  This is
> >> how jemalloc is configured in Arrow (actually I think it is 1 second)
> >> for releasing RSS after reaching peak consumption.
> >>
> >> BEFORE mem_size: 0.082276352gb
> >> AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb
> >> AFTER-ARROW: 3.281625024gb
> >> ---five second sleep---
> >> AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb
> >> AFTER-SLEEP-ARROW: 3.281625024gb
> >>
> >> # Why didn't switching to the system allocator help?
> >>
> >> The problem isn't "the dynamic allocator is allocating more than it
> >> needs".  There is a point in this process where ~6GB are actually
> >> needed.  The system allocator either also holds on to that RSS for a
> >> little bit or the RSS numbers themselves take a little bit of time to
> >> update.  I'm not entirely sure.
> >>
> >> # Why isn't this a zero-copy conversion to pandas?
> >>
> >> That's a good question, I don't know the details.  If I try manually
> >> doing the conversion with zero_copy_only I get the error "Cannot do
> >> zero copy conversion into multi-column DataFrame block"
> >>
> >> # What is up with the numpy.ndarray objects in the heap?
> >>
> >> I'm pretty sure guppy3 is double-counting.  Note that the total size
> >> is ~20GB.  I've been able to reproduce this in cases where the heap is
> >> 3GB and guppy still shows the dataframe taking up 6GB.  In fact, I
> >> once even managed to generate this:
> >>
> >> AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb
> >> AFTER-SLEEP-ARROW: 0.0gb
> >> Partition of a set of 212560 objects. Total size = 13328742559 bytes.
> >> Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
> >>      0     57   0 6563250864  49 6563250864  49
> pandas.core.series.Series
> >>      1    133   0 3339213718  25 9902464582  74 numpy.ndarray
> >>      2      1   0 3339197360  25 13241661942  99
> pandas.core.frame.DataFrame
> >>
> >> The RSS is 3.44GB but guppy reports the dataframe as 13GB.
> >>
> >> I did see some strange behavior when working with the
> >> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you
> >> can work around this by deleting the reader.
> >>
> >> # Can I return the data immediately / I don't want to use 2x memory
> consumption
> >>
> >> I think split_blocks and self_destruct is the best answer at the
> >> moment.  self_destruct has remained in the code since at least 1.0.0
> >> so perhaps it is time we remove the "experimental" flag and maybe
> >> replace it with a "caution" or "danger" flag (as it causes the table
> >> to become unusable afterwards).
> >>
> >> Jemalloc has some manual facilities to purge dirty memory and we
> >> expose some of them with
> >> pyarrow.default_memory_pool().release_unused() but that doesn't seem
> >> to be helping in this situation.  Either the excess memory is in the
> >> non-jemalloc pool or the jemalloc command can't quite release this
> >> memory, or the RSS stats are just stale.  I'm not entirely sure.
> >>
> >> On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <aj...@gmail.com> wrote:
> >> >
> >> > Slightly related, I have some other code that opens up an arrow file
> using a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch
> to a pandas dataframe. After this conversion is done, and I inspect the
> heap, I always see the following:
> >> >
> >> > hpy().heap()
> >> > Partition of a set of 351136 objects. Total size = 20112096840 bytes.
> >> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of
> class)
> >> >      0    121   0 9939601034  49 9939601034  49 numpy.ndarray
> >> >      1      1   0 9939585700  49 19879186734  99
> pandas.core.frame.DataFrame
> >> >      2      1   0 185786680   1 20064973414 100
> pandas.core.indexes.datetimes.DatetimeIndex
> >> >
> >> > Specifically the numpy.ndarray. It only shows up after the conversion
> and it does not seem to go away. It also seems to be roughly the same size
> as the dataframe itself.
> >> >
> >> > - Arun
> >> >
> >> > On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <aj...@gmail.com>
> wrote:
> >> >>
> >> >> Just to follow up on this, is there a way to manually force the
> arrow pool to de-allocate? My usecase is essentially having multiple
> processes in a Pool or via Slurm read from an arrow file, do some work, and
> then exit. Issue is that the 2x memory consumption reduces the bandwidth on
> the machine to effectively half.
> >> >>
> >> >> Thank You,
> >> >> Arun
> >> >>
> >> >> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <aj...@gmail.com>
> wrote:
> >> >>>
> >> >>> Additionally, I tested with my actual data, and did not see memory
> savings.
> >> >>>
> >> >>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com>
> wrote:
> >> >>>>
> >> >>>> Hi Joris,
> >> >>>>
> >> >>>> Thank you for the explanation. The 2x memory consumption on
> conversion makes sense if there is a copy, but it does seem like it
> persists longer than it should. Might be because of python's GC policies?
> >> >>>> I tried out your recommendations but they did not seem to work.
> However, I did notice an experimental option on `to_pandas`,
> `self_destruct`, which seems to address the issue I'm facing. Sadly, that
> itself did not work either... but, combined with `split_blocks=True`, I am
> seeing memory savings:
> >> >>>>
> >> >>>> import pandas as pd
> >> >>>> import numpy as np
> >> >>>> import pyarrow as pa
> >> >>>> from pyarrow import feather
> >> >>>> import os
> >> >>>> import psutil
> >> >>>> pa.set_memory_pool(pa.system_memory_pool())
> >> >>>> DATA_FILE = 'test.arrow'
> >> >>>>
> >> >>>> def setup():
> >> >>>>   np.random.seed(0)
> >> >>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
> columns=list([f'{i}' for i in range(57)]))
> >> >>>>   df.to_feather(DATA_FILE)
> >> >>>>   print(f'wrote {DATA_FILE}')
> >> >>>>   import sys
> >> >>>>   sys.exit()
> >> >>>>
> >> >>>> if __name__ == "__main__":
> >> >>>>   # setup()
> >> >>>>   process = psutil.Process(os.getpid())
> >> >>>>   path = DATA_FILE
> >> >>>>
> >> >>>>   mem_size = process.memory_info().rss / 1e9
> >> >>>>   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>
> >> >>>>   feather_table = feather.read_table(path)
> >> >>>>   # df = feather_table.to_pandas(split_blocks=True)
> >> >>>>   # df = feather_table.to_pandas()
> >> >>>>   df = feather_table.to_pandas(self_destruct=True,
> split_blocks=True)
> >> >>>>
> >> >>>>   mem_size = process.memory_info().rss / 1e9
> >> >>>>   df_size = df.memory_usage().sum() / 1e9
> >> >>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >> >>>>
> >> >>>>
> >> >>>> OUTPUT(to_pandas()):
> >> >>>> BEFORE mem_size: 0.091795456gb
> >> >>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
> >> >>>> ARROW: 3.281625024gb
> >> >>>>
> >> >>>> OUTPUT (to_pandas(split_blocks=True)):
> >> >>>> BEFORE mem_size: 0.091795456gb
> >> >>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
> >> >>>> ARROW: 3.281627712gb
> >> >>>>
> >> >>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
> >> >>>> BEFORE mem_size: 0.091795456gb
> >> >>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
> >> >>>> ARROW: 3.281627712gb
> >> >>>>
> >> >>>> I'm guessing since this feature is experimental, it might either
> go away, or might have strange behaviors. Is there anything I should look
> out for, or is there some alternative to reproduce these results?
> >> >>>>
> >> >>>> Thank You,
> >> >>>> Arun
> >> >>>>
> >> >>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> >> >>>>>
> >> >>>>> Hi Aron, Weston,
> >> >>>>>
> >> >>>>> I didn't try running the script locally, but a quick note: the
> >> >>>>> `feather.read_feather` function reads the Feather file into an
> Arrow
> >> >>>>> table ànd directly converts it to a pandas DataFrame. A memory
> >> >>>>> consumption 2x the size of the dataframe sounds not that
> unexpected to
> >> >>>>> me: most of the time, when converting an arrow table to a pandas
> >> >>>>> DataFrame, the data will be copied to accommodate for pandas'
> specific
> >> >>>>> internal memory layout (at least numeric columns will be combined
> >> >>>>> together in 2D arrays).
> >> >>>>>
> >> >>>>> To verify if this is the cause, you might want to do either of:
> >> >>>>> - use `feather.read_table` instead of `feather.read_feather`,
> which
> >> >>>>> will read the file as an Arrow table instead (and don't do any
> >> >>>>> conversion to pandas)
> >> >>>>> - if you want to include the conversion to pandas, also use
> >> >>>>> `read_table` and do the conversion to pandas explicitly with a
> >> >>>>> `to_pandas()` call on the result. In that case, you can specify
> >> >>>>> `split_blocks=True` to use more zero-copy conversion in the
> >> >>>>> arrow->pandas conversion
> >> >>>>>
> >> >>>>> Joris
> >> >>>>>
> >> >>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com>
> wrote:
> >> >>>>> >
> >> >>>>> > Hi Wes,
> >> >>>>> >
> >> >>>>> > Sorry for the late reply on this, but I think I got a
> reproducible test case:
> >> >>>>> >
> >> >>>>> > import pandas as pd
> >> >>>>> > import numpy as np
> >> >>>>> > import pyarrow as pa
> >> >>>>> > from pyarrow import feather
> >> >>>>> > import os
> >> >>>>> > import psutil
> >> >>>>> > pa.set_memory_pool(pa.system_memory_pool())
> >> >>>>> > DATA_FILE = 'test.arrow'
> >> >>>>> >
> >> >>>>> > def setup():
> >> >>>>> >   np.random.seed(0)
> >> >>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546,
> 57)), columns=list([f'i_{i}' for i in range(57)]))
> >> >>>>> >   df.to_feather(DATA_FILE)
> >> >>>>> >   print(f'wrote {DATA_FILE}')
> >> >>>>> >   import sys
> >> >>>>> >   sys.exit()
> >> >>>>> >
> >> >>>>> > if __name__ == "__main__":
> >> >>>>> >   # setup()
> >> >>>>> >   process = psutil.Process(os.getpid())
> >> >>>>> >   path = DATA_FILE
> >> >>>>> >
> >> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>> >
> >> >>>>> >   df = feather.read_feather(path)
> >> >>>>> >
> >> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >   df_size = df.memory_usage().sum() / 1e9
> >> >>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >> >>>>> >
> >> >>>>> > OUTPUT:
> >> >>>>> > BEFORE mem_size: 0.091795456gb
> >> >>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
> >> >>>>> > ARROW: 3.281625024gb
> >> >>>>> >
> >> >>>>> > Let me know if you're able to see similar results.
> >> >>>>> >
> >> >>>>> > Thanks,
> >> >>>>> > Arun
> >> >>>>> >
> >> >>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <
> weston.pace@gmail.com> wrote:
> >> >>>>> >>
> >> >>>>> >> I get more or less the same results as you for the provided
> setup data
> >> >>>>> >> (exact same #'s for arrow & df_size and slightly different for
> RSS
> >> >>>>> >> which is to be expected).  The fact that the arrow size is
> much lower
> >> >>>>> >> than the dataframe size is not too surprising to me.  If a
> column
> >> >>>>> >> can't be zero copied then it's memory will disappear from the
> arrow
> >> >>>>> >> pool (I think).  Plus, object columns will have overhead in
> pandas
> >> >>>>> >> that they do not have in Arrow.
> >> >>>>> >>
> >> >>>>> >> The df_size issue for me seems to be tied to string columns.
> I think
> >> >>>>> >> pandas is overestimating how much size is needed there (many
> of my
> >> >>>>> >> strings are similar and I wonder if some kind of object
> sharing is
> >> >>>>> >> happening).  But we can table this for another time.
> >> >>>>> >>
> >> >>>>> >> I tried writing my feather file with your parameters and it
> didn't
> >> >>>>> >> have much impact on any of the numbers.
> >> >>>>> >>
> >> >>>>> >> Since the arrow size for you is expected (nearly the same as
> the
> >> >>>>> >> df_size) I'm not sure what to investigate next.  The memory
> does not
> >> >>>>> >> seem to be retained by Arrow.  Is there any chance you could
> create a
> >> >>>>> >> reproducible test case using randomly generated numpy data
> (then you
> >> >>>>> >> could share that setup function)?
> >> >>>>> >>
> >> >>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com>
> wrote:
> >> >>>>> >> >
> >> >>>>> >> > Hi Wes,
> >> >>>>> >> >
> >> >>>>> >> > I'm not including the setup() call when I encounter the
> issue. I just kept it in there for ease of reproducibility. Memory usage is
> indeed higher when it is included, but that isn't surprising.
> >> >>>>> >> >
> >> >>>>> >> > I tried switching over to the system allocator but there is
> no change.
> >> >>>>> >> >
> >> >>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
> >> >>>>> >> >
> >> >>>>> >> > I updated my script to also include the Arrow bytes
> allocated and it gave me the following:
> >> >>>>> >> >
> >> >>>>> >> > MVE:
> >> >>>>> >> > import pandas as pd
> >> >>>>> >> > import pyarrow as pa
> >> >>>>> >> > from pyarrow import feather
> >> >>>>> >> > import os
> >> >>>>> >> > import psutil
> >> >>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
> >> >>>>> >> >
> >> >>>>> >> >
> >> >>>>> >> > def setup():
> >> >>>>> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >>>>> >> >   df.to_feather('test.csv')
> >> >>>>> >> >
> >> >>>>> >> > if __name__ == "__main__":
> >> >>>>> >> >   # setup()
> >> >>>>> >> >   process = psutil.Process(os.getpid())
> >> >>>>> >> >   path = 'test.csv'
> >> >>>>> >> >
> >> >>>>> >> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>> >> >
> >> >>>>> >> >   df = feather.read_feather(path)
> >> >>>>> >> >
> >> >>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >>>>> >> >   mem_size = process.memory_info().rss / 1e10
> >> >>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated()
> / 1e9}gb')
> >> >>>>> >> >
> >> >>>>> >> > Output with my data:
> >> >>>>> >> > BEFORE mem_size: 0.08761344gb
> >> >>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> >> >>>>> >> > ARROW: 3.080121792gb
> >> >>>>> >> >
> >> >>>>> >> > Output with Provided Setup Data:
> >> >>>>> >> > BEFORE mem_size: 0.09179136gb
> >> >>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> >> >>>>> >> > ARROW: 0.00029664gb
> >> >>>>> >> >
> >> >>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes
> are distinct and non-overlapping, but it seems strange that the output with
> the provided data has the Arrow bytes allocated at ~0GB whereas the one
> with my data has the allocated data approximately equal to the dataframe
> size. I'm not sure if it affects anything but my file was written with the
> following:
> >> >>>>> >> >
> >> >>>>> >> > import pyarrow.lib as ext
> >> >>>>> >> > import pyarrow
> >> >>>>> >> > COMPRESSION_LEVEL = 19
> >> >>>>> >> > COMPRESSION_ALGO = 'zstd'
> >> >>>>> >> > KILOBYTE = 1 << 10
> >> >>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
> >> >>>>> >> > CHUNK_SIZE = MEGABYTE
> >> >>>>> >> >
> >> >>>>> >> > table = pyarrow.Table.from_pandas(df,
> preserve_index=preserve_index)
> >> >>>>> >> > ext.write_feather(table, dest, compression=compression,
> compression_level=compression_level,chunksize=chunk_size, version=2)
> >> >>>>> >> >
> >> >>>>> >> > As to the discrepancy around calculating dataframe size. I'm
> not sure why that would be so off for you. Going off the docs, it seems
> like it should be accurate. My Dataframe in question is [7196546 rows x 56
> columns] where each column is mostly a float or integer and datetime index.
> 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >> >>>>> >> >
> >> >>>>> >> > Thank You,
> >> >>>>> >> > Arun
> >> >>>>> >> >
> >> >>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <
> weston.pace@gmail.com> wrote:
> >> >>>>> >> >>
> >> >>>>> >> >> 2x overshoot of memory does seem a little high.  Are you
> including the
> >> >>>>> >> >> "setup" part when you encounter that?  Arrow's file-based
> CSV reader
> >> >>>>> >> >> will require 2-3x memory usage because it buffers the bytes
> in memory
> >> >>>>> >> >> in case it needs to re-convert them later (because it
> realizes the
> >> >>>>> >> >> data type for the column is different).  I'm not sure if
> Panda's CSV
> >> >>>>> >> >> reader is similar.
> >> >>>>> >> >>
> >> >>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow
> to hold on
> >> >>>>> >> >> to a bit more memory and hold onto it (for a little while
> at least)
> >> >>>>> >> >> even after it is no longer used.  Even malloc will hold
> onto memory
> >> >>>>> >> >> sometimes due to fragmentation or other concerns.  You
> could try
> >> >>>>> >> >> changing to the system allocator
> >> >>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of
> your file)
> >> >>>>> >> >> to see if that makes a difference.
> >> >>>>> >> >>
> >> >>>>> >> >> I'm not sure your method of calculating the dataframe size
> is
> >> >>>>> >> >> reliable.  I don't actually know enough about pandas but
> when I tried
> >> >>>>> >> >> your experiment with my own 1.9G CSV file it ended up
> reporting:
> >> >>>>> >> >>
> >> >>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >> >>>>> >> >>
> >> >>>>> >> >> which seems suspicious.
> >> >>>>> >> >>
> >> >>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1)
> didn't seem
> >> >>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow
> itself was
> >> >>>>> >> >> only using ~1.9GB and I will naively assume the difference
> between the
> >> >>>>> >> >> two is bloat caused by object wrappers when converting to
> pandas.
> >> >>>>> >> >>
> >> >>>>> >> >> Another thing you might try and measure is
> >> >>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will
> tell you how
> >> >>>>> >> >> much memory Arrow itself is hanging onto.  If that is not
> 6GB then it
> >> >>>>> >> >> is a pretty good guess that memory is being held somewhere
> else.
> >> >>>>> >> >>
> >> >>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <
> ajoseps@gmail.com> wrote:
> >> >>>>> >> >> >
> >> >>>>> >> >> > Hi Apache Arrow Members,
> >> >>>>> >> >> >
> >> >>>>> >> >> > My question is below but I've compiled a minimum
> reproducible example with a public dataset:
> >> >>>>> >> >> >
> >> >>>>> >> >> > import pandas as pd
> >> >>>>> >> >> > from pyarrow import feather
> >> >>>>> >> >> > import os
> >> >>>>> >> >> > import psutil
> >> >>>>> >> >> >
> >> >>>>> >> >> >
> >> >>>>> >> >> > def setup():
> >> >>>>> >> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >>>>> >> >> >   df.to_feather('test.csv')
> >> >>>>> >> >> >
> >> >>>>> >> >> > if __name__ == "__main__":
> >> >>>>> >> >> >   # setup()
> >> >>>>> >> >> >   process = psutil.Process(os.getpid())
> >> >>>>> >> >> >   path = 'test.csv'
> >> >>>>> >> >> >
> >> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >>>>> >> >> >
> >> >>>>> >> >> >   df = feather.read_feather(path)
> >> >>>>> >> >> >
> >> >>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >> >>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size:
> {df_size}gb')
> >> >>>>> >> >> >
> >> >>>>> >> >> > I substituted my df with a sample csv. I had trouble
> finding a sample CSV of adequate size however, my dataset is ~3GB, and I
> see memory usage of close to 6GB.
> >> >>>>> >> >> >
> >> >>>>> >> >> > Output with My Data:
> >> >>>>> >> >> > BEFORE mem_size: 0.088891392gb
> >> >>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >> >>>>> >> >> >
> >> >>>>> >> >> > It seems strange that the overall memory usage of the
> process is approx double of the size of the dataframe itself. Is there a
> reason for this, and is there a way to mitigate this?
> >> >>>>> >> >> >
> >> >>>>> >> >> > $ conda list pyarrow
> >> >>>>> >> >> > #
> >> >>>>> >> >> > # Name                    Version
>  Build  Channel
> >> >>>>> >> >> > pyarrow                   4.0.1
>  py37h0f64622_13_cpu    conda-forge
> >> >>>>> >> >> >
> >> >>>>> >> >> > Thank You,
> >> >>>>> >> >> > Arun Joseph
> >> >>>>> >> >> >
> >> >>>>> >> >
> >> >>>>> >> >
> >> >>>>> >> >
> >> >>>>> >> > --
> >> >>>>> >> > Arun Joseph
> >> >>>>> >> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > --
> >> >>>>> > Arun Joseph
> >> >>>>> >
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Arun Joseph
> >> >>>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Arun Joseph
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Arun Joseph
> >> >>
> >> >
> >> >
> >> > --
> >> > Arun Joseph
> >> >
> >>
> >>
> >
> >
> > --
> > Arun Joseph
> >
>
-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Thank you Wes and David for the in-depth responses. I also created a stack
overflow post
<https://stackoverflow.com/questions/70267522/does-conversion-from-arrow-format-to-pandas-dataframe-duplicate-data-on-the-heap>
on this with an example/outputs (before I saw your responses, but updated
after I saw it), I hope that is ok and or useful. Otherwise I can remove it.

# I'm pretty sure guppy3 is double-counting.
As for guppy double counting, that is really strange. I think you're
looking at the cumulative size re: the dataframe size. Although, I did
observe what you are describing when I was generating the test data as
well. In the stack overflow post, I have another example which prints out
the RSS and the guppy heap output and it does not seem like there is double
counting for the normal run. I also included a sleep at the end before
recording the heap and RSS as well.

# I think split_blocks and self_destruct is the best answer at the moment.
self_destruct has remained in the code since at least 1.0.0 so perhaps it
is time we remove the "experimental" flag and maybe replace it with a
"caution" or "danger" flag (as it causes the table to become unusable
afterwards). In terms of the closest immediate fix, split_blocks and
self_destruct do seem like the best choices.
Yes I agree. I'll be incorporating these changes in my actual codebase.
While they don't always work, there should be some improvement.

# I did see some strange behavior when working with the
RecordBatchFileReader and I opened ARROW-15017 to resolve this but you can
work around this by deleting the reader.
By deleting the Reader, do you mean just doing a `del Reader` or `Reader =
None`?

# Note that to minimize the memory usage, you should also pass
use_threads=False
I will also try this out, thank you

On Tue, Dec 7, 2021 at 6:32 PM David Li <li...@apache.org> wrote:

> Just for edification (though I have limited understanding of the machinery
> here, someone more familiar with Pandas internals may have more
> insight/this may be wrong or very outdated!):
>
> zero_copy_only does not work for two reasons (well, one reason
> fundamentally): the representation in memory of a Pandas dataframe has been
> a dense, 2D NumPy array per column type. In other words, all data across
> all columns of the same type are contiguous in memory. (At least
> historically. My understanding is that this has changed/become more
> flexible relatively recently.) This is the representation that Arrow tries
> to generate by default. (See
> https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.)
>
> However, the Arrow table you have is not contiguous: each column is
> allocated separately, and for a Table, each column is made up of a list of
> contiguous chunks. So there are very few cases where data can be
> zero-copied, it must instead be copied and "compacted".
>
> The split_blocks option *helps* work around this. It allows each column in
> the Pandas DataFrame to be its own allocation. However, each individual
> column must still be contiguous. If you try zero_copy_only with
> split_blocks, you'll get a different error message, this is because the
> columns of your Arrow Table have more than one chunk. If you create a small
> in-memory Table with only one column with one chunk, zero_copy_only +
> split_blocks will work!
>
> split_blocks with self_destruct works in this case still because
> self_destruct will still copy data, it will just also try to free the Arrow
> data as each column is converted. (Note that to minimize the memory usage,
> you should also pass use_threads=False. In that case, the maximum memory
> overhead should be one column's worth.)
>
> -David
>
> On Tue, Dec 7, 2021, at 18:09, Weston Pace wrote:
>
> Thank you for the new example.
>
> # Why is it 2x?
>
> This is essentially a "peak RAM" usage of the operation.  Given that
> split_blocks helped I think we can attribute this doubling to the
> pandas conversion.
>
> # Why doesn't the memory get returned?
>
> It does, it just doesn't do so immediately.  If I put a 5 second sleep
> before I print the memory I see that the RSS shrinks down.  This is
> how jemalloc is configured in Arrow (actually I think it is 1 second)
> for releasing RSS after reaching peak consumption.
>
> BEFORE mem_size: 0.082276352gb
> AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb
> AFTER-ARROW: 3.281625024gb
> ---five second sleep---
> AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb
> AFTER-SLEEP-ARROW: 3.281625024gb
>
> # Why didn't switching to the system allocator help?
>
> The problem isn't "the dynamic allocator is allocating more than it
> needs".  There is a point in this process where ~6GB are actually
> needed.  The system allocator either also holds on to that RSS for a
> little bit or the RSS numbers themselves take a little bit of time to
> update.  I'm not entirely sure.
>
> # Why isn't this a zero-copy conversion to pandas?
>
> That's a good question, I don't know the details.  If I try manually
> doing the conversion with zero_copy_only I get the error "Cannot do
> zero copy conversion into multi-column DataFrame block"
>
> # What is up with the numpy.ndarray objects in the heap?
>
> I'm pretty sure guppy3 is double-counting.  Note that the total size
> is ~20GB.  I've been able to reproduce this in cases where the heap is
> 3GB and guppy still shows the dataframe taking up 6GB.  In fact, I
> once even managed to generate this:
>
> AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb
> AFTER-SLEEP-ARROW: 0.0gb
> Partition of a set of 212560 objects. Total size = 13328742559 bytes.
> Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0     57   0 6563250864  49 6563250864  49 pandas.core.series.Series
>      1    133   0 3339213718  25 9902464582  74 numpy.ndarray
>      2      1   0 3339197360  25 13241661942  99
> pandas.core.frame.DataFrame
>
> The RSS is 3.44GB but guppy reports the dataframe as 13GB.
>
> I did see some strange behavior when working with the
> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you
> can work around this by deleting the reader.
>
> # Can I return the data immediately / I don't want to use 2x memory
> consumption
>
> I think split_blocks and self_destruct is the best answer at the
> moment.  self_destruct has remained in the code since at least 1.0.0
> so perhaps it is time we remove the "experimental" flag and maybe
> replace it with a "caution" or "danger" flag (as it causes the table
> to become unusable afterwards).
>
> Jemalloc has some manual facilities to purge dirty memory and we
> expose some of them with
> pyarrow.default_memory_pool().release_unused() but that doesn't seem
> to be helping in this situation.  Either the excess memory is in the
> non-jemalloc pool or the jemalloc command can't quite release this
> memory, or the RSS stats are just stale.  I'm not entirely sure.
>
> On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <aj...@gmail.com> wrote:
> >
> > Slightly related, I have some other code that opens up an arrow file
> using a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch
> to a pandas dataframe. After this conversion is done, and I inspect the
> heap, I always see the following:
> >
> > hpy().heap()
> > Partition of a set of 351136 objects. Total size = 20112096840 bytes.
> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
> >      0    121   0 9939601034  49 9939601034  49 numpy.ndarray
> >      1      1   0 9939585700  49 19879186734  99
> pandas.core.frame.DataFrame
> >      2      1   0 185786680   1 20064973414 100
> pandas.core.indexes.datetimes.DatetimeIndex
> >
> > Specifically the numpy.ndarray. It only shows up after the conversion
> and it does not seem to go away. It also seems to be roughly the same size
> as the dataframe itself.
> >
> > - Arun
> >
> > On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <aj...@gmail.com> wrote:
> >>
> >> Just to follow up on this, is there a way to manually force the arrow
> pool to de-allocate? My usecase is essentially having multiple processes in
> a Pool or via Slurm read from an arrow file, do some work, and then exit.
> Issue is that the 2x memory consumption reduces the bandwidth on the
> machine to effectively half.
> >>
> >> Thank You,
> >> Arun
> >>
> >> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <aj...@gmail.com> wrote:
> >>>
> >>> Additionally, I tested with my actual data, and did not see memory
> savings.
> >>>
> >>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com> wrote:
> >>>>
> >>>> Hi Joris,
> >>>>
> >>>> Thank you for the explanation. The 2x memory consumption on
> conversion makes sense if there is a copy, but it does seem like it
> persists longer than it should. Might be because of python's GC policies?
> >>>> I tried out your recommendations but they did not seem to work.
> However, I did notice an experimental option on `to_pandas`,
> `self_destruct`, which seems to address the issue I'm facing. Sadly, that
> itself did not work either... but, combined with `split_blocks=True`, I am
> seeing memory savings:
> >>>>
> >>>> import pandas as pd
> >>>> import numpy as np
> >>>> import pyarrow as pa
> >>>> from pyarrow import feather
> >>>> import os
> >>>> import psutil
> >>>> pa.set_memory_pool(pa.system_memory_pool())
> >>>> DATA_FILE = 'test.arrow'
> >>>>
> >>>> def setup():
> >>>>   np.random.seed(0)
> >>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
> columns=list([f'{i}' for i in range(57)]))
> >>>>   df.to_feather(DATA_FILE)
> >>>>   print(f'wrote {DATA_FILE}')
> >>>>   import sys
> >>>>   sys.exit()
> >>>>
> >>>> if __name__ == "__main__":
> >>>>   # setup()
> >>>>   process = psutil.Process(os.getpid())
> >>>>   path = DATA_FILE
> >>>>
> >>>>   mem_size = process.memory_info().rss / 1e9
> >>>>   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>
> >>>>   feather_table = feather.read_table(path)
> >>>>   # df = feather_table.to_pandas(split_blocks=True)
> >>>>   # df = feather_table.to_pandas()
> >>>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
> >>>>
> >>>>   mem_size = process.memory_info().rss / 1e9
> >>>>   df_size = df.memory_usage().sum() / 1e9
> >>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >>>>
> >>>>
> >>>> OUTPUT(to_pandas()):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
> >>>> ARROW: 3.281625024gb
> >>>>
> >>>> OUTPUT (to_pandas(split_blocks=True)):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
> >>>> ARROW: 3.281627712gb
> >>>>
> >>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
> >>>> ARROW: 3.281627712gb
> >>>>
> >>>> I'm guessing since this feature is experimental, it might either go
> away, or might have strange behaviors. Is there anything I should look out
> for, or is there some alternative to reproduce these results?
> >>>>
> >>>> Thank You,
> >>>> Arun
> >>>>
> >>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
> >>>>>
> >>>>> Hi Aron, Weston,
> >>>>>
> >>>>> I didn't try running the script locally, but a quick note: the
> >>>>> `feather.read_feather` function reads the Feather file into an Arrow
> >>>>> table ànd directly converts it to a pandas DataFrame. A memory
> >>>>> consumption 2x the size of the dataframe sounds not that unexpected
> to
> >>>>> me: most of the time, when converting an arrow table to a pandas
> >>>>> DataFrame, the data will be copied to accommodate for pandas'
> specific
> >>>>> internal memory layout (at least numeric columns will be combined
> >>>>> together in 2D arrays).
> >>>>>
> >>>>> To verify if this is the cause, you might want to do either of:
> >>>>> - use `feather.read_table` instead of `feather.read_feather`, which
> >>>>> will read the file as an Arrow table instead (and don't do any
> >>>>> conversion to pandas)
> >>>>> - if you want to include the conversion to pandas, also use
> >>>>> `read_table` and do the conversion to pandas explicitly with a
> >>>>> `to_pandas()` call on the result. In that case, you can specify
> >>>>> `split_blocks=True` to use more zero-copy conversion in the
> >>>>> arrow->pandas conversion
> >>>>>
> >>>>> Joris
> >>>>>
> >>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
> >>>>> >
> >>>>> > Hi Wes,
> >>>>> >
> >>>>> > Sorry for the late reply on this, but I think I got a reproducible
> test case:
> >>>>> >
> >>>>> > import pandas as pd
> >>>>> > import numpy as np
> >>>>> > import pyarrow as pa
> >>>>> > from pyarrow import feather
> >>>>> > import os
> >>>>> > import psutil
> >>>>> > pa.set_memory_pool(pa.system_memory_pool())
> >>>>> > DATA_FILE = 'test.arrow'
> >>>>> >
> >>>>> > def setup():
> >>>>> >   np.random.seed(0)
> >>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
> columns=list([f'i_{i}' for i in range(57)]))
> >>>>> >   df.to_feather(DATA_FILE)
> >>>>> >   print(f'wrote {DATA_FILE}')
> >>>>> >   import sys
> >>>>> >   sys.exit()
> >>>>> >
> >>>>> > if __name__ == "__main__":
> >>>>> >   # setup()
> >>>>> >   process = psutil.Process(os.getpid())
> >>>>> >   path = DATA_FILE
> >>>>> >
> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >
> >>>>> >   df = feather.read_feather(path)
> >>>>> >
> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >   df_size = df.memory_usage().sum() / 1e9
> >>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >>>>> >
> >>>>> > OUTPUT:
> >>>>> > BEFORE mem_size: 0.091795456gb
> >>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
> >>>>> > ARROW: 3.281625024gb
> >>>>> >
> >>>>> > Let me know if you're able to see similar results.
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Arun
> >>>>> >
> >>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com>
> wrote:
> >>>>> >>
> >>>>> >> I get more or less the same results as you for the provided setup
> data
> >>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
> >>>>> >> which is to be expected).  The fact that the arrow size is much
> lower
> >>>>> >> than the dataframe size is not too surprising to me.  If a column
> >>>>> >> can't be zero copied then it's memory will disappear from the
> arrow
> >>>>> >> pool (I think).  Plus, object columns will have overhead in pandas
> >>>>> >> that they do not have in Arrow.
> >>>>> >>
> >>>>> >> The df_size issue for me seems to be tied to string columns.  I
> think
> >>>>> >> pandas is overestimating how much size is needed there (many of my
> >>>>> >> strings are similar and I wonder if some kind of object sharing is
> >>>>> >> happening).  But we can table this for another time.
> >>>>> >>
> >>>>> >> I tried writing my feather file with your parameters and it didn't
> >>>>> >> have much impact on any of the numbers.
> >>>>> >>
> >>>>> >> Since the arrow size for you is expected (nearly the same as the
> >>>>> >> df_size) I'm not sure what to investigate next.  The memory does
> not
> >>>>> >> seem to be retained by Arrow.  Is there any chance you could
> create a
> >>>>> >> reproducible test case using randomly generated numpy data (then
> you
> >>>>> >> could share that setup function)?
> >>>>> >>
> >>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com>
> wrote:
> >>>>> >> >
> >>>>> >> > Hi Wes,
> >>>>> >> >
> >>>>> >> > I'm not including the setup() call when I encounter the issue.
> I just kept it in there for ease of reproducibility. Memory usage is indeed
> higher when it is included, but that isn't surprising.
> >>>>> >> >
> >>>>> >> > I tried switching over to the system allocator but there is no
> change.
> >>>>> >> >
> >>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
> >>>>> >> >
> >>>>> >> > I updated my script to also include the Arrow bytes allocated
> and it gave me the following:
> >>>>> >> >
> >>>>> >> > MVE:
> >>>>> >> > import pandas as pd
> >>>>> >> > import pyarrow as pa
> >>>>> >> > from pyarrow import feather
> >>>>> >> > import os
> >>>>> >> > import psutil
> >>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
> >>>>> >> >
> >>>>> >> >
> >>>>> >> > def setup():
> >>>>> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >>>>> >> >   df.to_feather('test.csv')
> >>>>> >> >
> >>>>> >> > if __name__ == "__main__":
> >>>>> >> >   # setup()
> >>>>> >> >   process = psutil.Process(os.getpid())
> >>>>> >> >   path = 'test.csv'
> >>>>> >> >
> >>>>> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >> >
> >>>>> >> >   df = feather.read_feather(path)
> >>>>> >> >
> >>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >>>>> >> >   mem_size = process.memory_info().rss / 1e10
> >>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >>>>> >> >
> >>>>> >> > Output with my data:
> >>>>> >> > BEFORE mem_size: 0.08761344gb
> >>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> >>>>> >> > ARROW: 3.080121792gb
> >>>>> >> >
> >>>>> >> > Output with Provided Setup Data:
> >>>>> >> > BEFORE mem_size: 0.09179136gb
> >>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> >>>>> >> > ARROW: 0.00029664gb
> >>>>> >> >
> >>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes
> are distinct and non-overlapping, but it seems strange that the output with
> the provided data has the Arrow bytes allocated at ~0GB whereas the one
> with my data has the allocated data approximately equal to the dataframe
> size. I'm not sure if it affects anything but my file was written with the
> following:
> >>>>> >> >
> >>>>> >> > import pyarrow.lib as ext
> >>>>> >> > import pyarrow
> >>>>> >> > COMPRESSION_LEVEL = 19
> >>>>> >> > COMPRESSION_ALGO = 'zstd'
> >>>>> >> > KILOBYTE = 1 << 10
> >>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
> >>>>> >> > CHUNK_SIZE = MEGABYTE
> >>>>> >> >
> >>>>> >> > table = pyarrow.Table.from_pandas(df,
> preserve_index=preserve_index)
> >>>>> >> > ext.write_feather(table, dest, compression=compression,
> compression_level=compression_level,chunksize=chunk_size, version=2)
> >>>>> >> >
> >>>>> >> > As to the discrepancy around calculating dataframe size. I'm
> not sure why that would be so off for you. Going off the docs, it seems
> like it should be accurate. My Dataframe in question is [7196546 rows x 56
> columns] where each column is mostly a float or integer and datetime index.
> 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >>>>> >> >
> >>>>> >> > Thank You,
> >>>>> >> > Arun
> >>>>> >> >
> >>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <
> weston.pace@gmail.com> wrote:
> >>>>> >> >>
> >>>>> >> >> 2x overshoot of memory does seem a little high.  Are you
> including the
> >>>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV
> reader
> >>>>> >> >> will require 2-3x memory usage because it buffers the bytes in
> memory
> >>>>> >> >> in case it needs to re-convert them later (because it realizes
> the
> >>>>> >> >> data type for the column is different).  I'm not sure if
> Panda's CSV
> >>>>> >> >> reader is similar.
> >>>>> >> >>
> >>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to
> hold on
> >>>>> >> >> to a bit more memory and hold onto it (for a little while at
> least)
> >>>>> >> >> even after it is no longer used.  Even malloc will hold onto
> memory
> >>>>> >> >> sometimes due to fragmentation or other concerns.  You could
> try
> >>>>> >> >> changing to the system allocator
> >>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of
> your file)
> >>>>> >> >> to see if that makes a difference.
> >>>>> >> >>
> >>>>> >> >> I'm not sure your method of calculating the dataframe size is
> >>>>> >> >> reliable.  I don't actually know enough about pandas but when
> I tried
> >>>>> >> >> your experiment with my own 1.9G CSV file it ended up
> reporting:
> >>>>> >> >>
> >>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >>>>> >> >>
> >>>>> >> >> which seems suspicious.
> >>>>> >> >>
> >>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't
> seem
> >>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow
> itself was
> >>>>> >> >> only using ~1.9GB and I will naively assume the difference
> between the
> >>>>> >> >> two is bloat caused by object wrappers when converting to
> pandas.
> >>>>> >> >>
> >>>>> >> >> Another thing you might try and measure is
> >>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell
> you how
> >>>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB
> then it
> >>>>> >> >> is a pretty good guess that memory is being held somewhere
> else.
> >>>>> >> >>
> >>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com>
> wrote:
> >>>>> >> >> >
> >>>>> >> >> > Hi Apache Arrow Members,
> >>>>> >> >> >
> >>>>> >> >> > My question is below but I've compiled a minimum
> reproducible example with a public dataset:
> >>>>> >> >> >
> >>>>> >> >> > import pandas as pd
> >>>>> >> >> > from pyarrow import feather
> >>>>> >> >> > import os
> >>>>> >> >> > import psutil
> >>>>> >> >> >
> >>>>> >> >> >
> >>>>> >> >> > def setup():
> >>>>> >> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >>>>> >> >> >   df.to_feather('test.csv')
> >>>>> >> >> >
> >>>>> >> >> > if __name__ == "__main__":
> >>>>> >> >> >   # setup()
> >>>>> >> >> >   process = psutil.Process(os.getpid())
> >>>>> >> >> >   path = 'test.csv'
> >>>>> >> >> >
> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >> >> >
> >>>>> >> >> >   df = feather.read_feather(path)
> >>>>> >> >> >
> >>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >> >> >
> >>>>> >> >> > I substituted my df with a sample csv. I had trouble finding
> a sample CSV of adequate size however, my dataset is ~3GB, and I see memory
> usage of close to 6GB.
> >>>>> >> >> >
> >>>>> >> >> > Output with My Data:
> >>>>> >> >> > BEFORE mem_size: 0.088891392gb
> >>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >>>>> >> >> >
> >>>>> >> >> > It seems strange that the overall memory usage of the
> process is approx double of the size of the dataframe itself. Is there a
> reason for this, and is there a way to mitigate this?
> >>>>> >> >> >
> >>>>> >> >> > $ conda list pyarrow
> >>>>> >> >> > #
> >>>>> >> >> > # Name                    Version                   Build
> Channel
> >>>>> >> >> > pyarrow                   4.0.1
> py37h0f64622_13_cpu    conda-forge
> >>>>> >> >> >
> >>>>> >> >> > Thank You,
> >>>>> >> >> > Arun Joseph
> >>>>> >> >> >
> >>>>> >> >
> >>>>> >> >
> >>>>> >> >
> >>>>> >> > --
> >>>>> >> > Arun Joseph
> >>>>> >> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Arun Joseph
> >>>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Arun Joseph
> >>>>
> >>>
> >>>
> >>> --
> >>> Arun Joseph
> >>>
> >>
> >>
> >> --
> >> Arun Joseph
> >>
> >
> >
> > --
> > Arun Joseph
> >
>
>
>

-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by David Li <li...@apache.org>.
Just for edification (though I have limited understanding of the machinery here, someone more familiar with Pandas internals may have more insight/this may be wrong or very outdated!):

zero_copy_only does not work for two reasons (well, one reason fundamentally): the representation in memory of a Pandas dataframe has been a dense, 2D NumPy array per column type. In other words, all data across all columns of the same type are contiguous in memory. (At least historically. My understanding is that this has changed/become more flexible relatively recently.) This is the representation that Arrow tries to generate by default. (See https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.)

However, the Arrow table you have is not contiguous: each column is allocated separately, and for a Table, each column is made up of a list of contiguous chunks. So there are very few cases where data can be zero-copied, it must instead be copied and "compacted".

The split_blocks option *helps* work around this. It allows each column in the Pandas DataFrame to be its own allocation. However, each individual column must still be contiguous. If you try zero_copy_only with split_blocks, you'll get a different error message, this is because the columns of your Arrow Table have more than one chunk. If you create a small in-memory Table with only one column with one chunk, zero_copy_only + split_blocks will work!

split_blocks with self_destruct works in this case still because self_destruct will still copy data, it will just also try to free the Arrow data as each column is converted. (Note that to minimize the memory usage, you should also pass use_threads=False. In that case, the maximum memory overhead should be one column's worth.)

-David

On Tue, Dec 7, 2021, at 18:09, Weston Pace wrote:
> Thank you for the new example.
> 
> # Why is it 2x?
> 
> This is essentially a "peak RAM" usage of the operation.  Given that
> split_blocks helped I think we can attribute this doubling to the
> pandas conversion.
> 
> # Why doesn't the memory get returned?
> 
> It does, it just doesn't do so immediately.  If I put a 5 second sleep
> before I print the memory I see that the RSS shrinks down.  This is
> how jemalloc is configured in Arrow (actually I think it is 1 second)
> for releasing RSS after reaching peak consumption.
> 
> BEFORE mem_size: 0.082276352gb
> AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb
> AFTER-ARROW: 3.281625024gb
> ---five second sleep---
> AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb
> AFTER-SLEEP-ARROW: 3.281625024gb
> 
> # Why didn't switching to the system allocator help?
> 
> The problem isn't "the dynamic allocator is allocating more than it
> needs".  There is a point in this process where ~6GB are actually
> needed.  The system allocator either also holds on to that RSS for a
> little bit or the RSS numbers themselves take a little bit of time to
> update.  I'm not entirely sure.
> 
> # Why isn't this a zero-copy conversion to pandas?
> 
> That's a good question, I don't know the details.  If I try manually
> doing the conversion with zero_copy_only I get the error "Cannot do
> zero copy conversion into multi-column DataFrame block"
> 
> # What is up with the numpy.ndarray objects in the heap?
> 
> I'm pretty sure guppy3 is double-counting.  Note that the total size
> is ~20GB.  I've been able to reproduce this in cases where the heap is
> 3GB and guppy still shows the dataframe taking up 6GB.  In fact, I
> once even managed to generate this:
> 
> AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb
> AFTER-SLEEP-ARROW: 0.0gb
> Partition of a set of 212560 objects. Total size = 13328742559 bytes.
> Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0     57   0 6563250864  49 6563250864  49 pandas.core.series.Series
>      1    133   0 3339213718  25 9902464582  74 numpy.ndarray
>      2      1   0 3339197360  25 13241661942  99 pandas.core.frame.DataFrame
> 
> The RSS is 3.44GB but guppy reports the dataframe as 13GB.
> 
> I did see some strange behavior when working with the
> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you
> can work around this by deleting the reader.
> 
> # Can I return the data immediately / I don't want to use 2x memory consumption
> 
> I think split_blocks and self_destruct is the best answer at the
> moment.  self_destruct has remained in the code since at least 1.0.0
> so perhaps it is time we remove the "experimental" flag and maybe
> replace it with a "caution" or "danger" flag (as it causes the table
> to become unusable afterwards).
> 
> Jemalloc has some manual facilities to purge dirty memory and we
> expose some of them with
> pyarrow.default_memory_pool().release_unused() but that doesn't seem
> to be helping in this situation.  Either the excess memory is in the
> non-jemalloc pool or the jemalloc command can't quite release this
> memory, or the RSS stats are just stale.  I'm not entirely sure.
> 
> On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <aj...@gmail.com> wrote:
> >
> > Slightly related, I have some other code that opens up an arrow file using a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch to a pandas dataframe. After this conversion is done, and I inspect the heap, I always see the following:
> >
> > hpy().heap()
> > Partition of a set of 351136 objects. Total size = 20112096840 bytes.
> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
> >      0    121   0 9939601034  49 9939601034  49 numpy.ndarray
> >      1      1   0 9939585700  49 19879186734  99 pandas.core.frame.DataFrame
> >      2      1   0 185786680   1 20064973414 100 pandas.core.indexes.datetimes.DatetimeIndex
> >
> > Specifically the numpy.ndarray. It only shows up after the conversion and it does not seem to go away. It also seems to be roughly the same size as the dataframe itself.
> >
> > - Arun
> >
> > On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <aj...@gmail.com> wrote:
> >>
> >> Just to follow up on this, is there a way to manually force the arrow pool to de-allocate? My usecase is essentially having multiple processes in a Pool or via Slurm read from an arrow file, do some work, and then exit. Issue is that the 2x memory consumption reduces the bandwidth on the machine to effectively half.
> >>
> >> Thank You,
> >> Arun
> >>
> >> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <aj...@gmail.com> wrote:
> >>>
> >>> Additionally, I tested with my actual data, and did not see memory savings.
> >>>
> >>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com> wrote:
> >>>>
> >>>> Hi Joris,
> >>>>
> >>>> Thank you for the explanation. The 2x memory consumption on conversion makes sense if there is a copy, but it does seem like it persists longer than it should. Might be because of python's GC policies?
> >>>> I tried out your recommendations but they did not seem to work. However, I did notice an experimental option on `to_pandas`, `self_destruct`, which seems to address the issue I'm facing. Sadly, that itself did not work either... but, combined with `split_blocks=True`, I am seeing memory savings:
> >>>>
> >>>> import pandas as pd
> >>>> import numpy as np
> >>>> import pyarrow as pa
> >>>> from pyarrow import feather
> >>>> import os
> >>>> import psutil
> >>>> pa.set_memory_pool(pa.system_memory_pool())
> >>>> DATA_FILE = 'test.arrow'
> >>>>
> >>>> def setup():
> >>>>   np.random.seed(0)
> >>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), columns=list([f'{i}' for i in range(57)]))
> >>>>   df.to_feather(DATA_FILE)
> >>>>   print(f'wrote {DATA_FILE}')
> >>>>   import sys
> >>>>   sys.exit()
> >>>>
> >>>> if __name__ == "__main__":
> >>>>   # setup()
> >>>>   process = psutil.Process(os.getpid())
> >>>>   path = DATA_FILE
> >>>>
> >>>>   mem_size = process.memory_info().rss / 1e9
> >>>>   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>
> >>>>   feather_table = feather.read_table(path)
> >>>>   # df = feather_table.to_pandas(split_blocks=True)
> >>>>   # df = feather_table.to_pandas()
> >>>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
> >>>>
> >>>>   mem_size = process.memory_info().rss / 1e9
> >>>>   df_size = df.memory_usage().sum() / 1e9
> >>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
> >>>>
> >>>>
> >>>> OUTPUT(to_pandas()):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
> >>>> ARROW: 3.281625024gb
> >>>>
> >>>> OUTPUT (to_pandas(split_blocks=True)):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
> >>>> ARROW: 3.281627712gb
> >>>>
> >>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
> >>>> BEFORE mem_size: 0.091795456gb
> >>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
> >>>> ARROW: 3.281627712gb
> >>>>
> >>>> I'm guessing since this feature is experimental, it might either go away, or might have strange behaviors. Is there anything I should look out for, or is there some alternative to reproduce these results?
> >>>>
> >>>> Thank You,
> >>>> Arun
> >>>>
> >>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <jo...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Aron, Weston,
> >>>>>
> >>>>> I didn't try running the script locally, but a quick note: the
> >>>>> `feather.read_feather` function reads the Feather file into an Arrow
> >>>>> table ànd directly converts it to a pandas DataFrame. A memory
> >>>>> consumption 2x the size of the dataframe sounds not that unexpected to
> >>>>> me: most of the time, when converting an arrow table to a pandas
> >>>>> DataFrame, the data will be copied to accommodate for pandas' specific
> >>>>> internal memory layout (at least numeric columns will be combined
> >>>>> together in 2D arrays).
> >>>>>
> >>>>> To verify if this is the cause, you might want to do either of:
> >>>>> - use `feather.read_table` instead of `feather.read_feather`, which
> >>>>> will read the file as an Arrow table instead (and don't do any
> >>>>> conversion to pandas)
> >>>>> - if you want to include the conversion to pandas, also use
> >>>>> `read_table` and do the conversion to pandas explicitly with a
> >>>>> `to_pandas()` call on the result. In that case, you can specify
> >>>>> `split_blocks=True` to use more zero-copy conversion in the
> >>>>> arrow->pandas conversion
> >>>>>
> >>>>> Joris
> >>>>>
> >>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
> >>>>> >
> >>>>> > Hi Wes,
> >>>>> >
> >>>>> > Sorry for the late reply on this, but I think I got a reproducible test case:
> >>>>> >
> >>>>> > import pandas as pd
> >>>>> > import numpy as np
> >>>>> > import pyarrow as pa
> >>>>> > from pyarrow import feather
> >>>>> > import os
> >>>>> > import psutil
> >>>>> > pa.set_memory_pool(pa.system_memory_pool())
> >>>>> > DATA_FILE = 'test.arrow'
> >>>>> >
> >>>>> > def setup():
> >>>>> >   np.random.seed(0)
> >>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)), columns=list([f'i_{i}' for i in range(57)]))
> >>>>> >   df.to_feather(DATA_FILE)
> >>>>> >   print(f'wrote {DATA_FILE}')
> >>>>> >   import sys
> >>>>> >   sys.exit()
> >>>>> >
> >>>>> > if __name__ == "__main__":
> >>>>> >   # setup()
> >>>>> >   process = psutil.Process(os.getpid())
> >>>>> >   path = DATA_FILE
> >>>>> >
> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >
> >>>>> >   df = feather.read_feather(path)
> >>>>> >
> >>>>> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >   df_size = df.memory_usage().sum() / 1e9
> >>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
> >>>>> >
> >>>>> > OUTPUT:
> >>>>> > BEFORE mem_size: 0.091795456gb
> >>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
> >>>>> > ARROW: 3.281625024gb
> >>>>> >
> >>>>> > Let me know if you're able to see similar results.
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Arun
> >>>>> >
> >>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com> wrote:
> >>>>> >>
> >>>>> >> I get more or less the same results as you for the provided setup data
> >>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
> >>>>> >> which is to be expected).  The fact that the arrow size is much lower
> >>>>> >> than the dataframe size is not too surprising to me.  If a column
> >>>>> >> can't be zero copied then it's memory will disappear from the arrow
> >>>>> >> pool (I think).  Plus, object columns will have overhead in pandas
> >>>>> >> that they do not have in Arrow.
> >>>>> >>
> >>>>> >> The df_size issue for me seems to be tied to string columns.  I think
> >>>>> >> pandas is overestimating how much size is needed there (many of my
> >>>>> >> strings are similar and I wonder if some kind of object sharing is
> >>>>> >> happening).  But we can table this for another time.
> >>>>> >>
> >>>>> >> I tried writing my feather file with your parameters and it didn't
> >>>>> >> have much impact on any of the numbers.
> >>>>> >>
> >>>>> >> Since the arrow size for you is expected (nearly the same as the
> >>>>> >> df_size) I'm not sure what to investigate next.  The memory does not
> >>>>> >> seem to be retained by Arrow.  Is there any chance you could create a
> >>>>> >> reproducible test case using randomly generated numpy data (then you
> >>>>> >> could share that setup function)?
> >>>>> >>
> >>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com> wrote:
> >>>>> >> >
> >>>>> >> > Hi Wes,
> >>>>> >> >
> >>>>> >> > I'm not including the setup() call when I encounter the issue. I just kept it in there for ease of reproducibility. Memory usage is indeed higher when it is included, but that isn't surprising.
> >>>>> >> >
> >>>>> >> > I tried switching over to the system allocator but there is no change.
> >>>>> >> >
> >>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
> >>>>> >> >
> >>>>> >> > I updated my script to also include the Arrow bytes allocated and it gave me the following:
> >>>>> >> >
> >>>>> >> > MVE:
> >>>>> >> > import pandas as pd
> >>>>> >> > import pyarrow as pa
> >>>>> >> > from pyarrow import feather
> >>>>> >> > import os
> >>>>> >> > import psutil
> >>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
> >>>>> >> >
> >>>>> >> >
> >>>>> >> > def setup():
> >>>>> >> >   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
> >>>>> >> >   df.to_feather('test.csv')
> >>>>> >> >
> >>>>> >> > if __name__ == "__main__":
> >>>>> >> >   # setup()
> >>>>> >> >   process = psutil.Process(os.getpid())
> >>>>> >> >   path = 'test.csv'
> >>>>> >> >
> >>>>> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >> >
> >>>>> >> >   df = feather.read_feather(path)
> >>>>> >> >
> >>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >>>>> >> >   mem_size = process.memory_info().rss / 1e10
> >>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
> >>>>> >> >
> >>>>> >> > Output with my data:
> >>>>> >> > BEFORE mem_size: 0.08761344gb
> >>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> >>>>> >> > ARROW: 3.080121792gb
> >>>>> >> >
> >>>>> >> > Output with Provided Setup Data:
> >>>>> >> > BEFORE mem_size: 0.09179136gb
> >>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> >>>>> >> > ARROW: 0.00029664gb
> >>>>> >> >
> >>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are distinct and non-overlapping, but it seems strange that the output with the provided data has the Arrow bytes allocated at ~0GB whereas the one with my data has the allocated data approximately equal to the dataframe size. I'm not sure if it affects anything but my file was written with the following:
> >>>>> >> >
> >>>>> >> > import pyarrow.lib as ext
> >>>>> >> > import pyarrow
> >>>>> >> > COMPRESSION_LEVEL = 19
> >>>>> >> > COMPRESSION_ALGO = 'zstd'
> >>>>> >> > KILOBYTE = 1 << 10
> >>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
> >>>>> >> > CHUNK_SIZE = MEGABYTE
> >>>>> >> >
> >>>>> >> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
> >>>>> >> > ext.write_feather(table, dest, compression=compression, compression_level=compression_level,chunksize=chunk_size, version=2)
> >>>>> >> >
> >>>>> >> > As to the discrepancy around calculating dataframe size. I'm not sure why that would be so off for you. Going off the docs, it seems like it should be accurate. My Dataframe in question is [7196546 rows x 56 columns] where each column is mostly a float or integer and datetime index. 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >>>>> >> >
> >>>>> >> > Thank You,
> >>>>> >> > Arun
> >>>>> >> >
> >>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com> wrote:
> >>>>> >> >>
> >>>>> >> >> 2x overshoot of memory does seem a little high.  Are you including the
> >>>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
> >>>>> >> >> will require 2-3x memory usage because it buffers the bytes in memory
> >>>>> >> >> in case it needs to re-convert them later (because it realizes the
> >>>>> >> >> data type for the column is different).  I'm not sure if Panda's CSV
> >>>>> >> >> reader is similar.
> >>>>> >> >>
> >>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
> >>>>> >> >> to a bit more memory and hold onto it (for a little while at least)
> >>>>> >> >> even after it is no longer used.  Even malloc will hold onto memory
> >>>>> >> >> sometimes due to fragmentation or other concerns.  You could try
> >>>>> >> >> changing to the system allocator
> >>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
> >>>>> >> >> to see if that makes a difference.
> >>>>> >> >>
> >>>>> >> >> I'm not sure your method of calculating the dataframe size is
> >>>>> >> >> reliable.  I don't actually know enough about pandas but when I tried
> >>>>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
> >>>>> >> >>
> >>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >>>>> >> >>
> >>>>> >> >> which seems suspicious.
> >>>>> >> >>
> >>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
> >>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
> >>>>> >> >> only using ~1.9GB and I will naively assume the difference between the
> >>>>> >> >> two is bloat caused by object wrappers when converting to pandas.
> >>>>> >> >>
> >>>>> >> >> Another thing you might try and measure is
> >>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
> >>>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB then it
> >>>>> >> >> is a pretty good guess that memory is being held somewhere else.
> >>>>> >> >>
> >>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com> wrote:
> >>>>> >> >> >
> >>>>> >> >> > Hi Apache Arrow Members,
> >>>>> >> >> >
> >>>>> >> >> > My question is below but I've compiled a minimum reproducible example with a public dataset:
> >>>>> >> >> >
> >>>>> >> >> > import pandas as pd
> >>>>> >> >> > from pyarrow import feather
> >>>>> >> >> > import os
> >>>>> >> >> > import psutil
> >>>>> >> >> >
> >>>>> >> >> >
> >>>>> >> >> > def setup():
> >>>>> >> >> >   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
> >>>>> >> >> >   df.to_feather('test.csv')
> >>>>> >> >> >
> >>>>> >> >> > if __name__ == "__main__":
> >>>>> >> >> >   # setup()
> >>>>> >> >> >   process = psutil.Process(os.getpid())
> >>>>> >> >> >   path = 'test.csv'
> >>>>> >> >> >
> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >>>>> >> >> >
> >>>>> >> >> >   df = feather.read_feather(path)
> >>>>> >> >> >
> >>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
> >>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >>>>> >> >> >
> >>>>> >> >> > I substituted my df with a sample csv. I had trouble finding a sample CSV of adequate size however, my dataset is ~3GB, and I see memory usage of close to 6GB.
> >>>>> >> >> >
> >>>>> >> >> > Output with My Data:
> >>>>> >> >> > BEFORE mem_size: 0.088891392gb
> >>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >>>>> >> >> >
> >>>>> >> >> > It seems strange that the overall memory usage of the process is approx double of the size of the dataframe itself. Is there a reason for this, and is there a way to mitigate this?
> >>>>> >> >> >
> >>>>> >> >> > $ conda list pyarrow
> >>>>> >> >> > #
> >>>>> >> >> > # Name                    Version                   Build  Channel
> >>>>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge
> >>>>> >> >> >
> >>>>> >> >> > Thank You,
> >>>>> >> >> > Arun Joseph
> >>>>> >> >> >
> >>>>> >> >
> >>>>> >> >
> >>>>> >> >
> >>>>> >> > --
> >>>>> >> > Arun Joseph
> >>>>> >> >
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > Arun Joseph
> >>>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Arun Joseph
> >>>>
> >>>
> >>>
> >>> --
> >>> Arun Joseph
> >>>
> >>
> >>
> >> --
> >> Arun Joseph
> >>
> >
> >
> > --
> > Arun Joseph
> >
> 

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Weston Pace <we...@gmail.com>.
Thank you for the new example.

# Why is it 2x?

This is essentially a "peak RAM" usage of the operation.  Given that
split_blocks helped I think we can attribute this doubling to the
pandas conversion.

# Why doesn't the memory get returned?

It does, it just doesn't do so immediately.  If I put a 5 second sleep
before I print the memory I see that the RSS shrinks down.  This is
how jemalloc is configured in Arrow (actually I think it is 1 second)
for releasing RSS after reaching peak consumption.

BEFORE mem_size: 0.082276352gb
AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb
AFTER-ARROW: 3.281625024gb
---five second sleep---
AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb
AFTER-SLEEP-ARROW: 3.281625024gb

# Why didn't switching to the system allocator help?

The problem isn't "the dynamic allocator is allocating more than it
needs".  There is a point in this process where ~6GB are actually
needed.  The system allocator either also holds on to that RSS for a
little bit or the RSS numbers themselves take a little bit of time to
update.  I'm not entirely sure.

# Why isn't this a zero-copy conversion to pandas?

That's a good question, I don't know the details.  If I try manually
doing the conversion with zero_copy_only I get the error "Cannot do
zero copy conversion into multi-column DataFrame block"

# What is up with the numpy.ndarray objects in the heap?

I'm pretty sure guppy3 is double-counting.  Note that the total size
is ~20GB.  I've been able to reproduce this in cases where the heap is
3GB and guppy still shows the dataframe taking up 6GB.  In fact, I
once even managed to generate this:

AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb
AFTER-SLEEP-ARROW: 0.0gb
Partition of a set of 212560 objects. Total size = 13328742559 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0     57   0 6563250864  49 6563250864  49 pandas.core.series.Series
     1    133   0 3339213718  25 9902464582  74 numpy.ndarray
     2      1   0 3339197360  25 13241661942  99 pandas.core.frame.DataFrame

The RSS is 3.44GB but guppy reports the dataframe as 13GB.

I did see some strange behavior when working with the
RecordBatchFileReader and I opened ARROW-15017 to resolve this but you
can work around this by deleting the reader.

# Can I return the data immediately / I don't want to use 2x memory consumption

I think split_blocks and self_destruct is the best answer at the
moment.  self_destruct has remained in the code since at least 1.0.0
so perhaps it is time we remove the "experimental" flag and maybe
replace it with a "caution" or "danger" flag (as it causes the table
to become unusable afterwards).

Jemalloc has some manual facilities to purge dirty memory and we
expose some of them with
pyarrow.default_memory_pool().release_unused() but that doesn't seem
to be helping in this situation.  Either the excess memory is in the
non-jemalloc pool or the jemalloc command can't quite release this
memory, or the RSS stats are just stale.  I'm not entirely sure.

On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <aj...@gmail.com> wrote:
>
> Slightly related, I have some other code that opens up an arrow file using a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch to a pandas dataframe. After this conversion is done, and I inspect the heap, I always see the following:
>
> hpy().heap()
> Partition of a set of 351136 objects. Total size = 20112096840 bytes.
>  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0    121   0 9939601034  49 9939601034  49 numpy.ndarray
>      1      1   0 9939585700  49 19879186734  99 pandas.core.frame.DataFrame
>      2      1   0 185786680   1 20064973414 100 pandas.core.indexes.datetimes.DatetimeIndex
>
> Specifically the numpy.ndarray. It only shows up after the conversion and it does not seem to go away. It also seems to be roughly the same size as the dataframe itself.
>
> - Arun
>
> On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <aj...@gmail.com> wrote:
>>
>> Just to follow up on this, is there a way to manually force the arrow pool to de-allocate? My usecase is essentially having multiple processes in a Pool or via Slurm read from an arrow file, do some work, and then exit. Issue is that the 2x memory consumption reduces the bandwidth on the machine to effectively half.
>>
>> Thank You,
>> Arun
>>
>> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <aj...@gmail.com> wrote:
>>>
>>> Additionally, I tested with my actual data, and did not see memory savings.
>>>
>>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com> wrote:
>>>>
>>>> Hi Joris,
>>>>
>>>> Thank you for the explanation. The 2x memory consumption on conversion makes sense if there is a copy, but it does seem like it persists longer than it should. Might be because of python's GC policies?
>>>> I tried out your recommendations but they did not seem to work. However, I did notice an experimental option on `to_pandas`, `self_destruct`, which seems to address the issue I'm facing. Sadly, that itself did not work either... but, combined with `split_blocks=True`, I am seeing memory savings:
>>>>
>>>> import pandas as pd
>>>> import numpy as np
>>>> import pyarrow as pa
>>>> from pyarrow import feather
>>>> import os
>>>> import psutil
>>>> pa.set_memory_pool(pa.system_memory_pool())
>>>> DATA_FILE = 'test.arrow'
>>>>
>>>> def setup():
>>>>   np.random.seed(0)
>>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), columns=list([f'{i}' for i in range(57)]))
>>>>   df.to_feather(DATA_FILE)
>>>>   print(f'wrote {DATA_FILE}')
>>>>   import sys
>>>>   sys.exit()
>>>>
>>>> if __name__ == "__main__":
>>>>   # setup()
>>>>   process = psutil.Process(os.getpid())
>>>>   path = DATA_FILE
>>>>
>>>>   mem_size = process.memory_info().rss / 1e9
>>>>   print(f'BEFORE mem_size: {mem_size}gb')
>>>>
>>>>   feather_table = feather.read_table(path)
>>>>   # df = feather_table.to_pandas(split_blocks=True)
>>>>   # df = feather_table.to_pandas()
>>>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
>>>>
>>>>   mem_size = process.memory_info().rss / 1e9
>>>>   df_size = df.memory_usage().sum() / 1e9
>>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>>>>
>>>>
>>>> OUTPUT(to_pandas()):
>>>> BEFORE mem_size: 0.091795456gb
>>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
>>>> ARROW: 3.281625024gb
>>>>
>>>> OUTPUT (to_pandas(split_blocks=True)):
>>>> BEFORE mem_size: 0.091795456gb
>>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
>>>> ARROW: 3.281627712gb
>>>>
>>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
>>>> BEFORE mem_size: 0.091795456gb
>>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
>>>> ARROW: 3.281627712gb
>>>>
>>>> I'm guessing since this feature is experimental, it might either go away, or might have strange behaviors. Is there anything I should look out for, or is there some alternative to reproduce these results?
>>>>
>>>> Thank You,
>>>> Arun
>>>>
>>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <jo...@gmail.com> wrote:
>>>>>
>>>>> Hi Aron, Weston,
>>>>>
>>>>> I didn't try running the script locally, but a quick note: the
>>>>> `feather.read_feather` function reads the Feather file into an Arrow
>>>>> table ànd directly converts it to a pandas DataFrame. A memory
>>>>> consumption 2x the size of the dataframe sounds not that unexpected to
>>>>> me: most of the time, when converting an arrow table to a pandas
>>>>> DataFrame, the data will be copied to accommodate for pandas' specific
>>>>> internal memory layout (at least numeric columns will be combined
>>>>> together in 2D arrays).
>>>>>
>>>>> To verify if this is the cause, you might want to do either of:
>>>>> - use `feather.read_table` instead of `feather.read_feather`, which
>>>>> will read the file as an Arrow table instead (and don't do any
>>>>> conversion to pandas)
>>>>> - if you want to include the conversion to pandas, also use
>>>>> `read_table` and do the conversion to pandas explicitly with a
>>>>> `to_pandas()` call on the result. In that case, you can specify
>>>>> `split_blocks=True` to use more zero-copy conversion in the
>>>>> arrow->pandas conversion
>>>>>
>>>>> Joris
>>>>>
>>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
>>>>> >
>>>>> > Hi Wes,
>>>>> >
>>>>> > Sorry for the late reply on this, but I think I got a reproducible test case:
>>>>> >
>>>>> > import pandas as pd
>>>>> > import numpy as np
>>>>> > import pyarrow as pa
>>>>> > from pyarrow import feather
>>>>> > import os
>>>>> > import psutil
>>>>> > pa.set_memory_pool(pa.system_memory_pool())
>>>>> > DATA_FILE = 'test.arrow'
>>>>> >
>>>>> > def setup():
>>>>> >   np.random.seed(0)
>>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)), columns=list([f'i_{i}' for i in range(57)]))
>>>>> >   df.to_feather(DATA_FILE)
>>>>> >   print(f'wrote {DATA_FILE}')
>>>>> >   import sys
>>>>> >   sys.exit()
>>>>> >
>>>>> > if __name__ == "__main__":
>>>>> >   # setup()
>>>>> >   process = psutil.Process(os.getpid())
>>>>> >   path = DATA_FILE
>>>>> >
>>>>> >   mem_size = process.memory_info().rss / 1e9
>>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>>> >
>>>>> >   df = feather.read_feather(path)
>>>>> >
>>>>> >   mem_size = process.memory_info().rss / 1e9
>>>>> >   df_size = df.memory_usage().sum() / 1e9
>>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>>>>> >
>>>>> > OUTPUT:
>>>>> > BEFORE mem_size: 0.091795456gb
>>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
>>>>> > ARROW: 3.281625024gb
>>>>> >
>>>>> > Let me know if you're able to see similar results.
>>>>> >
>>>>> > Thanks,
>>>>> > Arun
>>>>> >
>>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com> wrote:
>>>>> >>
>>>>> >> I get more or less the same results as you for the provided setup data
>>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
>>>>> >> which is to be expected).  The fact that the arrow size is much lower
>>>>> >> than the dataframe size is not too surprising to me.  If a column
>>>>> >> can't be zero copied then it's memory will disappear from the arrow
>>>>> >> pool (I think).  Plus, object columns will have overhead in pandas
>>>>> >> that they do not have in Arrow.
>>>>> >>
>>>>> >> The df_size issue for me seems to be tied to string columns.  I think
>>>>> >> pandas is overestimating how much size is needed there (many of my
>>>>> >> strings are similar and I wonder if some kind of object sharing is
>>>>> >> happening).  But we can table this for another time.
>>>>> >>
>>>>> >> I tried writing my feather file with your parameters and it didn't
>>>>> >> have much impact on any of the numbers.
>>>>> >>
>>>>> >> Since the arrow size for you is expected (nearly the same as the
>>>>> >> df_size) I'm not sure what to investigate next.  The memory does not
>>>>> >> seem to be retained by Arrow.  Is there any chance you could create a
>>>>> >> reproducible test case using randomly generated numpy data (then you
>>>>> >> could share that setup function)?
>>>>> >>
>>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com> wrote:
>>>>> >> >
>>>>> >> > Hi Wes,
>>>>> >> >
>>>>> >> > I'm not including the setup() call when I encounter the issue. I just kept it in there for ease of reproducibility. Memory usage is indeed higher when it is included, but that isn't surprising.
>>>>> >> >
>>>>> >> > I tried switching over to the system allocator but there is no change.
>>>>> >> >
>>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
>>>>> >> >
>>>>> >> > I updated my script to also include the Arrow bytes allocated and it gave me the following:
>>>>> >> >
>>>>> >> > MVE:
>>>>> >> > import pandas as pd
>>>>> >> > import pyarrow as pa
>>>>> >> > from pyarrow import feather
>>>>> >> > import os
>>>>> >> > import psutil
>>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
>>>>> >> >
>>>>> >> >
>>>>> >> > def setup():
>>>>> >> >   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>>>>> >> >   df.to_feather('test.csv')
>>>>> >> >
>>>>> >> > if __name__ == "__main__":
>>>>> >> >   # setup()
>>>>> >> >   process = psutil.Process(os.getpid())
>>>>> >> >   path = 'test.csv'
>>>>> >> >
>>>>> >> >   mem_size = process.memory_info().rss / 1e9
>>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>>> >> >
>>>>> >> >   df = feather.read_feather(path)
>>>>> >> >
>>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>>>> >> >   mem_size = process.memory_info().rss / 1e10
>>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>>>>> >> >
>>>>> >> > Output with my data:
>>>>> >> > BEFORE mem_size: 0.08761344gb
>>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>>>>> >> > ARROW: 3.080121792gb
>>>>> >> >
>>>>> >> > Output with Provided Setup Data:
>>>>> >> > BEFORE mem_size: 0.09179136gb
>>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>>>>> >> > ARROW: 0.00029664gb
>>>>> >> >
>>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are distinct and non-overlapping, but it seems strange that the output with the provided data has the Arrow bytes allocated at ~0GB whereas the one with my data has the allocated data approximately equal to the dataframe size. I'm not sure if it affects anything but my file was written with the following:
>>>>> >> >
>>>>> >> > import pyarrow.lib as ext
>>>>> >> > import pyarrow
>>>>> >> > COMPRESSION_LEVEL = 19
>>>>> >> > COMPRESSION_ALGO = 'zstd'
>>>>> >> > KILOBYTE = 1 << 10
>>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
>>>>> >> > CHUNK_SIZE = MEGABYTE
>>>>> >> >
>>>>> >> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
>>>>> >> > ext.write_feather(table, dest, compression=compression, compression_level=compression_level,chunksize=chunk_size, version=2)
>>>>> >> >
>>>>> >> > As to the discrepancy around calculating dataframe size. I'm not sure why that would be so off for you. Going off the docs, it seems like it should be accurate. My Dataframe in question is [7196546 rows x 56 columns] where each column is mostly a float or integer and datetime index. 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>>>>> >> >
>>>>> >> > Thank You,
>>>>> >> > Arun
>>>>> >> >
>>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com> wrote:
>>>>> >> >>
>>>>> >> >> 2x overshoot of memory does seem a little high.  Are you including the
>>>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
>>>>> >> >> will require 2-3x memory usage because it buffers the bytes in memory
>>>>> >> >> in case it needs to re-convert them later (because it realizes the
>>>>> >> >> data type for the column is different).  I'm not sure if Panda's CSV
>>>>> >> >> reader is similar.
>>>>> >> >>
>>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
>>>>> >> >> to a bit more memory and hold onto it (for a little while at least)
>>>>> >> >> even after it is no longer used.  Even malloc will hold onto memory
>>>>> >> >> sometimes due to fragmentation or other concerns.  You could try
>>>>> >> >> changing to the system allocator
>>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
>>>>> >> >> to see if that makes a difference.
>>>>> >> >>
>>>>> >> >> I'm not sure your method of calculating the dataframe size is
>>>>> >> >> reliable.  I don't actually know enough about pandas but when I tried
>>>>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
>>>>> >> >>
>>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>>>>> >> >>
>>>>> >> >> which seems suspicious.
>>>>> >> >>
>>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
>>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
>>>>> >> >> only using ~1.9GB and I will naively assume the difference between the
>>>>> >> >> two is bloat caused by object wrappers when converting to pandas.
>>>>> >> >>
>>>>> >> >> Another thing you might try and measure is
>>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
>>>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB then it
>>>>> >> >> is a pretty good guess that memory is being held somewhere else.
>>>>> >> >>
>>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com> wrote:
>>>>> >> >> >
>>>>> >> >> > Hi Apache Arrow Members,
>>>>> >> >> >
>>>>> >> >> > My question is below but I've compiled a minimum reproducible example with a public dataset:
>>>>> >> >> >
>>>>> >> >> > import pandas as pd
>>>>> >> >> > from pyarrow import feather
>>>>> >> >> > import os
>>>>> >> >> > import psutil
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > def setup():
>>>>> >> >> >   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>>>>> >> >> >   df.to_feather('test.csv')
>>>>> >> >> >
>>>>> >> >> > if __name__ == "__main__":
>>>>> >> >> >   # setup()
>>>>> >> >> >   process = psutil.Process(os.getpid())
>>>>> >> >> >   path = 'test.csv'
>>>>> >> >> >
>>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>>> >> >> >
>>>>> >> >> >   df = feather.read_feather(path)
>>>>> >> >> >
>>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>>> >> >> >
>>>>> >> >> > I substituted my df with a sample csv. I had trouble finding a sample CSV of adequate size however, my dataset is ~3GB, and I see memory usage of close to 6GB.
>>>>> >> >> >
>>>>> >> >> > Output with My Data:
>>>>> >> >> > BEFORE mem_size: 0.088891392gb
>>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>>>>> >> >> >
>>>>> >> >> > It seems strange that the overall memory usage of the process is approx double of the size of the dataframe itself. Is there a reason for this, and is there a way to mitigate this?
>>>>> >> >> >
>>>>> >> >> > $ conda list pyarrow
>>>>> >> >> > #
>>>>> >> >> > # Name                    Version                   Build  Channel
>>>>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge
>>>>> >> >> >
>>>>> >> >> > Thank You,
>>>>> >> >> > Arun Joseph
>>>>> >> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > --
>>>>> >> > Arun Joseph
>>>>> >> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Arun Joseph
>>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Arun Joseph
>>>>
>>>
>>>
>>> --
>>> Arun Joseph
>>>
>>
>>
>> --
>> Arun Joseph
>>
>
>
> --
> Arun Joseph
>

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Slightly related, I have some other code that opens up an arrow file using
a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch to a
pandas dataframe. After this conversion is done, and I inspect the heap, I
always see the following:

hpy().heap()
Partition of a set of 351136 objects. Total size = 20112096840 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0    121   0 9939601034  49 9939601034  49 numpy.ndarray
     1      1   0 9939585700  49 19879186734  99 pandas.core.frame.DataFrame
     2      1   0 185786680   1 20064973414 100
pandas.core.indexes.datetimes.DatetimeIndex

Specifically the numpy.ndarray. It only shows up after the conversion and
it does not seem to go away. It also seems to be roughly the same size as
the dataframe itself.

- Arun

On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <aj...@gmail.com> wrote:

> Just to follow up on this, is there a way to manually force the arrow pool
> to de-allocate? My usecase is essentially having multiple processes in a
> Pool or via Slurm read from an arrow file, do some work, and then exit.
> Issue is that the 2x memory consumption reduces the bandwidth on the
> machine to effectively half.
>
> Thank You,
> Arun
>
> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <aj...@gmail.com> wrote:
>
>> Additionally, I tested with my actual data, and did not see memory
>> savings.
>>
>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com> wrote:
>>
>>> Hi Joris,
>>>
>>> Thank you for the explanation. The 2x memory consumption on conversion
>>> makes sense if there is a copy, but it does seem like it persists longer
>>> than it should. Might be because of python's GC policies?
>>> I tried out your recommendations but they did not seem to work. However,
>>> I did notice an experimental option on `to_pandas`, `self_destruct`, which
>>> seems to address the issue I'm facing. Sadly, that itself did not work
>>> either... but, combined with `split_blocks=True`, I am seeing memory
>>> savings:
>>>
>>> import pandas as pd
>>> import numpy as np
>>> import pyarrow as pa
>>> from pyarrow import feather
>>> import os
>>> import psutil
>>> pa.set_memory_pool(pa.system_memory_pool())
>>> DATA_FILE = 'test.arrow'
>>>
>>> def setup():
>>>   np.random.seed(0)
>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
>>> columns=list([f'{i}' for i in range(57)]))
>>>   df.to_feather(DATA_FILE)
>>>   print(f'wrote {DATA_FILE}')
>>>   import sys
>>>   sys.exit()
>>>
>>> if __name__ == "__main__":
>>>   # setup()
>>>   process = psutil.Process(os.getpid())
>>>   path = DATA_FILE
>>>
>>>   mem_size = process.memory_info().rss / 1e9
>>>   print(f'BEFORE mem_size: {mem_size}gb')
>>>
>>>   feather_table = feather.read_table(path)
>>>   # df = feather_table.to_pandas(split_blocks=True)
>>>   # df = feather_table.to_pandas()
>>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
>>>
>>>   mem_size = process.memory_info().rss / 1e9
>>>   df_size = df.memory_usage().sum() / 1e9
>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>>>
>>>
>>> OUTPUT(to_pandas()):
>>> BEFORE mem_size: 0.091795456gb
>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
>>> ARROW: 3.281625024gb
>>>
>>> OUTPUT (to_pandas(split_blocks=True)):
>>> BEFORE mem_size: 0.091795456gb
>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
>>> ARROW: 3.281627712gb
>>>
>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
>>> BEFORE mem_size: 0.091795456gb
>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
>>> ARROW: 3.281627712gb
>>>
>>> I'm guessing since this feature is experimental, it might either go
>>> away, or might have strange behaviors. Is there anything I should look out
>>> for, or is there some alternative to reproduce these results?
>>>
>>> Thank You,
>>> Arun
>>>
>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
>>> jorisvandenbossche@gmail.com> wrote:
>>>
>>>> Hi Aron, Weston,
>>>>
>>>> I didn't try running the script locally, but a quick note: the
>>>> `feather.read_feather` function reads the Feather file into an Arrow
>>>> table ànd directly converts it to a pandas DataFrame. A memory
>>>> consumption 2x the size of the dataframe sounds not that unexpected to
>>>> me: most of the time, when converting an arrow table to a pandas
>>>> DataFrame, the data will be copied to accommodate for pandas' specific
>>>> internal memory layout (at least numeric columns will be combined
>>>> together in 2D arrays).
>>>>
>>>> To verify if this is the cause, you might want to do either of:
>>>> - use `feather.read_table` instead of `feather.read_feather`, which
>>>> will read the file as an Arrow table instead (and don't do any
>>>> conversion to pandas)
>>>> - if you want to include the conversion to pandas, also use
>>>> `read_table` and do the conversion to pandas explicitly with a
>>>> `to_pandas()` call on the result. In that case, you can specify
>>>> `split_blocks=True` to use more zero-copy conversion in the
>>>> arrow->pandas conversion
>>>>
>>>> Joris
>>>>
>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
>>>> >
>>>> > Hi Wes,
>>>> >
>>>> > Sorry for the late reply on this, but I think I got a reproducible
>>>> test case:
>>>> >
>>>> > import pandas as pd
>>>> > import numpy as np
>>>> > import pyarrow as pa
>>>> > from pyarrow import feather
>>>> > import os
>>>> > import psutil
>>>> > pa.set_memory_pool(pa.system_memory_pool())
>>>> > DATA_FILE = 'test.arrow'
>>>> >
>>>> > def setup():
>>>> >   np.random.seed(0)
>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
>>>> columns=list([f'i_{i}' for i in range(57)]))
>>>> >   df.to_feather(DATA_FILE)
>>>> >   print(f'wrote {DATA_FILE}')
>>>> >   import sys
>>>> >   sys.exit()
>>>> >
>>>> > if __name__ == "__main__":
>>>> >   # setup()
>>>> >   process = psutil.Process(os.getpid())
>>>> >   path = DATA_FILE
>>>> >
>>>> >   mem_size = process.memory_info().rss / 1e9
>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>> >
>>>> >   df = feather.read_feather(path)
>>>> >
>>>> >   mem_size = process.memory_info().rss / 1e9
>>>> >   df_size = df.memory_usage().sum() / 1e9
>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
>>>> 1e9}gb')
>>>> >
>>>> > OUTPUT:
>>>> > BEFORE mem_size: 0.091795456gb
>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
>>>> > ARROW: 3.281625024gb
>>>> >
>>>> > Let me know if you're able to see similar results.
>>>> >
>>>> > Thanks,
>>>> > Arun
>>>> >
>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> I get more or less the same results as you for the provided setup
>>>> data
>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
>>>> >> which is to be expected).  The fact that the arrow size is much lower
>>>> >> than the dataframe size is not too surprising to me.  If a column
>>>> >> can't be zero copied then it's memory will disappear from the arrow
>>>> >> pool (I think).  Plus, object columns will have overhead in pandas
>>>> >> that they do not have in Arrow.
>>>> >>
>>>> >> The df_size issue for me seems to be tied to string columns.  I think
>>>> >> pandas is overestimating how much size is needed there (many of my
>>>> >> strings are similar and I wonder if some kind of object sharing is
>>>> >> happening).  But we can table this for another time.
>>>> >>
>>>> >> I tried writing my feather file with your parameters and it didn't
>>>> >> have much impact on any of the numbers.
>>>> >>
>>>> >> Since the arrow size for you is expected (nearly the same as the
>>>> >> df_size) I'm not sure what to investigate next.  The memory does not
>>>> >> seem to be retained by Arrow.  Is there any chance you could create a
>>>> >> reproducible test case using randomly generated numpy data (then you
>>>> >> could share that setup function)?
>>>> >>
>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com>
>>>> wrote:
>>>> >> >
>>>> >> > Hi Wes,
>>>> >> >
>>>> >> > I'm not including the setup() call when I encounter the issue. I
>>>> just kept it in there for ease of reproducibility. Memory usage is indeed
>>>> higher when it is included, but that isn't surprising.
>>>> >> >
>>>> >> > I tried switching over to the system allocator but there is no
>>>> change.
>>>> >> >
>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
>>>> >> >
>>>> >> > I updated my script to also include the Arrow bytes allocated and
>>>> it gave me the following:
>>>> >> >
>>>> >> > MVE:
>>>> >> > import pandas as pd
>>>> >> > import pyarrow as pa
>>>> >> > from pyarrow import feather
>>>> >> > import os
>>>> >> > import psutil
>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
>>>> >> >
>>>> >> >
>>>> >> > def setup():
>>>> >> >   df = pd.read_csv('
>>>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>>>> ')
>>>> >> >   df.to_feather('test.csv')
>>>> >> >
>>>> >> > if __name__ == "__main__":
>>>> >> >   # setup()
>>>> >> >   process = psutil.Process(os.getpid())
>>>> >> >   path = 'test.csv'
>>>> >> >
>>>> >> >   mem_size = process.memory_info().rss / 1e9
>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>> >> >
>>>> >> >   df = feather.read_feather(path)
>>>> >> >
>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>>> >> >   mem_size = process.memory_info().rss / 1e10
>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
>>>> 1e9}gb')
>>>> >> >
>>>> >> > Output with my data:
>>>> >> > BEFORE mem_size: 0.08761344gb
>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>>>> >> > ARROW: 3.080121792gb
>>>> >> >
>>>> >> > Output with Provided Setup Data:
>>>> >> > BEFORE mem_size: 0.09179136gb
>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>>>> >> > ARROW: 0.00029664gb
>>>> >> >
>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are
>>>> distinct and non-overlapping, but it seems strange that the output with the
>>>> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
>>>> data has the allocated data approximately equal to the dataframe size. I'm
>>>> not sure if it affects anything but my file was written with the following:
>>>> >> >
>>>> >> > import pyarrow.lib as ext
>>>> >> > import pyarrow
>>>> >> > COMPRESSION_LEVEL = 19
>>>> >> > COMPRESSION_ALGO = 'zstd'
>>>> >> > KILOBYTE = 1 << 10
>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
>>>> >> > CHUNK_SIZE = MEGABYTE
>>>> >> >
>>>> >> > table = pyarrow.Table.from_pandas(df,
>>>> preserve_index=preserve_index)
>>>> >> > ext.write_feather(table, dest, compression=compression,
>>>> compression_level=compression_level,chunksize=chunk_size, version=2)
>>>> >> >
>>>> >> > As to the discrepancy around calculating dataframe size. I'm not
>>>> sure why that would be so off for you. Going off the docs, it seems like it
>>>> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
>>>> where each column is mostly a float or integer and datetime index. 7196546
>>>> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>>>> >> >
>>>> >> > Thank You,
>>>> >> > Arun
>>>> >> >
>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com>
>>>> wrote:
>>>> >> >>
>>>> >> >> 2x overshoot of memory does seem a little high.  Are you
>>>> including the
>>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV
>>>> reader
>>>> >> >> will require 2-3x memory usage because it buffers the bytes in
>>>> memory
>>>> >> >> in case it needs to re-convert them later (because it realizes the
>>>> >> >> data type for the column is different).  I'm not sure if Panda's
>>>> CSV
>>>> >> >> reader is similar.
>>>> >> >>
>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold
>>>> on
>>>> >> >> to a bit more memory and hold onto it (for a little while at
>>>> least)
>>>> >> >> even after it is no longer used.  Even malloc will hold onto
>>>> memory
>>>> >> >> sometimes due to fragmentation or other concerns.  You could try
>>>> >> >> changing to the system allocator
>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your
>>>> file)
>>>> >> >> to see if that makes a difference.
>>>> >> >>
>>>> >> >> I'm not sure your method of calculating the dataframe size is
>>>> >> >> reliable.  I don't actually know enough about pandas but when I
>>>> tried
>>>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
>>>> >> >>
>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>>>> >> >>
>>>> >> >> which seems suspicious.
>>>> >> >>
>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't
>>>> seem
>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself
>>>> was
>>>> >> >> only using ~1.9GB and I will naively assume the difference
>>>> between the
>>>> >> >> two is bloat caused by object wrappers when converting to pandas.
>>>> >> >>
>>>> >> >> Another thing you might try and measure is
>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you
>>>> how
>>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB
>>>> then it
>>>> >> >> is a pretty good guess that memory is being held somewhere else.
>>>> >> >>
>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com>
>>>> wrote:
>>>> >> >> >
>>>> >> >> > Hi Apache Arrow Members,
>>>> >> >> >
>>>> >> >> > My question is below but I've compiled a minimum reproducible
>>>> example with a public dataset:
>>>> >> >> >
>>>> >> >> > import pandas as pd
>>>> >> >> > from pyarrow import feather
>>>> >> >> > import os
>>>> >> >> > import psutil
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > def setup():
>>>> >> >> >   df = pd.read_csv('
>>>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>>>> ')
>>>> >> >> >   df.to_feather('test.csv')
>>>> >> >> >
>>>> >> >> > if __name__ == "__main__":
>>>> >> >> >   # setup()
>>>> >> >> >   process = psutil.Process(os.getpid())
>>>> >> >> >   path = 'test.csv'
>>>> >> >> >
>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>> >> >> >
>>>> >> >> >   df = feather.read_feather(path)
>>>> >> >> >
>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>> >> >> >
>>>> >> >> > I substituted my df with a sample csv. I had trouble finding a
>>>> sample CSV of adequate size however, my dataset is ~3GB, and I see memory
>>>> usage of close to 6GB.
>>>> >> >> >
>>>> >> >> > Output with My Data:
>>>> >> >> > BEFORE mem_size: 0.088891392gb
>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>>>> >> >> >
>>>> >> >> > It seems strange that the overall memory usage of the process
>>>> is approx double of the size of the dataframe itself. Is there a reason for
>>>> this, and is there a way to mitigate this?
>>>> >> >> >
>>>> >> >> > $ conda list pyarrow
>>>> >> >> > #
>>>> >> >> > # Name                    Version                   Build
>>>> Channel
>>>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
>>>>   conda-forge
>>>> >> >> >
>>>> >> >> > Thank You,
>>>> >> >> > Arun Joseph
>>>> >> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Arun Joseph
>>>> >> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Arun Joseph
>>>> >
>>>>
>>>
>>>
>>> --
>>> Arun Joseph
>>>
>>>
>>
>> --
>> Arun Joseph
>>
>>
>
> --
> Arun Joseph
>
>

-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Just to follow up on this, is there a way to manually force the arrow pool
to de-allocate? My usecase is essentially having multiple processes in a
Pool or via Slurm read from an arrow file, do some work, and then exit.
Issue is that the 2x memory consumption reduces the bandwidth on the
machine to effectively half.

Thank You,
Arun

On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <aj...@gmail.com> wrote:

> Additionally, I tested with my actual data, and did not see memory savings.
>
> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com> wrote:
>
>> Hi Joris,
>>
>> Thank you for the explanation. The 2x memory consumption on conversion
>> makes sense if there is a copy, but it does seem like it persists longer
>> than it should. Might be because of python's GC policies?
>> I tried out your recommendations but they did not seem to work. However,
>> I did notice an experimental option on `to_pandas`, `self_destruct`, which
>> seems to address the issue I'm facing. Sadly, that itself did not work
>> either... but, combined with `split_blocks=True`, I am seeing memory
>> savings:
>>
>> import pandas as pd
>> import numpy as np
>> import pyarrow as pa
>> from pyarrow import feather
>> import os
>> import psutil
>> pa.set_memory_pool(pa.system_memory_pool())
>> DATA_FILE = 'test.arrow'
>>
>> def setup():
>>   np.random.seed(0)
>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
>> columns=list([f'{i}' for i in range(57)]))
>>   df.to_feather(DATA_FILE)
>>   print(f'wrote {DATA_FILE}')
>>   import sys
>>   sys.exit()
>>
>> if __name__ == "__main__":
>>   # setup()
>>   process = psutil.Process(os.getpid())
>>   path = DATA_FILE
>>
>>   mem_size = process.memory_info().rss / 1e9
>>   print(f'BEFORE mem_size: {mem_size}gb')
>>
>>   feather_table = feather.read_table(path)
>>   # df = feather_table.to_pandas(split_blocks=True)
>>   # df = feather_table.to_pandas()
>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
>>
>>   mem_size = process.memory_info().rss / 1e9
>>   df_size = df.memory_usage().sum() / 1e9
>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>>
>>
>> OUTPUT(to_pandas()):
>> BEFORE mem_size: 0.091795456gb
>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
>> ARROW: 3.281625024gb
>>
>> OUTPUT (to_pandas(split_blocks=True)):
>> BEFORE mem_size: 0.091795456gb
>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
>> ARROW: 3.281627712gb
>>
>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
>> BEFORE mem_size: 0.091795456gb
>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
>> ARROW: 3.281627712gb
>>
>> I'm guessing since this feature is experimental, it might either go away,
>> or might have strange behaviors. Is there anything I should look out for,
>> or is there some alternative to reproduce these results?
>>
>> Thank You,
>> Arun
>>
>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
>> jorisvandenbossche@gmail.com> wrote:
>>
>>> Hi Aron, Weston,
>>>
>>> I didn't try running the script locally, but a quick note: the
>>> `feather.read_feather` function reads the Feather file into an Arrow
>>> table ànd directly converts it to a pandas DataFrame. A memory
>>> consumption 2x the size of the dataframe sounds not that unexpected to
>>> me: most of the time, when converting an arrow table to a pandas
>>> DataFrame, the data will be copied to accommodate for pandas' specific
>>> internal memory layout (at least numeric columns will be combined
>>> together in 2D arrays).
>>>
>>> To verify if this is the cause, you might want to do either of:
>>> - use `feather.read_table` instead of `feather.read_feather`, which
>>> will read the file as an Arrow table instead (and don't do any
>>> conversion to pandas)
>>> - if you want to include the conversion to pandas, also use
>>> `read_table` and do the conversion to pandas explicitly with a
>>> `to_pandas()` call on the result. In that case, you can specify
>>> `split_blocks=True` to use more zero-copy conversion in the
>>> arrow->pandas conversion
>>>
>>> Joris
>>>
>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
>>> >
>>> > Hi Wes,
>>> >
>>> > Sorry for the late reply on this, but I think I got a reproducible
>>> test case:
>>> >
>>> > import pandas as pd
>>> > import numpy as np
>>> > import pyarrow as pa
>>> > from pyarrow import feather
>>> > import os
>>> > import psutil
>>> > pa.set_memory_pool(pa.system_memory_pool())
>>> > DATA_FILE = 'test.arrow'
>>> >
>>> > def setup():
>>> >   np.random.seed(0)
>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
>>> columns=list([f'i_{i}' for i in range(57)]))
>>> >   df.to_feather(DATA_FILE)
>>> >   print(f'wrote {DATA_FILE}')
>>> >   import sys
>>> >   sys.exit()
>>> >
>>> > if __name__ == "__main__":
>>> >   # setup()
>>> >   process = psutil.Process(os.getpid())
>>> >   path = DATA_FILE
>>> >
>>> >   mem_size = process.memory_info().rss / 1e9
>>> >   print(f'BEFORE mem_size: {mem_size}gb')
>>> >
>>> >   df = feather.read_feather(path)
>>> >
>>> >   mem_size = process.memory_info().rss / 1e9
>>> >   df_size = df.memory_usage().sum() / 1e9
>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>>> >
>>> > OUTPUT:
>>> > BEFORE mem_size: 0.091795456gb
>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
>>> > ARROW: 3.281625024gb
>>> >
>>> > Let me know if you're able to see similar results.
>>> >
>>> > Thanks,
>>> > Arun
>>> >
>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com>
>>> wrote:
>>> >>
>>> >> I get more or less the same results as you for the provided setup data
>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
>>> >> which is to be expected).  The fact that the arrow size is much lower
>>> >> than the dataframe size is not too surprising to me.  If a column
>>> >> can't be zero copied then it's memory will disappear from the arrow
>>> >> pool (I think).  Plus, object columns will have overhead in pandas
>>> >> that they do not have in Arrow.
>>> >>
>>> >> The df_size issue for me seems to be tied to string columns.  I think
>>> >> pandas is overestimating how much size is needed there (many of my
>>> >> strings are similar and I wonder if some kind of object sharing is
>>> >> happening).  But we can table this for another time.
>>> >>
>>> >> I tried writing my feather file with your parameters and it didn't
>>> >> have much impact on any of the numbers.
>>> >>
>>> >> Since the arrow size for you is expected (nearly the same as the
>>> >> df_size) I'm not sure what to investigate next.  The memory does not
>>> >> seem to be retained by Arrow.  Is there any chance you could create a
>>> >> reproducible test case using randomly generated numpy data (then you
>>> >> could share that setup function)?
>>> >>
>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com>
>>> wrote:
>>> >> >
>>> >> > Hi Wes,
>>> >> >
>>> >> > I'm not including the setup() call when I encounter the issue. I
>>> just kept it in there for ease of reproducibility. Memory usage is indeed
>>> higher when it is included, but that isn't surprising.
>>> >> >
>>> >> > I tried switching over to the system allocator but there is no
>>> change.
>>> >> >
>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
>>> >> >
>>> >> > I updated my script to also include the Arrow bytes allocated and
>>> it gave me the following:
>>> >> >
>>> >> > MVE:
>>> >> > import pandas as pd
>>> >> > import pyarrow as pa
>>> >> > from pyarrow import feather
>>> >> > import os
>>> >> > import psutil
>>> >> > pa.set_memory_pool(pa.system_memory_pool())
>>> >> >
>>> >> >
>>> >> > def setup():
>>> >> >   df = pd.read_csv('
>>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>>> ')
>>> >> >   df.to_feather('test.csv')
>>> >> >
>>> >> > if __name__ == "__main__":
>>> >> >   # setup()
>>> >> >   process = psutil.Process(os.getpid())
>>> >> >   path = 'test.csv'
>>> >> >
>>> >> >   mem_size = process.memory_info().rss / 1e9
>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>> >> >
>>> >> >   df = feather.read_feather(path)
>>> >> >
>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>> >> >   mem_size = process.memory_info().rss / 1e10
>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
>>> 1e9}gb')
>>> >> >
>>> >> > Output with my data:
>>> >> > BEFORE mem_size: 0.08761344gb
>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>>> >> > ARROW: 3.080121792gb
>>> >> >
>>> >> > Output with Provided Setup Data:
>>> >> > BEFORE mem_size: 0.09179136gb
>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>>> >> > ARROW: 0.00029664gb
>>> >> >
>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are
>>> distinct and non-overlapping, but it seems strange that the output with the
>>> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
>>> data has the allocated data approximately equal to the dataframe size. I'm
>>> not sure if it affects anything but my file was written with the following:
>>> >> >
>>> >> > import pyarrow.lib as ext
>>> >> > import pyarrow
>>> >> > COMPRESSION_LEVEL = 19
>>> >> > COMPRESSION_ALGO = 'zstd'
>>> >> > KILOBYTE = 1 << 10
>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
>>> >> > CHUNK_SIZE = MEGABYTE
>>> >> >
>>> >> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
>>> >> > ext.write_feather(table, dest, compression=compression,
>>> compression_level=compression_level,chunksize=chunk_size, version=2)
>>> >> >
>>> >> > As to the discrepancy around calculating dataframe size. I'm not
>>> sure why that would be so off for you. Going off the docs, it seems like it
>>> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
>>> where each column is mostly a float or integer and datetime index. 7196546
>>> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>>> >> >
>>> >> > Thank You,
>>> >> > Arun
>>> >> >
>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> 2x overshoot of memory does seem a little high.  Are you including
>>> the
>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV
>>> reader
>>> >> >> will require 2-3x memory usage because it buffers the bytes in
>>> memory
>>> >> >> in case it needs to re-convert them later (because it realizes the
>>> >> >> data type for the column is different).  I'm not sure if Panda's
>>> CSV
>>> >> >> reader is similar.
>>> >> >>
>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold
>>> on
>>> >> >> to a bit more memory and hold onto it (for a little while at least)
>>> >> >> even after it is no longer used.  Even malloc will hold onto memory
>>> >> >> sometimes due to fragmentation or other concerns.  You could try
>>> >> >> changing to the system allocator
>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your
>>> file)
>>> >> >> to see if that makes a difference.
>>> >> >>
>>> >> >> I'm not sure your method of calculating the dataframe size is
>>> >> >> reliable.  I don't actually know enough about pandas but when I
>>> tried
>>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
>>> >> >>
>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>>> >> >>
>>> >> >> which seems suspicious.
>>> >> >>
>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
>>> >> >> only using ~1.9GB and I will naively assume the difference between
>>> the
>>> >> >> two is bloat caused by object wrappers when converting to pandas.
>>> >> >>
>>> >> >> Another thing you might try and measure is
>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you
>>> how
>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB then
>>> it
>>> >> >> is a pretty good guess that memory is being held somewhere else.
>>> >> >>
>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com>
>>> wrote:
>>> >> >> >
>>> >> >> > Hi Apache Arrow Members,
>>> >> >> >
>>> >> >> > My question is below but I've compiled a minimum reproducible
>>> example with a public dataset:
>>> >> >> >
>>> >> >> > import pandas as pd
>>> >> >> > from pyarrow import feather
>>> >> >> > import os
>>> >> >> > import psutil
>>> >> >> >
>>> >> >> >
>>> >> >> > def setup():
>>> >> >> >   df = pd.read_csv('
>>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>>> ')
>>> >> >> >   df.to_feather('test.csv')
>>> >> >> >
>>> >> >> > if __name__ == "__main__":
>>> >> >> >   # setup()
>>> >> >> >   process = psutil.Process(os.getpid())
>>> >> >> >   path = 'test.csv'
>>> >> >> >
>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>> >> >> >
>>> >> >> >   df = feather.read_feather(path)
>>> >> >> >
>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>> >> >> >
>>> >> >> > I substituted my df with a sample csv. I had trouble finding a
>>> sample CSV of adequate size however, my dataset is ~3GB, and I see memory
>>> usage of close to 6GB.
>>> >> >> >
>>> >> >> > Output with My Data:
>>> >> >> > BEFORE mem_size: 0.088891392gb
>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>>> >> >> >
>>> >> >> > It seems strange that the overall memory usage of the process is
>>> approx double of the size of the dataframe itself. Is there a reason for
>>> this, and is there a way to mitigate this?
>>> >> >> >
>>> >> >> > $ conda list pyarrow
>>> >> >> > #
>>> >> >> > # Name                    Version                   Build
>>> Channel
>>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
>>> conda-forge
>>> >> >> >
>>> >> >> > Thank You,
>>> >> >> > Arun Joseph
>>> >> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Arun Joseph
>>> >> >
>>> >
>>> >
>>> >
>>> > --
>>> > Arun Joseph
>>> >
>>>
>>
>>
>> --
>> Arun Joseph
>>
>>
>
> --
> Arun Joseph
>
>

-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Additionally, I tested with my actual data, and did not see memory savings.

On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <aj...@gmail.com> wrote:

> Hi Joris,
>
> Thank you for the explanation. The 2x memory consumption on conversion
> makes sense if there is a copy, but it does seem like it persists longer
> than it should. Might be because of python's GC policies?
> I tried out your recommendations but they did not seem to work. However, I
> did notice an experimental option on `to_pandas`, `self_destruct`, which
> seems to address the issue I'm facing. Sadly, that itself did not work
> either... but, combined with `split_blocks=True`, I am seeing memory
> savings:
>
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> from pyarrow import feather
> import os
> import psutil
> pa.set_memory_pool(pa.system_memory_pool())
> DATA_FILE = 'test.arrow'
>
> def setup():
>   np.random.seed(0)
>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
> columns=list([f'{i}' for i in range(57)]))
>   df.to_feather(DATA_FILE)
>   print(f'wrote {DATA_FILE}')
>   import sys
>   sys.exit()
>
> if __name__ == "__main__":
>   # setup()
>   process = psutil.Process(os.getpid())
>   path = DATA_FILE
>
>   mem_size = process.memory_info().rss / 1e9
>   print(f'BEFORE mem_size: {mem_size}gb')
>
>   feather_table = feather.read_table(path)
>   # df = feather_table.to_pandas(split_blocks=True)
>   # df = feather_table.to_pandas()
>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
>
>   mem_size = process.memory_info().rss / 1e9
>   df_size = df.memory_usage().sum() / 1e9
>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>
>
> OUTPUT(to_pandas()):
> BEFORE mem_size: 0.091795456gb
> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
> ARROW: 3.281625024gb
>
> OUTPUT (to_pandas(split_blocks=True)):
> BEFORE mem_size: 0.091795456gb
> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
> ARROW: 3.281627712gb
>
> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
> BEFORE mem_size: 0.091795456gb
> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
> ARROW: 3.281627712gb
>
> I'm guessing since this feature is experimental, it might either go away,
> or might have strange behaviors. Is there anything I should look out for,
> or is there some alternative to reproduce these results?
>
> Thank You,
> Arun
>
> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
>> Hi Aron, Weston,
>>
>> I didn't try running the script locally, but a quick note: the
>> `feather.read_feather` function reads the Feather file into an Arrow
>> table ànd directly converts it to a pandas DataFrame. A memory
>> consumption 2x the size of the dataframe sounds not that unexpected to
>> me: most of the time, when converting an arrow table to a pandas
>> DataFrame, the data will be copied to accommodate for pandas' specific
>> internal memory layout (at least numeric columns will be combined
>> together in 2D arrays).
>>
>> To verify if this is the cause, you might want to do either of:
>> - use `feather.read_table` instead of `feather.read_feather`, which
>> will read the file as an Arrow table instead (and don't do any
>> conversion to pandas)
>> - if you want to include the conversion to pandas, also use
>> `read_table` and do the conversion to pandas explicitly with a
>> `to_pandas()` call on the result. In that case, you can specify
>> `split_blocks=True` to use more zero-copy conversion in the
>> arrow->pandas conversion
>>
>> Joris
>>
>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
>> >
>> > Hi Wes,
>> >
>> > Sorry for the late reply on this, but I think I got a reproducible test
>> case:
>> >
>> > import pandas as pd
>> > import numpy as np
>> > import pyarrow as pa
>> > from pyarrow import feather
>> > import os
>> > import psutil
>> > pa.set_memory_pool(pa.system_memory_pool())
>> > DATA_FILE = 'test.arrow'
>> >
>> > def setup():
>> >   np.random.seed(0)
>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
>> columns=list([f'i_{i}' for i in range(57)]))
>> >   df.to_feather(DATA_FILE)
>> >   print(f'wrote {DATA_FILE}')
>> >   import sys
>> >   sys.exit()
>> >
>> > if __name__ == "__main__":
>> >   # setup()
>> >   process = psutil.Process(os.getpid())
>> >   path = DATA_FILE
>> >
>> >   mem_size = process.memory_info().rss / 1e9
>> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >
>> >   df = feather.read_feather(path)
>> >
>> >   mem_size = process.memory_info().rss / 1e9
>> >   df_size = df.memory_usage().sum() / 1e9
>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>> >
>> > OUTPUT:
>> > BEFORE mem_size: 0.091795456gb
>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
>> > ARROW: 3.281625024gb
>> >
>> > Let me know if you're able to see similar results.
>> >
>> > Thanks,
>> > Arun
>> >
>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com>
>> wrote:
>> >>
>> >> I get more or less the same results as you for the provided setup data
>> >> (exact same #'s for arrow & df_size and slightly different for RSS
>> >> which is to be expected).  The fact that the arrow size is much lower
>> >> than the dataframe size is not too surprising to me.  If a column
>> >> can't be zero copied then it's memory will disappear from the arrow
>> >> pool (I think).  Plus, object columns will have overhead in pandas
>> >> that they do not have in Arrow.
>> >>
>> >> The df_size issue for me seems to be tied to string columns.  I think
>> >> pandas is overestimating how much size is needed there (many of my
>> >> strings are similar and I wonder if some kind of object sharing is
>> >> happening).  But we can table this for another time.
>> >>
>> >> I tried writing my feather file with your parameters and it didn't
>> >> have much impact on any of the numbers.
>> >>
>> >> Since the arrow size for you is expected (nearly the same as the
>> >> df_size) I'm not sure what to investigate next.  The memory does not
>> >> seem to be retained by Arrow.  Is there any chance you could create a
>> >> reproducible test case using randomly generated numpy data (then you
>> >> could share that setup function)?
>> >>
>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com> wrote:
>> >> >
>> >> > Hi Wes,
>> >> >
>> >> > I'm not including the setup() call when I encounter the issue. I
>> just kept it in there for ease of reproducibility. Memory usage is indeed
>> higher when it is included, but that isn't surprising.
>> >> >
>> >> > I tried switching over to the system allocator but there is no
>> change.
>> >> >
>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
>> >> >
>> >> > I updated my script to also include the Arrow bytes allocated and it
>> gave me the following:
>> >> >
>> >> > MVE:
>> >> > import pandas as pd
>> >> > import pyarrow as pa
>> >> > from pyarrow import feather
>> >> > import os
>> >> > import psutil
>> >> > pa.set_memory_pool(pa.system_memory_pool())
>> >> >
>> >> >
>> >> > def setup():
>> >> >   df = pd.read_csv('
>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>> ')
>> >> >   df.to_feather('test.csv')
>> >> >
>> >> > if __name__ == "__main__":
>> >> >   # setup()
>> >> >   process = psutil.Process(os.getpid())
>> >> >   path = 'test.csv'
>> >> >
>> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >> >
>> >> >   df = feather.read_feather(path)
>> >> >
>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >> >   mem_size = process.memory_info().rss / 1e10
>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
>> 1e9}gb')
>> >> >
>> >> > Output with my data:
>> >> > BEFORE mem_size: 0.08761344gb
>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>> >> > ARROW: 3.080121792gb
>> >> >
>> >> > Output with Provided Setup Data:
>> >> > BEFORE mem_size: 0.09179136gb
>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>> >> > ARROW: 0.00029664gb
>> >> >
>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are
>> distinct and non-overlapping, but it seems strange that the output with the
>> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
>> data has the allocated data approximately equal to the dataframe size. I'm
>> not sure if it affects anything but my file was written with the following:
>> >> >
>> >> > import pyarrow.lib as ext
>> >> > import pyarrow
>> >> > COMPRESSION_LEVEL = 19
>> >> > COMPRESSION_ALGO = 'zstd'
>> >> > KILOBYTE = 1 << 10
>> >> > MEGABYTE = KILOBYTE * KILOBYTE
>> >> > CHUNK_SIZE = MEGABYTE
>> >> >
>> >> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
>> >> > ext.write_feather(table, dest, compression=compression,
>> compression_level=compression_level,chunksize=chunk_size, version=2)
>> >> >
>> >> > As to the discrepancy around calculating dataframe size. I'm not
>> sure why that would be so off for you. Going off the docs, it seems like it
>> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
>> where each column is mostly a float or integer and datetime index. 7196546
>> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>> >> >
>> >> > Thank You,
>> >> > Arun
>> >> >
>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com>
>> wrote:
>> >> >>
>> >> >> 2x overshoot of memory does seem a little high.  Are you including
>> the
>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
>> >> >> will require 2-3x memory usage because it buffers the bytes in
>> memory
>> >> >> in case it needs to re-convert them later (because it realizes the
>> >> >> data type for the column is different).  I'm not sure if Panda's CSV
>> >> >> reader is similar.
>> >> >>
>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
>> >> >> to a bit more memory and hold onto it (for a little while at least)
>> >> >> even after it is no longer used.  Even malloc will hold onto memory
>> >> >> sometimes due to fragmentation or other concerns.  You could try
>> >> >> changing to the system allocator
>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your
>> file)
>> >> >> to see if that makes a difference.
>> >> >>
>> >> >> I'm not sure your method of calculating the dataframe size is
>> >> >> reliable.  I don't actually know enough about pandas but when I
>> tried
>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
>> >> >>
>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>> >> >>
>> >> >> which seems suspicious.
>> >> >>
>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
>> >> >> only using ~1.9GB and I will naively assume the difference between
>> the
>> >> >> two is bloat caused by object wrappers when converting to pandas.
>> >> >>
>> >> >> Another thing you might try and measure is
>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you
>> how
>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB then
>> it
>> >> >> is a pretty good guess that memory is being held somewhere else.
>> >> >>
>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com>
>> wrote:
>> >> >> >
>> >> >> > Hi Apache Arrow Members,
>> >> >> >
>> >> >> > My question is below but I've compiled a minimum reproducible
>> example with a public dataset:
>> >> >> >
>> >> >> > import pandas as pd
>> >> >> > from pyarrow import feather
>> >> >> > import os
>> >> >> > import psutil
>> >> >> >
>> >> >> >
>> >> >> > def setup():
>> >> >> >   df = pd.read_csv('
>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>> ')
>> >> >> >   df.to_feather('test.csv')
>> >> >> >
>> >> >> > if __name__ == "__main__":
>> >> >> >   # setup()
>> >> >> >   process = psutil.Process(os.getpid())
>> >> >> >   path = 'test.csv'
>> >> >> >
>> >> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >> >> >
>> >> >> >   df = feather.read_feather(path)
>> >> >> >
>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >> >> >
>> >> >> > I substituted my df with a sample csv. I had trouble finding a
>> sample CSV of adequate size however, my dataset is ~3GB, and I see memory
>> usage of close to 6GB.
>> >> >> >
>> >> >> > Output with My Data:
>> >> >> > BEFORE mem_size: 0.088891392gb
>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>> >> >> >
>> >> >> > It seems strange that the overall memory usage of the process is
>> approx double of the size of the dataframe itself. Is there a reason for
>> this, and is there a way to mitigate this?
>> >> >> >
>> >> >> > $ conda list pyarrow
>> >> >> > #
>> >> >> > # Name                    Version                   Build  Channel
>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
>> conda-forge
>> >> >> >
>> >> >> > Thank You,
>> >> >> > Arun Joseph
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Arun Joseph
>> >> >
>> >
>> >
>> >
>> > --
>> > Arun Joseph
>> >
>>
>
>
> --
> Arun Joseph
>
>

-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Hi Joris,

Thank you for the explanation. The 2x memory consumption on conversion
makes sense if there is a copy, but it does seem like it persists longer
than it should. Might be because of python's GC policies?
I tried out your recommendations but they did not seem to work. However, I
did notice an experimental option on `to_pandas`, `self_destruct`, which
seems to address the issue I'm facing. Sadly, that itself did not work
either... but, combined with `split_blocks=True`, I am seeing memory
savings:

import pandas as pd
import numpy as np
import pyarrow as pa
from pyarrow import feather
import os
import psutil
pa.set_memory_pool(pa.system_memory_pool())
DATA_FILE = 'test.arrow'

def setup():
  np.random.seed(0)
  df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
columns=list([f'{i}' for i in range(57)]))
  df.to_feather(DATA_FILE)
  print(f'wrote {DATA_FILE}')
  import sys
  sys.exit()

if __name__ == "__main__":
  # setup()
  process = psutil.Process(os.getpid())
  path = DATA_FILE

  mem_size = process.memory_info().rss / 1e9
  print(f'BEFORE mem_size: {mem_size}gb')

  feather_table = feather.read_table(path)
  # df = feather_table.to_pandas(split_blocks=True)
  # df = feather_table.to_pandas()
  df = feather_table.to_pandas(self_destruct=True, split_blocks=True)

  mem_size = process.memory_info().rss / 1e9
  df_size = df.memory_usage().sum() / 1e9
  print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
  print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')


OUTPUT(to_pandas()):
BEFORE mem_size: 0.091795456gb
AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
ARROW: 3.281625024gb

OUTPUT (to_pandas(split_blocks=True)):
BEFORE mem_size: 0.091795456gb
AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
ARROW: 3.281627712gb

OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
BEFORE mem_size: 0.091795456gb
AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
ARROW: 3.281627712gb

I'm guessing since this feature is experimental, it might either go away,
or might have strange behaviors. Is there anything I should look out for,
or is there some alternative to reproduce these results?

Thank You,
Arun

On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Aron, Weston,
>
> I didn't try running the script locally, but a quick note: the
> `feather.read_feather` function reads the Feather file into an Arrow
> table ànd directly converts it to a pandas DataFrame. A memory
> consumption 2x the size of the dataframe sounds not that unexpected to
> me: most of the time, when converting an arrow table to a pandas
> DataFrame, the data will be copied to accommodate for pandas' specific
> internal memory layout (at least numeric columns will be combined
> together in 2D arrays).
>
> To verify if this is the cause, you might want to do either of:
> - use `feather.read_table` instead of `feather.read_feather`, which
> will read the file as an Arrow table instead (and don't do any
> conversion to pandas)
> - if you want to include the conversion to pandas, also use
> `read_table` and do the conversion to pandas explicitly with a
> `to_pandas()` call on the result. In that case, you can specify
> `split_blocks=True` to use more zero-copy conversion in the
> arrow->pandas conversion
>
> Joris
>
> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > Sorry for the late reply on this, but I think I got a reproducible test
> case:
> >
> > import pandas as pd
> > import numpy as np
> > import pyarrow as pa
> > from pyarrow import feather
> > import os
> > import psutil
> > pa.set_memory_pool(pa.system_memory_pool())
> > DATA_FILE = 'test.arrow'
> >
> > def setup():
> >   np.random.seed(0)
> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
> columns=list([f'i_{i}' for i in range(57)]))
> >   df.to_feather(DATA_FILE)
> >   print(f'wrote {DATA_FILE}')
> >   import sys
> >   sys.exit()
> >
> > if __name__ == "__main__":
> >   # setup()
> >   process = psutil.Process(os.getpid())
> >   path = DATA_FILE
> >
> >   mem_size = process.memory_info().rss / 1e9
> >   print(f'BEFORE mem_size: {mem_size}gb')
> >
> >   df = feather.read_feather(path)
> >
> >   mem_size = process.memory_info().rss / 1e9
> >   df_size = df.memory_usage().sum() / 1e9
> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
> >
> > OUTPUT:
> > BEFORE mem_size: 0.091795456gb
> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
> > ARROW: 3.281625024gb
> >
> > Let me know if you're able to see similar results.
> >
> > Thanks,
> > Arun
> >
> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com>
> wrote:
> >>
> >> I get more or less the same results as you for the provided setup data
> >> (exact same #'s for arrow & df_size and slightly different for RSS
> >> which is to be expected).  The fact that the arrow size is much lower
> >> than the dataframe size is not too surprising to me.  If a column
> >> can't be zero copied then it's memory will disappear from the arrow
> >> pool (I think).  Plus, object columns will have overhead in pandas
> >> that they do not have in Arrow.
> >>
> >> The df_size issue for me seems to be tied to string columns.  I think
> >> pandas is overestimating how much size is needed there (many of my
> >> strings are similar and I wonder if some kind of object sharing is
> >> happening).  But we can table this for another time.
> >>
> >> I tried writing my feather file with your parameters and it didn't
> >> have much impact on any of the numbers.
> >>
> >> Since the arrow size for you is expected (nearly the same as the
> >> df_size) I'm not sure what to investigate next.  The memory does not
> >> seem to be retained by Arrow.  Is there any chance you could create a
> >> reproducible test case using randomly generated numpy data (then you
> >> could share that setup function)?
> >>
> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com> wrote:
> >> >
> >> > Hi Wes,
> >> >
> >> > I'm not including the setup() call when I encounter the issue. I just
> kept it in there for ease of reproducibility. Memory usage is indeed higher
> when it is included, but that isn't surprising.
> >> >
> >> > I tried switching over to the system allocator but there is no change.
> >> >
> >> > I've updated to Arrow 6.0.1 as well and there is no change.
> >> >
> >> > I updated my script to also include the Arrow bytes allocated and it
> gave me the following:
> >> >
> >> > MVE:
> >> > import pandas as pd
> >> > import pyarrow as pa
> >> > from pyarrow import feather
> >> > import os
> >> > import psutil
> >> > pa.set_memory_pool(pa.system_memory_pool())
> >> >
> >> >
> >> > def setup():
> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >   df.to_feather('test.csv')
> >> >
> >> > if __name__ == "__main__":
> >> >   # setup()
> >> >   process = psutil.Process(os.getpid())
> >> >   path = 'test.csv'
> >> >
> >> >   mem_size = process.memory_info().rss / 1e9
> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >
> >> >   df = feather.read_feather(path)
> >> >
> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >   mem_size = process.memory_info().rss / 1e10
> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
> 1e9}gb')
> >> >
> >> > Output with my data:
> >> > BEFORE mem_size: 0.08761344gb
> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> >> > ARROW: 3.080121792gb
> >> >
> >> > Output with Provided Setup Data:
> >> > BEFORE mem_size: 0.09179136gb
> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> >> > ARROW: 0.00029664gb
> >> >
> >> > I'm assuming that the df and the arrow bytes allocated/sizes are
> distinct and non-overlapping, but it seems strange that the output with the
> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
> data has the allocated data approximately equal to the dataframe size. I'm
> not sure if it affects anything but my file was written with the following:
> >> >
> >> > import pyarrow.lib as ext
> >> > import pyarrow
> >> > COMPRESSION_LEVEL = 19
> >> > COMPRESSION_ALGO = 'zstd'
> >> > KILOBYTE = 1 << 10
> >> > MEGABYTE = KILOBYTE * KILOBYTE
> >> > CHUNK_SIZE = MEGABYTE
> >> >
> >> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
> >> > ext.write_feather(table, dest, compression=compression,
> compression_level=compression_level,chunksize=chunk_size, version=2)
> >> >
> >> > As to the discrepancy around calculating dataframe size. I'm not sure
> why that would be so off for you. Going off the docs, it seems like it
> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
> where each column is mostly a float or integer and datetime index. 7196546
> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >> >
> >> > Thank You,
> >> > Arun
> >> >
> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com>
> wrote:
> >> >>
> >> >> 2x overshoot of memory does seem a little high.  Are you including
> the
> >> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
> >> >> will require 2-3x memory usage because it buffers the bytes in memory
> >> >> in case it needs to re-convert them later (because it realizes the
> >> >> data type for the column is different).  I'm not sure if Panda's CSV
> >> >> reader is similar.
> >> >>
> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
> >> >> to a bit more memory and hold onto it (for a little while at least)
> >> >> even after it is no longer used.  Even malloc will hold onto memory
> >> >> sometimes due to fragmentation or other concerns.  You could try
> >> >> changing to the system allocator
> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
> >> >> to see if that makes a difference.
> >> >>
> >> >> I'm not sure your method of calculating the dataframe size is
> >> >> reliable.  I don't actually know enough about pandas but when I tried
> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
> >> >>
> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >> >>
> >> >> which seems suspicious.
> >> >>
> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
> >> >> only using ~1.9GB and I will naively assume the difference between
> the
> >> >> two is bloat caused by object wrappers when converting to pandas.
> >> >>
> >> >> Another thing you might try and measure is
> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB then it
> >> >> is a pretty good guess that memory is being held somewhere else.
> >> >>
> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com>
> wrote:
> >> >> >
> >> >> > Hi Apache Arrow Members,
> >> >> >
> >> >> > My question is below but I've compiled a minimum reproducible
> example with a public dataset:
> >> >> >
> >> >> > import pandas as pd
> >> >> > from pyarrow import feather
> >> >> > import os
> >> >> > import psutil
> >> >> >
> >> >> >
> >> >> > def setup():
> >> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >> >   df.to_feather('test.csv')
> >> >> >
> >> >> > if __name__ == "__main__":
> >> >> >   # setup()
> >> >> >   process = psutil.Process(os.getpid())
> >> >> >   path = 'test.csv'
> >> >> >
> >> >> >   mem_size = process.memory_info().rss / 1e9
> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >> >
> >> >> >   df = feather.read_feather(path)
> >> >> >
> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >> >   mem_size = process.memory_info().rss / 1e9
> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >> >
> >> >> > I substituted my df with a sample csv. I had trouble finding a
> sample CSV of adequate size however, my dataset is ~3GB, and I see memory
> usage of close to 6GB.
> >> >> >
> >> >> > Output with My Data:
> >> >> > BEFORE mem_size: 0.088891392gb
> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >> >> >
> >> >> > It seems strange that the overall memory usage of the process is
> approx double of the size of the dataframe itself. Is there a reason for
> this, and is there a way to mitigate this?
> >> >> >
> >> >> > $ conda list pyarrow
> >> >> > #
> >> >> > # Name                    Version                   Build  Channel
> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
> conda-forge
> >> >> >
> >> >> > Thank You,
> >> >> > Arun Joseph
> >> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Arun Joseph
> >> >
> >
> >
> >
> > --
> > Arun Joseph
> >
>


-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Joris Van den Bossche <jo...@gmail.com>.
Hi Aron, Weston,

I didn't try running the script locally, but a quick note: the
`feather.read_feather` function reads the Feather file into an Arrow
table ànd directly converts it to a pandas DataFrame. A memory
consumption 2x the size of the dataframe sounds not that unexpected to
me: most of the time, when converting an arrow table to a pandas
DataFrame, the data will be copied to accommodate for pandas' specific
internal memory layout (at least numeric columns will be combined
together in 2D arrays).

To verify if this is the cause, you might want to do either of:
- use `feather.read_table` instead of `feather.read_feather`, which
will read the file as an Arrow table instead (and don't do any
conversion to pandas)
- if you want to include the conversion to pandas, also use
`read_table` and do the conversion to pandas explicitly with a
`to_pandas()` call on the result. In that case, you can specify
`split_blocks=True` to use more zero-copy conversion in the
arrow->pandas conversion

Joris

On Mon, 6 Dec 2021 at 15:05, Arun Joseph <aj...@gmail.com> wrote:
>
> Hi Wes,
>
> Sorry for the late reply on this, but I think I got a reproducible test case:
>
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> from pyarrow import feather
> import os
> import psutil
> pa.set_memory_pool(pa.system_memory_pool())
> DATA_FILE = 'test.arrow'
>
> def setup():
>   np.random.seed(0)
>   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)), columns=list([f'i_{i}' for i in range(57)]))
>   df.to_feather(DATA_FILE)
>   print(f'wrote {DATA_FILE}')
>   import sys
>   sys.exit()
>
> if __name__ == "__main__":
>   # setup()
>   process = psutil.Process(os.getpid())
>   path = DATA_FILE
>
>   mem_size = process.memory_info().rss / 1e9
>   print(f'BEFORE mem_size: {mem_size}gb')
>
>   df = feather.read_feather(path)
>
>   mem_size = process.memory_info().rss / 1e9
>   df_size = df.memory_usage().sum() / 1e9
>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>
> OUTPUT:
> BEFORE mem_size: 0.091795456gb
> AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
> ARROW: 3.281625024gb
>
> Let me know if you're able to see similar results.
>
> Thanks,
> Arun
>
> On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com> wrote:
>>
>> I get more or less the same results as you for the provided setup data
>> (exact same #'s for arrow & df_size and slightly different for RSS
>> which is to be expected).  The fact that the arrow size is much lower
>> than the dataframe size is not too surprising to me.  If a column
>> can't be zero copied then it's memory will disappear from the arrow
>> pool (I think).  Plus, object columns will have overhead in pandas
>> that they do not have in Arrow.
>>
>> The df_size issue for me seems to be tied to string columns.  I think
>> pandas is overestimating how much size is needed there (many of my
>> strings are similar and I wonder if some kind of object sharing is
>> happening).  But we can table this for another time.
>>
>> I tried writing my feather file with your parameters and it didn't
>> have much impact on any of the numbers.
>>
>> Since the arrow size for you is expected (nearly the same as the
>> df_size) I'm not sure what to investigate next.  The memory does not
>> seem to be retained by Arrow.  Is there any chance you could create a
>> reproducible test case using randomly generated numpy data (then you
>> could share that setup function)?
>>
>> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com> wrote:
>> >
>> > Hi Wes,
>> >
>> > I'm not including the setup() call when I encounter the issue. I just kept it in there for ease of reproducibility. Memory usage is indeed higher when it is included, but that isn't surprising.
>> >
>> > I tried switching over to the system allocator but there is no change.
>> >
>> > I've updated to Arrow 6.0.1 as well and there is no change.
>> >
>> > I updated my script to also include the Arrow bytes allocated and it gave me the following:
>> >
>> > MVE:
>> > import pandas as pd
>> > import pyarrow as pa
>> > from pyarrow import feather
>> > import os
>> > import psutil
>> > pa.set_memory_pool(pa.system_memory_pool())
>> >
>> >
>> > def setup():
>> >   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>> >   df.to_feather('test.csv')
>> >
>> > if __name__ == "__main__":
>> >   # setup()
>> >   process = psutil.Process(os.getpid())
>> >   path = 'test.csv'
>> >
>> >   mem_size = process.memory_info().rss / 1e9
>> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >
>> >   df = feather.read_feather(path)
>> >
>> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >   mem_size = process.memory_info().rss / 1e10
>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>> >
>> > Output with my data:
>> > BEFORE mem_size: 0.08761344gb
>> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>> > ARROW: 3.080121792gb
>> >
>> > Output with Provided Setup Data:
>> > BEFORE mem_size: 0.09179136gb
>> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>> > ARROW: 0.00029664gb
>> >
>> > I'm assuming that the df and the arrow bytes allocated/sizes are distinct and non-overlapping, but it seems strange that the output with the provided data has the Arrow bytes allocated at ~0GB whereas the one with my data has the allocated data approximately equal to the dataframe size. I'm not sure if it affects anything but my file was written with the following:
>> >
>> > import pyarrow.lib as ext
>> > import pyarrow
>> > COMPRESSION_LEVEL = 19
>> > COMPRESSION_ALGO = 'zstd'
>> > KILOBYTE = 1 << 10
>> > MEGABYTE = KILOBYTE * KILOBYTE
>> > CHUNK_SIZE = MEGABYTE
>> >
>> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
>> > ext.write_feather(table, dest, compression=compression, compression_level=compression_level,chunksize=chunk_size, version=2)
>> >
>> > As to the discrepancy around calculating dataframe size. I'm not sure why that would be so off for you. Going off the docs, it seems like it should be accurate. My Dataframe in question is [7196546 rows x 56 columns] where each column is mostly a float or integer and datetime index. 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>> >
>> > Thank You,
>> > Arun
>> >
>> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com> wrote:
>> >>
>> >> 2x overshoot of memory does seem a little high.  Are you including the
>> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
>> >> will require 2-3x memory usage because it buffers the bytes in memory
>> >> in case it needs to re-convert them later (because it realizes the
>> >> data type for the column is different).  I'm not sure if Panda's CSV
>> >> reader is similar.
>> >>
>> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
>> >> to a bit more memory and hold onto it (for a little while at least)
>> >> even after it is no longer used.  Even malloc will hold onto memory
>> >> sometimes due to fragmentation or other concerns.  You could try
>> >> changing to the system allocator
>> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
>> >> to see if that makes a difference.
>> >>
>> >> I'm not sure your method of calculating the dataframe size is
>> >> reliable.  I don't actually know enough about pandas but when I tried
>> >> your experiment with my own 1.9G CSV file it ended up reporting:
>> >>
>> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>> >>
>> >> which seems suspicious.
>> >>
>> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
>> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
>> >> only using ~1.9GB and I will naively assume the difference between the
>> >> two is bloat caused by object wrappers when converting to pandas.
>> >>
>> >> Another thing you might try and measure is
>> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
>> >> much memory Arrow itself is hanging onto.  If that is not 6GB then it
>> >> is a pretty good guess that memory is being held somewhere else.
>> >>
>> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com> wrote:
>> >> >
>> >> > Hi Apache Arrow Members,
>> >> >
>> >> > My question is below but I've compiled a minimum reproducible example with a public dataset:
>> >> >
>> >> > import pandas as pd
>> >> > from pyarrow import feather
>> >> > import os
>> >> > import psutil
>> >> >
>> >> >
>> >> > def setup():
>> >> >   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>> >> >   df.to_feather('test.csv')
>> >> >
>> >> > if __name__ == "__main__":
>> >> >   # setup()
>> >> >   process = psutil.Process(os.getpid())
>> >> >   path = 'test.csv'
>> >> >
>> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >> >
>> >> >   df = feather.read_feather(path)
>> >> >
>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >> >
>> >> > I substituted my df with a sample csv. I had trouble finding a sample CSV of adequate size however, my dataset is ~3GB, and I see memory usage of close to 6GB.
>> >> >
>> >> > Output with My Data:
>> >> > BEFORE mem_size: 0.088891392gb
>> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>> >> >
>> >> > It seems strange that the overall memory usage of the process is approx double of the size of the dataframe itself. Is there a reason for this, and is there a way to mitigate this?
>> >> >
>> >> > $ conda list pyarrow
>> >> > #
>> >> > # Name                    Version                   Build  Channel
>> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge
>> >> >
>> >> > Thank You,
>> >> > Arun Joseph
>> >> >
>> >
>> >
>> >
>> > --
>> > Arun Joseph
>> >
>
>
>
> --
> Arun Joseph
>

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Hi Wes,

Sorry for the late reply on this, but I think I got a reproducible test
case:

import pandas as pd
import numpy as np
import pyarrow as pa
from pyarrow import feather
import os
import psutil
pa.set_memory_pool(pa.system_memory_pool())
DATA_FILE = 'test.arrow'

def setup():
  np.random.seed(0)
  df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
columns=list([f'i_{i}' for i in range(57)]))
  df.to_feather(DATA_FILE)
  print(f'wrote {DATA_FILE}')
  import sys
  sys.exit()

if __name__ == "__main__":
  # setup()
  process = psutil.Process(os.getpid())
  path = DATA_FILE

  mem_size = process.memory_info().rss / 1e9
  print(f'BEFORE mem_size: {mem_size}gb')

  df = feather.read_feather(path)

  mem_size = process.memory_info().rss / 1e9
  df_size = df.memory_usage().sum() / 1e9
  print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
  print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')

OUTPUT:
BEFORE mem_size: 0.091795456gb
AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
ARROW: 3.281625024gb

Let me know if you're able to see similar results.

Thanks,
Arun

On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <we...@gmail.com> wrote:

> I get more or less the same results as you for the provided setup data
> (exact same #'s for arrow & df_size and slightly different for RSS
> which is to be expected).  The fact that the arrow size is much lower
> than the dataframe size is not too surprising to me.  If a column
> can't be zero copied then it's memory will disappear from the arrow
> pool (I think).  Plus, object columns will have overhead in pandas
> that they do not have in Arrow.
>
> The df_size issue for me seems to be tied to string columns.  I think
> pandas is overestimating how much size is needed there (many of my
> strings are similar and I wonder if some kind of object sharing is
> happening).  But we can table this for another time.
>
> I tried writing my feather file with your parameters and it didn't
> have much impact on any of the numbers.
>
> Since the arrow size for you is expected (nearly the same as the
> df_size) I'm not sure what to investigate next.  The memory does not
> seem to be retained by Arrow.  Is there any chance you could create a
> reproducible test case using randomly generated numpy data (then you
> could share that setup function)?
>
> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > I'm not including the setup() call when I encounter the issue. I just
> kept it in there for ease of reproducibility. Memory usage is indeed higher
> when it is included, but that isn't surprising.
> >
> > I tried switching over to the system allocator but there is no change.
> >
> > I've updated to Arrow 6.0.1 as well and there is no change.
> >
> > I updated my script to also include the Arrow bytes allocated and it
> gave me the following:
> >
> > MVE:
> > import pandas as pd
> > import pyarrow as pa
> > from pyarrow import feather
> > import os
> > import psutil
> > pa.set_memory_pool(pa.system_memory_pool())
> >
> >
> > def setup():
> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >   df.to_feather('test.csv')
> >
> > if __name__ == "__main__":
> >   # setup()
> >   process = psutil.Process(os.getpid())
> >   path = 'test.csv'
> >
> >   mem_size = process.memory_info().rss / 1e9
> >   print(f'BEFORE mem_size: {mem_size}gb')
> >
> >   df = feather.read_feather(path)
> >
> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >   mem_size = process.memory_info().rss / 1e10
> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
> >
> > Output with my data:
> > BEFORE mem_size: 0.08761344gb
> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> > ARROW: 3.080121792gb
> >
> > Output with Provided Setup Data:
> > BEFORE mem_size: 0.09179136gb
> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> > ARROW: 0.00029664gb
> >
> > I'm assuming that the df and the arrow bytes allocated/sizes are
> distinct and non-overlapping, but it seems strange that the output with the
> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
> data has the allocated data approximately equal to the dataframe size. I'm
> not sure if it affects anything but my file was written with the following:
> >
> > import pyarrow.lib as ext
> > import pyarrow
> > COMPRESSION_LEVEL = 19
> > COMPRESSION_ALGO = 'zstd'
> > KILOBYTE = 1 << 10
> > MEGABYTE = KILOBYTE * KILOBYTE
> > CHUNK_SIZE = MEGABYTE
> >
> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
> > ext.write_feather(table, dest, compression=compression,
> compression_level=compression_level,chunksize=chunk_size, version=2)
> >
> > As to the discrepancy around calculating dataframe size. I'm not sure
> why that would be so off for you. Going off the docs, it seems like it
> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
> where each column is mostly a float or integer and datetime index. 7196546
> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
> >
> > Thank You,
> > Arun
> >
> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com>
> wrote:
> >>
> >> 2x overshoot of memory does seem a little high.  Are you including the
> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
> >> will require 2-3x memory usage because it buffers the bytes in memory
> >> in case it needs to re-convert them later (because it realizes the
> >> data type for the column is different).  I'm not sure if Panda's CSV
> >> reader is similar.
> >>
> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
> >> to a bit more memory and hold onto it (for a little while at least)
> >> even after it is no longer used.  Even malloc will hold onto memory
> >> sometimes due to fragmentation or other concerns.  You could try
> >> changing to the system allocator
> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
> >> to see if that makes a difference.
> >>
> >> I'm not sure your method of calculating the dataframe size is
> >> reliable.  I don't actually know enough about pandas but when I tried
> >> your experiment with my own 1.9G CSV file it ended up reporting:
> >>
> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
> >>
> >> which seems suspicious.
> >>
> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
> >> only using ~1.9GB and I will naively assume the difference between the
> >> two is bloat caused by object wrappers when converting to pandas.
> >>
> >> Another thing you might try and measure is
> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
> >> much memory Arrow itself is hanging onto.  If that is not 6GB then it
> >> is a pretty good guess that memory is being held somewhere else.
> >>
> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com> wrote:
> >> >
> >> > Hi Apache Arrow Members,
> >> >
> >> > My question is below but I've compiled a minimum reproducible example
> with a public dataset:
> >> >
> >> > import pandas as pd
> >> > from pyarrow import feather
> >> > import os
> >> > import psutil
> >> >
> >> >
> >> > def setup():
> >> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >> >   df.to_feather('test.csv')
> >> >
> >> > if __name__ == "__main__":
> >> >   # setup()
> >> >   process = psutil.Process(os.getpid())
> >> >   path = 'test.csv'
> >> >
> >> >   mem_size = process.memory_info().rss / 1e9
> >> >   print(f'BEFORE mem_size: {mem_size}gb')
> >> >
> >> >   df = feather.read_feather(path)
> >> >
> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >> >   mem_size = process.memory_info().rss / 1e9
> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >> >
> >> > I substituted my df with a sample csv. I had trouble finding a sample
> CSV of adequate size however, my dataset is ~3GB, and I see memory usage of
> close to 6GB.
> >> >
> >> > Output with My Data:
> >> > BEFORE mem_size: 0.088891392gb
> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >> >
> >> > It seems strange that the overall memory usage of the process is
> approx double of the size of the dataframe itself. Is there a reason for
> this, and is there a way to mitigate this?
> >> >
> >> > $ conda list pyarrow
> >> > #
> >> > # Name                    Version                   Build  Channel
> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
> conda-forge
> >> >
> >> > Thank You,
> >> > Arun Joseph
> >> >
> >
> >
> >
> > --
> > Arun Joseph
> >
>


-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Weston Pace <we...@gmail.com>.
I get more or less the same results as you for the provided setup data
(exact same #'s for arrow & df_size and slightly different for RSS
which is to be expected).  The fact that the arrow size is much lower
than the dataframe size is not too surprising to me.  If a column
can't be zero copied then it's memory will disappear from the arrow
pool (I think).  Plus, object columns will have overhead in pandas
that they do not have in Arrow.

The df_size issue for me seems to be tied to string columns.  I think
pandas is overestimating how much size is needed there (many of my
strings are similar and I wonder if some kind of object sharing is
happening).  But we can table this for another time.

I tried writing my feather file with your parameters and it didn't
have much impact on any of the numbers.

Since the arrow size for you is expected (nearly the same as the
df_size) I'm not sure what to investigate next.  The memory does not
seem to be retained by Arrow.  Is there any chance you could create a
reproducible test case using randomly generated numpy data (then you
could share that setup function)?

On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <aj...@gmail.com> wrote:
>
> Hi Wes,
>
> I'm not including the setup() call when I encounter the issue. I just kept it in there for ease of reproducibility. Memory usage is indeed higher when it is included, but that isn't surprising.
>
> I tried switching over to the system allocator but there is no change.
>
> I've updated to Arrow 6.0.1 as well and there is no change.
>
> I updated my script to also include the Arrow bytes allocated and it gave me the following:
>
> MVE:
> import pandas as pd
> import pyarrow as pa
> from pyarrow import feather
> import os
> import psutil
> pa.set_memory_pool(pa.system_memory_pool())
>
>
> def setup():
>   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>   df.to_feather('test.csv')
>
> if __name__ == "__main__":
>   # setup()
>   process = psutil.Process(os.getpid())
>   path = 'test.csv'
>
>   mem_size = process.memory_info().rss / 1e9
>   print(f'BEFORE mem_size: {mem_size}gb')
>
>   df = feather.read_feather(path)
>
>   df_size = df.memory_usage(deep=True).sum() / 1e9
>   mem_size = process.memory_info().rss / 1e10
>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>
> Output with my data:
> BEFORE mem_size: 0.08761344gb
> AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
> ARROW: 3.080121792gb
>
> Output with Provided Setup Data:
> BEFORE mem_size: 0.09179136gb
> AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
> ARROW: 0.00029664gb
>
> I'm assuming that the df and the arrow bytes allocated/sizes are distinct and non-overlapping, but it seems strange that the output with the provided data has the Arrow bytes allocated at ~0GB whereas the one with my data has the allocated data approximately equal to the dataframe size. I'm not sure if it affects anything but my file was written with the following:
>
> import pyarrow.lib as ext
> import pyarrow
> COMPRESSION_LEVEL = 19
> COMPRESSION_ALGO = 'zstd'
> KILOBYTE = 1 << 10
> MEGABYTE = KILOBYTE * KILOBYTE
> CHUNK_SIZE = MEGABYTE
>
> table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
> ext.write_feather(table, dest, compression=compression, compression_level=compression_level,chunksize=chunk_size, version=2)
>
> As to the discrepancy around calculating dataframe size. I'm not sure why that would be so off for you. Going off the docs, it seems like it should be accurate. My Dataframe in question is [7196546 rows x 56 columns] where each column is mostly a float or integer and datetime index. 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>
> Thank You,
> Arun
>
> On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com> wrote:
>>
>> 2x overshoot of memory does seem a little high.  Are you including the
>> "setup" part when you encounter that?  Arrow's file-based CSV reader
>> will require 2-3x memory usage because it buffers the bytes in memory
>> in case it needs to re-convert them later (because it realizes the
>> data type for the column is different).  I'm not sure if Panda's CSV
>> reader is similar.
>>
>> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
>> to a bit more memory and hold onto it (for a little while at least)
>> even after it is no longer used.  Even malloc will hold onto memory
>> sometimes due to fragmentation or other concerns.  You could try
>> changing to the system allocator
>> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
>> to see if that makes a difference.
>>
>> I'm not sure your method of calculating the dataframe size is
>> reliable.  I don't actually know enough about pandas but when I tried
>> your experiment with my own 1.9G CSV file it ended up reporting:
>>
>> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>>
>> which seems suspicious.
>>
>> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
>> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
>> only using ~1.9GB and I will naively assume the difference between the
>> two is bloat caused by object wrappers when converting to pandas.
>>
>> Another thing you might try and measure is
>> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
>> much memory Arrow itself is hanging onto.  If that is not 6GB then it
>> is a pretty good guess that memory is being held somewhere else.
>>
>> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com> wrote:
>> >
>> > Hi Apache Arrow Members,
>> >
>> > My question is below but I've compiled a minimum reproducible example with a public dataset:
>> >
>> > import pandas as pd
>> > from pyarrow import feather
>> > import os
>> > import psutil
>> >
>> >
>> > def setup():
>> >   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>> >   df.to_feather('test.csv')
>> >
>> > if __name__ == "__main__":
>> >   # setup()
>> >   process = psutil.Process(os.getpid())
>> >   path = 'test.csv'
>> >
>> >   mem_size = process.memory_info().rss / 1e9
>> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >
>> >   df = feather.read_feather(path)
>> >
>> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >   mem_size = process.memory_info().rss / 1e9
>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >
>> > I substituted my df with a sample csv. I had trouble finding a sample CSV of adequate size however, my dataset is ~3GB, and I see memory usage of close to 6GB.
>> >
>> > Output with My Data:
>> > BEFORE mem_size: 0.088891392gb
>> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>> >
>> > It seems strange that the overall memory usage of the process is approx double of the size of the dataframe itself. Is there a reason for this, and is there a way to mitigate this?
>> >
>> > $ conda list pyarrow
>> > #
>> > # Name                    Version                   Build  Channel
>> > pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge
>> >
>> > Thank You,
>> > Arun Joseph
>> >
>
>
>
> --
> Arun Joseph
>

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Arun Joseph <aj...@gmail.com>.
Hi Wes,

I'm not including the setup() call when I encounter the issue. I just kept
it in there for ease of reproducibility. Memory usage is indeed higher when
it is included, but that isn't surprising.

I tried switching over to the system allocator but there is no change.

I've updated to Arrow 6.0.1 as well and there is no change.

I updated my script to also include the Arrow bytes allocated and it gave
me the following:

MVE:
import pandas as pd
import pyarrow as pa
from pyarrow import feather
import os
import psutil
pa.set_memory_pool(pa.system_memory_pool())


def setup():
  df = pd.read_csv('
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
')
  df.to_feather('test.csv')

if __name__ == "__main__":
  # setup()
  process = psutil.Process(os.getpid())
  path = 'test.csv'

  mem_size = process.memory_info().rss / 1e9
  print(f'BEFORE mem_size: {mem_size}gb')

  df = feather.read_feather(path)

  df_size = df.memory_usage(deep=True).sum() / 1e9
  mem_size = process.memory_info().rss / 1e10
  print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
  print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')

Output with my data:
BEFORE mem_size: 0.08761344gb
AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
ARROW: 3.080121792gb

Output with Provided Setup Data:
BEFORE mem_size: 0.09179136gb
AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
ARROW: 0.00029664gb

I'm assuming that the df and the arrow bytes allocated/sizes are distinct
and non-overlapping, but it seems strange that the output with the provided
data has the Arrow bytes allocated at ~0GB whereas the one with my data has
the allocated data approximately equal to the dataframe size. I'm not sure
if it affects anything but my file was written with the following:

import pyarrow.lib as ext
import pyarrow
COMPRESSION_LEVEL = 19
COMPRESSION_ALGO = 'zstd'
KILOBYTE = 1 << 10
MEGABYTE = KILOBYTE * KILOBYTE
CHUNK_SIZE = MEGABYTE

table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
ext.write_feather(table, dest, compression=compression, compression_level=
compression_level,chunksize=chunk_size, version=2)

As to the discrepancy around calculating dataframe size. I'm not sure why
that would be so off for you. Going off the docs
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html>,
it seems like it should be accurate. My Dataframe in question is [7196546
rows x 56 columns] where each column is mostly a float or integer and
datetime index. 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.

Thank You,
Arun

On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <we...@gmail.com> wrote:

> 2x overshoot of memory does seem a little high.  Are you including the
> "setup" part when you encounter that?  Arrow's file-based CSV reader
> will require 2-3x memory usage because it buffers the bytes in memory
> in case it needs to re-convert them later (because it realizes the
> data type for the column is different).  I'm not sure if Panda's CSV
> reader is similar.
>
> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
> to a bit more memory and hold onto it (for a little while at least)
> even after it is no longer used.  Even malloc will hold onto memory
> sometimes due to fragmentation or other concerns.  You could try
> changing to the system allocator
> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
> to see if that makes a difference.
>
> I'm not sure your method of calculating the dataframe size is
> reliable.  I don't actually know enough about pandas but when I tried
> your experiment with my own 1.9G CSV file it ended up reporting:
>
> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>
> which seems suspicious.
>
> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
> only using ~1.9GB and I will naively assume the difference between the
> two is bloat caused by object wrappers when converting to pandas.
>
> Another thing you might try and measure is
> `pa.default_memory_pool().bytes_allocated()`.  This will tell you how
> much memory Arrow itself is hanging onto.  If that is not 6GB then it
> is a pretty good guess that memory is being held somewhere else.
>
> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com> wrote:
> >
> > Hi Apache Arrow Members,
> >
> > My question is below but I've compiled a minimum reproducible example
> with a public dataset:
> >
> > import pandas as pd
> > from pyarrow import feather
> > import os
> > import psutil
> >
> >
> > def setup():
> >   df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> >   df.to_feather('test.csv')
> >
> > if __name__ == "__main__":
> >   # setup()
> >   process = psutil.Process(os.getpid())
> >   path = 'test.csv'
> >
> >   mem_size = process.memory_info().rss / 1e9
> >   print(f'BEFORE mem_size: {mem_size}gb')
> >
> >   df = feather.read_feather(path)
> >
> >   df_size = df.memory_usage(deep=True).sum() / 1e9
> >   mem_size = process.memory_info().rss / 1e9
> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >
> > I substituted my df with a sample csv. I had trouble finding a sample
> CSV of adequate size however, my dataset is ~3GB, and I see memory usage of
> close to 6GB.
> >
> > Output with My Data:
> > BEFORE mem_size: 0.088891392gb
> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >
> > It seems strange that the overall memory usage of the process is approx
> double of the size of the dataframe itself. Is there a reason for this, and
> is there a way to mitigate this?
> >
> > $ conda list pyarrow
> > #
> > # Name                    Version                   Build  Channel
> > pyarrow                   4.0.1           py37h0f64622_13_cpu
> conda-forge
> >
> > Thank You,
> > Arun Joseph
> >
>


-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Posted by Weston Pace <we...@gmail.com>.
2x overshoot of memory does seem a little high.  Are you including the
"setup" part when you encounter that?  Arrow's file-based CSV reader
will require 2-3x memory usage because it buffers the bytes in memory
in case it needs to re-convert them later (because it realizes the
data type for the column is different).  I'm not sure if Panda's CSV
reader is similar.

Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
to a bit more memory and hold onto it (for a little while at least)
even after it is no longer used.  Even malloc will hold onto memory
sometimes due to fragmentation or other concerns.  You could try
changing to the system allocator
(pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
to see if that makes a difference.

I'm not sure your method of calculating the dataframe size is
reliable.  I don't actually know enough about pandas but when I tried
your experiment with my own 1.9G CSV file it ended up reporting:

AFTER mem_size: 2.348068864gb df_size: 4.519898461gb

which seems suspicious.

Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
all that unexpected.  There was 2.348GB of usage.  Arrow itself was
only using ~1.9GB and I will naively assume the difference between the
two is bloat caused by object wrappers when converting to pandas.

Another thing you might try and measure is
`pa.default_memory_pool().bytes_allocated()`.  This will tell you how
much memory Arrow itself is hanging onto.  If that is not 6GB then it
is a pretty good guess that memory is being held somewhere else.

On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <aj...@gmail.com> wrote:
>
> Hi Apache Arrow Members,
>
> My question is below but I've compiled a minimum reproducible example with a public dataset:
>
> import pandas as pd
> from pyarrow import feather
> import os
> import psutil
>
>
> def setup():
>   df = pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>   df.to_feather('test.csv')
>
> if __name__ == "__main__":
>   # setup()
>   process = psutil.Process(os.getpid())
>   path = 'test.csv'
>
>   mem_size = process.memory_info().rss / 1e9
>   print(f'BEFORE mem_size: {mem_size}gb')
>
>   df = feather.read_feather(path)
>
>   df_size = df.memory_usage(deep=True).sum() / 1e9
>   mem_size = process.memory_info().rss / 1e9
>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>
> I substituted my df with a sample csv. I had trouble finding a sample CSV of adequate size however, my dataset is ~3GB, and I see memory usage of close to 6GB.
>
> Output with My Data:
> BEFORE mem_size: 0.088891392gb
> AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>
> It seems strange that the overall memory usage of the process is approx double of the size of the dataframe itself. Is there a reason for this, and is there a way to mitigate this?
>
> $ conda list pyarrow
> #
> # Name                    Version                   Build  Channel
> pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge
>
> Thank You,
> Arun Joseph
>