You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/03/25 18:26:00 UTC
[jira] [Commented] (ARROW-16028) Memory leak in `fragment.to_table`

    [ https://issues.apache.org/jira/browse/ARROW-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512531#comment-17512531 ] 

Will Jones commented on ARROW-16028:
------------------------------------

Are you sure the data you are trying to load isn't just too big for the memory on your machine? 
{quote}What is really weird is if we put a debug point in the loop and *load* just {*}one fragment{*}.
{quote}
FYI in the API you are using, {{dataset.fragments}} [returns the materialized list of fragments|https://github.com/apache/arrow/blob/5a5f4ce326194750422ef6f053469ed1912ce69f/python/pyarrow/parquet.py#L1806-L1808], not an iterator, so you are actually loading all the fragments in that call, not just one. Instead, you should try using the newer datasets API and the associated {{dataset.get_fragments()}} method, which does return an iterator:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("path in bucket", filesystem=fs)
for fragment in dataset.get_fragments(filters=some_filters):
    # do something with fragment{code}
{quote}It loads, but something *keeps eating memory after load* until there is no left.
{quote}
How are you measuring memory usage? Many tools, like Activity Monitor or Task Manager have a certain lag, so it's normal to see them register increases in memory *after* a memory hungry operation occurs.

> Memory leak in `fragment.to_table`
> ----------------------------------
>
>                 Key: ARROW-16028
>                 URL: https://issues.apache.org/jira/browse/ARROW-16028
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1
>            Reporter: ondrej metelka
>            Priority: Major
>
> This "pseudo" code ends with OOM.
>  
> {code:java}
> import fsspec
> import pyarrow
> import pyarrow.parquet as pq
> fs = fsspec.filesystem(
>     "s3",
>     default_cache_type="none",
>     default_fill_cache=False,
>     **our_storage_options,
> )
> dataset = pq.ParquetDataset(
>     "path in bucket",
>     filesystem=fs,
>     filters=some_filters,
>     use_legacy_dataset=False,
> )
> # this ends with OOM
> dataset.read(columns=columns_to_read)
> # and this too
> tables = []
> for fragment in dataset.fragments:
>    tables.append(fragment.to_table(columns=columns_to_read))
> all_data = pyarrow.lib.concat_tables(tables) {code}
> What is really weird is if we put a debug point in the loop and *load* just {*}one fragment{*}. It loads, but something *keeps eating memory after load* until there is no left.
> We are trying to read a parquet table that has several files under desired partitions. Each fragment has tens of columns and tens of millions of rows.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)