You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/03/25 18:26:00 UTC
[jira] [Commented] (ARROW-16028) Memory leak in `fragment.to_table`
[ https://issues.apache.org/jira/browse/ARROW-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512531#comment-17512531 ]
Will Jones commented on ARROW-16028:
------------------------------------
Are you sure the data you are trying to load isn't just too big for the memory on your machine?
{quote}What is really weird is if we put a debug point in the loop and *load* just {*}one fragment{*}.
{quote}
FYI in the API you are using, {{dataset.fragments}} [returns the materialized list of fragments|https://github.com/apache/arrow/blob/5a5f4ce326194750422ef6f053469ed1912ce69f/python/pyarrow/parquet.py#L1806-L1808], not an iterator, so you are actually loading all the fragments in that call, not just one. Instead, you should try using the newer datasets API and the associated {{dataset.get_fragments()}} method, which does return an iterator:
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("path in bucket", filesystem=fs)
for fragment in dataset.get_fragments(filters=some_filters):
# do something with fragment{code}
{quote}It loads, but something *keeps eating memory after load* until there is no left.
{quote}
How are you measuring memory usage? Many tools, like Activity Monitor or Task Manager have a certain lag, so it's normal to see them register increases in memory *after* a memory hungry operation occurs.
> Memory leak in `fragment.to_table`
> ----------------------------------
>
> Key: ARROW-16028
> URL: https://issues.apache.org/jira/browse/ARROW-16028
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 6.0.1
> Reporter: ondrej metelka
> Priority: Major
>
> This "pseudo" code ends with OOM.
>
> {code:java}
> import fsspec
> import pyarrow
> import pyarrow.parquet as pq
> fs = fsspec.filesystem(
> "s3",
> default_cache_type="none",
> default_fill_cache=False,
> **our_storage_options,
> )
> dataset = pq.ParquetDataset(
> "path in bucket",
> filesystem=fs,
> filters=some_filters,
> use_legacy_dataset=False,
> )
> # this ends with OOM
> dataset.read(columns=columns_to_read)
> # and this too
> tables = []
> for fragment in dataset.fragments:
> tables.append(fragment.to_table(columns=columns_to_read))
> all_data = pyarrow.lib.concat_tables(tables) {code}
> What is really weird is if we put a debug point in the loop and *load* just {*}one fragment{*}. It loads, but something *keeps eating memory after load* until there is no left.
> We are trying to read a parquet table that has several files under desired partitions. Each fragment has tens of columns and tens of millions of rows.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)