You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/06/07 13:56:00 UTC

[jira] [Commented] (ARROW-12983) Very large memory consumption when building a table

    [ https://issues.apache.org/jira/browse/ARROW-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358619#comment-17358619 ] 

David Li commented on ARROW-12983:
----------------------------------

I can confirm this and provide I think an explanation.

When we convert an array, we append values until we get an out-of-capacity error. Then we finish the current chunk and start a new one, recursively. However, we mistakenly don't account for the values already converted! Hence, _if_ the values don't fit in one chunk, we get stuck in an infinite loop of converting the same values over and over, creating an infinitely growing list.

Here's the code in question: [https://github.com/apache/arrow/blob/maint-4.0.x/cpp/src/arrow/util/converter.h#L305-L318]

> Very large memory consumption when building a table
> ---------------------------------------------------
>
>                 Key: ARROW-12983
>                 URL: https://issues.apache.org/jira/browse/ARROW-12983
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 4.0.0, 4.0.1
>            Reporter: Laurent Mazare
>            Priority: Major
>
> _Apologies if this is a duplicate, I haven't found anything related_
> When creating an arrow table via the python api, the following code runs out of memory after using all the available resources on a box with 512GB of ram. This happens with pyarrow 4.0.0 and 4.0.1. However when running the same code with pyarrow 3.0.0, the memory usage only reaches 5GB (which seems like the appropriate ballpark for the table size).
>  The code generates a table with a single string column with 1m rows, each string being 3000 characters long.
> Not sure whether the issue is python related or not, I haven't tried replicating it from the C++ api.
>  
> {code:python}
> import os, string
> import numpy as np
> import pyarrow as pa
> print(pa.__version__)
> np.random.seed(42)
> alphabet = list(string.ascii_uppercase)
> _col = []
> for _n in range(1000):
>   k = ''.join(np.random.choice(alphabet, 3000))
>   _col += [k] * 1000
> table = pa.Table.from_pydict({'col': _col})
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)