You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Laurent Mazare (Jira)" <ji...@apache.org> on 2021/06/05 12:38:00 UTC

[jira] [Created] (ARROW-12983) Very large memory consumption when building a table

Laurent Mazare created ARROW-12983:
--------------------------------------

             Summary: Very large memory consumption when building a table
                 Key: ARROW-12983
                 URL: https://issues.apache.org/jira/browse/ARROW-12983
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 4.0.1, 4.0.0
            Reporter: Laurent Mazare


_Apologies if this is a duplicate, I haven't found anything related_

When creating an arrow table via the python api, the following code runs out of memory after using all the available resources on a box with 512GB of ram. This happens with pyarrow 4.0.0 and 4.0.1. However when running the same code with pyarrow 3.0.0, the memory usage only reaches 5GB (which seems like the appropriate ballpark for the table size).
 The code generates a table with a single string column with 1m rows, each string being 3000 characters long.

Not sure whether the issue is python related or not, I haven't tried replicating it from the C++ api.

 
{code:python}
import os, string
import numpy as np
import pyarrow as pa

print(pa.__version__)
np.random.seed(42)

alphabet = list(string.ascii_uppercase)

_col = []
for _n in range(1000):
  k = ''.join(np.random.choice(alphabet, 3000))
  _col += [k] * 1000

table = pa.Table.from_pydict({'col': _col})
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)