You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/06/07 14:21:00 UTC

[jira] [Commented] (ARROW-12983) [C++] Converter::Extend gets stuck in infinite loop causing OOM if values don't fit in single chunk

    [ https://issues.apache.org/jira/browse/ARROW-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358629#comment-17358629 ] 

Weston Pace commented on ARROW-12983:
-------------------------------------

I think it's slightly different.  The converter's Extend method might better be named AddRange.  It looks like...
 # Reserve space for all values
 # Go through values and append them

In the past there was a python converter and a chunker (which was not a converter).  With [https://github.com/apache/arrow/commit/245564c9e01d5587772c54a6ade11cbdcb1893db] the chunker became itself, a converter.

 

The chunker's append logic is
 # Try and append to child
 # If child hit capacity error then create a chunk with the current values, reset child, continue

 

This append logic works.  The chunker's extend (add range) logic is...

 
 # Try and extend child
 # If child hit capacity error then create a chunk with current values, reset child, continue

This extend logic doesn't work.  That is because the child fails the first step of extend (Reserve space for all values) and never actually adds any items.  So then the chunk that is created is an empty chunk.  Repeat until there are enough empty chunks to run out of memory.

> [C++] Converter::Extend gets stuck in infinite loop causing OOM if values don't fit in single chunk
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12983
>                 URL: https://issues.apache.org/jira/browse/ARROW-12983
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 4.0.0, 4.0.1
>            Reporter: Laurent Mazare
>            Assignee: David Li
>            Priority: Major
>
> _Apologies if this is a duplicate, I haven't found anything related_
> When creating an arrow table via the python api, the following code runs out of memory after using all the available resources on a box with 512GB of ram. This happens with pyarrow 4.0.0 and 4.0.1. However when running the same code with pyarrow 3.0.0, the memory usage only reaches 5GB (which seems like the appropriate ballpark for the table size).
>  The code generates a table with a single string column with 1m rows, each string being 3000 characters long.
> Not sure whether the issue is python related or not, I haven't tried replicating it from the C++ api.
>  
> {code:python}
> import os, string
> import numpy as np
> import pyarrow as pa
> print(pa.__version__)
> np.random.seed(42)
> alphabet = list(string.ascii_uppercase)
> _col = []
> for _n in range(1000):
>   k = ''.join(np.random.choice(alphabet, 3000))
>   _col += [k] * 1000
> table = pa.Table.from_pydict({'col': _col})
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)