You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/06/24 02:26:00 UTC

[jira] [Commented] (ARROW-4099) [Python] Pretty printing very large ChunkedArray objects can use unbounded memory

    [ https://issues.apache.org/jira/browse/ARROW-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870756#comment-16870756 ] 

Wes McKinney commented on ARROW-4099:
-------------------------------------

What we probably need to do is implement a global size bound on the output of {{PrettyPrint}} so that we bail out early when we hit a particular limit (e.g. around a megabyte or so). This is a pretty significant refactor of {{src/arrow/pretty_print.cc}} since there are many functions that write directly into {{std::ostream}} without any size book-keeping. This isn't causing enough of a user problem to require us to fix it right now

> [Python] Pretty printing very large ChunkedArray objects can use unbounded memory
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-4099
>                 URL: https://issues.apache.org/jira/browse/ARROW-4099
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.14.0
>
>
> In working on ARROW-2970, I have the following dataset:
> {code}
> values = [b'x'] + [
>     b'x' * (1 << 20)
> ] * 2 * (1 << 10)
> arr = np.array(values)
> arrow_arr = pa.array(arr)
> {code}
> The object {{arrow_arr}} has 129 chunks, each element of which is 1MB of binary. The repr for this object is over 600MB:
> {code}
> In [10]: rep = repr(arrow_arr)
> In [11]: len(rep)
> Out[11]: 637536258
> {code}
> There's probably a number of failsafes we can implement to avoid badness in these pathological cases (which may not happen often, but given the kinds of bug reports we are seeing, people do have datasets that look like this)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)