You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/04/11 04:12:00 UTC

[jira] [Updated] (ARROW-15583) [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors

     [ https://issues.apache.org/jira/browse/ARROW-15583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-15583:
-----------------------------------
    Labels: pull-request-available substrait  (was: substrait)

> [C++] The Substrait consumer could potentially use a massive amount of RAM if the producer uses large anchors
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15583
>                 URL: https://issues.apache.org/jira/browse/ARROW-15583
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Sanjiban Sengupta
>            Priority: Major
>              Labels: pull-request-available, substrait
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In Substrait a function is referred to by a "fully qualified name" which consists of a URI and a function name.  For example, the "add" function is something like {{https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml}}.  To avoid serializing these long names multiple times in the plan the producer should pick an anchor value (an int32 in protobuf) and use that everywhere (with a single lookup table at the top level of the plan).
> To avoid map lookups the Arrow C++ consumer currently assumes that this lookup table will be small enough it can be stored in a vector...
> {noformat}
> {
>   "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#add",
>   "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml#subtract"
> }
> {noformat}
> However, this sort of assumes that a plan is going to use numbers like 0, 1, 2, ... N to create N anchors.  There is nothing that prevents a consumer from using whatever numbers it wants (e.g. a pointer value).  If the producer uses a really large anchor value then the  C++ Substrait consumer will create a lookup table with a lot of blank values.  This could lead to a lot of wasted memory.
> We could try and request the Substrait spec enfoce small anchors or we could change the extension set handling in the C++ consumer to use an unordered_map.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)