You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Apache Arrow JIRA Bot (Jira)" <ji...@apache.org> on 2022/12/15 17:53:00 UTC

[jira] [Commented] (ARROW-16389) [C++] Support hash-join on larger than memory datasets

    [ https://issues.apache.org/jira/browse/ARROW-16389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648178#comment-17648178 ] 

Apache Arrow JIRA Bot commented on ARROW-16389:
-----------------------------------------------

This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned per [project policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment]. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

> [C++] Support hash-join on larger than memory datasets
> ------------------------------------------------------
>
>                 Key: ARROW-16389
>                 URL: https://issues.apache.org/jira/browse/ARROW-16389
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Sasha Krassovsky
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> The current implementation of the hash-join node current queues in memory the hashtable, the entire build side input, and the entire probe side input (e.g. the entire dataset).  This means the current implementation will run out of memory and crash if the input dataset is larger than the memory on the system.
> By spilling to disk when memory starts to fill up we can allow the hash-join node to process datasets larger than the available memory on the machine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)