You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/13 18:32:00 UTC

[jira] [Comment Edited] (ARROW-16178) [C++] Add a ThreadLocalState concept built on thread local

    [ https://issues.apache.org/jira/browse/ARROW-16178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521868#comment-17521868 ] 

Weston Pace edited comment on ARROW-16178 at 4/13/22 6:31 PM:
--------------------------------------------------------------

Yes.  Reconciliation is not always needed but you are correct about the scratch spaces.  For example, let's assume we have an int32 array {{x}} and an expression {{x < 20 && x > 10}} and we are implementing a filter node.

This translates roughly to...

{noformat}
const auto& x = batch.get("x");
auto y = call("lt", {x, 20});
auto z = call("gt", {x, 10});
auto filter = call("and", {y, z});
auto out = call("take", {batch, filter});
{noformat}

Each call we have to heap-allocate {{y}}, {{z}}, {{filter}}, and {{out}}.  However, if our max batch size is defined (let's say 10k rows) then we can preallocate {{y}}, {{z}} and {{filter}} at plan creation time.  In fact, since this node has only a single output we can even preallocate {{out}}.

At the moment, for the filter node, this discussion is largely theoretical.  The kernel infrastructure isn't yet ready to receive a preallocated output buffer.  In the hash-join node however (and possibly the aggregate node), this sort of pattern is actively being used and there is a need for a more efficient way of accessing the thread local data.


was (Author: westonpace):
Yes.  Reconciliation is not always needed but you are correct about the scratch spaces.  For example, let's assume we have an int32 array {{x}} and an expression {{x < 20 && x > 10}} and we are implementing a filter node.

This translates roughly to...

{noformat}
const auto& x = batch.get("x");
auto y = call("lt", {x, 20});
auto z = call("gt", {x, 10});
auto filter = call("and", y, z);
auto out = call("take", {batch, filter});
{noformat}

Each call we have to heap-allocate {{y}}, {{z}}, {{filter}}, and {{out}}.  However, if our max batch size is defined (let's say 10k rows) then we can preallocate {{y}}, {{z}} and {{filter}} at plan creation time.  In fact, since this node has only a single output we can even preallocate {{out}}.

At the moment, for the filter node, this discussion is largely theoretical.  The kernel infrastructure isn't yet ready to receive a preallocated output buffer.  In the hash-join node however (and possibly the aggregate node), this sort of pattern is actively being used and there is a need for a more efficient way of accessing the thread local data.

> [C++] Add a ThreadLocalState concept built on thread local
> ----------------------------------------------------------
>
>                 Key: ARROW-16178
>                 URL: https://issues.apache.org/jira/browse/ARROW-16178
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The ThreadLocalState is tied to an executor and, on creation, creates a state for every thread in the executor.  In order to quickly access a particular thread's state we need a way  to get a thread index (the index of the thread in the executor).  Historically we used ThreadIndexer and this JIRA introduces a new approach using thread local.
> Similar to the ThreadIndexer this thread local state concept will fail when the capacity is resized during a run.
> Similar to the ThreadIndexer this concept won't work too well for serial execution until ARROW-15732 is resolved.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)