You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/04/23 20:42:00 UTC

[jira] [Created] (ARROW-12523) [C++] [Dataset] Remove buffering from AsyncScanner

Weston Pace created ARROW-12523:
-----------------------------------

             Summary: [C++] [Dataset] Remove buffering from AsyncScanner
                 Key: ARROW-12523
                 URL: https://issues.apache.org/jira/browse/ARROW-12523
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


The MakeEnumeratedGenerator operator buffers blocks by 1 so it can properly mark a block as "last" (e.g. when it receives an EOF it releases the last block, marks it last, and then releases an EOF).

However, this adds complexity (this is very evident in the testing for unordered scan) and could potentially disrupt cache locality.  For example, a thread will receive batch X, parse & decode batch X, then filter and project batch X-1.

We could push the responsibility of tagging the last batch/fragment into the readers themselves or we could release an empty "last" batch which serves as a token to the later resequencer (think of it as an end-of-fragment token in addition to the end-of-scan token we already have).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)