You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Dan Burkert (JIRA)" <ji...@apache.org> on 2017/12/19 00:13:00 UTC

[jira] [Created] (KUDU-2243) CFile Reader improvements

Dan Burkert created KUDU-2243:
---------------------------------

             Summary: CFile Reader improvements
                 Key: KUDU-2243
                 URL: https://issues.apache.org/jira/browse/KUDU-2243
             Project: Kudu
          Issue Type: Improvement
          Components: cfile
    Affects Versions: 1.6.0
            Reporter: Dan Burkert


I've done a pretty thorough review of all the CFile reader code over the last few days in order to make a targeted bug fix, and I've got some ideas for how we can simplify it.  I'd like to get others thoughts.

* To reduce confusion between CFile data blocks and FS manager blocks, I think we should change all references in code and docs of CFile data blocks to 'cblock'.

* Much of the complexity of the CFileIterator is due to it's complex public API, which requires separate {{Seek(idx) -> Prepare(nrows) -> Scan(output buf, predicates)}} calls.  Additionally, the Prepare step can materialize many blocks, which then need to be put in a queue. I think all of this could be simplified by changing the API to be {{Seek(idx) -> Scan(nrows, output buf, predicates)}}, and have the CFile iterator only cache the most-recently-materialized block (instead of the queue). For really big scan batches, this will change the internal scan/materialize pattern from materializing all cblocks up front then copying, to materializing and copying of cblocks being interleaved.  Since in most cases cblocks are usually much bigger (256kib) than scan batches (100 cells), I think it won't actually lead to measurably different behavior.

* {{QueueCurrentDataBlock}} and {{ReadCurrentDataBlock}} should drop {{Current}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)