You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Dan Burkert (JIRA)" <ji...@apache.org> on 2018/01/25 19:21:00 UTC
[jira] [Commented] (KUDU-2243) CFile Reader improvements

    [ https://issues.apache.org/jira/browse/KUDU-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339682#comment-16339682 ] 

Dan Burkert commented on KUDU-2243:
-----------------------------------

We discussed the concept renaming on slack and came to the following conclusions:

CFile should be renamed to Chunk:
 * CFile has connotations that it's a file, but in reality a CFile is 1:1 with the fs::BlockManager Block.
 * CFile maps very closely to Parquet's Column Chunk Abstraction ('Column chunk: A chunk of the data for a particular column')
 * We'd therefore have column chunks, ad-hoc index chunks, bloom chunks, and delta chunks

cfile block/cblock should be renamed to page:
 * As the unit of encoding and compression, and the smallest indivisible on-disk container, it maps very well to the classical database concept of a page.
 * It maps well to Parquet's concept of a page ('Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which is interleaved in a column chunk.')

The current fs block manager block abstraction will remain, to which the 'block' term will unambiguously refer.

> CFile Reader improvements
> -------------------------
>
>                 Key: KUDU-2243
>                 URL: https://issues.apache.org/jira/browse/KUDU-2243
>             Project: Kudu
>          Issue Type: Improvement
>          Components: cfile
>    Affects Versions: 1.6.0
>            Reporter: Dan Burkert
>            Priority: Major
>
> I've done a pretty thorough review of all the CFile reader code over the last few days in order to make a targeted bug fix, and I've got some ideas for how we can simplify it.  I'd like to get others thoughts.
> * To reduce confusion between CFile data blocks and FS manager blocks, I think we should change all references in code and docs of CFile data blocks to 'cblock'.
> * Much of the complexity of the CFileIterator is due to it's complex public API, which requires separate {{Seek(idx) -> Prepare(nrows) -> Scan(output buf, predicates)}} calls.  Additionally, the Prepare step can materialize many blocks, which then need to be put in a queue. I think all of this could be simplified by changing the API to be {{Seek(idx) -> Scan(nrows, output buf, predicates)}}, and have the CFile iterator only cache the most-recently-materialized block (instead of the queue). For really big scan batches, this will change the internal scan/materialize pattern from materializing all cblocks up front then copying, to materializing and copying of cblocks being interleaved.  Since in most cases cblocks are usually much bigger (256kib) than scan batches (100 cells), I think it won't actually lead to measurably different behavior.
> * {{QueueCurrentDataBlock}} and {{ReadCurrentDataBlock}} should drop {{Current}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)