You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Francois Saint-Jacques (Jira)" <ji...@apache.org> on 2020/02/03 18:29:01 UTC
[jira] [Comment Edited] (ARROW-4333) [C++] Sketch out design for kernels and "query" execution in compute layer

    [ https://issues.apache.org/jira/browse/ARROW-4333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767264#comment-16767264 ] 

Francois Saint-Jacques edited comment on ARROW-4333 at 2/3/20 6:28 PM:
-----------------------------------------------------------------------

References I suggest you to check.
 - [The Design and Implementation of Modern Column-Oriented Database Systems, 2012|http://db.csail.mit.edu/pubs/abadi-column-stores.pdf]: This is a long read, but worth the investment. Will give you a broad overview of what are columnar databases and what makes them fast.
 - [MonetDB/X100: Hyper-Pipelining Query Execution, 2005|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf] foundation paper on how to implement fast query engine for analytics on modern hardware.
 - [Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask, 2018|http://www.vldb.org/pvldb/vol11/p2209-kersten.pdf]: This is an update to the paper you linked, which studies Compilation vs Vectorization.
 - [Vectorization vs. Compilation in Query Execution, 2011|https://15721.courses.cs.cmu.edu/spring2019/papers/21-vectorization2/p5-sompolski.pdf]
 - [Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last, 2017|https://pdfs.semanticscholar.org/1a21/509b67d3ed06cd7062f2f9b7e5b0b32a32e6.pdf]
 - [Make the Most out of Your SIMD Investments: Counter Control Flow Divergence in Compiled Query Pipelines, 2018|http://db.in.tum.de/~lang/papers/simd_divergence.pdf], talks about using AVX-512 masked instructions.

Anything by:
 [Daniel J. Abadi|https://www.semanticscholar.org/author/Daniel-J.-Abadi/2254232] part of the team that wrote CStore which became Vertica. Write less about columnar execution in the last years.
 [Peter Boncz|https://www.semanticscholar.org/author/Peter-A.-Boncz/1687211] behind MonetDB/Vectorwize.
 [Thomas Neumann|https://www.semanticscholar.org/author/Thomas-Neumann/1706846] behind [Hyper|https://hyper-db.de/] bought by Tableau.
 [Andrew Pavlo|https://www.semanticscholar.org/author/Andrew-Pavlo/1774210] teaches database course at CMU

Amazing video lectures of courses at CMU, you can ignore most of the storage layer, concurrency, transaction context. We're interested in execution engine, vectorization, and compilation.
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQKwIAV3es0U6t]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjbjOyrcqgE6_lCV6xvzffSN]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY2xvwxuKjZT5qFH0sQga8_]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9]

The [CMU 15-721 Advanced Database Systems schedule|https://15721.courses.cs.cmu.edu/spring2019/schedule.html] is usually a good source of papers.

Relevant code base:
 - [Impala|https://github.com/apache/impala/tree/master/be/src]
 - [ClickHouse|https://github.com/yandex/ClickHouse/tree/master/dbms/src]  
 - [MapD|https://github.com/omnisci/mapd-core/tree/master]
 - [Supersonic|https://github.com/google/supersonic]
 - [Peloton|https://github.com/cmu-db/peloton]


was (Author: fsaintjacques):
References I suggest you to check.
 - [The Design and Implementation of Modern Column-Oriented Database Systems, 2012|http://db.csail.mit.edu/pubs/abadi-column-stores.pdf]: This is a long read, but worth the investment. Will give you a broad overview of what are columnar databases and what makes them fast.
 - [MonetDB/X100: Hyper-Pipelining Query Execution, 2005|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf] foundation paper on how to implement fast query engine for analytics on modern hardware.
 - [Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask, 2018|https://pdfs.semanticscholar.org/2e84/4872e32a4a4e94e229a9a9e70ac47d710252.pdf]: This is an update to the paper you linked, which studies Compilation vs Vectorization.
 - [Vectorization vs. Compilation in Query Execution, 2011|https://15721.courses.cs.cmu.edu/spring2019/papers/21-vectorization2/p5-sompolski.pdf]
 - [Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last, 2017|https://pdfs.semanticscholar.org/1a21/509b67d3ed06cd7062f2f9b7e5b0b32a32e6.pdf]
 - [Make the Most out of Your SIMD Investments: Counter Control Flow Divergence in Compiled Query Pipelines, 2018|http://db.in.tum.de/~lang/papers/simd_divergence.pdf], talks about using AVX-512 masked instructions.

Anything by:
 [Daniel J. Abadi|https://www.semanticscholar.org/author/Daniel-J.-Abadi/2254232] part of the team that wrote CStore which became Vertica. Write less about columnar execution in the last years.
 [Peter Boncz|https://www.semanticscholar.org/author/Peter-A.-Boncz/1687211] behind MonetDB/Vectorwize.
 [Thomas Neumann|https://www.semanticscholar.org/author/Thomas-Neumann/1706846] behind [Hyper|https://hyper-db.de/] bought by Tableau.
 [Andrew Pavlo|https://www.semanticscholar.org/author/Andrew-Pavlo/1774210] teaches database course at CMU

Amazing video lectures of courses at CMU, you can ignore most of the storage layer, concurrency, transaction context. We're interested in execution engine, vectorization, and compilation.
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQKwIAV3es0U6t]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjbjOyrcqgE6_lCV6xvzffSN]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY2xvwxuKjZT5qFH0sQga8_]
 * [https://www.youtube.com/playlist?list=PLSE8ODhjZXjY0GMWN4X8FIkYNfiu8_Wl9]

The [CMU 15-721 Advanced Database Systems schedule|https://15721.courses.cs.cmu.edu/spring2019/schedule.html] is usually a good source of papers.

Relevant code base:
 - [Impala|https://github.com/apache/impala/tree/master/be/src]
 - [ClickHouse|https://github.com/yandex/ClickHouse/tree/master/dbms/src]  
 - [MapD|https://github.com/omnisci/mapd-core/tree/master]
 - [Supersonic|https://github.com/google/supersonic]
 - [Peloton|https://github.com/cmu-db/peloton]

> [C++] Sketch out design for kernels and "query" execution in compute layer
> --------------------------------------------------------------------------
>
>                 Key: ARROW-4333
>                 URL: https://issues.apache.org/jira/browse/ARROW-4333
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Micah Kornfield
>            Priority: Major
>              Labels: analytics
>
> It would be good to formalize the design of kernels and the controlling query execution layer (e.g. volcano batch model?) to understand the following:
> Contracts for kernels:
>  * Thread safety of kernels?
>  * When Kernels should allocate memory vs expect preallocated memory?  How to communicate requirements for a kernels memory allocaiton?
>  * How to communicate the whether a kernels execution is parallelizable across a ChunkedArray?  How to determine if the order to execution across a ChunkedArray is important?
>  * How to communicate when it is safe to re-use the same buffers and input and output to the same kernel?
> What does the threading model look like for the higher level of control?  Where should synchronization happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)