You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "r4ntix (via GitHub)" <gi...@apache.org> on 2023/04/15 13:35:08 UTC
[GitHub] [arrow-datafusion] r4ntix commented on issue #5942: Poor reported performance of DataFusion against DuckDB and Hyper

r4ntix commented on issue #5942:
URL: https://github.com/apache/arrow-datafusion/issues/5942#issuecomment-1509831466

   > I wonder if running [parquet-layout](https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-layout.rs) against the parquet file might prove insightful.
   > 
   > DataFusion is currently limited to row group level parallelism, and there certainly are parquet writers that write very large row groups which would cause issues for this - [apache/arrow#34280](https://github.com/apache/arrow/issues/34280). Longer-term I would like to eventually get back to #2504 but that is not likely in the next couple of months.
   
   The flexibility of the parquet file causes different Writers to use different file generation strategies. The data in a Parquet file can be spread over the row groups and the pages using any encoding and compression the writer or user wants. 
   
   If the physical layout of the parquet file affects the way different query engines `scan`, should we introduce a standard TPC-H Parquet file and re-run the performance comparison test?
   
   I also saw this issue in this paper: https://dl.gi.de/bitstream/handle/20.500.12116/40316/B3-1.pdf?sequence=1&isAllowed=y
   
   > we look at three different Parquet writers to show how much Parquet files differ even though they store the same data. Parquet Writer Comparison:
   > | Generator          | Rows per Row Group | Pages per Row Group | File Sizes(SF1,SF10,SF100) |
   > | ------------------ | ------------------ | ------------------- | -------------------------- |
   > | Spark              | 3,000,000          | 150                 | 192 MB, 2.1 GB, 20 GB      |
   > | Spark uncompressed | 3,000,000          | 150                 | 333 MB, 3.3 GB, 33 GB      |
   > | DuckDB             | 100,352            | 1                   | 281 MB, 2.8 GB, 28 GB      |
   > | Arrow              | 67,108,864         | 15 - 1800           | 189 MB, 2.0 GB, 20 GB      |
   >
   > For each generator, we measure the number of rows and the number of pages that are stored per row group. The Spark and DuckDB Parquet writers store a fixed number of elements per page and a fixed number of pages per row group. Since Parquet does not force synchronization between the column chunks, there are writers such as Arrow that do not store the same number of elements per page. Arrow uses a fixed data page size between roughly 0.5MB and 1 MB. For DuckDB and Spark, the page sizes vary from 0.5 MB to 6 MB. 
   >
   > Even though we only cover three different Parquet writers, we have already observed two extremes. DuckDB and Arrow do not take advantage of the hierarchical data layout: DuckDB will only use one page per row group, and Arrow stores the entire dataset in one row group for scale factor 1 and 10 since each row group stores 67 million rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org