You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/06/30 18:58:00 UTC

[jira] [Commented] (ARROW-16944) [C++] Create macro-benchmarks of file format readers

    [ https://issues.apache.org/jira/browse/ARROW-16944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561153#comment-17561153 ] 

David Li commented on ARROW-16944:
----------------------------------

It would be a good idea to gather some "real world" datasets to use. NYC Taxi is an obvious one, ARROW-9612 and the associated discussion suggests Wikipedia and US election data as well.

> [C++] Create macro-benchmarks of file format readers
> ----------------------------------------------------
>
>                 Key: ARROW-16944
>                 URL: https://issues.apache.org/jira/browse/ARROW-16944
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: David Li
>            Priority: Major
>
> Currently we have (some) microbenchmarks, but measuring performance of our various readers (CSV, JSON, IPC, Parquet, ORC) over "real world" files would also be interesting and hopefully more illustrative of the use cases we actually care about. Such benchmarks may be expensive, though.
> Ideally, we would do this in a variety of scenarios: in-memory (to focus on CPU optimization), on-disk (though such measurements would likely be extremely noisy?), and over the network (perhaps with something like Minio + Toxiproxy to try to have a consistent, reproducible setup) so that we can also judge the I/O characteristics of the readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)