You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/02/19 20:44:00 UTC
[jira] [Comment Edited] (ARROW-4313) Define general benchmark database schema

    [ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772315#comment-16772315 ] 

Wes McKinney edited comment on ARROW-4313 at 2/19/19 8:43 PM:
--------------------------------------------------------------

I'm involved many projects so I haven't been able to follow the discussion to see where there is disagreement or conflict. 

From my perspective I want the following in the short term

* A general purpose database schema, preferably for PostgreSQL, which can be used to easily provision a new benchmark database
* A script for running the C++ benchmarks and inserting the results into _any instance_ of that database. This script should capture hardware information as well as any additional information that is known about the environment (OS, thirdparty library versions -- e.g. so we can see if upgrading a dependency, like gRPC for example, causes a performance problem). The script should not be coupled to a particular instance of the database. It should work in an air-gapped environment

I think until we should work as quickly as possible to have a working version of both of these to validate that we are on the right track. If we try to come up with the "perfect database schema" and punt the benchmark collector script until later we could be waiting a long time. 

Ideally the database schema can accommodate results from multiple benchmark execution frameworks other than Google benchmark for C++. So we could write an adapter script to export data from ASV (for Python) into this database.

[~aregm] this does not seem to be out of line with the requirements you listed unless I am misunderstanding. I would rather not be too involved with the details right now unless the project stalls out for some reason and needs me to help push it through to completion. 


was (Author: wesmckinn):
I'm involved many projects so I haven't been able to follow the discussion to see where there is disagreement or conflict. 

From my perspective I want the following in the short term

* A general purpose database schema, preferably for PostgreSQL, which can be used to easily provision a new benchmark database
* A script for running the C++ benchmarks and inserting the results into the database. This script should capture hardware information as well as any additional information that is known about the environment (OS, thirdparty library versions -- e.g. so we can see if upgrading a dependency, like gRPC for example, causes a performance problem)

I think until we should work as quickly as possible to have a working version of both of these to validate that we are on the right track. If we try to come up with the "perfect database schema" and punt the benchmark collector script until later we could be waiting a long time. 

Ideally the database schema can accommodate results from multiple benchmark execution frameworks other than Google benchmark for C++. So we could write an adapter script to export data from ASV (for Python) into this database.

[~aregm] this does not seem to be out of line with the requirements you listed unless I am misunderstanding. I would rather not be too involved with the details right now unless the project stalls out for some reason and needs me to help push it through to completion. 

> Define general benchmark database schema
> ----------------------------------------
>
>                 Key: ARROW-4313
>                 URL: https://issues.apache.org/jira/browse/ARROW-4313
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Benchmarking
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.13.0
>
>         Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>          Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> Some possible attributes that the benchmark database should track, to permit heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)