You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "waitingkuo (via GitHub)" <gi...@apache.org> on 2023/02/15 14:34:39 UTC
[GitHub] [arrow-datafusion] waitingkuo commented on issue #5276: Update benchmmarks on clickbench
waitingkuo commented on issue #5276:
URL: https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1431466976
the codes to reproduce the benchmark are here https://github.com/ClickHouse/ClickBench/tree/main/datafusion
The way to update the data in the clickbench website is to update the json file in the results folder and send a PR https://github.com/ClickHouse/ClickBench/tree/main/datafusion/results
The current result listed in the website was benchmarked by datafusion v11, which was release around 6 months ago.
to imporve:
1. use datafusion v18 to rerun the benchmark codes and send the PR. t
2. at the time I wrote the benchmark codes, datafusion didnt support some features. e.g. it didn't support schema from parquet, so for some data type like timestamp, we need to load it as string and then cast it to timestamp explicitly. e.g.
this is the original sql queries
```sql
SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM ...
```
https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/queries.sql#L41
this is the modified version for datafusion
```sql
SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM
```
https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql#L41
I did modify some quries so that it works in datafusion. To fix this, we need to verify whether the new datafusion work or not. If so we could update the quries. If not, we could fire the issue to improve
I'll do the first option soon (update to v18), it should be a quick improvement.
It'll take more time to do the second approach. Welcome for the contribution. I'll get back to this if there's no one work on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org