You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "waitingkuo (via GitHub)" <gi...@apache.org> on 2023/02/15 14:34:39 UTC

[GitHub] [arrow-datafusion] waitingkuo commented on issue #5276: Update benchmmarks on clickbench

waitingkuo commented on issue #5276:
URL: https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1431466976

   the codes to reproduce the benchmark are here https://github.com/ClickHouse/ClickBench/tree/main/datafusion
   The way to update the data in the clickbench website is to update the json file in the results folder and send a PR https://github.com/ClickHouse/ClickBench/tree/main/datafusion/results
   
   The current result listed in the website was benchmarked by datafusion v11, which was release around 6 months ago.
   
   to imporve:
   
   1. use datafusion v18 to rerun the benchmark codes and send the PR. t
   
   2. at the time I wrote the benchmark codes, datafusion didnt support some features. e.g. it didn't support schema from parquet, so for some data type like timestamp, we need to load it as string and then cast it to timestamp explicitly.  e.g.
    
   this is the original sql queries
   ```sql
   SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM ...
   ```
   https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/queries.sql#L41
   
   this is the modified version for datafusion
   ```sql
   SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM
   ```
   https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql#L41
   
   I did modify some quries so that it works in datafusion. To fix this, we need to verify whether the new datafusion work or not. If so we could update the quries. If not, we could fire the issue to improve
   
   I'll do the first option soon (update to v18), it should be a quick improvement.
   It'll take more time to do the second approach. Welcome for the contribution. I'll get back to this if there's no one work on this.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org