You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/19 00:20:03 UTC

[GitHub] [arrow-datafusion] isidentical opened a new pull request, #3889: Allow enabling collection of statistics during TPC-H benchmarks

isidentical opened a new pull request, #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3888 .
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   As described in #3888, it is useful to see a benchmark with/without statistics to determine what sort of effect it will have.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   A new flag, `--enable-statistics` to turn on statistic collection.
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   No.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3889: Allow configuring collection of statistics during TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889#issuecomment-1284534986

   Thanks for resolving conflicts @Dandandan 🙏🏻 By the way, just as a question (or maybe something we might be interested to discuss in general), are there any plans/efforts to do continuous benchmarking to ensure that no regression goes unnoticed? (I was very interested in @andygrove 's benchmarking of ballista/spark, so maybe even doing it both for ballista/datafusion).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan merged pull request #3889: Allow configuring collection of statistics during TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.
Dandandan merged PR #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3889: Allow configuring collection of statistics during TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889#discussion_r999591291


##########
benchmarks/src/bin/tpch.rs:
##########
@@ -98,6 +98,10 @@ struct DataFusionBenchmarkOpt {
     /// Path to output directory where JSON summary file should be written to
     #[structopt(parse(from_os_str), short = "o", long = "output")]
     output_path: Option<PathBuf>,
+
+    /// Whether to disable collection of statistics (and cost based optimizations) or not.
+    #[structopt(short = "S", long = "disable-statistics")]

Review Comment:
   There is no current "continuous benchmarking" in place for this. If that would be the case, it would make sense to run the benchmarks both with and without collecting statistics.
   
   Anyway, good that we can enable / disable them 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3889: Allow configuring collection of statistics during TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889#discussion_r998887827


##########
benchmarks/src/bin/tpch.rs:
##########
@@ -98,6 +98,10 @@ struct DataFusionBenchmarkOpt {
     /// Path to output directory where JSON summary file should be written to
     #[structopt(parse(from_os_str), short = "o", long = "output")]
     output_path: Option<PathBuf>,
+
+    /// Whether to disable collection of statistics (and cost based optimizations) or not.
+    #[structopt(short = "S", long = "disable-statistics")]

Review Comment:
   It might be good to keep defaults the same as datafusion-cli (statistics disabled)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on pull request #3889: Allow configuring collection of statistics during TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.
Dandandan commented on PR #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889#issuecomment-1284516419

   Thanks @isidentical 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3889: Allow configuring collection of statistics during TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889#discussion_r999432821


##########
benchmarks/src/bin/tpch.rs:
##########
@@ -98,6 +98,10 @@ struct DataFusionBenchmarkOpt {
     /// Path to output directory where JSON summary file should be written to
     #[structopt(parse(from_os_str), short = "o", long = "output")]
     output_path: Option<PathBuf>,
+
+    /// Whether to disable collection of statistics (and cost based optimizations) or not.
+    #[structopt(short = "S", long = "disable-statistics")]

Review Comment:
   I was also thinking about it but then decided against it since the default is already `true`. Do you know whether we have any places where we continuously run benchmarks where making the default `false` would change results? If it wouldn't cause a regression on places where we compare benchmarks across datafusion commits/versions, I think it should be an easy to change. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3889: Allow configuring collection of statistics during TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3889:
URL: https://github.com/apache/arrow-datafusion/pull/3889#issuecomment-1283276230

   There are currently only two physical plans that differ when we pass `--disable-stats`, which is I guess the most reliable indication that they are the only plans that we actually use statistics. Here are the results from them in regards to on/off execution (which looks amazing, thanks to all the previous work on hash build side switching by @Dandandan and others)
   
   ![image](https://user-images.githubusercontent.com/47358913/196578430-f72a47cb-1f73-49b4-ae6d-db3f3530f571.png)
   
   I'll try to inspect other plans to see what else we can do (or if there is anything obvious we are missing, except filters which are in progress in #3868). But I'd expect the majority of the speed-ups in here to come from global join ordering (unlike only locals we have) with #3843 so no high hopes to see a magic bullet just yet 😇 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org