You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/24 17:19:23 UTC

[GitHub] [arrow] andygrove opened a new pull request #8760: ARROW-10712: Add tests to TPC-H benchmarks

andygrove opened a new pull request #8760:
URL: https://github.com/apache/arrow/pull/8760


   This adds tests to the TPC-H benchmark suite that only run if the `TPCH_DATA` environment variable exists and points to a directory containing `tbl` files.
   
   This is useful when running tests locally and adds a mechanism that we could leverage in CI. We could have a docker image containint the data generator and generate the SF=1 data set before running the cargo tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#discussion_r534324711



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -353,3 +353,36 @@ fn get_schema(table: &str) -> Schema {
         _ => unimplemented!(),
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use std::env;
+
+    #[tokio::test]
+    async fn q1() -> Result<()> {
+        verify_query(1).await
+    }
+
+    #[tokio::test]
+    async fn q12() -> Result<()> {
+        verify_query(12).await
+    }
+
+    async fn verify_query(n: usize) -> Result<()> {

Review comment:
       I think it would help to document here the expectations (e.g. copy the description from the PR as comments).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 edited a comment on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

seddonm1 edited a comment on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-745805422


   @andygrove 
   
   here is my suggested approach which has already caught an issue:
   https://github.com/apache/arrow/compare/master...seddonm1:test-tpch?expand=1
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-748668410


   Closing this since it is superseded by other work


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-744076299


   Hi @andygrove 
   So I looked at this over the weekend. I was thinking that we could just embed the expected TPC-H answers (given the deterministic inputs we have chosen) and store them as Parquet. This is similar to how the Databricks TPC-H works: https://github.com/databricks/tpch-dbgen/tree/master/answers. 
   
   I have produced results with Spark with single partition parquets (snappy) and that would require somewhere around 3.5MB. We could also do a limit `n` (which is what it looks like databricks have done) and just check the answers are contained in the result set to reduce data requirements.
   
   I have also had trouble generating the parquet files with your program. Here is my alternative way to generate the test dataset which needs to be run from within the `tpch-dbgen` directory (or you can change the volume mount):
   
   ```bash
   docker run \
   --rm \
   --volume $(pwd):/tpch:Z \
   --env "ETL_CONF_ENV=production" \
   --env "CONF_NUM_PARITIONS=10" \
   --env "INPUT_PATH=/tpch/tbl" \
   --env "OUTPUT_PATH=/tpch/parquet" \
   --env "SCHEMA_PATH=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/schema" \
   --entrypoint="" \
   --publish 4040:4040 \
   ghcr.io/tripl-ai/arc:arc_3.6.2_spark_3.0.1_scala_2.12_hadoop_3.2.0_1.10.0 \
   bin/spark-submit \
   --master local\[\*\] \
   --driver-memory 4G \
   --driver-java-options "-XX:+UseG1GC" \
   --class ai.tripl.arc.ARC \
   /opt/spark/jars/arc.jar \
   --etl.config.uri=https://raw.githubusercontent.com/tripl-ai/arc-starter/master/examples/tpch/tpch.ipynb
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-740950757


   Thanks @seddonm1 that would be great if you have the time. I was really just planning on addressing feedback. Feel free to push to this PR or create a new one to replace this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove closed pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

andygrove closed pull request #8760:
URL: https://github.com/apache/arrow/pull/8760


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-747011899


   Thanks @andygrove I will raise it there. I am hoping Decimal support can land before we need to commit parquet files to the repo.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-740658677


   I will be getting back to this PR in the next day or two.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-733126013


   https://issues.apache.org/jira/browse/ARROW-10712


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Dandandan commented on a change in pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#discussion_r530428982



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -353,3 +353,36 @@ fn get_schema(table: &str) -> Schema {
         _ => unimplemented!(),
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use std::env;
+
+    #[tokio::test]
+    async fn q1() -> Result<()> {
+        verify_query(1).await
+    }
+
+    #[tokio::test]
+    async fn q12() -> Result<()> {
+        verify_query(12).await
+    }
+
+    async fn verify_query(n: usize) -> Result<()> {

Review comment:
       maybe also test the mem_table / some other options by adding at as parameter in `verify_query` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-745805422


   @andygrove 
   
   here is my suggested approach which seems like it might already be catching issues:
   https://github.com/seddonm1/arrow/commit/7b671e4a0c4b21658aa7214346454920dddb1c5a
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on a change in pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on a change in pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#discussion_r536897356



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -383,6 +383,9 @@ mod tests {
             };
             benchmark(opt).await?
         }

Review comment:
       You have a duplicate `}`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#discussion_r534323934



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -353,3 +353,36 @@ fn get_schema(table: &str) -> Schema {
         _ => unimplemented!(),
     }
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use std::env;
+
+    #[tokio::test]
+    async fn q1() -> Result<()> {
+        verify_query(1).await
+    }
+
+    #[tokio::test]
+    async fn q12() -> Result<()> {
+        verify_query(12).await
+    }
+
+    async fn verify_query(n: usize) -> Result<()> {
+        if let Ok(path) = env::var("TPCH_DATA") {
+            let opt = BenchmarkOpt {
+                query: n,
+                debug: false,
+                iterations: 1,
+                concurrency: 2,
+                batch_size: 4096,
+                path: PathBuf::from(path),
+                file_format: "tbl".to_string(),
+                mem_table: false,
+            };
+            benchmark(opt).await?
+        }
+        Ok(())

Review comment:
       ```suggestion
           } else {
               println!("TPCH_DATA environment variable not set, skipping test")
           }
           Ok(())
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-740934197


   I can help with this if you can describe your plans.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] andygrove commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

andygrove commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-746470112


   Thanks @seddonm1 I like this approach. We have a separate repo (arrow-testing) that is a git submodule of the main arrow repo where we check in files like this. There is sometimes some pushback on adding large files here (quite rightly) so I think it might be a good idea to raise this proposal on the mailing list first.
   
   There might be value here for other implementations (like C++) that are building query engines.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] seddonm1 commented on pull request #8760: ARROW-10712: [Rust] Add tests to TPC-H benchmarks

Posted by GitBox <gi...@apache.org>.

seddonm1 commented on pull request #8760:
URL: https://github.com/apache/arrow/pull/8760#issuecomment-740954932


   @andygrove No worries. Hopefully I can help on some of these easier tasks to free you up for the harder ones.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org