You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/06 05:15:59 UTC

[GitHub] [arrow] seddonm1 opened a new pull request #8845: ARROW-10815: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

seddonm1 opened a new pull request #8845:
URL: https://github.com/apache/arrow/pull/8845


   Changes:
   - can execute query 5 and 6.
   - add all tables.
   - add all queries with query validation parameters provided.
   - make basic modifications to queries 5 and 6 to allow supported syntax.
   - queries 1 and 12 retain existing modifications.
   
   this was an easy way to begin understanding some of the code base and 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8845:
URL: https://github.com/apache/arrow/pull/8845


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#discussion_r537115746



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -145,59 +145,849 @@ async fn benchmark(opt: BenchmarkOpt) -> Result<()> {
 
 fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<LogicalPlan> {
     match query {
+
+        // original
+        // 1 => ctx.create_logical_plan(
+        //     format!(
+        //     "select
+        //         l_returnflag,
+        //         l_linestatus,
+        //         sum(l_quantity) as sum_qty,
+        //         sum(l_extendedprice) as sum_base_price,
+        //         sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+        //         sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+        //         avg(l_quantity) as avg_qty,
+        //         avg(l_extendedprice) as avg_price,
+        //         avg(l_discount) as avg_disc,
+        //         count(*) as count_order
+        //     from
+        //         lineitem
+        //     where
+        //         l_shipdate <= date '1998-12-01' - interval '{DELTA}' day (3)
+        //     group by
+        //         l_returnflag,
+        //         l_linestatus
+        //     order by
+        //         l_returnflag,
+        //         l_linestatus;",
+        //     DELTA="90").as_ref()
+        // ),
         1 => ctx.create_logical_plan(
+            format!(
             "select
-                    l_returnflag,
-                    l_linestatus,
-                    sum(l_quantity),
-                    sum(l_extendedprice),
-                    sum(l_extendedprice * (1 - l_discount)),
-                    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
-                    avg(l_quantity),
-                    avg(l_extendedprice),
-                    avg(l_discount),
-                    count(*)
-                from
-                    lineitem
-                where
-                    l_shipdate <= '1998-12-01'
-                group by
-                    l_returnflag,
-                    l_linestatus
-                order by
-                    l_returnflag,
-                    l_linestatus",
+                l_returnflag,
+                l_linestatus,
+                sum(l_quantity) as sum_qty,
+                sum(l_extendedprice) as sum_base_price,
+                sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+                sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+                avg(l_quantity) as avg_qty,
+                avg(l_extendedprice) as avg_price,
+                avg(l_discount) as avg_disc,
+                count(*) as count_order
+            from
+                lineitem
+            where
+                l_shipdate <= '1998-09-02'
+            group by
+                l_returnflag,
+                l_linestatus
+            order by
+                l_returnflag,
+                l_linestatus;",
+            ).as_ref()
+        ),
+
+        2 => ctx.create_logical_plan(
+            format!(
+            "select
+                s_acctbal,
+                s_name,
+                n_name,
+                p_partkey,
+                p_mfgr,
+                s_address,
+                s_phone,
+                s_comment
+            from
+                part,
+                supplier,
+                partsupp,
+                nation,
+                region
+            where
+                p_partkey = ps_partkey
+                and s_suppkey = ps_suppkey
+                and p_size = {SIZE}

Review comment:
       I have updated PR without `format!` and with the TPC-H specified validation parameters hard coded.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] removed a comment on pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
github-actions[bot] removed a comment on pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#issuecomment-739457414


   https://issues.apache.org/jira/browse/ARROW-10815


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8845: ARROW-10815: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#issuecomment-739457414


   https://issues.apache.org/jira/browse/ARROW-10815


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#issuecomment-739515127


   It looks like formatting is off. See the Rust README for instructions on running cargo fmt.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#issuecomment-739458417


   https://issues.apache.org/jira/browse/ARROW-10820


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#issuecomment-739491824


   Awesome contribution, thanks @seddonm1  :) I added one comment


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#discussion_r537111215



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -145,59 +145,849 @@ async fn benchmark(opt: BenchmarkOpt) -> Result<()> {
 
 fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<LogicalPlan> {
     match query {
+
+        // original
+        // 1 => ctx.create_logical_plan(
+        //     format!(
+        //     "select
+        //         l_returnflag,
+        //         l_linestatus,
+        //         sum(l_quantity) as sum_qty,
+        //         sum(l_extendedprice) as sum_base_price,
+        //         sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+        //         sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+        //         avg(l_quantity) as avg_qty,
+        //         avg(l_extendedprice) as avg_price,
+        //         avg(l_discount) as avg_disc,
+        //         count(*) as count_order
+        //     from
+        //         lineitem
+        //     where
+        //         l_shipdate <= date '1998-12-01' - interval '{DELTA}' day (3)
+        //     group by
+        //         l_returnflag,
+        //         l_linestatus
+        //     order by
+        //         l_returnflag,
+        //         l_linestatus;",
+        //     DELTA="90").as_ref()
+        // ),
         1 => ctx.create_logical_plan(
+            format!(
             "select
-                    l_returnflag,
-                    l_linestatus,
-                    sum(l_quantity),
-                    sum(l_extendedprice),
-                    sum(l_extendedprice * (1 - l_discount)),
-                    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
-                    avg(l_quantity),
-                    avg(l_extendedprice),
-                    avg(l_discount),
-                    count(*)
-                from
-                    lineitem
-                where
-                    l_shipdate <= '1998-12-01'
-                group by
-                    l_returnflag,
-                    l_linestatus
-                order by
-                    l_returnflag,
-                    l_linestatus",
+                l_returnflag,
+                l_linestatus,
+                sum(l_quantity) as sum_qty,
+                sum(l_extendedprice) as sum_base_price,
+                sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+                sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+                avg(l_quantity) as avg_qty,
+                avg(l_extendedprice) as avg_price,
+                avg(l_discount) as avg_disc,
+                count(*) as count_order
+            from
+                lineitem
+            where
+                l_shipdate <= '1998-09-02'
+            group by
+                l_returnflag,
+                l_linestatus
+            order by
+                l_returnflag,
+                l_linestatus;",
+            ).as_ref()
+        ),
+
+        2 => ctx.create_logical_plan(
+            format!(
+            "select
+                s_acctbal,
+                s_name,
+                n_name,
+                p_partkey,
+                p_mfgr,
+                s_address,
+                s_phone,
+                s_comment
+            from
+                part,
+                supplier,
+                partsupp,
+                nation,
+                region
+            where
+                p_partkey = ps_partkey
+                and s_suppkey = ps_suppkey
+                and p_size = {SIZE}

Review comment:
       I am happy to change if you would like. 
   My rationale:
   - we can retain the information of which arguments are parameterised as per the TPC-H specification.
   - we can retain 1:1 mapping of their query specification.
   - the `format!` macro is compile time and should have very little performance impact




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#issuecomment-739458179


   Sorry, put the wrong Jira ID when copying format 🤦 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#discussion_r537021940



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -145,59 +145,849 @@ async fn benchmark(opt: BenchmarkOpt) -> Result<()> {
 
 fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<LogicalPlan> {
     match query {
+
+        // original
+        // 1 => ctx.create_logical_plan(
+        //     format!(
+        //     "select
+        //         l_returnflag,
+        //         l_linestatus,
+        //         sum(l_quantity) as sum_qty,
+        //         sum(l_extendedprice) as sum_base_price,
+        //         sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+        //         sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+        //         avg(l_quantity) as avg_qty,
+        //         avg(l_extendedprice) as avg_price,
+        //         avg(l_discount) as avg_disc,
+        //         count(*) as count_order
+        //     from
+        //         lineitem
+        //     where
+        //         l_shipdate <= date '1998-12-01' - interval '{DELTA}' day (3)
+        //     group by
+        //         l_returnflag,
+        //         l_linestatus
+        //     order by
+        //         l_returnflag,
+        //         l_linestatus;",
+        //     DELTA="90").as_ref()
+        // ),
         1 => ctx.create_logical_plan(
+            format!(
             "select
-                    l_returnflag,
-                    l_linestatus,
-                    sum(l_quantity),
-                    sum(l_extendedprice),
-                    sum(l_extendedprice * (1 - l_discount)),
-                    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
-                    avg(l_quantity),
-                    avg(l_extendedprice),
-                    avg(l_discount),
-                    count(*)
-                from
-                    lineitem
-                where
-                    l_shipdate <= '1998-12-01'
-                group by
-                    l_returnflag,
-                    l_linestatus
-                order by
-                    l_returnflag,
-                    l_linestatus",
+                l_returnflag,
+                l_linestatus,
+                sum(l_quantity) as sum_qty,
+                sum(l_extendedprice) as sum_base_price,
+                sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+                sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+                avg(l_quantity) as avg_qty,
+                avg(l_extendedprice) as avg_price,
+                avg(l_discount) as avg_disc,
+                count(*) as count_order
+            from
+                lineitem
+            where
+                l_shipdate <= '1998-09-02'
+            group by
+                l_returnflag,
+                l_linestatus
+            order by
+                l_returnflag,
+                l_linestatus;",
+            ).as_ref()
+        ),
+
+        2 => ctx.create_logical_plan(
+            format!(
+            "select
+                s_acctbal,
+                s_name,
+                n_name,
+                p_partkey,
+                p_mfgr,
+                s_address,
+                s_phone,
+                s_comment
+            from
+                part,
+                supplier,
+                partsupp,
+                nation,
+                region
+            where
+                p_partkey = ps_partkey
+                and s_suppkey = ps_suppkey
+                and p_size = {SIZE}

Review comment:
       Why would we put the constants here in `format!` macro. Wouldn't it be better / easier to keep them as is, as they are directly coupled to the query anyway?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8845: ARROW-10820: [Rust] [DataFusion] Complete TPC-H Benchmark Queries

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8845:
URL: https://github.com/apache/arrow/pull/8845#discussion_r537055905



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -145,59 +145,849 @@ async fn benchmark(opt: BenchmarkOpt) -> Result<()> {
 
 fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<LogicalPlan> {
     match query {
+
+        // original
+        // 1 => ctx.create_logical_plan(
+        //     format!(
+        //     "select
+        //         l_returnflag,
+        //         l_linestatus,
+        //         sum(l_quantity) as sum_qty,
+        //         sum(l_extendedprice) as sum_base_price,
+        //         sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+        //         sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+        //         avg(l_quantity) as avg_qty,
+        //         avg(l_extendedprice) as avg_price,
+        //         avg(l_discount) as avg_disc,
+        //         count(*) as count_order
+        //     from
+        //         lineitem
+        //     where
+        //         l_shipdate <= date '1998-12-01' - interval '{DELTA}' day (3)
+        //     group by
+        //         l_returnflag,
+        //         l_linestatus
+        //     order by
+        //         l_returnflag,
+        //         l_linestatus;",
+        //     DELTA="90").as_ref()
+        // ),
         1 => ctx.create_logical_plan(
+            format!(
             "select
-                    l_returnflag,
-                    l_linestatus,
-                    sum(l_quantity),
-                    sum(l_extendedprice),
-                    sum(l_extendedprice * (1 - l_discount)),
-                    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
-                    avg(l_quantity),
-                    avg(l_extendedprice),
-                    avg(l_discount),
-                    count(*)
-                from
-                    lineitem
-                where
-                    l_shipdate <= '1998-12-01'
-                group by
-                    l_returnflag,
-                    l_linestatus
-                order by
-                    l_returnflag,
-                    l_linestatus",
+                l_returnflag,
+                l_linestatus,
+                sum(l_quantity) as sum_qty,
+                sum(l_extendedprice) as sum_base_price,
+                sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
+                sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
+                avg(l_quantity) as avg_qty,
+                avg(l_extendedprice) as avg_price,
+                avg(l_discount) as avg_disc,
+                count(*) as count_order
+            from
+                lineitem
+            where
+                l_shipdate <= '1998-09-02'
+            group by
+                l_returnflag,
+                l_linestatus
+            order by
+                l_returnflag,
+                l_linestatus;",
+            ).as_ref()
+        ),
+
+        2 => ctx.create_logical_plan(
+            format!(
+            "select
+                s_acctbal,
+                s_name,
+                n_name,
+                p_partkey,
+                p_mfgr,
+                s_address,
+                s_phone,
+                s_comment
+            from
+                part,
+                supplier,
+                partsupp,
+                nation,
+                region
+            where
+                p_partkey = ps_partkey
+                and s_suppkey = ps_suppkey
+                and p_size = {SIZE}

Review comment:
       I think I agree. If we ever wanted to run these queries with different parameters we would want to use the tools provided by TPC to generate the queries to a text file. I'm not sure there is value in having them as parameters here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org