You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/08 17:56:29 UTC

[GitHub] [arrow] save-buffer opened a new pull request, #12843: [C++] TPC-H generator fix and new test

save-buffer opened a new pull request, #12843:
URL: https://github.com/apache/arrow/pull/12843

   We'd occasionally get a hang in the multi-column generators. This was due to us creating an insufficient number of tasks, so we'd never actually finish outputting. This fixes it. 
   This PR also adds a test to exercise it: it generates TPC-H SF 1 all in one plan and verifies the result. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #12843:
URL: https://github.com/apache/arrow/pull/12843#discussion_r850045973


##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -570,6 +606,46 @@ TEST(TpchNode, Region) {
   ASSERT_EQ(num_rows, 5);
 }
 
+TEST(TpchNode, Region) {
+  ASSERT_OK_AND_ASSIGN(auto res, GenerateTable(&TpchGen::Region));
+  VerifyRegion(res);
+}
+
+TEST(TpchNode, AllTables) {
+  constexpr double kScaleFactor = 0.05;
+  constexpr int kNumTables = 8;
+  std::array<TableNodeFn, kNumTables> tables = {
+      &TpchGen::Supplier, &TpchGen::Part,     &TpchGen::PartSupp, &TpchGen::Customer,
+      &TpchGen::Orders,   &TpchGen::Lineitem, &TpchGen::Nation,   &TpchGen::Region,
+  };
+  using VerifyFn = void(const std::vector<ExecBatch>&, double);
+  std::array<VerifyFn*, kNumTables> verify_fns = {
+      &VerifySupplier, &VerifyPart,     &VerifyPartSupp, &VerifyCustomer,
+      &VerifyOrders,   &VerifyLineitem, &VerifyNation,   &VerifyRegion,
+  };
+
+  std::array<AsyncGenerator<util::optional<ExecBatch>>, kNumTables> gens;
+  std::array<std::vector<ExecBatch>, kNumTables> batches;
+  ExecContext ctx(default_memory_pool(), arrow::internal::GetCpuThreadPool());
+  ASSERT_OK_AND_ASSIGN(std::shared_ptr<ExecPlan> plan, ExecPlan::Make(&ctx));
+  ASSERT_OK_AND_ASSIGN(std::unique_ptr<TpchGen> gen,
+                       TpchGen::Make(plan.get(), kScaleFactor));
+  for (int i = 0; i < kNumTables; i++) {
+    ASSERT_OK(AddTableAndSinkToPlan(*plan, *gen, gens[i], tables[i]));
+  }
+
+  ASSERT_OK(plan->Validate());
+  ASSERT_OK(plan->StartProducing());
+  plan->finished().Wait();

Review Comment:
   ```suggestion
     ASSERT_OK(plan->finished().status());
   ```
   Or there is an `ASSERT_FINISHES_OK(plan->finished())` but I think you might need an extra include for that.



##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -333,24 +343,15 @@ void CountModifiedComments(const Datum& d, int* good_count, int* bad_count) {
   }
 }
 
-TEST(TpchNode, ScaleFactor) {

Review Comment:
   All of these methods moving around makes it a little harder to follow changes.  Can we maybe pick an order (e.g. ScaleFactor, each table in alphabetical order, AllTables) so we can stick with it?  Or is there some meaning behind this new order?



##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -365,18 +366,28 @@ TEST(TpchNode, Supplier) {
   }
   ASSERT_EQ(seen_suppkey.size(), kExpectedRows);
   ASSERT_EQ(num_rows, kExpectedRows);
-  ASSERT_EQ(good_count, 5);
-  ASSERT_EQ(bad_count, 5);
+  ASSERT_EQ(good_count, static_cast<int64_t>(5 * scale_factor));
+  ASSERT_EQ(bad_count, static_cast<int64_t>(5 * scale_factor));
 }
 
-TEST(TpchNode, Part) {
-  ASSERT_OK_AND_ASSIGN(auto res, GenerateTable(&TpchGen::Part));
+TEST(TpchNode, ScaleFactor) {
+  constexpr double kScaleFactor = 0.01;
+  ASSERT_OK_AND_ASSIGN(auto res, GenerateTable(&TpchGen::Supplier, kScaleFactor));
+  VerifySupplier(res, kScaleFactor);
+}

Review Comment:
   Since `AllTables` is now testing a different scale factor for all of the tables I think this test is probably redundant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098280499

   @save-buffer Note you can also run the thread sanitizer build locally using `archery docker run ubuntu-cpp-thread-sanitizer` (or you can pass `-DARROW_USE_TSAN=ON` to CMake).
   
   See https://arrow.apache.org/docs/developers/continuous_integration/archery.html for setting up Archery.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098554412

   Revision: e37595074472fd6931a0e7558235bef618a93163
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-1854](https://github.com/ursacomputing/crossbow/branches/all?query=actions-1854)
   
   |Task|Status|
   |----|------|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1854-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1854-github-test-ubuntu-20.04-cpp-thread-sanitizer)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098553671

   I'd be fine with that too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098306611

   Revision: 82f3b575d52062a66f4daf4555332846d619e45a
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-1847](https://github.com/ursacomputing/crossbow/branches/all?query=actions-1847)
   
   |Task|Status|
   |----|------|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1847-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1847-github-test-ubuntu-20.04-cpp-thread-sanitizer)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098482351

   Revision: 716ece57f70987515a364893f27163ac958b98cb
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-1850](https://github.com/ursacomputing/crossbow/branches/all?query=actions-1850)
   
   |Task|Status|
   |----|------|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1850-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1850-github-test-ubuntu-20.04-cpp-thread-sanitizer)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1104930157

   @github-actions crossbow submit *sanitize*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1093286010

   Relabled to ARROW-16148 as this does not fit the criteria for a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes) and I had created the JIRA earlier anyways.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on a diff in pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on code in PR #12843:
URL: https://github.com/apache/arrow/pull/12843#discussion_r855503777


##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -570,6 +600,46 @@ TEST(TpchNode, Region) {
   ASSERT_EQ(num_rows, 5);
 }
 
+TEST(TpchNode, Region) {
+  ASSERT_OK_AND_ASSIGN(auto res, GenerateTable(&TpchGen::Region));
+  VerifyRegion(res);
+}
+
+TEST(TpchNode, AllTables) {
+  constexpr double kScaleFactor = 0.05;
+  constexpr int kNumTables = 8;
+  std::array<TableNodeFn, kNumTables> tables = {
+      &TpchGen::Supplier, &TpchGen::Part,     &TpchGen::PartSupp, &TpchGen::Customer,
+      &TpchGen::Orders,   &TpchGen::Lineitem, &TpchGen::Nation,   &TpchGen::Region,
+  };
+  using VerifyFn = void(const std::vector<ExecBatch>&, double);
+  std::array<VerifyFn*, kNumTables> verify_fns = {
+      &VerifySupplier, &VerifyPart,     &VerifyPartSupp, &VerifyCustomer,
+      &VerifyOrders,   &VerifyLineitem, &VerifyNation,   &VerifyRegion,
+  };
+
+  std::array<AsyncGenerator<util::optional<ExecBatch>>, kNumTables> gens;
+  std::array<std::vector<ExecBatch>, kNumTables> batches;

Review Comment:
   Good point - did that. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098480581

   I reduced the workload size for tsan since it was such an easy change. I think the main bottleneck is the verification, particularly of LineItem, so I guess to speed it up we could make the verification multithreaded (but also more complicated). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1093285752

   https://issues.apache.org/jira/browse/ARROW-16148


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1104602082

   Seems like valgrind issues are fixed with this latest commit. The run failed because of a timeout in hash-join-node-test which is not affected by this PR. 
   Thanks to @westonpace for investigating it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1104978093

   Revision: 957b994714b04225c04ad890c42e8b87b0d347c4
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-1894](https://github.com/ursacomputing/crossbow/branches/all?query=actions-1894)
   
   |Task|Status|
   |----|------|
   |test-fedora-r-clang-sanitizer|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-1894-azure-test-fedora-r-clang-sanitizer)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-1894-azure-test-fedora-r-clang-sanitizer)|
   |test-ubuntu-18.04-r-sanitizer|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-1894-azure-test-ubuntu-18.04-r-sanitizer)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-1894-azure-test-ubuntu-18.04-r-sanitizer)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-1894-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-1894-github-test-ubuntu-20.04-cpp-thread-sanitizer)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: [C++] TPC-H generator fix and new test

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1093140666

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098279184

   @github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1108055590

   Benchmark runs are scheduled for baseline = 8bd5514f52bf9cc542a389edaf697cbc2c97b752 and contender = 24b48083f102ef9366adb0797f5f8fa53c188a7b. 24b48083f102ef9366adb0797f5f8fa53c188a7b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/3da722dffc1c425fbb1baacf246902af...763de36c34e14ad5a19ea6e51b0258cd/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/6f7e0408739e4939b45cf0c079f6cbc7...e420b9dcfa84435cb2c646b6114f9630/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6eb4313b43a24b69b8f6e95ac540e12e...c2f7bb0573764be4a1355f54b8d69c18/)
   [Finished :arrow_down:1.72% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b33fa74d1d544a3b8dd5724f670dd4c4...d254f2a9be8742e68e526b9fc590604c/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/581| `24b48083` ec2-t3-xlarge-us-east-2>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/569| `24b48083` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/567| `24b48083` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/579| `24b48083` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/580| `8bd5514f` ec2-t3-xlarge-us-east-2>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/568| `8bd5514f` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/566| `8bd5514f` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/578| `8bd5514f` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #12843:
URL: https://github.com/apache/arrow/pull/12843#discussion_r854945088


##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -570,6 +600,46 @@ TEST(TpchNode, Region) {
   ASSERT_EQ(num_rows, 5);
 }
 
+TEST(TpchNode, Region) {
+  ASSERT_OK_AND_ASSIGN(auto res, GenerateTable(&TpchGen::Region));
+  VerifyRegion(res);
+}
+
+TEST(TpchNode, AllTables) {
+  constexpr double kScaleFactor = 0.05;
+  constexpr int kNumTables = 8;
+  std::array<TableNodeFn, kNumTables> tables = {
+      &TpchGen::Supplier, &TpchGen::Part,     &TpchGen::PartSupp, &TpchGen::Customer,
+      &TpchGen::Orders,   &TpchGen::Lineitem, &TpchGen::Nation,   &TpchGen::Region,
+  };
+  using VerifyFn = void(const std::vector<ExecBatch>&, double);
+  std::array<VerifyFn*, kNumTables> verify_fns = {
+      &VerifySupplier, &VerifyPart,     &VerifyPartSupp, &VerifyCustomer,
+      &VerifyOrders,   &VerifyLineitem, &VerifyNation,   &VerifyRegion,
+  };
+
+  std::array<AsyncGenerator<util::optional<ExecBatch>>, kNumTables> gens;
+  std::array<std::vector<ExecBatch>, kNumTables> batches;

Review Comment:
   You can probably move `batches` inside the verification loop and then it doesn't need to be `std::array`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on a diff in pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on code in PR #12843:
URL: https://github.com/apache/arrow/pull/12843#discussion_r850055862


##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -365,18 +366,28 @@ TEST(TpchNode, Supplier) {
   }
   ASSERT_EQ(seen_suppkey.size(), kExpectedRows);
   ASSERT_EQ(num_rows, kExpectedRows);
-  ASSERT_EQ(good_count, 5);
-  ASSERT_EQ(bad_count, 5);
+  ASSERT_EQ(good_count, static_cast<int64_t>(5 * scale_factor));
+  ASSERT_EQ(bad_count, static_cast<int64_t>(5 * scale_factor));
 }
 
-TEST(TpchNode, Part) {
-  ASSERT_OK_AND_ASSIGN(auto res, GenerateTable(&TpchGen::Part));
+TEST(TpchNode, ScaleFactor) {
+  constexpr double kScaleFactor = 0.01;
+  ASSERT_OK_AND_ASSIGN(auto res, GenerateTable(&TpchGen::Supplier, kScaleFactor));
+  VerifySupplier(res, kScaleFactor);
+}

Review Comment:
   Good point, I'll get rid of it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098490647

   Given it's a test I think you could probably safely drop the scale factor even further to 0.1 (0.05 for the all tables test).  I'm not sure we need to limit the scale factor decrease to TSAN builds either.  Given that we are constantly running CI our test time is pretty limited.  As long as each node is generating a few batches of data I think we have decent coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098769469

   Revision: 3ec0934739591ad6590711ac89c99096fc75f4ba
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-1855](https://github.com/ursacomputing/crossbow/branches/all?query=actions-1855)
   
   |Task|Status|
   |----|------|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-1855-azure-test-conda-cpp-valgrind)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-1855-azure-test-conda-cpp-valgrind)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1099211302

   @save-buffer Do you want to tackle the Valgrind failure as part of this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1093285777

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098451532

   Perhaps we can generally try to make the test faster (which will benefit the TSAN builds but also other builds)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1104546361

   @github-actions crossbow submit test-conda-cpp-valgrind


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup
URL: https://github.com/apache/arrow/pull/12843


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1105143580

   @jonkeane Does someone generally take a look at R sanitizer failures and open JIRAs for them?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098481457

   @github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098448448

   It passes in `251103 ms total` on my desktop (i9-9900K). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #12843:
URL: https://github.com/apache/arrow/pull/12843#discussion_r850764806


##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -333,24 +343,15 @@ void CountModifiedComments(const Datum& d, int* good_count, int* bad_count) {
   }
 }
 
-TEST(TpchNode, ScaleFactor) {

Review Comment:
   Great.  As long as there is some kind of order we can stick with going forwards I'm content.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1104547059

   Revision: bf68edde42072abf57d438f478d1970e00799a8f
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-1889](https://github.com/ursacomputing/crossbow/branches/all?query=actions-1889)
   
   |Task|Status|
   |----|------|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-1889-azure-test-conda-cpp-valgrind)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-1889-azure-test-conda-cpp-valgrind)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jonkeane commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
jonkeane commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1105208102

   You can tag @assignUser on these to take a look + create an issue. Oddly(?), the test-fedora-r-clang-sanitizer failure didn't show up in the nightly last night — and I don't see anything obvious in the commits since then that would trigger it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098553741

   @github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098768657

   @github-actions crossbow submit test-conda-cpp-valgrind


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098481317

   It now passes in `95053 ms total` on my desktop with tsan enabled


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098482240

   I think bot doesn't listen to me... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on a diff in pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on code in PR #12843:
URL: https://github.com/apache/arrow/pull/12843#discussion_r850057113


##########
cpp/src/arrow/compute/exec/tpch_node_test.cc:
##########
@@ -333,24 +343,15 @@ void CountModifiedComments(const Datum& d, int* good_count, int* bad_count) {
   }
 }
 
-TEST(TpchNode, ScaleFactor) {

Review Comment:
   The order is the order in which the tables appear in the spec: Supplier, Part, Partsupp, Customer, Orders, Lineitem, Nation, Region. 
   ScaleFactor was put under VerifySupplier since it was generating the Supplier table, but I've gotten rid of it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] save-buffer commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
save-buffer commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1098447447

   OK it seems that the timeout was not caused by the bug that this PR is fixing (which could cause a real hang even without TSAN), it's just caused by TSAN being slow. Not sure what to do about it, maybe we can exclude this test from tsan?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser commented on pull request #12843: ARROW-16148: [C++] TPC-H generator cleanup

Posted by GitBox <gi...@apache.org>.
assignUser commented on PR #12843:
URL: https://github.com/apache/arrow/pull/12843#issuecomment-1106673435

   @jonkeane  I ran `test-fedora-r-clang-sanitizer` locally from the current master HEAD (which contains this PR) with no error,  so I am guessing it was fixed somewhere along the line since the rebase 11 days ago?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org