You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/23 00:11:04 UTC

[GitHub] [arrow] westonpace opened a new pull request, #12967: ARROW-16294: [C++] Improve performance of parquet readahead

westonpace opened a new pull request, #12967:
URL: https://github.com/apache/arrow/pull/12967

    * Turns out the batch size we were slicing internally for parquet (in the TableBatchReader) was not the batch size from the scanner.  I added batch slicing in file_parquet to leave the parquet reader itself more or less unchanged here (and it's not clear it would make more sense to use the smaller batch size slicing inside the reader anyways).
    * Changed parquet readahead so it reads ahead more than one row group.  It now tries to keep `batch_size * batch_readahead` reads in flight.  Users that want more parallel reads can increase batch readahead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12967:
URL: https://github.com/apache/arrow/pull/12967#discussion_r856744758


##########
cpp/src/parquet/arrow/reader.cc:
##########
@@ -1051,33 +1051,66 @@ class RowGroupGenerator {
   using RecordBatchGenerator =
       ::arrow::AsyncGenerator<std::shared_ptr<::arrow::RecordBatch>>;
 
+  struct ReadRequest {
+    ::arrow::Future<RecordBatchGenerator> read;
+    int64_t num_rows;
+  };
+
   explicit RowGroupGenerator(std::shared_ptr<FileReaderImpl> arrow_reader,
                              ::arrow::internal::Executor* cpu_executor,
-                             std::vector<int> row_groups, std::vector<int> column_indices)
+                             std::vector<int> row_groups, std::vector<int> column_indices,
+                             int64_t min_rows_in_flight)
       : arrow_reader_(std::move(arrow_reader)),
         cpu_executor_(cpu_executor),
         row_groups_(std::move(row_groups)),
         column_indices_(std::move(column_indices)),
-        index_(0) {}
+        min_rows_in_flight_(min_rows_in_flight),
+        rows_in_flight_(0),
+        index_(0),
+        readahead_index_(0) {}
 
   ::arrow::Future<RecordBatchGenerator> operator()() {
     if (index_ >= row_groups_.size()) {
       return ::arrow::AsyncGeneratorEnd<RecordBatchGenerator>();
     }
-    int row_group = row_groups_[index_++];
+    index_++;
+    FillReadahead();
+    ReadRequest next = std::move(in_flight_reads_.front());
+    DCHECK(!in_flight_reads_.empty());

Review Comment:
   Seems tests are all failing here. I guess it's because in the header we have `0` as the default value of `min_rows_in_flight`, we should rename the parameter in the header, update the default, and also validate the parameter value/force it to be at least 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12967:
URL: https://github.com/apache/arrow/pull/12967#issuecomment-1116788576

   Benchmark runs are scheduled for baseline = f03f090d0c9fd6c85e046e2790c5d443729f6b30 and contender = 24f372297c53049ed6abc3188c8084cf495169cf. 24f372297c53049ed6abc3188c8084cf495169cf is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/ff17c0841dca4fd3b667cce2d07784ca...78f3239f53434e95aece324ffc5305e8/)
   [Failed :arrow_down:0.31% :arrow_up:0.04%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/5c3faad45fd8472ca0c21f08440ad2ae...10f23e23227d4698bc2c1d91142aa049/)
   [Finished :arrow_down:0.36% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/03942feae38e403bae18268c05002a2c...3ae2720ddadc4ee6ba4875bd9097f238/)
   [Finished :arrow_down:0.2% :arrow_up:0.08%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0c5784c6cd034ae9bd81c44ab42138ed...ae89c8a319c84b40a050d5457964617b/)
   Buildkite builds:
   [Finished] [`24f37229` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/656)
   [Finished] [`24f37229` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/654)
   [Finished] [`24f37229` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/644)
   [Finished] [`24f37229` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/660)
   [Finished] [`f03f090d` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/655)
   [Failed] [`f03f090d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/653)
   [Finished] [`f03f090d` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/643)
   [Finished] [`f03f090d` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/659)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12967:
URL: https://github.com/apache/arrow/pull/12967#issuecomment-1107058194

   https://issues.apache.org/jira/browse/ARROW-16294


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12967:
URL: https://github.com/apache/arrow/pull/12967#issuecomment-1107058208

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12967:
URL: https://github.com/apache/arrow/pull/12967#discussion_r860134358


##########
cpp/src/parquet/arrow/reader.cc:
##########
@@ -1051,33 +1051,71 @@ class RowGroupGenerator {
   using RecordBatchGenerator =
       ::arrow::AsyncGenerator<std::shared_ptr<::arrow::RecordBatch>>;
 
+  struct ReadRequest {
+    ::arrow::Future<RecordBatchGenerator> read;
+    int64_t num_rows;
+  };
+
   explicit RowGroupGenerator(std::shared_ptr<FileReaderImpl> arrow_reader,
                              ::arrow::internal::Executor* cpu_executor,
-                             std::vector<int> row_groups, std::vector<int> column_indices)
+                             std::vector<int> row_groups, std::vector<int> column_indices,
+                             int64_t min_rows_in_flight)
       : arrow_reader_(std::move(arrow_reader)),
         cpu_executor_(cpu_executor),
         row_groups_(std::move(row_groups)),
         column_indices_(std::move(column_indices)),
-        index_(0) {}
+        min_rows_in_flight_(min_rows_in_flight),
+        rows_in_flight_(0),
+        index_(0),
+        readahead_index_(0) {}
 
   ::arrow::Future<RecordBatchGenerator> operator()() {
     if (index_ >= row_groups_.size()) {
       return ::arrow::AsyncGeneratorEnd<RecordBatchGenerator>();
     }
-    int row_group = row_groups_[index_++];
+    index_++;
+    FillReadahead();
+    ReadRequest next = std::move(in_flight_reads_.front());
+    DCHECK(!in_flight_reads_.empty());
+    in_flight_reads_.pop();
+    rows_in_flight_ -= next.num_rows;
+    return next.read;
+  }
+
+ private:
+  void FillReadahead() {
+    if (min_rows_in_flight_ == 0) {
+      // No readahead, fetch the batch when it is asked for
+      FetchNext();
+    } else {
+      while (readahead_index_ < row_groups_.size() &&
+             rows_in_flight_ < min_rows_in_flight_) {
+        FetchNext();
+      }
+    }
+  }
+
+  void FetchNext() {
+    int row_group_index = readahead_index_++;

Review Comment:
   Seems MSVC is still unhappy here:
   
   ```
   C:/projects/arrow/cpp/src/parquet/arrow/reader.cc(1099): error C2220: warning treated as error - no 'object' file generated
   C:/projects/arrow/cpp/src/parquet/arrow/reader.cc(1099): warning C4267: 'initializing': conversion from 'size_t' to 'int', possible loss of data
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12967:
URL: https://github.com/apache/arrow/pull/12967#discussion_r856742557


##########
cpp/src/parquet/arrow/reader.cc:
##########
@@ -1113,15 +1146,19 @@ class RowGroupGenerator {
   ::arrow::internal::Executor* cpu_executor_;
   std::vector<int> row_groups_;
   std::vector<int> column_indices_;
+  int64_t min_rows_in_flight_;
+  std::queue<ReadRequest> in_flight_reads_;
+  int64_t rows_in_flight_;
   size_t index_;
+  size_t readahead_index_;
 };
 
 ::arrow::Result<::arrow::AsyncGenerator<std::shared_ptr<::arrow::RecordBatch>>>
 FileReaderImpl::GetRecordBatchGenerator(std::shared_ptr<FileReader> reader,
                                         const std::vector<int> row_group_indices,
                                         const std::vector<int> column_indices,
                                         ::arrow::internal::Executor* cpu_executor,
-                                        int batch_readahead) {
+                                        int rows_to_readahead) {

Review Comment:
   Should we rename this in the header too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12967:
URL: https://github.com/apache/arrow/pull/12967#issuecomment-1107058202

   :warning: Ticket **has no components in JIRA**, make sure you assign one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12967:
URL: https://github.com/apache/arrow/pull/12967#issuecomment-1116788669

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/5c3faad45fd8472ca0c21f08440ad2ae...10f23e23227d4698bc2c1d91142aa049/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm closed pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead
URL: https://github.com/apache/arrow/pull/12967


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #12967:
URL: https://github.com/apache/arrow/pull/12967#discussion_r859988079


##########
cpp/src/arrow/compute/exec/source_node.cc:
##########
@@ -15,6 +15,7 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#include <iostream>

Review Comment:
   No.  I've removed it.  Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #12967:
URL: https://github.com/apache/arrow/pull/12967#issuecomment-1111349618

   > Looks like there's a lint failure, and MSVC is unhappy as usual:
   
   The dangers of thinking I can fix an MSVC complaint using Github's edit feature :laughing: 
   
   I think this is good now.  I changed the variable itself to int64_t instead of adding static_cast since it is now a "rows" variable and not a "batches" variable.
   
   The behavior for 0 readahead is now "fetch batch when asked for" instead of defaulting to something like 1 row (which probably would have been the same thing anyways but I think this is more clear about our intentions).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12967: ARROW-16294: [C++] Improve performance of parquet readahead

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12967:
URL: https://github.com/apache/arrow/pull/12967#discussion_r859704170


##########
cpp/src/arrow/compute/exec/source_node.cc:
##########
@@ -15,6 +15,7 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#include <iostream>

Review Comment:
   Did you mean to include this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org