You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/15 19:37:51 UTC

[GitHub] [arrow] lidavidm opened a new pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

lidavidm opened a new pull request #10060:
URL: https://github.com/apache/arrow/pull/10060


   This implements a CountRows method for scanner. It will ask the fragment if it can count rows using only metadata, and otherwise project away columns and count the resulting rows.
   
   Originally, I thought we did not need a special optimization for the metadata-only case, because the Parquet reader will skip I/O and fabricate empty batches if you ask it to read no columns. However, in benchmarking, the overhead of the rest of the pipeline was still significant and so I implemented the optimization after all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r623122752



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2838,6 +2838,13 @@ cdef class Scanner(_Weakrefable):
             result = self.scanner.Head(num_rows)
         return pyarrow_wrap_table(GetResultValue(result))
 
+    def count_rows(self):
+        """Count rows matching the scanner filter."""

Review comment:
       I was going to comment that adding a similar function to `Fragment` might also be nice, but I assume the user can now just do `fragment.scanner().count_rows()` instead of `fragment.count_rows()` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832859169


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 95194db6858b40b3dc586c2f41c3451dd7d42273. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:33.33% :arrow_up:0.0%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...7c0bd37156f24266bf8efc504de21419/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...4ffd165775264dae83f48b4d4d2db430/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...f09d1666437646628b1fad4677dbabad/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...2bf22edba9c74a0e83768fe546feb33c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-821323392


   Note this is somewhat of a regression for CSV files/if you call dim.Dataset in R as now we'll have to scan files instead of just immediately returning NA. We do have some options:
   - We could add an option to just fail if a "cheap" count can't be performed, so that R could fall back to reporting just NA.
   - We could optimize the CSV case like the IPC and Parquet ones. This should be possible when `newlines_in_values` is not set and needs some consideration for `ignore_empty_lines`. This may or may not not actually be all that much cheaper than loading the data. 
   
   Also, I need to refactor this to pass around a ScanOptions instead of an IOContext.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-821321573


   And a quick comparison (in Python) for IPC files:
   
   ```
   # Without special CountRows method
   count_rows
   43.33s 43.32s 43.24s 43.41s 43.43s 
   Mean  : 43.35s
   Median: 43.33s
   count_rows (parallel)
   16.21s 16.22s 16.25s 16.17s 16.19s 
   Mean  : 16.21s
   Median: 16.21s
   
   # With special CountRows method
   count_rows
   16.96s 6.90s 7.23s 7.28s 7.28s 
   Mean  : 9.13s
   Median: 7.28s
   count_rows (parallel)
   7.32s 7.20s 7.30s 7.26s 7.32s 
   Mean  : 7.28s
   Median: 7.30s
   
   # Baseline for comparison
   count_fragments: sum(fragment.to_table(columns=[]).num_rows for fragment in ds.get_fragments())
   47.26s 44.03s 44.00s 42.70s 43.18s 
   Mean  : 44.23s
   Median: 44.00s
   count_files: sum(pyarrow.feather.read_table(filename, columns=[]).num_rows for filename in ds.files)
   19.05s 19.22s 19.95s 19.82s 19.91s 
   Mean  : 19.59s
   Median: 19.82s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832859169


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 95194db6858b40b3dc586c2f41c3451dd7d42273. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...7c0bd37156f24266bf8efc504de21419/)
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...4ffd165775264dae83f48b4d4d2db430/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...f09d1666437646628b1fad4677dbabad/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...2bf22edba9c74a0e83768fe546feb33c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626579413



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +385,23 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+util::optional<Future<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = internal::checked_pointer_cast<ParquetFileFragment>(file);
+  if (FieldsInExpression(predicate).size() > 0) {

Review comment:
       It works for some predicates - so now it's implemented. It'll naturally get better as the expression machinery improves.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832899252


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 06c105461c94db9aaca4958856463c2daa7b580d. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...af6be8f6c15d494586aa8d0b2e8e774d/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...fdc35f38d7b14c3abec2b6d465c30182/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...07187315259547f0b0f73fff64067fc4/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...44ebe703a1054a60939d4ab7968128cf/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832851021


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 8b03ba2e1473f653513a24ee43a8b3b367b437b6. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...799368aecbc24dd68a47cfc7e575114d/)
   [Finished :arrow_down:66.67% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...75d88ba5359045c8a897183e5302cb8f/)
   [Finished :arrow_down:5.8% :arrow_up:1.45%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...3a5d5279216540358fce4e0e1388048b/)
   [Finished :arrow_down:6.75% :arrow_up:6.55%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...4097bacbd70c4cb78229e3585d2dd39c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832859169


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 95194db6858b40b3dc586c2f41c3451dd7d42273. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...7c0bd37156f24266bf8efc504de21419/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...4ffd165775264dae83f48b4d4d2db430/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...f09d1666437646628b1fad4677dbabad/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...2bf22edba9c74a0e83768fe546feb33c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626533245



##########
File path: cpp/src/arrow/dataset/scanner.cc
##########
@@ -488,10 +507,37 @@ Result<AsyncGenerator<EnumeratedRecordBatchGenerator>> FragmentsToBatches(
   return MakeMappedGenerator<EnumeratedRecordBatchGenerator>(
       std::move(enumerated_fragment_gen),
       [scanner](const Enumerated<std::shared_ptr<Fragment>>& fragment) {
-        return FragmentToBatches(scanner, fragment);
+        return FragmentToBatches(scanner, fragment, scanner->options());
       });
 }
 
+Result<AsyncGenerator<AsyncGenerator<util::optional<int64_t>>>> FragmentsToRowCount(
+    std::shared_ptr<AsyncScanner> scanner, FragmentGenerator fragment_gen) {
+  // Must use optional<int64_t> to avoid breaking the pipeline on empty batches
+  auto enumerated_fragment_gen = MakeEnumeratedGenerator(std::move(fragment_gen));
+  auto options = std::make_shared<ScanOptions>(*scanner->options());
+  RETURN_NOT_OK(SetProjection(options.get(), std::vector<std::string>()));
+  auto count_fragment_fn =
+      [scanner, options](const Enumerated<std::shared_ptr<Fragment>>& fragment)
+      -> Result<AsyncGenerator<util::optional<int64_t>>> {
+    auto count = fragment.value->CountRows(options->filter, options);
+    // Fast path
+    if (count.has_value()) {
+      return MakeSingleFutureGenerator(count.value().Then(
+          [](int64_t val) { return util::make_optional<int64_t>(val); }));

Review comment:
       I think this relates to ARROW-12655




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832859169


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 95194db6858b40b3dc586c2f41c3451dd7d42273. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:33.33% :arrow_up:0.0%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...7c0bd37156f24266bf8efc504de21419/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...4ffd165775264dae83f48b4d4d2db430/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...f09d1666437646628b1fad4677dbabad/)
   [Finished :arrow_down:7.37% :arrow_up:7.27%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...2bf22edba9c74a0e83768fe546feb33c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r625986262



##########
File path: cpp/src/arrow/dataset/scanner.cc
##########
@@ -488,10 +507,37 @@ Result<AsyncGenerator<EnumeratedRecordBatchGenerator>> FragmentsToBatches(
   return MakeMappedGenerator<EnumeratedRecordBatchGenerator>(
       std::move(enumerated_fragment_gen),
       [scanner](const Enumerated<std::shared_ptr<Fragment>>& fragment) {
-        return FragmentToBatches(scanner, fragment);
+        return FragmentToBatches(scanner, fragment, scanner->options());
       });
 }
 
+Result<AsyncGenerator<AsyncGenerator<util::optional<int64_t>>>> FragmentsToRowCount(
+    std::shared_ptr<AsyncScanner> scanner, FragmentGenerator fragment_gen) {
+  // Must use optional<int64_t> to avoid breaking the pipeline on empty batches
+  auto enumerated_fragment_gen = MakeEnumeratedGenerator(std::move(fragment_gen));
+  auto options = std::make_shared<ScanOptions>(*scanner->options());
+  RETURN_NOT_OK(SetProjection(options.get(), std::vector<std::string>()));
+  auto count_fragment_fn =
+      [scanner, options](const Enumerated<std::shared_ptr<Fragment>>& fragment)
+      -> Result<AsyncGenerator<util::optional<int64_t>>> {
+    auto count = fragment.value->CountRows(options->filter, options);
+    // Fast path
+    if (count.has_value()) {
+      return MakeSingleFutureGenerator(count.value().Then(
+          [](int64_t val) { return util::make_optional<int64_t>(val); }));

Review comment:
       ```suggestion
         return MakeSingleFutureGenerator(count->Then(util::make_optional<int64_t>));
   ```

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -169,6 +174,15 @@ Result<RecordBatchGenerator> FileFragment::ScanBatchesAsync(
   return format_->ScanBatchesAsync(options, self);
 }
 
+util::optional<Future<int64_t>> FileFragment::CountRows(
+    Expression predicate, std::shared_ptr<ScanOptions> options) {
+  ARROW_ASSIGN_OR_RAISE(
+      predicate, SimplifyWithGuarantee(std::move(predicate), partition_expression_));
+  if (!predicate.IsSatisfiable()) return Future<int64_t>::MakeFinished(0);
+  auto self = internal::checked_pointer_cast<FileFragment>(shared_from_this());
+  return format()->CountRows(self, predicate, std::move(options));

Review comment:
       ```suggestion
     return format()->CountRows(self, std::move(predicate), std::move(options));
   ```

##########
File path: cpp/src/arrow/dataset/file_ipc.cc
##########
@@ -173,6 +173,25 @@ Result<ScanTaskIterator> IpcFileFormat::ScanFile(
   return IpcScanTaskIterator::Make(options, fragment);
 }
 
+util::optional<Future<int64_t>> IpcFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  ARROW_ASSIGN_OR_RAISE(predicate, SimplifyWithGuarantee(std::move(predicate),
+                                                         file->partition_expression()));
+  if (!predicate.IsSatisfiable()) {
+    return Future<int64_t>::MakeFinished(0);
+  } else if (FieldsInExpression(predicate).size() > 0) {

Review comment:
       Nit: redundant `else`
   ```suggestion
     }
     
     if (FieldsInExpression(predicate).size() > 0) {
   ```

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +385,23 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+util::optional<Future<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = internal::checked_pointer_cast<ParquetFileFragment>(file);
+  if (FieldsInExpression(predicate).size() > 0) {

Review comment:
       Would it be worthwhile to try pushing the predicate down to row groups here? If every row group happens to pass or fail the filter whole, we could recover the metadata-only fast path (bailing if any row group requires non trivial filtering)

##########
File path: cpp/src/arrow/dataset/file_base.cc
##########
@@ -169,6 +174,15 @@ Result<RecordBatchGenerator> FileFragment::ScanBatchesAsync(
   return format_->ScanBatchesAsync(options, self);
 }
 
+util::optional<Future<int64_t>> FileFragment::CountRows(
+    Expression predicate, std::shared_ptr<ScanOptions> options) {
+  ARROW_ASSIGN_OR_RAISE(
+      predicate, SimplifyWithGuarantee(std::move(predicate), partition_expression_));
+  if (!predicate.IsSatisfiable()) return Future<int64_t>::MakeFinished(0);

Review comment:
       Nit: I think this is long enough to warrant braces
   ```suggestion
     if (!predicate.IsSatisfiable()) {
       return Future<int64_t>::MakeFinished(0);
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832851021


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 8b03ba2e1473f653513a24ee43a8b3b367b437b6. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...799368aecbc24dd68a47cfc7e575114d/)
   [Finished :arrow_down:66.67% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...75d88ba5359045c8a897183e5302cb8f/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...3a5d5279216540358fce4e0e1388048b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...4097bacbd70c4cb78229e3585d2dd39c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832708280


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 379f4ead1c31ad8451f07105b10525dcf36919ac. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...25b12bc74ddc40d19ae22b9f870260a9/)
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...52ddabc8230f4df08a7d988f0dc04522/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...0a3dfe11d1e047e5933797ee32602bae/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...e002106cc5594c2783d5353ad0dfa241/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832851021


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 8b03ba2e1473f653513a24ee43a8b3b367b437b6. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...799368aecbc24dd68a47cfc7e575114d/)
   [Finished :arrow_down:66.67% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...75d88ba5359045c8a897183e5302cb8f/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...3a5d5279216540358fce4e0e1388048b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...4097bacbd70c4cb78229e3585d2dd39c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626788257



##########
File path: cpp/src/arrow/dataset/file_parquet_test.cc
##########
@@ -245,13 +245,41 @@ TEST_F(TestParquetFileFormat, CountRowsPredicatePushdown) {
     ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(expected),
                               fragment->CountRows(predicate, options));
 
-    // N.B. SimplifyWithGuarantee can't handle simplifying (i64 == 1) against (i64 <= 1 &
-    // i64 >= 1) right now, but this works
     predicate = and_(less_equal(field_ref("i64"), literal(i)),
                      greater_equal(field_ref("i64"), literal(i)));
     ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*reader->schema()));
     ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(i),
                               fragment->CountRows(predicate, options));
+
+    predicate = equal(field_ref("i64"), literal(i));
+    ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*reader->schema()));
+    ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(i),
+                              fragment->CountRows(predicate, options));
+  }
+
+  // Ensure nulls are properly handled
+  {
+    auto dataset_schema = schema({field("i64", int64())});
+    auto null_batch = RecordBatchFromJSON(dataset_schema, R"([
+[null],
+[null],
+[null]
+])");
+    auto batch = RecordBatchFromJSON(dataset_schema, R"([
+[1],
+[2]
+])");
+    ASSERT_OK_AND_ASSIGN(auto reader,
+                         RecordBatchReader::Make({null_batch, batch}, dataset_schema));
+    auto source = GetFileSource(reader.get());
+    auto fragment = MakeFragment(*source);
+    ASSERT_OK_AND_ASSIGN(
+        auto predicate,
+        greater_equal(field_ref("i64"), literal(1)).Bind(*dataset_schema));
+    ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(2),
+                              fragment->CountRows(predicate, options));
+    // N.B. SimplifyWithGuarantee can't handle not(is_null) so trying to count with
+    // is_null doesn't work

Review comment:
       I filed ARROW-12659.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832708280


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 379f4ead1c31ad8451f07105b10525dcf36919ac. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...25b12bc74ddc40d19ae22b9f870260a9/)
   [Finished :arrow_down:33.33% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...52ddabc8230f4df08a7d988f0dc04522/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...0a3dfe11d1e047e5933797ee32602bae/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...e002106cc5594c2783d5353ad0dfa241/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832851021


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 8b03ba2e1473f653513a24ee43a8b3b367b437b6. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...799368aecbc24dd68a47cfc7e575114d/)
   [Finished :arrow_down:66.67% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...75d88ba5359045c8a897183e5302cb8f/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...3a5d5279216540358fce4e0e1388048b/)
   [Finished :arrow_down:6.75% :arrow_up:6.55%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...4097bacbd70c4cb78229e3585d2dd39c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-828836251


   > Note this is somewhat of a regression for CSV files/if you call dim.Dataset in R as now we'll have to scan files instead of just immediately returning NA.
   
   I don't consider this a regression: you're going from "not implemented" to "implemented but not fast". 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832859169


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 95194db6858b40b3dc586c2f41c3451dd7d42273. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:33.33% :arrow_up:0.0%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...7c0bd37156f24266bf8efc504de21419/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...4ffd165775264dae83f48b4d4d2db430/)
   [Finished :arrow_down:4.35% :arrow_up:2.9%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...f09d1666437646628b1fad4677dbabad/)
   [Finished :arrow_down:7.37% :arrow_up:7.27%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...2bf22edba9c74a0e83768fe546feb33c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832899252


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 06c105461c94db9aaca4958856463c2daa7b580d. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...af6be8f6c15d494586aa8d0b2e8e774d/)
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...fdc35f38d7b14c3abec2b6d465c30182/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...07187315259547f0b0f73fff64067fc4/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...44ebe703a1054a60939d4ab7968128cf/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-820683818


   https://issues.apache.org/jira/browse/ARROW-9697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832892652


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = a38d7f0f4bb5269798c3fdb7625afc094a1389e5. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...dadc005b5b3341128837b7be5b8523b5/)
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...45aa8cd56b7f477aad1a9f5e77d41ceb/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...a5125417389a48df8cab0b25fc0dd349/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...70519292edc7418398da80728454a343/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626785561



##########
File path: cpp/src/arrow/dataset/file_parquet_test.cc
##########
@@ -245,13 +245,41 @@ TEST_F(TestParquetFileFormat, CountRowsPredicatePushdown) {
     ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(expected),
                               fragment->CountRows(predicate, options));
 
-    // N.B. SimplifyWithGuarantee can't handle simplifying (i64 == 1) against (i64 <= 1 &
-    // i64 >= 1) right now, but this works
     predicate = and_(less_equal(field_ref("i64"), literal(i)),
                      greater_equal(field_ref("i64"), literal(i)));
     ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*reader->schema()));
     ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(i),
                               fragment->CountRows(predicate, options));
+
+    predicate = equal(field_ref("i64"), literal(i));
+    ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*reader->schema()));
+    ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(i),
+                              fragment->CountRows(predicate, options));
+  }
+
+  // Ensure nulls are properly handled
+  {
+    auto dataset_schema = schema({field("i64", int64())});
+    auto null_batch = RecordBatchFromJSON(dataset_schema, R"([
+[null],
+[null],
+[null]
+])");
+    auto batch = RecordBatchFromJSON(dataset_schema, R"([
+[1],
+[2]
+])");
+    ASSERT_OK_AND_ASSIGN(auto reader,
+                         RecordBatchReader::Make({null_batch, batch}, dataset_schema));
+    auto source = GetFileSource(reader.get());
+    auto fragment = MakeFragment(*source);
+    ASSERT_OK_AND_ASSIGN(
+        auto predicate,
+        greater_equal(field_ref("i64"), literal(1)).Bind(*dataset_schema));
+    ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(2),
+                              fragment->CountRows(predicate, options));
+    // N.B. SimplifyWithGuarantee can't handle not(is_null) so trying to count with
+    // is_null doesn't work

Review comment:
       Interesting, could you add a follow up for supporting this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626623312



##########
File path: cpp/src/arrow/dataset/file_ipc.cc
##########
@@ -173,6 +173,20 @@ Result<ScanTaskIterator> IpcFileFormat::ScanFile(
   return IpcScanTaskIterator::Make(options, fragment);
 }
 
+Future<util::optional<int64_t>> IpcFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  if (FieldsInExpression(predicate).size() > 0) {

Review comment:
       Nit: let's add a `ExpressionHasNoFieldRefs()` or `FieldsInExpressionCount()` function so we're not constructing a vector of strings just to check it for emptiness

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +386,55 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+Future<util::optional<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = internal::checked_pointer_cast<ParquetFileFragment>(file);
+
+  auto evaluate = [=]() -> Result<util::optional<int64_t>> {
+    if (FieldsInExpression(predicate).size() > 0) {
+      ARROW_ASSIGN_OR_RAISE(auto expressions,
+                            parquet_file->TestRowGroups(std::move(predicate)));
+      int64_t rows = 0;
+      for (size_t i = 0; i < parquet_file->row_groups_->size(); i++) {
+        // Row group can definitely be eliminated
+        if (!expressions[i].IsSatisfiable()) continue;
+        // Unknown, bail out of fast path
+        if (FieldsInExpression(expressions[i]).size() > 0) return util::nullopt;
+
+        compute::ExecContext exec_context{options->pool};
+        ARROW_ASSIGN_OR_RAISE(
+            Datum mask, ExecuteScalarExpression(expressions[i], Datum(), &exec_context));
+        // Evaluated to something other than a scalar
+        if (!mask.is_scalar()) return util::nullopt;
+        const auto& mask_scalar = mask.scalar_as<BooleanScalar>();
+        // Evaluated to something other than a Boolean
+        if (!mask_scalar.is_valid) return util::nullopt;
+        // Row group can definitely be eliminated
+        if (!mask_scalar.value) continue;

Review comment:
       `TestRowGroups()` should return expressions which are as constant-folded as possible, and `!IsSatisfiable()` will catch `literal(false)` and (should, anyway) anything which is guaranteed to evaluate to null, so I think this can be simplified to:
   ```suggestion
           // Unless the row group is entirely included, bail out of fast path
           if (expressions[i] != literal(true)) return util::nullopt;
   ```

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +386,55 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+Future<util::optional<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = internal::checked_pointer_cast<ParquetFileFragment>(file);
+
+  auto evaluate = [=]() -> Result<util::optional<int64_t>> {

Review comment:
       Nit: long lambdas are less readable than a free function




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832899252


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 06c105461c94db9aaca4958856463c2daa7b580d. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...af6be8f6c15d494586aa8d0b2e8e774d/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...fdc35f38d7b14c3abec2b6d465c30182/)
   [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...07187315259547f0b0f73fff64067fc4/)
   [Finished :arrow_down:6.91% :arrow_up:8.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...44ebe703a1054a60939d4ab7968128cf/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626088532



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +385,23 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+util::optional<Future<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = internal::checked_pointer_cast<ParquetFileFragment>(file);
+  if (FieldsInExpression(predicate).size() > 0) {

Review comment:
       It should be doable. However, it'll require rethinking the API a bit: currently, it assumes we know up front whether we can use the fast path. But to push down the predicate, we may need to load the footer, so it'll have to go from optional<Future<int>> to Future<optional<int>>. I think that shouldn't penalize the slow path really though since either the Future can be constructed right away, or it'll do work that we need to do later in the scan anyways (namely load the footer).

##########
File path: cpp/src/arrow/dataset/scanner.cc
##########
@@ -488,10 +507,37 @@ Result<AsyncGenerator<EnumeratedRecordBatchGenerator>> FragmentsToBatches(
   return MakeMappedGenerator<EnumeratedRecordBatchGenerator>(
       std::move(enumerated_fragment_gen),
       [scanner](const Enumerated<std::shared_ptr<Fragment>>& fragment) {
-        return FragmentToBatches(scanner, fragment);
+        return FragmentToBatches(scanner, fragment, scanner->options());
       });
 }
 
+Result<AsyncGenerator<AsyncGenerator<util::optional<int64_t>>>> FragmentsToRowCount(
+    std::shared_ptr<AsyncScanner> scanner, FragmentGenerator fragment_gen) {
+  // Must use optional<int64_t> to avoid breaking the pipeline on empty batches
+  auto enumerated_fragment_gen = MakeEnumeratedGenerator(std::move(fragment_gen));
+  auto options = std::make_shared<ScanOptions>(*scanner->options());
+  RETURN_NOT_OK(SetProjection(options.get(), std::vector<std::string>()));
+  auto count_fragment_fn =
+      [scanner, options](const Enumerated<std::shared_ptr<Fragment>>& fragment)
+      -> Result<AsyncGenerator<util::optional<int64_t>>> {
+    auto count = fragment.value->CountRows(options->filter, options);
+    // Fast path
+    if (count.has_value()) {
+      return MakeSingleFutureGenerator(count.value().Then(
+          [](int64_t val) { return util::make_optional<int64_t>(val); }));

Review comment:
       Unfortunately GCC isn't happy with this one:
   
   ```
   In file included from /home/lidavidm/Code/upstream/arrow-9697/cpp/src/arrow/util/async_generator.h:25,
                    from /home/lidavidm/Code/upstream/arrow-9697/cpp/src/arrow/dataset/scanner.h:36,
                    from /home/lidavidm/Code/upstream/arrow-9697/cpp/src/arrow/dataset/scanner.cc:18:
   /home/lidavidm/Code/upstream/arrow-9697/cpp/src/arrow/util/future.h: In substitution of 'template<class Signature> using ForSignature = arrow::detail::ContinueFuture::ForReturn<typename std::result_of<_Signature>::type> [with Signature = nonstd::optional_lite::optional<long int> (&(const long int&))(long int&&)]':
   /home/lidavidm/Code/upstream/arrow-9697/cpp/src/arrow/util/future.h:491:13:   required from here
   /home/lidavidm/Code/upstream/arrow-9697/cpp/src/arrow/util/future.h:67:9: error: no type named 'type' in 'class std::result_of<nonstd::optional_lite::optional<long int> (&(const long int&))(long int&&)>'
      67 |   using ForSignature = ForReturn<result_of_t<Signature>>;
         |         ^~~~~~~~~~~~
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832892652


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = a38d7f0f4bb5269798c3fdb7625afc094a1389e5. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:33.33% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...dadc005b5b3341128837b7be5b8523b5/)
   [Finished :arrow_down:33.33% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...45aa8cd56b7f477aad1a9f5e77d41ceb/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...a5125417389a48df8cab0b25fc0dd349/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...70519292edc7418398da80728454a343/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832707848


   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832892652


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = a38d7f0f4bb5269798c3fdb7625afc094a1389e5. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:33.33% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...dadc005b5b3341128837b7be5b8523b5/)
   [Finished :arrow_down:33.33% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...45aa8cd56b7f477aad1a9f5e77d41ceb/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...a5125417389a48df8cab0b25fc0dd349/)
   [Finished :arrow_down:7.78% :arrow_up:7.37%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...70519292edc7418398da80728454a343/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626571972



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -385,6 +385,23 @@ Result<ScanTaskIterator> ParquetFileFormat::ScanFile(
   return MakeVectorIterator(std::move(tasks));
 }
 
+util::optional<Future<int64_t>> ParquetFileFormat::CountRows(
+    const std::shared_ptr<FileFragment>& file, Expression predicate,
+    std::shared_ptr<ScanOptions> options) {
+  auto parquet_file = internal::checked_pointer_cast<ParquetFileFragment>(file);
+  if (FieldsInExpression(predicate).size() > 0) {

Review comment:
       Ah, it turns out it doesn't work because right now, we can only definitively reject a row group, but can't decide whether a row group should definitively be included in the row count, because `SimplifyWithGuarantee((i64 == 1), (i64 <= 1) & (i64 >= 1))` simplifies to `i64 == 1`, not to `true`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-820684070


   R Benchmark PR (because R had a special case for this before): https://github.com/ursacomputing/arrowbench/pull/24
   Note I still need to plumb that through into Conbench.
   
   Results:
   ```
   Before:
       process      real start_mem_bytes end_mem_bytes max_mem_bytes gc_level0 gc_level1 gc_level2
   1 0.4612792 0.4754918      2669531136    8582832128   24671309824         2         0         0
   
   After:
      process     real start_mem_bytes end_mem_bytes max_mem_bytes gc_level0 gc_level1 gc_level2
   1 1.571199 0.204026      2480492544    3325677568    3325677568         0         0         0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832851021


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 8b03ba2e1473f653513a24ee43a8b3b367b437b6. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...799368aecbc24dd68a47cfc7e575114d/)
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...75d88ba5359045c8a897183e5302cb8f/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...3a5d5279216540358fce4e0e1388048b/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...4097bacbd70c4cb78229e3585d2dd39c/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626783124



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -592,18 +625,34 @@ Result<std::vector<int>> ParquetFileFragment::FilterRowGroups(
     }
   }
 
-  std::vector<int> row_groups;
+  std::vector<compute::Expression> row_groups(row_groups_->size());
   for (size_t i = 0; i < row_groups_->size(); ++i) {
     ARROW_ASSIGN_OR_RAISE(auto row_group_predicate,
                           SimplifyWithGuarantee(predicate, statistics_expressions_[i]));
-    if (row_group_predicate.IsSatisfiable()) {
-      row_groups.push_back(row_groups_->at(i));
-    }
+    row_groups[i] = std::move(row_group_predicate);
   }
-
   return row_groups;
 }
 
+Result<util::optional<int64_t>> ParquetFileFragment::TryCountRows(
+    compute::Expression predicate) {
+  DCHECK_NE(metadata_, nullptr);
+  if (ExpressionHasFieldRefs(predicate)) {
+    ARROW_ASSIGN_OR_RAISE(auto expressions, TestRowGroups(std::move(predicate)));
+    int64_t rows = 0;
+    for (size_t i = 0; i < row_groups_->size(); i++) {
+      // Unless the row group is entirely included, bail out of fast path
+      if (expressions[i] == compute::literal(false)) continue;

Review comment:
       Got it, thanks. I added this case and the case above. However, you still can't count rows with a predicate like `is_null(i64)` because there's no information about columns being non-null. (And even if it's added, the expression simplification can't handle it because it extracts a table of what fields _are_ what values, but not what values fields _are not_.)
   
   I also fixed a small bug: in column statistics, the total number of rows is `num_values() + null_count()` (so emit `is_null` if `null_count() > 0 && num_values == 0`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832708280


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 379f4ead1c31ad8451f07105b10525dcf36919ac. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...25b12bc74ddc40d19ae22b9f870260a9/)
   [Finished :arrow_down:33.33% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...52ddabc8230f4df08a7d988f0dc04522/)
   [Finished :arrow_down:5.8% :arrow_up:2.17%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...0a3dfe11d1e047e5933797ee32602bae/)
   [Finished :arrow_down:2.73% :arrow_up:2.06%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...e002106cc5594c2783d5353ad0dfa241/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832899252


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 06c105461c94db9aaca4958856463c2daa7b580d. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...af6be8f6c15d494586aa8d0b2e8e774d/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...fdc35f38d7b14c3abec2b6d465c30182/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...07187315259547f0b0f73fff64067fc4/)
   [Finished :arrow_down:6.91% :arrow_up:8.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...44ebe703a1054a60939d4ab7968128cf/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r626746016



##########
File path: cpp/src/arrow/dataset/file_parquet_test.cc
##########
@@ -220,6 +220,41 @@ TEST_F(TestParquetFileFormat, WriteRecordBatchReaderCustomOptions) {
                     *actual_schema);
 }
 
+TEST_F(TestParquetFileFormat, CountRows) { TestCountRows(); }
+
+TEST_F(TestParquetFileFormat, CountRowsPredicatePushdown) {
+  constexpr int64_t kNumRowGroups = 16;
+  constexpr int64_t kTotalNumRows = kNumRowGroups * (kNumRowGroups + 1) / 2;
+
+  // See PredicatePushdown test below for a description of the generated data
+  auto reader = ArithmeticDatasetFixture::GetRecordBatchReader(kNumRowGroups);
+  auto source = GetFileSource(reader.get());
+  auto options = std::make_shared<ScanOptions>();
+
+  auto fragment = MakeFragment(*source);
+
+  ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(kTotalNumRows),
+                            fragment->CountRows(literal(true), options));
+
+  for (int i = 1; i <= kNumRowGroups; i++) {
+    SCOPED_TRACE(i);
+    // The row group for which all values in column i64 == i has i rows
+    auto predicate = less_equal(field_ref("i64"), literal(i));
+    ASSERT_OK_AND_ASSIGN(predicate, predicate.Bind(*reader->schema()));
+    auto expected = i * (i + 1) / 2;
+    ASSERT_FINISHES_OK_AND_EQ(util::make_optional<int64_t>(expected),
+                              fragment->CountRows(predicate, options));
+
+    // N.B. SimplifyWithGuarantee can't handle simplifying (i64 == 1) against (i64 <= 1 &
+    // i64 >= 1) right now, but this works

Review comment:
       Instead of catching a degenerate case like this, it seems more reasonable to detect `min->Equals(max)` in ColumnChunkStatisticsAsExpression and emit `equal(field_expr, min)` instead of `and_(greater_equal(field_expr, min), less_equal(field_expr, max))`

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -592,18 +625,34 @@ Result<std::vector<int>> ParquetFileFragment::FilterRowGroups(
     }
   }
 
-  std::vector<int> row_groups;
+  std::vector<compute::Expression> row_groups(row_groups_->size());
   for (size_t i = 0; i < row_groups_->size(); ++i) {
     ARROW_ASSIGN_OR_RAISE(auto row_group_predicate,
                           SimplifyWithGuarantee(predicate, statistics_expressions_[i]));
-    if (row_group_predicate.IsSatisfiable()) {
-      row_groups.push_back(row_groups_->at(i));
-    }
+    row_groups[i] = std::move(row_group_predicate);
   }
-
   return row_groups;
 }
 
+Result<util::optional<int64_t>> ParquetFileFragment::TryCountRows(
+    compute::Expression predicate) {
+  DCHECK_NE(metadata_, nullptr);
+  if (ExpressionHasFieldRefs(predicate)) {
+    ARROW_ASSIGN_OR_RAISE(auto expressions, TestRowGroups(std::move(predicate)));
+    int64_t rows = 0;
+    for (size_t i = 0; i < row_groups_->size(); i++) {
+      // Unless the row group is entirely included, bail out of fast path
+      if (expressions[i] == compute::literal(false)) continue;

Review comment:
       `expression[i]` *could* be simplified to a literal null, for example if `i64` happened to be null throughout a row group. This'd be a good unit test too
   ```suggestion
         // If a row group is entirely excluded, exclude its rows from the count
         if (!expressions[i].IsSatisfiable()) continue;
         // Unless the row group is entirely included, bail out of fast path
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

bkietz closed pull request #10060:
URL: https://github.com/apache/arrow/pull/10060


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832892652


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = a38d7f0f4bb5269798c3fdb7625afc094a1389e5. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...dadc005b5b3341128837b7be5b8523b5/)
   [Finished :arrow_down:33.33% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...45aa8cd56b7f477aad1a9f5e77d41ceb/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...a5125417389a48df8cab0b25fc0dd349/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...70519292edc7418398da80728454a343/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832708280


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 379f4ead1c31ad8451f07105b10525dcf36919ac. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...25b12bc74ddc40d19ae22b9f870260a9/)
   [Finished :arrow_down:33.33% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...52ddabc8230f4df08a7d988f0dc04522/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...0a3dfe11d1e047e5933797ee32602bae/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...e002106cc5594c2783d5353ad0dfa241/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-829246035


   I split off the CSV work and filed ARROW-12598 as a follow-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832892652


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = a38d7f0f4bb5269798c3fdb7625afc094a1389e5. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:33.33% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...dadc005b5b3341128837b7be5b8523b5/)
   [Finished :arrow_down:33.33% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...45aa8cd56b7f477aad1a9f5e77d41ceb/)
   [Failed :arrow_down:4.35% :arrow_up:2.17%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...a5125417389a48df8cab0b25fc0dd349/)
   [Finished :arrow_down:7.78% :arrow_up:7.37%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...70519292edc7418398da80728454a343/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832899252


   Benchmark runs are scheduled for baseline = f4dc0560964bd205ee06002411f97b402e30e5a2 and contender = 06c105461c94db9aaca4958856463c2daa7b580d. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/a03d5f88d22c427f9875cdecbeed6d26...af6be8f6c15d494586aa8d0b2e8e774d/)
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4f0da124e5ec4d1e88b0177169f7b764...fdc35f38d7b14c3abec2b6d465c30182/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/d2e3042249804201b16fe946714cfd31...07187315259547f0b0f73fff64067fc4/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2bd460c5a2d24ccb9882ffcc6a1f8402...44ebe703a1054a60939d4ab7968128cf/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on a change in pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on a change in pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#discussion_r623139878



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2838,6 +2838,13 @@ cdef class Scanner(_Weakrefable):
             result = self.scanner.Head(num_rows)
         return pyarrow_wrap_table(GetResultValue(result))
 
+    def count_rows(self):
+        """Count rows matching the scanner filter."""

Review comment:
       They can, yes. However I could add the helper function to Fragment for consistency (and I think there are a number of other helper functions that I could also add).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-825622744


   Note that for CSV, this will need a bugfix from ARROW-12500 as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot edited a comment on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

ursabot edited a comment on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-832708280


   Benchmark runs are scheduled for baseline = a73fb8377e9f9e8e15b54e8d00d955ecf5d0334a and contender = 379f4ead1c31ad8451f07105b10525dcf36919ac. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:33.33%] [ec2-t3-large-us-east-2](https://conbench.ursa.dev/compare/runs/45a87097a5874d5794d95bc0d51619dd...25b12bc74ddc40d19ae22b9f870260a9/)
   [Finished :arrow_down:33.33% :arrow_up:33.33%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/08be8f963d754ca395aca799a8e43afd...52ddabc8230f4df08a7d988f0dc04522/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/96b19279fded48e3ade581e8ca487c04...0a3dfe11d1e047e5933797ee32602bae/)
   [Finished :arrow_down:2.73% :arrow_up:2.06%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b082b36a2daa46bebbc6433c880264af...e002106cc5594c2783d5353ad0dfa241/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] lidavidm commented on pull request #10060: ARROW-9697: [C++][Python][R][Dataset] Add CountRows for Scanner

Posted by GitBox <gi...@apache.org>.

lidavidm commented on pull request #10060:
URL: https://github.com/apache/arrow/pull/10060#issuecomment-828839144


   > > Note this is somewhat of a regression for CSV files/if you call dim.Dataset in R as now we'll have to scan files instead of just immediately returning NA.
   > 
   > I don't consider this a regression: you're going from "not implemented" to "implemented but not fast".
   
   Fair point. I was worried the slowness would be unexpected, if it's expected to be a cheap operation in R. I can split out the CSV optimization side and clean this up if we want to get this in sooner (& to keep the size of this down).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org