You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/08 18:04:22 UTC

[GitHub] [arrow] westonpace commented on issue #34494: [C++] How to handle the limit clause when scanning Parquet files using Scanner?

westonpace commented on issue #34494:
URL: https://github.com/apache/arrow/issues/34494#issuecomment-1460614234

   The scanner API doesn't have any sort of proper cancellation.  A scan must be fully consumed to stop all background work.
   
   However, you can move past the scanner to start using [Declaration](https://arrow.apache.org/docs/cpp/streaming_execution.html) directly (at this point the scanner is basically a front-end for streaming execution engine).
   
   The plan [created by the scanner](https://github.com/apache/arrow/blob/apache-arrow-11.0.0/cpp/src/arrow/dataset/scanner.cc#L441-L447) is:
   
   ```
   compute::Declaration::Sequence(
   {
     {"scan", ScanNodeOptions{dataset_, scan_options_, sequence_fragments}},
     {"filter", compute::FilterNodeOptions{scan_options_->filter}},
     {"augmented_project",
       // exprs comes from the scan options also
       compute::ProjectNodeOptions{std::move(exprs), std::move(names)}}
     }
   )
   ```
   
   Starting with 12.0.0 (or the latest main) you can simply do:
   
   ```
   compute::Declaration::Sequence(
   {
     {"scan", ScanNodeOptions{dataset_, scan_options_, sequence_fragments}},
     {"filter", compute::FilterNodeOptions{scan_options_->filter}},
     {"project",
       compute::ProjectNodeOptions{std::move(exprs), std::move(names)}}
     },
     {"fetch", compute::FetchNodeOptions(read_offset, read_limit)
   )
   ```
   
   You can use `compute::DeclarationToTable` or `compute::DeclarationToReader` to process the declaration.  The "fetch" node can is used to implement paging (after a filter).  There may also be some paging options getting added to the scan node directly (for paging before filtering which is more efficient) but it's not clear that will make it in 12.0.0 yet.  If you need to work with 11.0.0 or earlier then your options are:
   
    1. Use ExecPlan (Declarations use ExecPlan under the hood but it adds a bunch of complexity) and use a custom sink node (select_k_sink).  This does pretty much the same thing but in a more complex way.
    2. Use ExecPlan and call StopProducing on the plan once you have gotten enough data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org