You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/30 02:39:41 UTC
[GitHub] [arrow] westonpace commented on pull request #13426: [DRAFT] ARROW-16894: [C++] Add Benchmarks for Asof Join Node

westonpace commented on PR #13426:
URL: https://github.com/apache/arrow/pull/13426#issuecomment-1170685688

   Some initial questions and I'll take a more detailed look soon.
   
   > Here is a very primitive version of our Asof Join Benchmarks (asof_join_benchmark.cc). Our main goal is to benchmark on four qualities: the effect of table density (the frequency of rows, e.g a row every 2s as opposed to every 1h over some time range), table width (# of columns), tids (# of keys), and multi-table joins. We also have a baseline comparison benchmark with hash joins (which is currently in this file).
   
   * Does the table density get applied uniformly over all input columns?  In other words, do you worry about cases where one input is very dense and the others are not so dense?
   * When you say multi-table joins how many tables are you testing?  Or is that a parameter?
   * # of columns and # of keys is good.  Eventually you will need to worry about data types I would think (probably more for payload columns than for key columns)
   
   > I think this needs some work before it goes into arrow. We currently run this benchmark by generating .feather files with Python via bamboo-streaming's datagen.py to represent each table, and then reading them in through cpp (see make_arrow_ipc_reader_node). We perhaps want to write a utility that allows us to do this in cpp, while varying many of the metrics I've mentioned above, or finding a way to generate those files as part of the benchmark.
   
   Another potential approach is to create a custom source node that generates data.  We do something like this for our tpc-h benchmark.  This allows us to isolate the scanning from the actual compute operations.  However, this kind of requires you to write the data generator in C++ which is probably not ideal.
   
   > There are also quite a large number of BENCHMARK_CAPTURE statements, as an immediate workaround to some limitations in Google Benchmarks. I haven't found a great non-verbose way to pass in the parameters needed (strings and vectors) while also having readable titles and details about the benchmark being written to the output file. Let me know if you have any advice about this / know some one who does.
   
   I'll think about this when I take a closer look.  Google benchmark is not necessarily the perfect tool for every situation.  But maybe there is something we can do.
   
   > While being processed, is a single source node dedicated a single thread?
   
   No.
   
   > How many threads can call InputReceived of the following node at once?
   
   That is mostly determined by the capacity of the executor which defaults to std::hardware_concurrency (e.g. one per core or two per core if you have hyperthreading) for the CPU thread pool.  At the moment it can be even greater than this number but this is an issue we are hoping to fix soon (limiting these calls to only CPU thread pool calls).
   
   > I was also wondering if you could clarify how the arrow threading engine would work for a node that has multiple inputs (an asof join / hash join ingesting from multiple source nodes, for example).
   
   Even if the asof join node had a single source node it could still be called many times.  You can think of a source node as a parallel while loop across all the batches:
   
   ```
   while (!source.empty()) {
     exec_context->executor->Spawn([this] {
       ExecBatch next_batch = ReadNext();
       output->InputReceived(next_batch);
     }
   }
   ```
   
   If there are multiple sources then they are still all submitting tasks to the same common thread pool so you won't see any more threads.  Also, in many unit tests and small use cases the source doesn't have enough data for more than one task so you often don't see the full scale of parallelism until you are testing larger datasets.
   
   There are some changes planned for the scheduler but most of what I said already will remain more or less true.  The future scheduler could potentially prioritize one source above others (for example, it often makes sense with a hash-join node to prioritize the build side input) so that is something to consider (for the as-of join node you probably want to read from all sources evenly I think).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org