You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/26 15:59:09 UTC

[GitHub] [arrow] pitrou opened a new pull request #8777: ARROW-10569: [C++] Improve table filtering performance

pitrou opened a new pull request #8777:
URL: https://github.com/apache/arrow/pull/8777


   Instead of applying the same boolean filter to all N columns, first convert to filter to take indices. 
   Selecting indices of a linear array is much faster, thanks to avoiding bit-unpacking on the boolean filter.
   
   Using the Python benchmark setup from ARROW-10569:
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.compute as pc
   import numpy as np
   
   num_rows = 10_000_000
   data = np.random.randn(num_rows)
   
   df = pd.DataFrame({'data{}'.format(i): data
                      for i in range(100)})
   
   df['key'] = np.random.randint(0, 100, size=num_rows)
   
   rb = pa.record_batch(df)
   t = pa.table(df)
   ```
   
   * before:
   ```python
   >>> %timeit rb.filter(pc.equal(rb[-1], 5))
   60 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   >>> %timeit t.filter(pc.equal(t[-1], 5))
   1.22 s ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   * after:
   ```python
   >>> %timeit rb.filter(pc.equal(rb[-1], 5))
   59.2 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   >>> %timeit t.filter(pc.equal(t[-1], 5))
   59.3 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8777: ARROW-10569: [C++] Improve table filtering performance

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8777:
URL: https://github.com/apache/arrow/pull/8777#discussion_r533972693



##########
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##########
@@ -1838,19 +1838,113 @@ Result<std::shared_ptr<Table>> FilterTable(const Table& table, const Datum& filt
   if (table.num_rows() != filter.length()) {
     return Status::Invalid("Filter inputs must all be the same length");
   }
+  if (table.num_rows() == 0) {
+    return Table::Make(table.schema(), table.columns(), 0);
+  }
+
+  // Fetch filter chunks
+  const auto& filter_opts = *static_cast<const FilterOptions*>(options);
+  ArrayDataVector filter_chunks;
+  switch (filter.kind()) {
+    case Datum::ARRAY:
+      filter_chunks.push_back(filter.array());
+      break;
+    case Datum::CHUNKED_ARRAY: {
+      const auto& chunked_array = filter.chunked_array();
+      filter_chunks.reserve(chunked_array->num_chunks());
+      for (const auto& filter_chunk : chunked_array->chunks()) {
+        filter_chunks.push_back(filter_chunk->data());
+      }
+    } break;
+    default:
+      return Status::NotImplemented("Filter should be array-like");
+  }
 
-  // The selection vector "trick" cannot currently be easily applied on Table
-  // because either the filter or the columns may be ChunkedArray, so we use
-  // Filter recursively on the columns for now until a more efficient
-  // implementation of Take with chunked data is available.
-  auto new_columns = table.columns();
-  for (auto& column : new_columns) {
+  // Instead of filtering each column with the boolean filter
+  // (which would be slow if the table has a large number of columns: ARROW-10569),
+  // convert each filter slice to take indices.
+  // We just have to be careful to choose the right filter slice size to match
+  // the table chunking.
+
+  const int num_columns = table.num_columns();

Review comment:
       Very good suggestion, thank you.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8777: ARROW-10569: [C++] Improve table filtering performance

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8777:
URL: https://github.com/apache/arrow/pull/8777#issuecomment-734382061


   https://issues.apache.org/jira/browse/ARROW-10569


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #8777: ARROW-10569: [C++] Improve table filtering performance

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #8777:
URL: https://github.com/apache/arrow/pull/8777


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #8777: ARROW-10569: [C++] Improve table filtering performance

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #8777:
URL: https://github.com/apache/arrow/pull/8777#issuecomment-737277230


   Looks good. We could add this benchmark to our ASV suite? Or a C++ benchmark? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8777: ARROW-10569: [C++] Improve table filtering performance

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8777:
URL: https://github.com/apache/arrow/pull/8777#discussion_r533624625



##########
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##########
@@ -1838,19 +1838,113 @@ Result<std::shared_ptr<Table>> FilterTable(const Table& table, const Datum& filt
   if (table.num_rows() != filter.length()) {
     return Status::Invalid("Filter inputs must all be the same length");
   }
+  if (table.num_rows() == 0) {
+    return Table::Make(table.schema(), table.columns(), 0);
+  }
+
+  // Fetch filter chunks
+  const auto& filter_opts = *static_cast<const FilterOptions*>(options);
+  ArrayDataVector filter_chunks;
+  switch (filter.kind()) {
+    case Datum::ARRAY:
+      filter_chunks.push_back(filter.array());
+      break;
+    case Datum::CHUNKED_ARRAY: {
+      const auto& chunked_array = filter.chunked_array();
+      filter_chunks.reserve(chunked_array->num_chunks());
+      for (const auto& filter_chunk : chunked_array->chunks()) {
+        filter_chunks.push_back(filter_chunk->data());
+      }
+    } break;
+    default:
+      return Status::NotImplemented("Filter should be array-like");
+  }
 
-  // The selection vector "trick" cannot currently be easily applied on Table
-  // because either the filter or the columns may be ChunkedArray, so we use
-  // Filter recursively on the columns for now until a more efficient
-  // implementation of Take with chunked data is available.
-  auto new_columns = table.columns();
-  for (auto& column : new_columns) {
+  // Instead of filtering each column with the boolean filter
+  // (which would be slow if the table has a large number of columns: ARROW-10569),
+  // convert each filter slice to take indices.
+  // We just have to be careful to choose the right filter slice size to match
+  // the table chunking.
+
+  const int num_columns = table.num_columns();

Review comment:
       It seems this could be implemented more simply using RechunkArraysConsistently; this could be used to ensure ahead of time that we always a filter slice which is the same length as the batch to be filtered. Then sweep through those pairs, materializing take indices once per filter slice.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org