You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/12 18:49:53 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new pull request, #2218: Use `filter` instead of `take` to avoid using indices

Dandandan opened a new pull request, #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218

   # Which issue does this PR close?
   n/a
   
    # Rationale for this change
   We can use the more efficient `filter` kernel than building an indices array and using `take`.
   Referenced by @alamb in https://github.com/apache/arrow-rs/issues/1542
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Use filter instead of take. 
   
   This has the following benefits
   * This also uses the faster counting, and has some extra fast paths.
   * Avoids materializing the indices.
   * The `filter` kernel itself is faster / than `take`.
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   No
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848801910


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +47,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter_count = selection

Review Comment:
   The upstream already performs this optimisation, see [here](https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/filter.rs#L403) and so you can probably elide counting the bits again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen merged pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
yjshen merged PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848814692


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +49,21 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter = match selection.null_count() {
+            0 => BooleanArray::from(selection.data().clone()),
+            _ => prep_null_mask_filter(selection),
+        };
+
+        let filter_count = filter
+            .values()
+            .count_set_bits_offset(selection.offset(), selection.len());
+
+        if filter_count == selection.len() {
             return self.evaluate(batch);

Review Comment:
   This logic is already handled upstream, I don't see a reason to duplicate it here?
   
   See https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/filter.rs#L331



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848831844


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -49,17 +49,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        let filter_count = selection
-            .values()
-            .count_set_bits_offset(selection.offset(), selection.len());
-
-        if selection.null_count() == 0 && filter_count == selection.len() {
-            return self.evaluate(batch);
-        }
-
         let tmp_batch = filter_record_batch(batch, &selection)?;
 
         let tmp_result = self.evaluate(&tmp_batch)?;
+        // All values from the `selection` filter are true.
+        if batch.columns().first().map(|x| x.len())

Review Comment:
   Another good suggestion :rofl:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848817420


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +47,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter_count = selection

Review Comment:
   > also avoided the scatter path in case of all values being true.
   
   You could inspect the length of the returned filtered array instead of counting the bits twice?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
yjshen commented on PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#issuecomment-1097446057

   Thanks again for sharing this with me @Dandandan @tustvold @alamb. Amazing team ❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848814692


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +49,21 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter = match selection.null_count() {
+            0 => BooleanArray::from(selection.data().clone()),
+            _ => prep_null_mask_filter(selection),
+        };
+
+        let filter_count = filter
+            .values()
+            .count_set_bits_offset(selection.offset(), selection.len());
+
+        if filter_count == selection.len() {
             return self.evaluate(batch);

Review Comment:
   This logic is already handled upstream, I don't see a reason to duplicate it here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
yjshen commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848981975


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,24 +47,13 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
-            return self.evaluate(batch);
-        }
-        let mut indices = vec![];
-        for (i, b) in selection.iter().enumerate() {
-            if let Some(true) = b {
-                indices.push(i as u64);
-            }
-        }
-        let indices = UInt64Array::from_iter_values(indices);
-        let tmp_columns = batch
-            .columns()
-            .iter()
-            .map(|c| take(c.as_ref(), &indices, None))
-            .collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?;
-
-        let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?;
+        let tmp_batch = filter_record_batch(batch, selection)?;

Review Comment:
   Cool, you do have a great sense of code smell!
   
   TIL the new `filter` kernel, it's great to read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848829549


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -49,17 +49,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        let filter_count = selection
-            .values()
-            .count_set_bits_offset(selection.offset(), selection.len());
-
-        if selection.null_count() == 0 && filter_count == selection.len() {
-            return self.evaluate(batch);
-        }
-
         let tmp_batch = filter_record_batch(batch, &selection)?;
 
         let tmp_result = self.evaluate(&tmp_batch)?;
+        // All values from the `selection` filter are true.
+        if batch.columns().first().map(|x| x.len())

Review Comment:
   `RecordBatch::num_rows()` might be simpler?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848828601


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +47,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter_count = selection

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r849440001


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,24 +47,13 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
-            return self.evaluate(batch);
-        }
-        let mut indices = vec![];
-        for (i, b) in selection.iter().enumerate() {
-            if let Some(true) = b {
-                indices.push(i as u64);
-            }
-        }
-        let indices = UInt64Array::from_iter_values(indices);
-        let tmp_columns = batch
-            .columns()
-            .iter()
-            .map(|c| take(c.as_ref(), &indices, None))
-            .collect::<ArrowResult<Vec<Arc<dyn Array>>>>()?;
-
-        let tmp_batch = RecordBatch::try_new(batch.schema(), tmp_columns)?;
+        let tmp_batch = filter_record_batch(batch, selection)?;

Review Comment:
   👃  👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848818818


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +49,21 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter = match selection.null_count() {
+            0 => BooleanArray::from(selection.data().clone()),
+            _ => prep_null_mask_filter(selection),
+        };
+
+        let filter_count = filter
+            .values()
+            .count_set_bits_offset(selection.offset(), selection.len());
+
+        if filter_count == selection.len() {
             return self.evaluate(batch);

Review Comment:
   See comment. The fasth path avoids calling `scatter` when the `selection` array is all-true. I reused some code from upstream, as it is not public yet.
   In theory, we could re-use the filter count calculation, but we can't currently wire them together.
   
   I am also OK with removing this path, for making the code a bit simpler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848816216


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +47,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter_count = selection

Review Comment:
   I thought this as well, but the earlier fast path also avoided the `scatter` path in case of all values being true.
   The filter count can not easily be reused between this code and the `filter` kernel.
   
   I am also OK with removing this path, for making the code a bit simpler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2218: Use `filter` (filter_record_batch) instead of `take` to avoid using indices

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2218:
URL: https://github.com/apache/arrow-datafusion/pull/2218#discussion_r848823960


##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -51,23 +47,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
         batch: &RecordBatch,
         selection: &BooleanArray,
     ) -> Result<ColumnarValue> {
-        if selection.iter().all(|b| b == Some(true)) {
+        let filter_count = selection

Review Comment:
   That is a good suggestion :+1: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org