You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/17 17:23:05 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13155: ARROW-16469: [Python] Table.filter and Dataset.filter

jorisvandenbossche commented on code in PR #13155:
URL: https://github.com/apache/arrow/pull/13155#discussion_r875076518


##########
python/pyarrow/_dataset.pyx:
##########
@@ -405,6 +405,27 @@ cdef class Dataset(_Weakrefable):
                                               use_threads=use_threads, coalesce_keys=coalesce_keys,
                                               output_type=InMemoryDataset)
 
+    def filter(self, expr):
+        """
+        Select rows from the Dataset.
+
+        The Dataset can be filtered based on a boolean :class:`Expression` filter.
+
+        Parameters
+        ----------
+        expr : Expression
+            The boolean :class:`Expression` to filter the table with.
+
+        Returns
+        -------
+        filtered : InMemoryDataset
+            An InMemoryDataset of the same schema, with only the rows selected
+            by applied filtering
+
+        """
+        return _pc()._exec_plan._filter_table(self, expr,

Review Comment:
   I think for the `Dataset` method, we should rather add this filter to the Scanner (for which there is already functionality, i.e. see this as a different way to express `dataset.scanner/to_table/..(filter=...)`)? 
   That would avoid actually materializing the full table? (before putting it again in an InMemoryDataset)



##########
python/pyarrow/table.pxi:
##########
@@ -2882,24 +2882,27 @@ cdef class Table(_PandasConvertible):
 
         return pyarrow_wrap_table(result)
 
-    def filter(self, mask, object null_selection_behavior="drop"):
+    def filter(self, mask_or_expr, object null_selection_behavior="drop"):
         """
         Select rows from the table.
 
-        See :func:`pyarrow.compute.filter` for full usage.
+        The Table can be filtered based on a mask, which will be passed to
+        :func:`pyarrow.compute.filter` to perform the filtering, or it can
+        be filtered through a boolean :class:`.Expression`
 
         Parameters
         ----------
-        mask : Array or array-like
-            The boolean mask to filter the table with.
+        mask_or_expr : Array or array-like or .Expression
+            The boolean mask or the :class:`.Expression` to filter the table with.
         null_selection_behavior
-            How nulls in the mask should be handled.
+            How nulls in the mask should be handled, does nothing if
+            an :class:`.Expression` is used.

Review Comment:
   This is not possible to pass through to the filter node?



##########
python/pyarrow/table.pxi:
##########
@@ -2882,24 +2882,27 @@ cdef class Table(_PandasConvertible):
 
         return pyarrow_wrap_table(result)
 
-    def filter(self, mask, object null_selection_behavior="drop"):
+    def filter(self, mask_or_expr, object null_selection_behavior="drop"):
         """
         Select rows from the table.
 
-        See :func:`pyarrow.compute.filter` for full usage.
+        The Table can be filtered based on a mask, which will be passed to
+        :func:`pyarrow.compute.filter` to perform the filtering, or it can
+        be filtered through a boolean :class:`.Expression`
 
         Parameters
         ----------
-        mask : Array or array-like
-            The boolean mask to filter the table with.
+        mask_or_expr : Array or array-like or .Expression

Review Comment:
   Strictly speaking renaming the keyword can break code. We could also leave it as `mask`, and only update the documentation (the expression still _represents_ a mask anyway, so I would say it is not a wrong name)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org