You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/23 17:46:56 UTC
[GitHub] [iceberg] Fokko commented on a diff in pull request #6258: Python: Implement PyArrow row level filtering

Fokko commented on code in PR #6258:
URL: https://github.com/apache/iceberg/pull/6258#discussion_r1030732726


##########
python/pyiceberg/table/__init__.py:
##########
@@ -355,7 +355,23 @@ def to_arrow(self):
         if "*" not in self.selected_fields:
             columns = list(self.selected_fields)
 
-        return pq.read_table(source=locations, filesystem=fs, columns=columns)
+        pyarrow_filter = None
+        if self.row_filter is not AlwaysTrue():
+            bound_row_filter = bind(self.table.schema(), self.row_filter)
+            pyarrow_filter = expression_to_pyarrow(bound_row_filter)
+
+        from pyarrow.dataset import dataset
+
+        ds = dataset(

Review Comment:
   I had to replace the table with a dataset here, to allow it to pass in a PyArrow expression.



##########
python/pyiceberg/table/__init__.py:
##########
@@ -355,7 +355,23 @@ def to_arrow(self):
         if "*" not in self.selected_fields:
             columns = list(self.selected_fields)
 
-        return pq.read_table(source=locations, filesystem=fs, columns=columns)
+        pyarrow_filter = None
+        if self.row_filter is not AlwaysTrue():
+            bound_row_filter = bind(self.table.schema(), self.row_filter)
+            pyarrow_filter = expression_to_pyarrow(bound_row_filter)
+
+        from pyarrow.dataset import dataset
+
+        ds = dataset(
+            source=locations,
+            filesystem=fs,
+            # Optionally provide the Schema for the Dataset,
+            # in which case it will not be inferred from the source.
+            # https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
+            schema=schema_to_pyarrow(self.table.schema()),
+        )
+
+        return ds.to_table(filter=pyarrow_filter, columns=columns)

Review Comment:
   I'm not sure if we want to return a table or a dataset here. I think the end-user should be able to use both. The Dataset also has a nice method called `to_batches` to read the data in chunks: https://arrow.apache.org/docs/python/dataset.html#iterative-out-of-core-or-streaming-reads
   This seams very applicable to Iceberg.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org