You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/02/27 08:16:07 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

jorisvandenbossche commented on code in PR #34294:
URL: https://github.com/apache/arrow/pull/34294#discussion_r1117370945


##########
python/pyarrow/interchange/dataframe.py:
##########
@@ -114,18 +116,24 @@ def num_chunks(self) -> int:
         """
         Return the number of chunks the DataFrame consists of.
         """
-        # pyarrow.Table can have columns with different number
-        # of chunks so we take the number of chunks that
-        # .to_batches() returns as it takes the min chunk size
-        # of all the columns (to_batches is a zero copy method)
-        batches = self._df.to_batches()
-        return len(batches)
+        if isinstance(self._df, pa.RecordBatch):
+            return 1
+        else:
+            # pyarrow.Table can have columns with different number
+            # of chunks so we take the number of chunks that
+            # .to_batches() returns as it takes the min chunk size
+            # of all the columns (to_batches is a zero copy method)
+            batches = self._df.to_batches()
+            return len(batches)
 
     def column_names(self) -> Iterable[str]:
         """
         Return an iterator yielding the column names.
         """
-        return self._df.column_names
+        if isinstance(self._df, pa.RecordBatch):
+            return self._df.schema.names

Review Comment:
   I think this one should work for both RecordBatch and Table (not needing the if/else conditional)



##########
python/pyarrow/interchange/dataframe.py:
##########
@@ -154,19 +162,28 @@ def select_columns(self, indices: Sequence[int]) -> _PyArrowDataFrame:
         """
         Create a new DataFrame by selecting a subset of columns by index.
         """
-        return _PyArrowDataFrame(
-            self._df.select(list(indices)), self._nan_as_null, self._allow_copy
-        )
+        if isinstance(self._df, pa.RecordBatch):
+            columns = [self._df.column(i) for i in indices]
+            names = [self._df.schema.names[i] for i in indices]
+            return _PyArrowDataFrame(pa.record_batch(columns, names=names))
+        else:
+            return _PyArrowDataFrame(
+                self._df.select(list(indices)), self._nan_as_null, self._allow_copy

Review Comment:
   We should actually add a `select` method to RecordBatch as wel, mimicking what we have for Table.select. Could you do that here or in a separate PR?



##########
python/pyarrow/tests/interchange/test_conversion.py:
##########
@@ -274,6 +368,21 @@ def test_roundtrip_pandas_datetime(unit):
     assert expected_protocol.num_chunks() == result_protocol.num_chunks()
     assert expected_protocol.column_names() == result_protocol.column_names()
 
+    batch = table.to_batches()[0]
+    pandas_df = pandas_from_dataframe(batch)
+    result = pi.from_dataframe(pandas_df)

Review Comment:
   Do you think this repetition with RecordBatch instead of Table is needed for those tests test_conversion.py that check actual conversion of values? (for that part, I assume that there is no difference in our code?)
   
   If we keep it, another option might be to parametrize the test for table/recordbatch, something:
   
   ```
   @pytest.mark.parametrize("constructor", [pa.table, pa.recordbatch])
   def test(constructor):
       table = constructor({...})
       ...
       expected = constructor({...})
       assert result.equals(expected)
   ```
   
   or 
   
   
   ```
   @pytest.mark.parametrize("use_batch", [False, True])
   def test(use_batch):
       table = pa.table({...})
       if use_batch:
           table = table.to_batches()[0]
       ...
   ```
   



##########
python/pyarrow/table.pxi:
##########
@@ -1517,6 +1517,36 @@ cdef class RecordBatch(_PandasConvertible):
         self.sp_batch = batch
         self.batch = batch.get()
 
+    # ----------------------------------------------------------------------
+    def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True):
+        """
+        Return the dataframe interchange object implementing the interchange protocol.
+        Parameters

Review Comment:
   You are missing some blank lines between the sections.



##########
python/pyarrow/interchange/from_dataframe.py:
##########
@@ -74,9 +74,9 @@ def from_dataframe(df: DataFrameObject, allow_copy=True) -> pa.Table:
 
     Returns
     -------
-    pa.Table
+    pa.Table or pa.RecordBatch
     """
-    if isinstance(df, pa.Table):
+    if isinstance(df, (pa.Table, pa.RecordBatch)):

Review Comment:
   Wondering, do we want to convert this RecordBatch in a Table, so we have a consistent return type? (so you know the result is _always_ a Table)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org