You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "AlenkaF (via GitHub)" <gi...@apache.org> on 2023/02/22 13:37:14 UTC

[GitHub] [arrow] AlenkaF opened a new pull request, #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

AlenkaF opened a new pull request, #34294:
URL: https://github.com/apache/arrow/pull/34294

   ### Rationale for this change
   Add the implementation of the Dataframe Interchange Protocol for `pyarrow.RecordBatch`. The protocol is already implemented for pyarrow.Table, see https://github.com/apache/arrow/pull/14804.
   
   ### Are these changes tested?
   Yes, tests are added to:
   
   - python/pyarrow/tests/interchange/test_interchange_spec.py
   - python/pyarrow/tests/interchange/test_conversion.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #34294:
URL: https://github.com/apache/arrow/pull/34294#discussion_r1118565888


##########
python/pyarrow/tests/interchange/test_conversion.py:
##########
@@ -274,6 +368,21 @@ def test_roundtrip_pandas_datetime(unit):
     assert expected_protocol.num_chunks() == result_protocol.num_chunks()
     assert expected_protocol.column_names() == result_protocol.column_names()
 
+    batch = table.to_batches()[0]
+    pandas_df = pandas_from_dataframe(batch)
+    result = pi.from_dataframe(pandas_df)

Review Comment:
   I have removed the repetition from all the tests in _test_conversion.py_ and updated the tests in _test_interchange_spec.py_ to reflect what was suggested here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #34294:
URL: https://github.com/apache/arrow/pull/34294#discussion_r1118411084


##########
python/pyarrow/interchange/from_dataframe.py:
##########
@@ -74,9 +74,9 @@ def from_dataframe(df: DataFrameObject, allow_copy=True) -> pa.Table:
 
     Returns
     -------
-    pa.Table
+    pa.Table or pa.RecordBatch
     """
-    if isinstance(df, pa.Table):
+    if isinstance(df, (pa.Table, pa.RecordBatch)):

Review Comment:
   Yeah, that should be the case (`from_dataframe` to return `pa.Table`) and it should therefore be the case for `RecordBatch` also. Will change 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche merged pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche merged PR #34294:
URL: https://github.com/apache/arrow/pull/34294


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #34294:
URL: https://github.com/apache/arrow/pull/34294#discussion_r1118412739


##########
python/pyarrow/tests/interchange/test_conversion.py:
##########
@@ -274,6 +368,21 @@ def test_roundtrip_pandas_datetime(unit):
     assert expected_protocol.num_chunks() == result_protocol.num_chunks()
     assert expected_protocol.column_names() == result_protocol.column_names()
 
+    batch = table.to_batches()[0]
+    pandas_df = pandas_from_dataframe(batch)
+    result = pi.from_dataframe(pandas_df)

Review Comment:
   Oh, why haven't I thought of that! :)
   
   WIll check again and remove where the repetition is actually not needed and in other cases I will parametrize the tests for table/recordbatch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #34294:
URL: https://github.com/apache/arrow/pull/34294#discussion_r1118426440


##########
python/pyarrow/interchange/dataframe.py:
##########
@@ -154,19 +162,28 @@ def select_columns(self, indices: Sequence[int]) -> _PyArrowDataFrame:
         """
         Create a new DataFrame by selecting a subset of columns by index.
         """
-        return _PyArrowDataFrame(
-            self._df.select(list(indices)), self._nan_as_null, self._allow_copy
-        )
+        if isinstance(self._df, pa.RecordBatch):
+            columns = [self._df.column(i) for i in indices]
+            names = [self._df.schema.names[i] for i in indices]
+            return _PyArrowDataFrame(pa.record_batch(columns, names=names))
+        else:
+            return _PyArrowDataFrame(
+                self._df.select(list(indices)), self._nan_as_null, self._allow_copy

Review Comment:
   Will do it in a separate PR (issue https://github.com/apache/arrow/issues/34359)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #34294:
URL: https://github.com/apache/arrow/pull/34294#discussion_r1117370945


##########
python/pyarrow/interchange/dataframe.py:
##########
@@ -114,18 +116,24 @@ def num_chunks(self) -> int:
         """
         Return the number of chunks the DataFrame consists of.
         """
-        # pyarrow.Table can have columns with different number
-        # of chunks so we take the number of chunks that
-        # .to_batches() returns as it takes the min chunk size
-        # of all the columns (to_batches is a zero copy method)
-        batches = self._df.to_batches()
-        return len(batches)
+        if isinstance(self._df, pa.RecordBatch):
+            return 1
+        else:
+            # pyarrow.Table can have columns with different number
+            # of chunks so we take the number of chunks that
+            # .to_batches() returns as it takes the min chunk size
+            # of all the columns (to_batches is a zero copy method)
+            batches = self._df.to_batches()
+            return len(batches)
 
     def column_names(self) -> Iterable[str]:
         """
         Return an iterator yielding the column names.
         """
-        return self._df.column_names
+        if isinstance(self._df, pa.RecordBatch):
+            return self._df.schema.names

Review Comment:
   I think this one should work for both RecordBatch and Table (not needing the if/else conditional)



##########
python/pyarrow/interchange/dataframe.py:
##########
@@ -154,19 +162,28 @@ def select_columns(self, indices: Sequence[int]) -> _PyArrowDataFrame:
         """
         Create a new DataFrame by selecting a subset of columns by index.
         """
-        return _PyArrowDataFrame(
-            self._df.select(list(indices)), self._nan_as_null, self._allow_copy
-        )
+        if isinstance(self._df, pa.RecordBatch):
+            columns = [self._df.column(i) for i in indices]
+            names = [self._df.schema.names[i] for i in indices]
+            return _PyArrowDataFrame(pa.record_batch(columns, names=names))
+        else:
+            return _PyArrowDataFrame(
+                self._df.select(list(indices)), self._nan_as_null, self._allow_copy

Review Comment:
   We should actually add a `select` method to RecordBatch as wel, mimicking what we have for Table.select. Could you do that here or in a separate PR?



##########
python/pyarrow/tests/interchange/test_conversion.py:
##########
@@ -274,6 +368,21 @@ def test_roundtrip_pandas_datetime(unit):
     assert expected_protocol.num_chunks() == result_protocol.num_chunks()
     assert expected_protocol.column_names() == result_protocol.column_names()
 
+    batch = table.to_batches()[0]
+    pandas_df = pandas_from_dataframe(batch)
+    result = pi.from_dataframe(pandas_df)

Review Comment:
   Do you think this repetition with RecordBatch instead of Table is needed for those tests test_conversion.py that check actual conversion of values? (for that part, I assume that there is no difference in our code?)
   
   If we keep it, another option might be to parametrize the test for table/recordbatch, something:
   
   ```
   @pytest.mark.parametrize("constructor", [pa.table, pa.recordbatch])
   def test(constructor):
       table = constructor({...})
       ...
       expected = constructor({...})
       assert result.equals(expected)
   ```
   
   or 
   
   
   ```
   @pytest.mark.parametrize("use_batch", [False, True])
   def test(use_batch):
       table = pa.table({...})
       if use_batch:
           table = table.to_batches()[0]
       ...
   ```
   



##########
python/pyarrow/table.pxi:
##########
@@ -1517,6 +1517,36 @@ cdef class RecordBatch(_PandasConvertible):
         self.sp_batch = batch
         self.batch = batch.get()
 
+    # ----------------------------------------------------------------------
+    def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True):
+        """
+        Return the dataframe interchange object implementing the interchange protocol.
+        Parameters

Review Comment:
   You are missing some blank lines between the sections.



##########
python/pyarrow/interchange/from_dataframe.py:
##########
@@ -74,9 +74,9 @@ def from_dataframe(df: DataFrameObject, allow_copy=True) -> pa.Table:
 
     Returns
     -------
-    pa.Table
+    pa.Table or pa.RecordBatch
     """
-    if isinstance(df, pa.Table):
+    if isinstance(df, (pa.Table, pa.RecordBatch)):

Review Comment:
   Wondering, do we want to convert this RecordBatch in a Table, so we have a consistent return type? (so you know the result is _always_ a Table)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34294:
URL: https://github.com/apache/arrow/pull/34294#issuecomment-1440031660

   * Closes: #33926


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #34294: GH-33926: [Python] DataFrame Interchange Protocol for pyarrow.RecordBatch

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #34294:
URL: https://github.com/apache/arrow/pull/34294#issuecomment-1449271038

   Benchmark runs are scheduled for baseline = 61c9a749741fcfad5de5d0e1e7fe1ced1330928a and contender = 6cf5e8984a48e70352e65112c3254fc03557767a. 6cf5e8984a48e70352e65112c3254fc03557767a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/2b0f4c3530fc432cb4219aa07e28f8ff...2c3dada7142f4464947de18174a7fca5/)
   [Finished :arrow_down:0.46% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/ac48c710ba744321b97b7ee68ff01444...06a04a525d11461aa9e81ca28f8cff62/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/942fd3bc8c5c46ae9567c47426fa4e9d...1134b0b3418f4302a1b76d702e25e3f0/)
   [Finished :arrow_down:0.13% :arrow_up:0.13%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f89873cbe0164f5fa9d75716f85aec4a...9a11c35d71344df89f6461295424611f/)
   Buildkite builds:
   [Finished] [`6cf5e898` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2448)
   [Finished] [`6cf5e898` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2478)
   [Finished] [`6cf5e898` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2446)
   [Finished] [`6cf5e898` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2469)
   [Finished] [`61c9a749` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2447)
   [Finished] [`61c9a749` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2477)
   [Finished] [`61c9a749` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2444)
   [Finished] [`61c9a749` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2468)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org