You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "danepitkin (via GitHub)" <gi...@apache.org> on 2023/06/16 19:41:39 UTC

[GitHub] [arrow] danepitkin opened a new pull request, #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

danepitkin opened a new pull request, #36130:
URL: https://github.com/apache/arrow/pull/36130

   
   ### Rationale for this change
   
   Deduplicate python code in subclasses.
   
   ### What changes are included in this PR?
   
   Refactoring and minor test updates.
   
   ### Are these changes tested?
   
   Yes
   
   ### Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235385450


##########
python/pyarrow/table.pxi:
##########
@@ -1513,12 +1513,18 @@ cdef class _Tabular(_PandasConvertible):
 
         return self.column(key)
 
+    def __len__(self):
+        return self.num_rows
+
     def __repr__(self):
         if not self._is_initialized():
             raise ValueError("This object's internal pointer is NULL, do not "
                              "use any methods or attributes on this object")
         return self.to_string(preview_cols=10)
 
+    def _column(self, int i):
+        raise NotImplementedError
+

Review Comment:
   I plan to leave it as-is for now since it utilizes C objects (`self.table` or `self.batch`). I haven't found a way to consolidate C/Cython code since that code requires the explicit object types in order to compile. If we found a way, we could consolidate a large chunk of Cython! It might not be possible, or if it is possible it might give worse performance.
   
   At this point, I think the best next steps are to leave the C/Cython as-is and focus on implementing missing APIs that might exist in one class and not the other.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235512961


##########
python/pyarrow/table.pxi:
##########
@@ -1787,6 +1941,35 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().take(self, indices)
 
+    def to_pydict(self):
+        raise NotImplementedError

Review Comment:
   I'll go ahead and include this change! I'll check some basic timing just to see if anything major happens with performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235428481


##########
python/pyarrow/table.pxi:
##########
@@ -1787,6 +1941,35 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().take(self, indices)
 
+    def to_pydict(self):
+        raise NotImplementedError

Review Comment:
   FWIW, looking at the current `to_pydict` implementations, while they are not exactly the same right now, I think it should be quite easy to rewrite it slightly so that it works for both table and record batch (in the end you iterate over the number of columns, and get each field/column and to_pylist on that)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1233503511


##########
python/pyarrow/table.pxi:
##########
@@ -1513,12 +1513,18 @@ cdef class _Tabular(_PandasConvertible):
 
         return self.column(key)
 
+    def __len__(self):
+        return self.num_rows
+
     def __repr__(self):
         if not self._is_initialized():
             raise ValueError("This object's internal pointer is NULL, do not "
                              "use any methods or attributes on this object")
         return self.to_string(preview_cols=10)
 
+    def _column(self, int i):
+        raise NotImplementedError
+

Review Comment:
   We plan to consolidate `_column` in another PR or we plan to keep it as is?



##########
python/pyarrow/table.pxi:
##########
@@ -1922,136 +2105,16 @@ cdef class RecordBatch(_Tabular):
         self.batch = NULL
         self._schema = None
 
-    def __init__(self):
-        raise TypeError("Do not call RecordBatch's constructor directly, use "
-                        "one of the `RecordBatch.from_*` functions instead.")
-
     cdef void init(self, const shared_ptr[CRecordBatch]& batch):
         self.sp_batch = batch
         self.batch = batch.get()
 
     def _is_initialized(self):
         return self.batch != NULL
 
-    @staticmethod
-    def from_pydict(mapping, schema=None, metadata=None):
-        """
-        Construct a RecordBatch from Arrow arrays or columns.
-
-        Parameters
-        ----------
-        mapping : dict or Mapping
-            A mapping of strings to Arrays or Python lists.
-        schema : Schema, default None
-            If not passed, will be inferred from the Mapping values.
-        metadata : dict or Mapping, default None
-            Optional metadata for the schema (if inferred).
-
-        Returns
-        -------
-        RecordBatch
-
-        Examples
-        --------
-        >>> import pyarrow as pa
-        >>> n_legs = [2, 2, 4, 4, 5, 100]
-        >>> animals = ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]
-        >>> pydict = {'n_legs': n_legs, 'animals': animals}
-
-        Construct a RecordBatch from arrays:
-
-        >>> pa.RecordBatch.from_pydict(pydict)
-        pyarrow.RecordBatch
-        n_legs: int64
-        animals: string
-        ----
-        n_legs: [2,2,4,4,5,100]
-        animals: ["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]
-        >>> pa.RecordBatch.from_pydict(pydict).to_pandas()
-           n_legs        animals
-        0       2       Flamingo
-        1       2         Parrot
-        2       4            Dog
-        3       4          Horse
-        4       5  Brittle stars
-        5     100      Centipede
-
-        Construct a RecordBatch with schema:
-
-        >>> my_schema = pa.schema([
-        ...     pa.field('n_legs', pa.int64()),
-        ...     pa.field('animals', pa.string())],
-        ...     metadata={"n_legs": "Number of legs per animal"})
-        >>> pa.RecordBatch.from_pydict(pydict, schema=my_schema).schema
-        n_legs: int64
-        animals: string
-        -- schema metadata --
-        n_legs: 'Number of legs per animal'
-        """
-
-        return _from_pydict(cls=RecordBatch,
-                            mapping=mapping,
-                            schema=schema,
-                            metadata=metadata)
-
-    @staticmethod
-    def from_pylist(mapping, schema=None, metadata=None):
-        """
-        Construct a RecordBatch from list of rows / dictionaries.
-
-        Parameters
-        ----------
-        mapping : list of dicts of rows
-            A mapping of strings to row values.
-        schema : Schema, default None
-            If not passed, will be inferred from the first row of the
-            mapping values.
-        metadata : dict or Mapping, default None
-            Optional metadata for the schema (if inferred).
-
-        Returns
-        -------
-        RecordBatch
-
-        Examples
-        --------
-        >>> import pyarrow as pa
-        >>> pylist = [{'n_legs': 2, 'animals': 'Flamingo'},
-        ...           {'n_legs': 4, 'animals': 'Dog'}]
-        >>> pa.RecordBatch.from_pylist(pylist)
-        pyarrow.RecordBatch
-        n_legs: int64
-        animals: string
-        ----
-        n_legs: [2,4]
-        animals: ["Flamingo","Dog"]
-
-        >>> pa.RecordBatch.from_pylist(pylist).to_pandas()
-           n_legs   animals
-        0       2  Flamingo
-        1       4       Dog
-
-        Construct a RecordBatch with metadata:
-
-        >>> my_metadata={"n_legs": "Number of legs per animal"}
-        >>> pa.RecordBatch.from_pylist(pylist, metadata=my_metadata).schema
-        n_legs: int64
-        animals: string
-        -- schema metadata --
-        n_legs: 'Number of legs per animal'
-        """

Review Comment:
   And this part of an example.



##########
python/pyarrow/table.pxi:
##########
@@ -1787,6 +1941,35 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().take(self, indices)
 
+    def to_pydict(self):
+        raise NotImplementedError
+
+    def to_pylist(self):
+        """
+        Convert the Table or RecordBatch to a list of rows / dictionaries.
+
+        Returns
+        -------
+        list
+
+        Examples
+        --------
+        Table (works similarly for RecordBatch)
+
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
+        ...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.to_pylist()
+        [{'n_legs': 2, 'animals': 'Flamingo'}, {'n_legs': 4, 'animals': 'Horse'}, ...

Review Comment:
   This is actually a leftover from me, but now I think it is better to use `pa.table` or `pa.record_batch` instead of constructing a table from a pandas dataframe.
   
   Can we do that here, similar to the example of `to_pylist` in the RecordBatch class?



##########
python/pyarrow/table.pxi:
##########
@@ -1922,136 +2105,16 @@ cdef class RecordBatch(_Tabular):
         self.batch = NULL
         self._schema = None
 
-    def __init__(self):
-        raise TypeError("Do not call RecordBatch's constructor directly, use "
-                        "one of the `RecordBatch.from_*` functions instead.")
-
     cdef void init(self, const shared_ptr[CRecordBatch]& batch):
         self.sp_batch = batch
         self.batch = batch.get()
 
     def _is_initialized(self):
         return self.batch != NULL
 
-    @staticmethod
-    def from_pydict(mapping, schema=None, metadata=None):
-        """
-        Construct a RecordBatch from Arrow arrays or columns.
-
-        Parameters
-        ----------
-        mapping : dict or Mapping
-            A mapping of strings to Arrays or Python lists.
-        schema : Schema, default None
-            If not passed, will be inferred from the Mapping values.
-        metadata : dict or Mapping, default None
-            Optional metadata for the schema (if inferred).
-
-        Returns
-        -------
-        RecordBatch
-
-        Examples
-        --------
-        >>> import pyarrow as pa
-        >>> n_legs = [2, 2, 4, 4, 5, 100]
-        >>> animals = ["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"]
-        >>> pydict = {'n_legs': n_legs, 'animals': animals}
-
-        Construct a RecordBatch from arrays:
-
-        >>> pa.RecordBatch.from_pydict(pydict)
-        pyarrow.RecordBatch
-        n_legs: int64
-        animals: string
-        ----
-        n_legs: [2,2,4,4,5,100]
-        animals: ["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]
-        >>> pa.RecordBatch.from_pydict(pydict).to_pandas()
-           n_legs        animals
-        0       2       Flamingo
-        1       2         Parrot
-        2       4            Dog
-        3       4          Horse
-        4       5  Brittle stars
-        5     100      Centipede
-
-        Construct a RecordBatch with schema:
-
-        >>> my_schema = pa.schema([
-        ...     pa.field('n_legs', pa.int64()),
-        ...     pa.field('animals', pa.string())],
-        ...     metadata={"n_legs": "Number of legs per animal"})
-        >>> pa.RecordBatch.from_pydict(pydict, schema=my_schema).schema
-        n_legs: int64
-        animals: string
-        -- schema metadata --
-        n_legs: 'Number of legs per animal'

Review Comment:
   Can we add this part of an example (schema) to the base class also?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235455796


##########
python/pyarrow/table.pxi:
##########
@@ -1787,6 +1941,35 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().take(self, indices)
 
+    def to_pydict(self):
+        raise NotImplementedError

Review Comment:
   I agree! The implementation actually used to be identical, but it was changed for Table only, which introduced Cython: https://github.com/apache/arrow/pull/612/files#diff-cede36e8e2e0eb6e6e1ee21745db9687174527f463520c6e6d8b9e8f957bf304L739-L752
   
   I wasn't sure if this was done for performance reasons and opted to leave as-is for now. I need to learn how to profile Cython correctly, so figured I'd save it for a future issue. Do you have any thoughts on performance just by looking at the code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.

AlenkaF commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235619837


##########
python/pyarrow/table.pxi:
##########
@@ -1513,12 +1513,18 @@ cdef class _Tabular(_PandasConvertible):
 
         return self.column(key)
 
+    def __len__(self):
+        return self.num_rows
+
     def __repr__(self):
         if not self._is_initialized():
             raise ValueError("This object's internal pointer is NULL, do not "
                              "use any methods or attributes on this object")
         return self.to_string(preview_cols=10)
 
+    def _column(self, int i):
+        raise NotImplementedError
+

Review Comment:
   Thanks so much for the explanation! Helped me understand this part much better 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche merged pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche merged PR #36130:
URL: https://github.com/apache/arrow/pull/36130


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #36130:
URL: https://github.com/apache/arrow/pull/36130#issuecomment-1595201531

   :warning: GitHub issue #36129 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235377708


##########
python/pyarrow/table.pxi:
##########
@@ -1787,6 +1941,35 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().take(self, indices)
 
+    def to_pydict(self):
+        raise NotImplementedError
+
+    def to_pylist(self):
+        """
+        Convert the Table or RecordBatch to a list of rows / dictionaries.
+
+        Returns
+        -------
+        list
+
+        Examples
+        --------
+        Table (works similarly for RecordBatch)
+
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
+        ...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.to_pylist()
+        [{'n_legs': 2, 'animals': 'Flamingo'}, {'n_legs': 4, 'animals': 'Horse'}, ...

Review Comment:
   Yes, great idea! I will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235465394


##########
python/pyarrow/table.pxi:
##########
@@ -1787,6 +1941,35 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().take(self, indices)
 
+    def to_pydict(self):
+        raise NotImplementedError

Review Comment:
   I am not sure why that original change was made. The main change is to type the intermediate column object, but we only call to_pylist on that .. 
   
   I very much doubt that this will have any performance impact, given that the main bottleneck here is the call to `to_pylist`, which performs the costly conversion to python objects, and thus any small speedup in accessing that method on the Array/ChunkedArray will be insignificant I think.
   
   (if you want to test that, you also don't necessarily need cython profiling, as a basic timing of the python to_pydict call could be sufficient to see that it doesn't matter)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on a diff in pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on code in PR #36130:
URL: https://github.com/apache/arrow/pull/36130#discussion_r1235952089


##########
python/pyarrow/table.pxi:
##########
@@ -1787,6 +1941,35 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().take(self, indices)
 
+    def to_pydict(self):
+        raise NotImplementedError

Review Comment:
   I ran the following code in IPython and received the results below. No performance difference found in my basic testing.
   
   ```
   import random
   import string
   import pyarrow as pa
   
   arr0 = [random.randint(0, 2_000_000) for _ in range(1_000_000)]
   arr1 = [''.join(random.choices(string.ascii_uppercase + string.digits, k=10)) for _ in range(1_000_000)]
   table = pa.table([arr0, arr1], ["c0", "c1"])
   ```
   ```%timeit table.to_pydict()```
   
   Before
   `2.03 s ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`
   After
   `1.99 s ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] danepitkin commented on pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.

danepitkin commented on PR #36130:
URL: https://github.com/apache/arrow/pull/36130#issuecomment-1603358397

   I feel pretty good about this PR now. I think it could be ready to merge! All comments have been addressed to the best of my knowledge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36130: GH-36129: [Python] Consolidate common APIs in Table and RecordBatch

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.

conbench-apache-arrow[bot] commented on PR #36130:
URL: https://github.com/apache/arrow/pull/36130#issuecomment-1616702043

   Conbench analyzed the 5 benchmark runs on commit `427f5507`.
   
   There were 4 benchmark results indicating a performance regression:
   
   - Commit Run on `arm64-m6g-linux-compute` at [2023-06-28 18:55:23Z](http://conbench.ursa.dev/compare/runs/c4ca6f428fec42be80ad8d31849f8831...949d26d6ff8e4f94a08195a509859289/)
     - [source=cpp-micro, suite=parquet-level-conversion-benchmark](http://conbench.ursa.dev/compare/benchmarks/0649c747f6327d8080004b076fa6e846...0649c82743a874838000f6557dcbf919)
     - [params=<Round, UInt8Type, RoundMode::DOWN>/size:1048576/inverse_null_proportion:0, source=cpp-micro, suite=arrow-compute-scalar-round-benchmark](http://conbench.ursa.dev/compare/benchmarks/0649c7423fba77718000d6030c478fb9...0649c822ac10775d8000d243bae0e207)
   - and 2 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14719729455) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org