You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "danepitkin (via GitHub)" <gi...@apache.org> on 2023/05/02 21:31:12 UTC

[GitHub] [arrow] danepitkin opened a new pull request, #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

danepitkin opened a new pull request, #35396:
URL: https://github.com/apache/arrow/pull/35396

   ### Rationale for this change
   
   Table and RecordBatch share many similar/identical APIs. Let's start consolidating them into the parent class.
   
   ### Are these changes tested?
   
   Relying on existing test cases for this refactor.
   
   ### Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35396:
URL: https://github.com/apache/arrow/pull/35396#issuecomment-1533420610

   Build error is unrelated. The error exists on main (and in other PRs).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35396:
URL: https://github.com/apache/arrow/pull/35396#issuecomment-1545394210

   Benchmark runs are scheduled for baseline = 1a038adad85be6f7e6949cc700dcb9b211feb44d and contender = e5405e773088b2eeaf9b1057862206fe527376d6. e5405e773088b2eeaf9b1057862206fe527376d6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/2a1019bef15540e195ac2b45f22f71c0...585d8e77b90a4edeb744055f969608ec/)
   [Finished :arrow_down:2.21% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9648ebda147f4392be54162295f0c572...29390e3884c740239a9da827e306bd0c/)
   [Failed :arrow_down:0.25% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/3ef9ee03999c4ea188de94b5488a5665...07687319c63849b4b8c611512bd25a8c/)
   [Finished :arrow_down:0.57% :arrow_up:0.06%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d14230f22162403ab43f96c6fc95b7fc...d1bfc79b7bfc4a67bf13da568b71ea53/)
   Buildkite builds:
   [Finished] [`e5405e77` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2853)
   [Finished] [`e5405e77` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2890)
   [Finished] [`e5405e77` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2857)
   [Finished] [`e5405e77` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2880)
   [Finished] [`1a038ada` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2852)
   [Finished] [`1a038ada` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2888)
   [Failed] [`1a038ada` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2850)
   [Finished] [`1a038ada` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2878)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF merged pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF merged PR #35396:
URL: https://github.com/apache/arrow/pull/35396


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on code in PR #35396:
URL: https://github.com/apache/arrow/pull/35396#discussion_r1188315143


##########
python/pyarrow/table.pxi:
##########
@@ -1500,6 +1627,68 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().drop_null(self)
 
+    def field(self, i):
+        """
+        Select a schema field by its column name or numeric index.
+
+        Parameters
+        ----------
+        i : int or string
+            The index or name of the field to retrieve.
+
+        Returns
+        -------
+        Field
+
+        Examples
+        --------
+        Table (works similarly for RecordBatch)
+
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
+        ...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.field(0)
+        pyarrow.Field<n_legs: int64>
+        >>> table.field(1)
+        pyarrow.Field<animals: string>
+        """
+        return self.schema.field(i)
+
+    @property
+    def schema(self):
+        raise NotImplementedError
+
+    def sort_by(self, sorting, **kwargs):
+        """
+        Sort the Table or RecordBatch by one or multiple columns.
+
+        Parameters
+        ----------
+        sorting : str or list[tuple(name, order)]
+            Name of the column to use to sort (ascending), or
+            a list of multiple sorting conditions where
+            each entry is a tuple with column name
+            and sorting order ("ascending" or "descending")
+        **kwargs : dict, optional
+            Additional sorting options.
+            As allowed by :class:`SortOptions`
+
+        Returns
+        -------
+        Table or RecordBatch
+            A new tabular object sorted according to the sort keys.

Review Comment:
   Could we move the example from the `Table` `sort_by` to here?



##########
python/pyarrow/table.pxi:
##########
@@ -1963,106 +2113,16 @@ cdef class RecordBatch(_Tabular):
         if self._schema is None:
             self._schema = pyarrow_wrap_schema(self.batch.schema())
 
-        return self._schema
-
-    def field(self, i):
-        """
-        Select a schema field by its column name or numeric index
-
-        Parameters
-        ----------
-        i : int or string
-            The index or name of the field to retrieve
-
-        Returns
-        -------
-        pyarrow.Field
-
-        Examples
-        --------
-        >>> import pyarrow as pa
-        >>> n_legs = pa.array([2, 2, 4, 4, 5, 100])
-        >>> animals = pa.array(["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"])
-        >>> batch = pa.RecordBatch.from_arrays([n_legs, animals],
-        ...                                     names=["n_legs", "animals"])
-        >>> batch.field(0)
-        pyarrow.Field<n_legs: int64>
-        >>> batch.field(1)
-        pyarrow.Field<animals: string>
-        """
-        return self.schema.field(i)
-
-    @property
-    def columns(self):
-        """
-        List of all columns in numerical order
-
-        Returns
-        -------
-        list of pyarrow.Array
-
-        Examples
-        --------
-        >>> import pyarrow as pa
-        >>> n_legs = pa.array([2, 2, 4, 4, 5, 100])
-        >>> animals = pa.array(["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"])
-        >>> batch = pa.RecordBatch.from_arrays([n_legs, animals],
-        ...                                     names=["n_legs", "animals"])
-        >>> batch.columns
-        [<pyarrow.lib.Int64Array object at ...>
-        [
-          2,
-          2,
-          4,
-          4,
-          5,
-          100
-        ], <pyarrow.lib.StringArray object at ...>
-        [
-          "Flamingo",
-          "Parrot",
-          "Dog",
-          "Horse",
-          "Brittle stars",
-          "Centipede"
-        ]]
-        """
-        return [self.column(i) for i in range(self.num_columns)]
-
-    def _ensure_integer_index(self, i):
-        """
-        Ensure integer index (convert string column name to integer if needed).
-        """
-        if isinstance(i, (bytes, str)):
-            field_indices = self.schema.get_all_field_indices(i)
-
-            if len(field_indices) == 0:
-                raise KeyError(
-                    "Field \"{}\" does not exist in record batch schema"
-                    .format(i))
-            elif len(field_indices) > 1:
-                raise KeyError(
-                    "Field \"{}\" exists {} times in record batch schema"
-                    .format(i, len(field_indices)))
-            else:
-                return field_indices[0]
-        elif isinstance(i, int):
-            return i
-        else:
-            raise TypeError("Index must either be string or integer")
+        return self._schema
 
-    def column(self, i):
+    @property
+    def columns(self):

Review Comment:
   Could we consolidate `columns` also? It seems both could be using `_column` or `column`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on PR #35396:
URL: https://github.com/apache/arrow/pull/35396#issuecomment-1543579090

   Thanks! Will merge, the failure is not related and the connected tests on that build are passing 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #35396:
URL: https://github.com/apache/arrow/pull/35396#discussion_r1190306435


##########
python/pyarrow/table.pxi:
##########
@@ -1963,106 +2113,16 @@ cdef class RecordBatch(_Tabular):
         if self._schema is None:
             self._schema = pyarrow_wrap_schema(self.batch.schema())
 
-        return self._schema
-
-    def field(self, i):
-        """
-        Select a schema field by its column name or numeric index
-
-        Parameters
-        ----------
-        i : int or string
-            The index or name of the field to retrieve
-
-        Returns
-        -------
-        pyarrow.Field
-
-        Examples
-        --------
-        >>> import pyarrow as pa
-        >>> n_legs = pa.array([2, 2, 4, 4, 5, 100])
-        >>> animals = pa.array(["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"])
-        >>> batch = pa.RecordBatch.from_arrays([n_legs, animals],
-        ...                                     names=["n_legs", "animals"])
-        >>> batch.field(0)
-        pyarrow.Field<n_legs: int64>
-        >>> batch.field(1)
-        pyarrow.Field<animals: string>
-        """
-        return self.schema.field(i)
-
-    @property
-    def columns(self):
-        """
-        List of all columns in numerical order
-
-        Returns
-        -------
-        list of pyarrow.Array
-
-        Examples
-        --------
-        >>> import pyarrow as pa
-        >>> n_legs = pa.array([2, 2, 4, 4, 5, 100])
-        >>> animals = pa.array(["Flamingo", "Parrot", "Dog", "Horse", "Brittle stars", "Centipede"])
-        >>> batch = pa.RecordBatch.from_arrays([n_legs, animals],
-        ...                                     names=["n_legs", "animals"])
-        >>> batch.columns
-        [<pyarrow.lib.Int64Array object at ...>
-        [
-          2,
-          2,
-          4,
-          4,
-          5,
-          100
-        ], <pyarrow.lib.StringArray object at ...>
-        [
-          "Flamingo",
-          "Parrot",
-          "Dog",
-          "Horse",
-          "Brittle stars",
-          "Centipede"
-        ]]
-        """
-        return [self.column(i) for i in range(self.num_columns)]
-
-    def _ensure_integer_index(self, i):
-        """
-        Ensure integer index (convert string column name to integer if needed).
-        """
-        if isinstance(i, (bytes, str)):
-            field_indices = self.schema.get_all_field_indices(i)
-
-            if len(field_indices) == 0:
-                raise KeyError(
-                    "Field \"{}\" does not exist in record batch schema"
-                    .format(i))
-            elif len(field_indices) > 1:
-                raise KeyError(
-                    "Field \"{}\" exists {} times in record batch schema"
-                    .format(i, len(field_indices)))
-            else:
-                return field_indices[0]
-        elif isinstance(i, int):
-            return i
-        else:
-            raise TypeError("Index must either be string or integer")
+        return self._schema
 
-    def column(self, i):
+    @property
+    def columns(self):

Review Comment:
   Absolutely!



##########
python/pyarrow/table.pxi:
##########
@@ -1500,6 +1627,68 @@ cdef class _Tabular(_PandasConvertible):
         """
         return _pc().drop_null(self)
 
+    def field(self, i):
+        """
+        Select a schema field by its column name or numeric index.
+
+        Parameters
+        ----------
+        i : int or string
+            The index or name of the field to retrieve.
+
+        Returns
+        -------
+        Field
+
+        Examples
+        --------
+        Table (works similarly for RecordBatch)
+
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
+        ...                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.field(0)
+        pyarrow.Field<n_legs: int64>
+        >>> table.field(1)
+        pyarrow.Field<animals: string>
+        """
+        return self.schema.field(i)
+
+    @property
+    def schema(self):
+        raise NotImplementedError
+
+    def sort_by(self, sorting, **kwargs):
+        """
+        Sort the Table or RecordBatch by one or multiple columns.
+
+        Parameters
+        ----------
+        sorting : str or list[tuple(name, order)]
+            Name of the column to use to sort (ascending), or
+            a list of multiple sorting conditions where
+            each entry is a tuple with column name
+            and sorting order ("ascending" or "descending")
+        **kwargs : dict, optional
+            Additional sorting options.
+            As allowed by :class:`SortOptions`
+
+        Returns
+        -------
+        Table or RecordBatch
+            A new tabular object sorted according to the sort keys.

Review Comment:
   Good catch! Will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35396:
URL: https://github.com/apache/arrow/pull/35396#issuecomment-1532180789

   * Closes: #35390


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35396:
URL: https://github.com/apache/arrow/pull/35396#issuecomment-1532180823

   :warning: GitHub issue #35390 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35396: GH-35390: [Python] Consolidate some APIs in Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35396:
URL: https://github.com/apache/arrow/pull/35396#issuecomment-1542693729

   Updated the diff to address comments!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org