You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "danepitkin (via GitHub)" <gi...@apache.org> on 2023/04/07 19:55:00 UTC

[GitHub] [arrow] danepitkin opened a new pull request, #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

danepitkin opened a new pull request, #34980:
URL: https://github.com/apache/arrow/pull/34980

   ### Rationale for this change
   
   This is an incremental first step towards https://github.com/apache/arrow/issues/30559
   
   ### What changes are included in this PR?
   
   Introduce `class _Table` in `table.pxi`.
   
   ### Are these changes tested?
   
   Existing pytests will check for regression.
   
   ### Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1500594741

   :warning: GitHub issue #34979 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche merged pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche merged PR #34980:
URL: https://github.com/apache/arrow/pull/34980


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1165908655


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.
+        See :func:`pyarrow.compute.drop_null` for full usage.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'year': [None, 2022, 2019, 2021],
+        ...                   'n_legs': [2, 4, 5, 100],
+        ...                   'animals': ["Flamingo", "Horse", None, "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.drop_null()
+        pyarrow.Table
+        year: double
+        n_legs: int64
+        animals: string
+        ----
+        year: [[2022,2021]]
+        n_legs: [[4,100]]
+        animals: [["Horse","Centipede"]]
+        """
+        return _pc().drop_null(self)
+
+    def take(self, object indices):
+        """
+        Select rows from the table.

Review Comment:
   Updated!



##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.
+        See :func:`pyarrow.compute.drop_null` for full usage.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'year': [None, 2022, 2019, 2021],
+        ...                   'n_legs': [2, 4, 5, 100],
+        ...                   'animals': ["Flamingo", "Horse", None, "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.drop_null()
+        pyarrow.Table
+        year: double
+        n_legs: int64
+        animals: string
+        ----
+        year: [[2022,2021]]
+        n_legs: [[4,100]]
+        animals: [["Horse","Centipede"]]
+        """
+        return _pc().drop_null(self)
+
+    def take(self, object indices):
+        """
+        Select rows from the table.
+
+        See :func:`pyarrow.compute.take` for full usage.
+
+        Parameters
+        ----------
+        indices : Array or array-like
+            The indices in the table whose rows will be returned.
+
+        Returns
+        -------
+        taken : Table

Review Comment:
   Updated!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1163260107


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.

Review Comment:
   I left only the `Table` examples. I can include `RecordBatch` explicitly if preferred.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1173454171


##########
python/pyarrow/table.pxi:
##########
@@ -1696,15 +1820,10 @@ cdef class RecordBatch(_PandasConvertible):
         except TypeError:
             return NotImplemented
 
-    def to_string(self, show_metadata=False):
-        # Use less verbose schema output.
-        schema_as_string = self.schema.to_string(
-            show_field_metadata=show_metadata,
-            show_schema_metadata=show_metadata
-        )
-        return 'pyarrow.{}\n{}'.format(type(self).__name__, schema_as_string)
-
     def __repr__(self):
+        # TODO remove this and update pytests/doctests for

Review Comment:
   I think it makes sense to have a separate PR for the changes in the doctest. Am looking forward to seeing a better repr for the RecordBatch also! =)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1173545030


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,129 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Tabular(_PandasConvertible):
+    """Internal: An interface for common operations on tabular objects."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def __repr__(self):
+        if not self._is_initialized():
+            raise ValueError("This object's internal pointer is NULL, do not "

Review Comment:
   Yes, you can have a table object that no longer is being backed by Arrow data when doing `df = table.to_pandas(self_destruct=True)`. This check and error then prevents getting a segfault when just printing `table` (calling any other method on it will still segfault)
   
   This options seems to have no effect for RecordBatch (I assume this is because the RecordBatch.to_pandas method converts the batch into a Table and then calls Table.to_pandas, so even though the Table C++ object is destructed, the original RecordBatch still owns that data)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1173545863


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,129 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Tabular(_PandasConvertible):
+    """Internal: An interface for common operations on tabular objects."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def __repr__(self):
+        if not self._is_initialized():
+            raise ValueError("This object's internal pointer is NULL, do not "

Review Comment:
   So we need to keep this check, and I think it is fine that for RecordBatch is this essentially never used



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1165656173


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.

Review Comment:
   ```suggestion
           Remove missing values from a RecordBatch or Table.
   ```
   
   Do we always want to mention both like this?
   



##########
python/pyarrow/lib.pxd:
##########
@@ -469,15 +469,19 @@ cdef class ChunkedArray(_PandasConvertible):
     cdef getitem(self, int64_t i)
 
 
-cdef class Table(_PandasConvertible):
+cdef class _Table(_PandasConvertible):

Review Comment:
   Do we maybe want to use "Base" in the name (to make it clearer that it is a shared base class?) `_BaseTable` would be fine (or `_BaseTabular`, if that's a better word to describe the commonality between RecordBatch and Table)



##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.

Review Comment:
   Or adding a standard sentence like "The following example uses a Table, but it works the same for RecordBatch". Or would that just be unnecessary noise in most cases?



##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.

Review Comment:
   I think that's fine to just use a single example. We could always add a comment like `# or pa.RecordBatch.from_pandas(df)` above the equivalent Table line, to make it clear how to run the equivalent example with a RecordBatch instead of a Table



##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.
+        See :func:`pyarrow.compute.drop_null` for full usage.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'year': [None, 2022, 2019, 2021],
+        ...                   'n_legs': [2, 4, 5, 100],
+        ...                   'animals': ["Flamingo", "Horse", None, "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.drop_null()
+        pyarrow.Table
+        year: double
+        n_legs: int64
+        animals: string
+        ----
+        year: [[2022,2021]]
+        n_legs: [[4,100]]
+        animals: [["Horse","Centipede"]]
+        """
+        return _pc().drop_null(self)
+
+    def take(self, object indices):
+        """
+        Select rows from the table.
+
+        See :func:`pyarrow.compute.take` for full usage.
+
+        Parameters
+        ----------
+        indices : Array or array-like
+            The indices in the table whose rows will be returned.
+
+        Returns
+        -------
+        taken : Table

Review Comment:
   ```suggestion
           taken : RecordBatch or Table
   ```



##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.
+        See :func:`pyarrow.compute.drop_null` for full usage.
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> import pandas as pd
+        >>> df = pd.DataFrame({'year': [None, 2022, 2019, 2021],
+        ...                   'n_legs': [2, 4, 5, 100],
+        ...                   'animals': ["Flamingo", "Horse", None, "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> table.drop_null()
+        pyarrow.Table
+        year: double
+        n_legs: int64
+        animals: string
+        ----
+        year: [[2022,2021]]
+        n_legs: [[4,100]]
+        animals: [["Horse","Centipede"]]
+        """
+        return _pc().drop_null(self)
+
+    def take(self, object indices):
+        """
+        Select rows from the table.

Review Comment:
   ```suggestion
           Select rows from the RecordBatch or Table.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1173289267


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,129 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Tabular(_PandasConvertible):
+    """Internal: An interface for common operations on tabular objects."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def __repr__(self):
+        if not self._is_initialized():
+            raise ValueError("This object's internal pointer is NULL, do not "

Review Comment:
   Looking back at when this was added:
   - https://github.com/apache/arrow/pull/6067
   - https://github.com/apache/arrow/pull/6067/files#diff-cede36e8e2e0eb6e6e1ee21745db9687174527f463520c6e6d8b9e8f957bf304
   
   it might make sense to keep it but note I am not familiar with the self-destruct option for table.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1166117745


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,129 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Tabular(_PandasConvertible):
+    """Internal: An interface for common operations on tabular objects."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def __repr__(self):
+        if not self._is_initialized():
+            raise ValueError("This object's internal pointer is NULL, do not "

Review Comment:
   Do we want this `ValueError`? `Table` had it, but `RecordBatch` didn't. It seems superfluous IMO. I had to add `_is_initialized()` so each subclass could implement checking its C++ object for validity.



##########
python/pyarrow/table.pxi:
##########
@@ -1696,15 +1820,10 @@ cdef class RecordBatch(_PandasConvertible):
         except TypeError:
             return NotImplemented
 
-    def to_string(self, show_metadata=False):
-        # Use less verbose schema output.
-        schema_as_string = self.schema.to_string(
-            show_field_metadata=show_metadata,
-            show_schema_metadata=show_metadata
-        )
-        return 'pyarrow.{}\n{}'.format(type(self).__name__, schema_as_string)
-
     def __repr__(self):
+        # TODO remove this and update pytests/doctests for

Review Comment:
   I will follow up with a subsequent diff. When I remove this, `RecordBatch` prints out partial tabular data like `Table`, but a bunch of doctests need to be updated so I felt its better done in a separate diff.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1526029789

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/95da08b47c884413b1029717262cf65a...e90a799eba894429b87bfa49f9ea4e17/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1500594698

   * Closes: #34979


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1500597881

   The first commit moves the `take()` method from the individual `Table` and `RecordBatch` classes to the base class `_Table`. The docstring is shared, but ends up being a bit confusing. I think the best path forward is to refactor this and create an internal method on the `_Table` class e.g. `_Table._take()` and override it in the subclasses with a subclass-specific docstring.
   
   E.g.
   ```
   cdef class Table(_Table):
       def take(self, object indices):
       """ <Class-specific docstring> """
           return self._take(indices)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1507722772

   I addressed the comments and also added methods `to_string()` and `__repr__` to the base class because I was testing out docstrings for `Table` vs `RecordBatch` for similarity. Sorry, I hope not to add too much more scope to this PR! 😬 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1165907420


##########
python/pyarrow/lib.pxd:
##########
@@ -469,15 +469,19 @@ cdef class ChunkedArray(_PandasConvertible):
     cdef getitem(self, int64_t i)
 
 
-cdef class Table(_PandasConvertible):
+cdef class _Table(_PandasConvertible):

Review Comment:
   How about `_Tabular`? It's an adjective so `class RecordBatch(_Tabular)` and `class Table(_Tabular)` sound very descriptive IMO. My initial thought is that adding `Base` here is redundant, but let me know if you think otherwise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1500624463

   Docstrings look good with commit 2, but we do have a wrapper to maintain now.
   
   ```
   >>> print(pa.RecordBatch.take.__doc__)
   RecordBatch.take(self, indices)
   
           Select rows from the record batch.
           ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1503286867

   > The docstring is shared, but ends up being a bit confusing.
   
   On the other hand, a significant part of the usefulness of sharing the implementation is lost if we can't share the docstring? (especially in this example where the function is only a one liner, for others that might be more worth it) 
   What did you find confusing about it? (the need to constantly say "RecordBatch or Table"?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1165908345


##########
python/pyarrow/table.pxi:
##########
@@ -1450,8 +1450,76 @@ cdef _sanitize_arrays(arrays, names, schema, metadata,
             converted_arrays.append(item)
     return converted_arrays
 
+cdef class _Table(_PandasConvertible):
+    """Internal: An interface for common table operations."""
 
-cdef class RecordBatch(_PandasConvertible):
+    def __init__(self):
+        raise TypeError("This object is not instantiable, "
+                        "use a subclass instead.")
+
+    def drop_null(self):
+        """
+        Remove missing values from a Table.

Review Comment:
   Updated. I could also change to `Remove missing values from tabular object.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34980:
URL: https://github.com/apache/arrow/pull/34980#discussion_r1173538663


##########
python/pyarrow/lib.pxd:
##########
@@ -469,15 +469,19 @@ cdef class ChunkedArray(_PandasConvertible):
     cdef getitem(self, int64_t i)
 
 
-cdef class Table(_PandasConvertible):
+cdef class _Table(_PandasConvertible):

Review Comment:
   "tabular" sounds good!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34980: GH-34979: [Python] Create a base class for Table and RecordBatch

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34980:
URL: https://github.com/apache/arrow/pull/34980#issuecomment-1526029105

   Benchmark runs are scheduled for baseline = 07d02d6ccaf2521cba102ac910c4def97cd3610c and contender = 7bf1dec7f8bab44522fcbf84263045cc5a24e534. 7bf1dec7f8bab44522fcbf84263045cc5a24e534 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/0f293f1ea5034bf3abb74dd678a5189b...8f561bfc16e84141b91508f557644d10/)
   [Finished :arrow_down:10.69% :arrow_up:6.3%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/95da08b47c884413b1029717262cf65a...e90a799eba894429b87bfa49f9ea4e17/)
   [Failed] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1c1c63f112574ad1804302b89f102dd3...c2b74a874cab476492979286c803031f/)
   [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/300c56bb28cb4c7fb16db4d17ac53677...0769f2f369684bdf8c6eb461b6729ef5/)
   Buildkite builds:
   [Finished] [`7bf1dec7` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2791)
   [Finished] [`7bf1dec7` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2825)
   [Finished] [`7bf1dec7` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2789)
   [Finished] [`7bf1dec7` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2816)
   [Finished] [`07d02d6c` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2790)
   [Finished] [`07d02d6c` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2824)
   [Failed] [`07d02d6c` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2788)
   [Failed] [`07d02d6c` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2815)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org