You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "ianmcook (via GitHub)" <gi...@apache.org> on 2024/03/12 21:33:07 UTC

[PR] [WIP][SPARK-47365] Add toArrow() DataFrame method to PySpark [spark]

ianmcook opened a new pull request, #45481:
URL: https://github.com/apache/spark/pull/45481

   ### What changes were proposed in this pull request?
   This adds an experimental PySpark DataFrame method `_toArrow()`. This returns the contents of the DataFrame as a [PyArrow Table](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html).
   
   ### Why are the changes needed?
   In the Apache Arrow community, we hear from a lot of users who want to return the contents of a PySpark DataFrame as a PyArrow Table. Currently the only documented way to do this is to return the contents as a pandas DataFrame, then use PyArrow (`pa`) to convert that to a PyArrow Table.
   ```py
   pa.Table.from_pandas(df.toPandas())
   ```
   Going through pandas adds significant overhead which is easily avoided since internally `toPandas()` already converts the contents of Spark DataFrame to Arrow format when `spark.sql.execution.arrow.pyspark.enabled` is `true`.
   
   Currently it is also possible to use the experimental `_collect_as_arrow()` method to return the contents of a PySpark DataFrame as a list of PyArrow RecordBatches. This PR adds another experimental method `_toArrow()` which builds on that and returns the more user-friendly PyArrow Table object. It handles the case where the DataFrame has zero rows, returning a zero-row PyArrow Table with the schema included (whereas `_collect_as_arrow()` returns an empty Python list).
   
   ### Does this PR introduce _any_ user-facing change?
   It adds an experimental DataFrame method `_toArrow()` to the PySpark SQL DataFrame API. It does not introduce any other user-facing changes.
   
   ### How was this patch tested?
   Tests are TBD
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594903656


##########
python/docs/source/reference/pyspark.sql/dataframe.rst:
##########
@@ -109,6 +109,7 @@ DataFrame
     DataFrame.tail
     DataFrame.take
     DataFrame.to
+    DataFrame.toArrowTable

Review Comment:
   Either name is fine by me, but @grundprinzip  commented that he prefers `toArrowTable ` https://github.com/apache/spark/pull/45481#discussion_r1592074090



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592333598


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":

Review Comment:
   I updated the PR to make the method public and rename it `toArrowTable`.
   
   I will also push a commit here that adds some relevant content to the [Apache Arrow in PySpark](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html) page.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add _toArrow() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1576718516


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,12 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":
+        table = self._to_table()[0]
+        return table
+
+    _toArrow.__doc__ = PySparkDataFrame._toArrow.__doc__

Review Comment:
   @HyukjinKwon is this line no longer needed after https://github.com/apache/spark/pull/46129?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1595134699


##########
examples/src/main/python/sql/arrow.py:
##########
@@ -33,6 +33,22 @@
 require_minimum_pyarrow_version()
 
 
+def dataframe_to_arrow_table_example(spark: SparkSession) -> None:
+    import pyarrow as pa  # noqa: F401
+    from pyspark.sql.functions import rand
+
+    # Create a Spark DataFrame
+    df = spark.range(100).drop("id").withColumns({"0": rand(), "1": rand(), "2": rand()})
+
+    # Convert the Spark DataFrame to a PyArrow Table
+    table = df.select("*").toArrow()  # type: ignore

Review Comment:
   ```suggestion
       table = df.select("*").toArrow()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add _toArrow() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1583675813


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,12 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":
+        table = self._to_table()[0]
+        return table
+
+    _toArrow.__doc__ = PySparkDataFrame._toArrow.__doc__

Review Comment:
   I assume this line is no longer needed. This is what's causing the CI to fail. I will remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1595134699


##########
examples/src/main/python/sql/arrow.py:
##########
@@ -33,6 +33,22 @@
 require_minimum_pyarrow_version()
 
 
+def dataframe_to_arrow_table_example(spark: SparkSession) -> None:
+    import pyarrow as pa  # noqa: F401
+    from pyspark.sql.functions import rand
+
+    # Create a Spark DataFrame
+    df = spark.range(100).drop("id").withColumns({"0": rand(), "1": rand(), "2": rand()})
+
+    # Convert the Spark DataFrame to a PyArrow Table
+    table = df.select("*").toArrow()  # type: ignore

Review Comment:
   ```suggestion
       table = df.select("*").toArrow()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on PR #45481:
URL: https://github.com/apache/spark/pull/45481#issuecomment-2100720402

   Thanks @HyukjinKwon. Anything else you need before merging this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594898205


##########
python/pyspark/sql/dataframe.py:
##########
@@ -6213,6 +6214,31 @@ def mapInArrow(
         """
         ...
 

Review Comment:
   yes please



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on PR #45481:
URL: https://github.com/apache/spark/pull/45481#issuecomment-2101763698

   Thanks. I rebased. I also took a closer look at your changes in #46129 and made a few changes for consistency with the new structure you introduced there:
   - I added a definition for `toArrowTable` in `python/pyspark/sql/classic/dataframe.py`.
   - I added a placeholder definition for `toArrowTable` in `python/pyspark/sql/dataframe.py`
   - I moved the docstring to there from `python/pyspark/sql/pandas/conversion.py`.
   
   I also added `.. versionadded:: 4.0.0` in the docstring.
   
   Please let me know if that all looks OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47365] Add toArrow() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1527183210


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -241,21 +241,81 @@ def toPandas(self) -> "PandasDataFrameLike":
         else:
             return pdf
 
-    def _collect_as_arrow(self, split_batches: bool = False) -> List["pa.RecordBatch"]:

Review Comment:
   Ok, understood. I will make some changes here to reflect that.
   
   P.S. The git diff here is confusing and makes it look like I am proposing a rename and full reimplementation of `_collect_as_arrow` which I am not :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594904286


##########
python/docs/source/reference/pyspark.sql/dataframe.rst:
##########
@@ -109,6 +109,7 @@ DataFrame
     DataFrame.tail
     DataFrame.take
     DataFrame.to
+    DataFrame.toArrowTable

Review Comment:
   Let's go with toArrow. I think this kind of naming is more common, e.g., https://datatable.readthedocs.io/en/latest/api/frame/to_arrow.html https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.to_arrow.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594902851


##########
python/docs/source/reference/pyspark.sql/dataframe.rst:
##########
@@ -109,6 +109,7 @@ DataFrame
     DataFrame.tail
     DataFrame.take
     DataFrame.to
+    DataFrame.toArrowTable

Review Comment:
   Yeah, let's probably name it `toArrow` just to match with others, e.g., https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594859522


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def toArrowTable(self) -> "pa.Table":

Review Comment:
   > Let's also add the abstract method at the parent dataframe `pyspark.sql.datafram`
   
   I don't see where to do this. Can you clarify please? Thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592339040


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":

Review Comment:
   I expect that this will be used by some end users, not only by developers. In recent years, PyArrow has become a very widely used library (>100M downloads/month on PyPI, ~50th most popular library) and we see a lot of non-developers using it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594903734


##########
python/pyspark/sql/dataframe.py:
##########
@@ -6213,6 +6214,31 @@ def mapInArrow(
         """
         ...
 
+    def toArrowTable(self) -> "pa.Table":
+        """
+        Returns the contents of this :class:`DataFrame` as PyArrow ``pyarrow.Table``.
+
+        This is only available if PyArrow is installed and available.
+
+        .. versionadded:: 4.0.0
+
+        Notes
+        -----
+        This method should only be used if the resulting PyArrow ``pyarrow.Table`` is
+        expected to be small, as all the data is loaded into the driver's memory.

Review Comment:
   ```suggestion
           expected to be small, as all the data is loaded into the driver's memory.
   
           This API is a developer API.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45481:
URL: https://github.com/apache/spark/pull/45481#issuecomment-2099437696

   Okay I'm fine with this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594900269


##########
python/docs/source/reference/pyspark.sql/dataframe.rst:
##########
@@ -109,6 +109,7 @@ DataFrame
     DataFrame.tail
     DataFrame.take
     DataFrame.to
+    DataFrame.toArrowTable

Review Comment:
   One last thing. Should we name it `toArrow` just to match with others like `toPandas`, `toJSON`, etc.? I don't have a strong preference though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594908506


##########
python/docs/source/reference/pyspark.sql/dataframe.rst:
##########
@@ -109,6 +109,7 @@ DataFrame
     DataFrame.tail
     DataFrame.take
     DataFrame.to
+    DataFrame.toArrowTable

Review Comment:
   Sounds good. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594878229


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def toArrowTable(self) -> "pa.Table":

Review Comment:
   Here https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594814378


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def toArrowTable(self) -> "pa.Table":

Review Comment:
   Let's also add the abstract method at the parent dataframe `pyspark.sql.dataframe` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594880292


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def toArrowTable(self) -> "pa.Table":

Review Comment:
   Thanks.  I think I added what's needed there. It's a placeholder definition, like what you have there for `toPandas`.  See my other comment below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594878401


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def toArrowTable(self) -> "pa.Table":

Review Comment:
   I think you should probably rebase it against the master branch. That refactoring happened lately (by me .. )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594881200


##########
python/pyspark/sql/dataframe.py:
##########
@@ -6213,6 +6214,31 @@ def mapInArrow(
         """
         ...
 

Review Comment:
   Do I need ` @dispatch_df_method` here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add _toArrow() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on PR #45481:
URL: https://github.com/apache/spark/pull/45481#issuecomment-2020635088

   cc @xinrong-meng @itholic @dongjoon-hyun


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592915751


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -225,15 +225,68 @@ def toPandas(self) -> "PandasDataFrameLike":
         else:
             return pdf
 
-    def _collect_as_arrow(self, split_batches: bool = False) -> List["pa.RecordBatch"]:
+    def toArrowTable(self) -> "pa.Table":
         """
-        Returns all records as a list of ArrowRecordBatches, pyarrow must be installed
+        Returns the contents of this :class:`DataFrame` as PyArrow ``pyarrow.Table``.
+
+        This is only available if PyArrow is installed and available.
+
+        Notes
+        -----
+        This method should only be used if the resulting PyArrow ``pyarrow.Table`` is
+        expected to be small, as all the data is loaded into the driver's memory.
+
+        Examples
+        --------
+        >>> df.toArrowTable()  # doctest: +SKIP
+        pyarrow.Table
+        age: int64
+        name: string
+        ----
+        age: [[2,5]]
+        name: [["Alice","Bob"]]
+        """
+        from pyspark.sql.dataframe import DataFrame
+
+        assert isinstance(self, DataFrame)
+
+        jconf = self.sparkSession._jconf
+
+        from pyspark.sql.pandas.types import to_arrow_schema
+        from pyspark.sql.pandas.utils import require_minimum_pyarrow_version
+
+        require_minimum_pyarrow_version()
+        to_arrow_schema(self.schema)
+
+        import pyarrow as pa
+
+        self_destruct = jconf.arrowPySparkSelfDestructEnabled()

Review Comment:
   Added in fd76fa3. Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594903656


##########
python/docs/source/reference/pyspark.sql/dataframe.rst:
##########
@@ -109,6 +109,7 @@ DataFrame
     DataFrame.tail
     DataFrame.take
     DataFrame.to
+    DataFrame.toArrowTable

Review Comment:
   Either name is fine by me, but @grundprinzip  commented that he prefers `toArrowTable` https://github.com/apache/spark/pull/45481#discussion_r1592074090
   
   PyArrow defines a couple of other DataFrame-like objects (RecordBatch, RecordBatchReader) so there is some benefit to being more specific in the naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1594814188


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def toArrowTable(self) -> "pa.Table":

Review Comment:
   Can we list this to `python/docs/source/reference/pyspark.sql/dataframe.rst`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #45481: [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark
URL: https://github.com/apache/spark/pull/45481


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [WIP][SPARK-47365] Add toArrow() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1527165494


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -241,21 +241,81 @@ def toPandas(self) -> "PandasDataFrameLike":
         else:
             return pdf
 
-    def _collect_as_arrow(self, split_batches: bool = False) -> List["pa.RecordBatch"]:

Review Comment:
   This isn't an API; I wouldn't fix it for external purposes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add _toArrow() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592194102


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":

Review Comment:
   Yeah, the current state looks a bit weird. We should either explicitly add an developer API, or just don't. My only question is that we can work around via `SparkSession.client.to_table()`, and we're going forward to Spark Connect default. Considering this is a developer API, I wonder if we should add this into DataFrame at this moment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592915751


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -225,15 +225,68 @@ def toPandas(self) -> "PandasDataFrameLike":
         else:
             return pdf
 
-    def _collect_as_arrow(self, split_batches: bool = False) -> List["pa.RecordBatch"]:
+    def toArrowTable(self) -> "pa.Table":
         """
-        Returns all records as a list of ArrowRecordBatches, pyarrow must be installed
+        Returns the contents of this :class:`DataFrame` as PyArrow ``pyarrow.Table``.
+
+        This is only available if PyArrow is installed and available.
+
+        Notes
+        -----
+        This method should only be used if the resulting PyArrow ``pyarrow.Table`` is
+        expected to be small, as all the data is loaded into the driver's memory.
+
+        Examples
+        --------
+        >>> df.toArrowTable()  # doctest: +SKIP
+        pyarrow.Table
+        age: int64
+        name: string
+        ----
+        age: [[2,5]]
+        name: [["Alice","Bob"]]
+        """
+        from pyspark.sql.dataframe import DataFrame
+
+        assert isinstance(self, DataFrame)
+
+        jconf = self.sparkSession._jconf
+
+        from pyspark.sql.pandas.types import to_arrow_schema
+        from pyspark.sql.pandas.utils import require_minimum_pyarrow_version
+
+        require_minimum_pyarrow_version()
+        to_arrow_schema(self.schema)
+
+        import pyarrow as pa
+
+        self_destruct = jconf.arrowPySparkSelfDestructEnabled()

Review Comment:
   Added in c49fe2b. Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add _toArrow() DataFrame method to PySpark [spark]

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592074090


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":

Review Comment:
   @HyukjinKwon I think it would be fine to make this a regular public method instead of the `_toArrow` maybe change it to `toArrowTable` to give a better indication of the return type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add _toArrow() DataFrame method to PySpark [spark]

Posted by "grundprinzip (via GitHub)" <gi...@apache.org>.

grundprinzip commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592071117


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":
+        table = self._to_table()[0]

Review Comment:
   nit: Using this notion, makes the code a bit more readable and makes sure it fails if _to_table ever changes the tuple size, and thus behavior.
   
   ```suggestion
           table, _ = self._to_table()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592887464


##########
python/pyspark/sql/pandas/conversion.py:
##########
@@ -225,15 +225,68 @@ def toPandas(self) -> "PandasDataFrameLike":
         else:
             return pdf
 
-    def _collect_as_arrow(self, split_batches: bool = False) -> List["pa.RecordBatch"]:
+    def toArrowTable(self) -> "pa.Table":
         """
-        Returns all records as a list of ArrowRecordBatches, pyarrow must be installed
+        Returns the contents of this :class:`DataFrame` as PyArrow ``pyarrow.Table``.
+
+        This is only available if PyArrow is installed and available.
+
+        Notes
+        -----
+        This method should only be used if the resulting PyArrow ``pyarrow.Table`` is
+        expected to be small, as all the data is loaded into the driver's memory.
+
+        Examples
+        --------
+        >>> df.toArrowTable()  # doctest: +SKIP
+        pyarrow.Table
+        age: int64
+        name: string
+        ----
+        age: [[2,5]]
+        name: [["Alice","Bob"]]
+        """
+        from pyspark.sql.dataframe import DataFrame
+
+        assert isinstance(self, DataFrame)
+
+        jconf = self.sparkSession._jconf
+
+        from pyspark.sql.pandas.types import to_arrow_schema
+        from pyspark.sql.pandas.utils import require_minimum_pyarrow_version
+
+        require_minimum_pyarrow_version()
+        to_arrow_schema(self.schema)
+
+        import pyarrow as pa
+
+        self_destruct = jconf.arrowPySparkSelfDestructEnabled()

Review Comment:
   Shall we document under "Setting Arrow self_destruct for memory savings" of `python/docs/source/user_guide/sql/arrow_pandas.rst` that Conversion to Arrow Table will also be influenced by the config?
   Following "which can save memory when creating a Pandas DataFrame via toPandas..." for example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrowTable() DataFrame method to PySpark [spark]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.

ianmcook commented on code in PR #45481:
URL: https://github.com/apache/spark/pull/45481#discussion_r1592333598


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1775,6 +1775,10 @@ def _to_table(self) -> Tuple["pa.Table", Optional[StructType]]:
         assert table is not None
         return (table, schema)
 
+    def _toArrow(self) -> "pa.Table":

Review Comment:
   I updated the PR to make the method public and rename it `toArrowTable`.
   
   I will also push a commit here that adds some relevant content to the [Apache Arrow in PySpark](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html) page. [update: done]



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47365][PYTHON] Add toArrow() DataFrame method to PySpark [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #45481:
URL: https://github.com/apache/spark/pull/45481#issuecomment-2102195319

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org