You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2022/04/08 16:54:00 UTC
[jira] [Commented] (SPARK-38833) PySpark applyInPandas should allow to return empty DataFrame without columns
[ https://issues.apache.org/jira/browse/SPARK-38833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519712#comment-17519712 ]
Apache Spark commented on SPARK-38833:
--------------------------------------
User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/36120
> PySpark applyInPandas should allow to return empty DataFrame without columns
> ----------------------------------------------------------------------------
>
> Key: SPARK-38833
> URL: https://issues.apache.org/jira/browse/SPARK-38833
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Affects Versions: 3.4.0
> Reporter: Enrico Minack
> Priority: Major
>
> Currently, returning an empty Pandas DataFrame from {{applyInPandas}} raises an error:
> {noformat}
> RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema. Expected: 2 Actual: 0
> {noformat}
> Here is an example:
> {code}
> import pandas as pd
> from pyspark.sql.functions import pandas_udf, ceil
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> def mean_func(key, pdf):
> if key == (1,):
> return pd.DataFrame([])
> else:
> return pd.DataFrame([key + (pdf.v.mean(),)])
> df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show()
> {code}
> Since the schema is defined when calling {{applyInPandas()}}, it looks redundant to define the columns when returning an empty {{pd.DataFrame}}. Returning a non-empty DataFrame does not require defining columns, so returning an empty DataFrame shouldn't require that either.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org