You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Raman Srinivasan (Jira)" <ji...@apache.org> on 2021/02/03 19:17:00 UTC
[jira] [Updated] (SPARK-34348) applyInPandas doesn't seem to work
with StructType output schema
[ https://issues.apache.org/jira/browse/SPARK-34348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raman Srinivasan updated SPARK-34348:
-------------------------------------
Description:
{code:java}
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
pdf['count'] = pdf.shape[0]
return pdf{code}
Using a DDL-formatted string for output schema works fine:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count int").show()
+---+----+-----+
| id| v|count|
+---+----+-----+
| 1| 1.0| 2|
| 1| 2.0| 2|
| 2| 3.0| 3|
| 2| 5.0| 3|
| 2|10.0| 3|
+---+----+-----+
{code}
But using StructType schema (appending a integer count column) fails:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema=df.schema.add(StructField('count', IntegerType(), False))).show()
AnalysisException: Cannot resolve column name "count" among (id, v);
{code}
It appears to be looking for the new return field in the input schema?
As a workaround, is there a toDDL method I can use to get the current schema as a DDL string to which I can append the new return fields?
was:
{code:java}
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
def subtract_mean(pdf):
# pdf is a pandas.DataFrame
pdf['count'] = pdf.shape[0]
return pdf{code}
Using a DDL-formatted string for output schema works fine:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count int").show()
+---+----+-----+
| id| v|count|
+---+----+-----+
| 1| 1.0| 2|
| 1| 2.0| 2|
| 2| 3.0| 3|
| 2| 5.0| 3|
| 2|10.0| 3|
+---+----+-----+
{code}
But using StructType schema (appending a integer count column) fails:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema=df.schema.add(StructField('count', IntegerType(), False))).show()
AnalysisException: Cannot resolve column name "count" among (id, v);
{code}
It appears to be looking for the new return field in the input schema?
> applyInPandas doesn't seem to work with StructType output schema
> -----------------------------------------------------------------
>
> Key: SPARK-34348
> URL: https://issues.apache.org/jira/browse/SPARK-34348
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.0.1
> Reporter: Raman Srinivasan
> Priority: Major
>
>
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> def subtract_mean(pdf):
> # pdf is a pandas.DataFrame
> pdf['count'] = pdf.shape[0]
> return pdf{code}
>
>
> Using a DDL-formatted string for output schema works fine:
> {code:java}
> df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count int").show()
> +---+----+-----+
> | id| v|count|
> +---+----+-----+
> | 1| 1.0| 2|
> | 1| 2.0| 2|
> | 2| 3.0| 3|
> | 2| 5.0| 3|
> | 2|10.0| 3|
> +---+----+-----+
> {code}
>
>
> But using StructType schema (appending a integer count column) fails:
> {code:java}
> df.groupby("id").applyInPandas(subtract_mean, schema=df.schema.add(StructField('count', IntegerType(), False))).show()
> AnalysisException: Cannot resolve column name "count" among (id, v);
> {code}
> It appears to be looking for the new return field in the input schema?
> As a workaround, is there a toDDL method I can use to get the current schema as a DDL string to which I can append the new return fields?
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org