You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nasir Ali (Jira)" <ji...@apache.org> on 2019/09/21 15:18:00 UTC
[jira] [Commented] (SPARK-28502) Error with struct conversion while
using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935065#comment-16935065 ]
Nasir Ali commented on SPARK-28502:
-----------------------------------
any update?
> Error with struct conversion while using pandas_udf
> ---------------------------------------------------
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
> Reporter: Nasir Ali
> Priority: Minor
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days window) and perform some operations on dataframe using (pandas) UDFs. I don't know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
> ["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct<start: timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-----+------------------------------------------+-------+
> |id |window |avg(id)|
> +-----+------------------------------------------+-------+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 |
> +-----+------------------------------------------+-------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org