You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jan Bols (JIRA)" <ji...@apache.org> on 2018/10/17 11:12:00 UTC
[jira] [Created] (SPARK-25756) pyspark pandas_udf does not respect
append outputMode in structured streaming
Jan Bols created SPARK-25756:
--------------------------------
Summary: pyspark pandas_udf does not respect append outputMode in structured streaming
Key: SPARK-25756
URL: https://issues.apache.org/jira/browse/SPARK-25756
Project: Spark
Issue Type: Bug
Components: PySpark, Structured Streaming
Affects Versions: 2.3.2
Reporter: Jan Bols
When using the following setup:
* structured streaming
* a watermark and groupBy followed by an apply using a pandas grouped map udf
* a sink using an append outputMode
I would expect the following:
* udf to be called for each group --> OK
* when new data arrives, the udf will be called again –> OK
* when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations
* the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output
It looks like pandas udf is unusable for structured streaming this way.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org