You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/10/22 04:41:00 UTC
[jira] [Commented] (SPARK-25756) pyspark pandas_udf does not
respect append outputMode in structured streaming
[ https://issues.apache.org/jira/browse/SPARK-25756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658567#comment-16658567 ]
Hyukjin Kwon commented on SPARK-25756:
--------------------------------------
1. Can you add a self-contained reproducer please? for instance, with {{rate}} source or {{socket}} source.
2. Does this only happen in pandas_udf (not normal python udf)?
3. Do you mind showing expected results and the current results?
> pyspark pandas_udf does not respect append outputMode in structured streaming
> -----------------------------------------------------------------------------
>
> Key: SPARK-25756
> URL: https://issues.apache.org/jira/browse/SPARK-25756
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Structured Streaming
> Affects Versions: 2.3.2
> Reporter: Jan Bols
> Priority: Major
>
> When using the following setup:
> * structured streaming
> * a watermark and groupBy followed by an apply using a pandas grouped map udf
> * a sink using an append outputMode
> I would expect the following:
> * udf to be called for each group --> OK
> * when new data arrives, the udf will be called again –> OK
> * when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations
> * the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output
> It looks like pandas udf is unusable for structured streaming this way.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org