You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/10/22 04:41:00 UTC

[jira] [Commented] (SPARK-25756) pyspark pandas_udf does not respect append outputMode in structured streaming

    [ https://issues.apache.org/jira/browse/SPARK-25756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658567#comment-16658567 ] 

Hyukjin Kwon commented on SPARK-25756:
--------------------------------------

1. Can you add a self-contained reproducer please? for instance, with {{rate}} source or {{socket}} source.
2. Does this only happen in pandas_udf (not normal python udf)?
3. Do you mind showing expected results and the current results?

> pyspark pandas_udf does not respect append outputMode in structured streaming
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-25756
>                 URL: https://issues.apache.org/jira/browse/SPARK-25756
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Structured Streaming
>    Affects Versions: 2.3.2
>            Reporter: Jan Bols
>            Priority: Major
>
> When using the following setup:
>  * structured streaming
>  * a watermark and groupBy followed by an apply using a pandas grouped map udf
>  * a sink using an append outputMode
> I would expect the following:
>  * udf to be called for each group --> OK
>  * when new data arrives, the udf will be called again –> OK
>  * when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations
>  * the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output
> It looks like pandas udf is unusable for structured streaming this way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org