You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jan Bols (JIRA)" <ji...@apache.org> on 2018/10/17 11:12:00 UTC

[jira] [Created] (SPARK-25756) pyspark pandas_udf does not respect append outputMode in structured streaming

Jan Bols created SPARK-25756:
--------------------------------

             Summary: pyspark pandas_udf does not respect append outputMode in structured streaming
                 Key: SPARK-25756
                 URL: https://issues.apache.org/jira/browse/SPARK-25756
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Structured Streaming
    Affects Versions: 2.3.2
            Reporter: Jan Bols


When using the following setup:
 * structured streaming
 * a watermark and groupBy followed by an apply using a pandas grouped map udf
 * a sink using an append outputMode

I would expect the following:
 * udf to be called for each group --> OK
 * when new data arrives, the udf will be called again –> OK
 * when new data arrives for the same group, the udf will be called with the complete pandas dataframe of all received data for that group (up till the watermark) --> NOK: within the same group, the size of the pandas dataframe can decrease between invocations
 * the results are only written to the sink once the processing time is passed the watermark --> NOK: every time the udf is called, new results are being sent to the output

It looks like pandas udf is unusable for structured streaming this way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org