You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2018/12/19 17:37:00 UTC

[jira] [Created] (SPARK-26410) Support per Pandas UDF configuration

Xiangrui Meng created SPARK-26410:
-------------------------------------

             Summary: Support per Pandas UDF configuration
                 Key: SPARK-26410
                 URL: https://issues.apache.org/jira/browse/SPARK-26410
             Project: Spark
          Issue Type: New Feature
          Components: PySpark
    Affects Versions: 3.0.0
            Reporter: Xiangrui Meng


We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the "right" batch size usually depends on the task itself. It would be nice if user can configure the batch size when they declare the Pandas UDF.

This is orthogonal to SPARK-23258 (using max buffer size instead of row count).

Besides API, we should also discuss how to merge Pandas UDFs of different configurations. For example,

{code}
df.select(predict1(col("features"), predict2(col("features")))
{code}

when predict1 requests 100 rows per batch, while predict2 requests 120 rows per batch.

cc: [~icexelloss] [~bryanc] [~holdenk] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org