You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Frederik (JIRA)" <ji...@apache.org> on 2018/10/22 14:32:00 UTC

[jira] [Created] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns

Frederik created SPARK-25801:
--------------------------------

             Summary: pandas_udf grouped_map fails with input dataframe with more than 255 columns
                 Key: SPARK-25801
                 URL: https://issues.apache.org/jira/browse/SPARK-25801
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.3.0
         Environment: python 2.7

pyspark 2.3.0
            Reporter: Frederik


Hi,

I'm using a pandas_udf to deploy a model to predict all samples in a spark dataframe,

for this I use a udf as follows:
@pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def predict_scores(pdf):  score_values = model.predict_proba(pdf)[:,1]  return pd.DataFrame({'scores': score_values})
So it takes a dataframe and predicts the probability of being positive according to an sklearn model for each row and returns this as single column. This works great on a random groupBy, e.g.:
sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
as long as the dataframe has <255 columns. When the input dataframe has more than 255 columns (thus features in my model), I get:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 148, in read_udfs
    mapper = eval(mapper_str, udfs)
  File "<string>", line 1
SyntaxError: more than 255 arguments
Which seems to be related with Python's general limitation of having not allowing more than 255 arguments for a function?

 

Is this a bug or is there a straightforward way around this problem?

 

Regards,

Frederik



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org