You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Frederik (JIRA)" <ji...@apache.org> on 2018/10/22 14:32:00 UTC
[jira] [Created] (SPARK-25801) pandas_udf grouped_map fails with
input dataframe with more than 255 columns
Frederik created SPARK-25801:
--------------------------------
Summary: pandas_udf grouped_map fails with input dataframe with more than 255 columns
Key: SPARK-25801
URL: https://issues.apache.org/jira/browse/SPARK-25801
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.3.0
Environment: python 2.7
pyspark 2.3.0
Reporter: Frederik
Hi,
I'm using a pandas_udf to deploy a model to predict all samples in a spark dataframe,
for this I use a udf as follows:
@pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return pd.DataFrame({'scores': score_values})
So it takes a dataframe and predicts the probability of being positive according to an sklearn model for each row and returns this as single column. This works great on a random groupBy, e.g.:
sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
as long as the dataframe has <255 columns. When the input dataframe has more than 255 columns (thus features in my model), I get:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 148, in read_udfs
mapper = eval(mapper_str, udfs)
File "<string>", line 1
SyntaxError: more than 255 arguments
Which seems to be related with Python's general limitation of having not allowing more than 255 arguments for a function?
Is this a bug or is there a straightforward way around this problem?
Regards,
Frederik
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org