You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/04/06 23:21:00 UTC
[jira] [Commented] (SPARK-23883) Error with conversion to arrow
while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-23883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429123#comment-16429123 ]
Bryan Cutler commented on SPARK-23883:
--------------------------------------
I think the problem might be that since the {{pandas_udf}} is for a groupby-apply, you need to specify the functionType as PandasUDFType.GROUPED_MAP
for example, @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
see [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf] for usage
Can you try that and see if it works?
> Error with conversion to arrow while using pandas_udf
> -----------------------------------------------------
>
> Key: SPARK-23883
> URL: https://issues.apache.org/jira/browse/SPARK-23883
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.3.0
> Environment: Spark 2.3.0
> Python 3.5
> Java 1.8.0_161-b12
> Reporter: Omri
> Priority: Major
>
> Hi,
> I have a code that works on DataBricks but doesn't work on a local spark installation.
> This is the code I'm running:
> {code:java}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> import numpy as np
> from pyspark.sql.types import *
> schema = StructType([
> StructField("Distance", FloatType()),
> StructField("CarId", IntegerType())
> ])
> def haversine(lon1, lat1, lon2, lat2):
> #Calculate distance, return scalar
> return 3.5 # Removed logic to facilitate reading
> @pandas_udf(schema)
> def totalDistance(oneCar):
> dist = haversine(oneCar.Longtitude.shift(1),
> oneCar.Latitude.shift(1),
> oneCar.loc[1:, 'Longitude'],
> oneCar.loc[1:, 'Latitude'])
> return pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index = [0])
> ## Calculate the overall distance made by each car
> distancePerCar= df.groupBy('CarId').apply(totalDistance)
> {code}
> I'm getting this exception, about Arrow not able to deal with this input:
> {noformat}
> ---------------------------------------------------------------------------
> TypeError Traceback (most recent call last)
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
> 114 try:
> --> 115 to_arrow_type(self._returnType_placeholder)
> 116 except TypeError:
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in to_arrow_type(dt)
> 1641 else:
> -> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
> 1643 return arrow_type
> TypeError: Unsupported type in conversion to Arrow: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
> During handling of the above exception, another exception occurred:
> NotImplementedError Traceback (most recent call last)
> <ipython-input-35-4f2194cfb998> in <module>()
> 18 km = 6367 * c
> 19 return km
> ---> 20 @pandas_udf("CarId: int, Distance: float")
> 21 def totalDistance(oneUser):
> 22 dist = haversine(oneUser.Longtitude.shift(1), oneUser.Latitude.shift(1),
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _create_udf(f, returnType, evalType)
> 62 udf_obj = UserDefinedFunction(
> 63 f, returnType=returnType, name=None, evalType=evalType, deterministic=True)
> ---> 64 return udf_obj._wrapped()
> 65
> 66
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _wrapped(self)
> 184
> 185 wrapper.func = self.func
> --> 186 wrapper.returnType = self.returnType
> 187 wrapper.evalType = self.evalType
> 188 wrapper.deterministic = self.deterministic
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
> 117 raise NotImplementedError(
> 118 "Invalid returnType with scalar Pandas UDFs: %s is "
> --> 119 "not supported" % str(self._returnType_placeholder))
> 120 elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
> 121 if isinstance(self._returnType_placeholder, StructType):
> NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true))) is not supported{noformat}
> I've also tried changing the schema to
> {code:java}
> @pandas_udf("<CarId:int,Distance:float>") {code}
> and
> {code:java}
> @pandas_udf("CarId:int,Distance:float"){code}
>
> As mentioned, this is working on a DataBricks instance in Azure, but not locally.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org