You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/04/06 23:21:00 UTC
[jira] [Commented] (SPARK-23883) Error with conversion to arrow while using pandas_udf

    [ https://issues.apache.org/jira/browse/SPARK-23883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429123#comment-16429123 ] 

Bryan Cutler commented on SPARK-23883:
--------------------------------------

I think the problem might be that since the {{pandas_udf}} is for a groupby-apply, you need to specify the functionType as PandasUDFType.GROUPED_MAP

for example, @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)

see [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf] for usage

Can you try that and see if it works?

> Error with conversion to arrow while using pandas_udf
> -----------------------------------------------------
>
>                 Key: SPARK-23883
>                 URL: https://issues.apache.org/jira/browse/SPARK-23883
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: Spark 2.3.0
> Python 3.5
> Java 1.8.0_161-b12
>            Reporter: Omri
>            Priority: Major
>
> Hi,
> I have a code that works on DataBricks but doesn't work on a local spark installation.
> This is the code I'm running:
> {code:java}
> from pyspark.sql.functions import pandas_udf
> import pandas as pd
> import numpy as np
> from pyspark.sql.types import *
> schema = StructType([
>   StructField("Distance", FloatType()),
>   StructField("CarId", IntegerType())
> ])
> def haversine(lon1, lat1, lon2, lat2):
>     #Calculate distance, return scalar
>     return 3.5 # Removed logic to facilitate reading
> @pandas_udf(schema)
> def totalDistance(oneCar):
>     dist = haversine(oneCar.Longtitude.shift(1),
>                      oneCar.Latitude.shift(1),
>                      oneCar.loc[1:, 'Longitude'], 
>                      oneCar.loc[1:, 'Latitude'])
>     return pd.DataFrame({"CarId":oneCar['CarId'].iloc[0],"Distance":np.sum(dist)},index = [0])
> ## Calculate the overall distance made by each car
> distancePerCar= df.groupBy('CarId').apply(totalDistance)
> {code}
> I'm getting this exception, about Arrow not able to deal with this input:
> {noformat}
> ---------------------------------------------------------------------------
> TypeError                                 Traceback (most recent call last)
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
>     114             try:
> --> 115                 to_arrow_type(self._returnType_placeholder)
>     116             except TypeError:
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.py in to_arrow_type(dt)
>    1641     else:
> -> 1642         raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
>    1643     return arrow_type
> TypeError: Unsupported type in conversion to Arrow: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true)))
> During handling of the above exception, another exception occurred:
> NotImplementedError                       Traceback (most recent call last)
> <ipython-input-35-4f2194cfb998> in <module>()
>      18     km = 6367 * c
>      19     return km
> ---> 20 @pandas_udf("CarId: int, Distance: float")
>      21 def totalDistance(oneUser):
>      22     dist = haversine(oneUser.Longtitude.shift(1), oneUser.Latitude.shift(1),
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _create_udf(f, returnType, evalType)
>      62     udf_obj = UserDefinedFunction(
>      63         f, returnType=returnType, name=None, evalType=evalType, deterministic=True)
> ---> 64     return udf_obj._wrapped()
>      65 
>      66 
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in _wrapped(self)
>     184 
>     185         wrapper.func = self.func
> --> 186         wrapper.returnType = self.returnType
>     187         wrapper.evalType = self.evalType
>     188         wrapper.deterministic = self.deterministic
> C:\opt\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\udf.py in returnType(self)
>     117                 raise NotImplementedError(
>     118                     "Invalid returnType with scalar Pandas UDFs: %s is "
> --> 119                     "not supported" % str(self._returnType_placeholder))
>     120         elif self.evalType == PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF:
>     121             if isinstance(self._returnType_placeholder, StructType):
> NotImplementedError: Invalid returnType with scalar Pandas UDFs: StructType(List(StructField(CarId,IntegerType,true),StructField(Distance,FloatType,true))) is not supported{noformat}
> I've also tried changing the schema to
> {code:java}
> @pandas_udf("<CarId:int,Distance:float>") {code}
> and
> {code:java}
> @pandas_udf("CarId:int,Distance:float"){code}
>  
> As mentioned, this is working on a DataBricks instance in Azure, but not locally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org