You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2015/05/27 21:35:18 UTC
[jira] [Updated] (SPARK-7902) SQL UDF doesn't support UDT in PySpark

     [ https://issues.apache.org/jira/browse/SPARK-7902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiangrui Meng updated SPARK-7902:
---------------------------------
    Description: 
We don't convert Python SQL internal types to Python types in SQL UDF execution. This causes problems if the input arguments contain UDTs or the return type is a UDT. Right now, the raw SQL types are passed into the Python UDF and the return value is not converted to Python SQL types.

This is the code (from [~rams]) to produce this bug. (Actually, it triggers another bug first right now.)
{code}
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
sz = udf(lambda s: s.size, IntegerType())
df.select(sz(df.features).alias("sz")).collect()
{code}

  was:
We don't convert Python SQL internal types to Python types in SQL UDF execution. This causes problems if the input arguments contain UDTs or the return type is a UDT. Right now, the raw SQL types are passed into the Python UDF and the return value is not converted to Python SQL types.

This is the code to produce this bug. (Actually, it triggers another bug first right now.)
{code}
from pyspark.mllib.linalg import SparseVector
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
sz = udf(lambda s: s.size, IntegerType())
df.select(sz(df.features).alias("sz")).collect()
{code}


> SQL UDF doesn't support UDT in PySpark
> --------------------------------------
>
>                 Key: SPARK-7902
>                 URL: https://issues.apache.org/jira/browse/SPARK-7902
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0
>            Reporter: Xiangrui Meng
>
> We don't convert Python SQL internal types to Python types in SQL UDF execution. This causes problems if the input arguments contain UDTs or the return type is a UDT. Right now, the raw SQL types are passed into the Python UDF and the return value is not converted to Python SQL types.
> This is the code (from [~rams]) to produce this bug. (Actually, it triggers another bug first right now.)
> {code}
> from pyspark.mllib.linalg import SparseVector
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> df = sqlContext.createDataFrame([(SparseVector(2, {0: 0.0}),)], ["features"])
> sz = udf(lambda s: s.size, IntegerType())
> df.select(sz(df.features).alias("sz")).collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org