You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Max Moroz (JIRA)" <ji...@apache.org> on 2016/06/25 06:48:16 UTC
[jira] [Updated] (SPARK-16205) dict -> StructType conversion is undocumented

     [ https://issues.apache.org/jira/browse/SPARK-16205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Max Moroz updated SPARK-16205:
------------------------------
    Description: 
According to the docs, StructType is equivalent only to python list and tuple. I accidentally returned a dict from a udf function that registered its return value as StructType.

Expected behavior: either (1) an exception is raised (if strict type is checked); or (2) dict is treated as an iterable, resulting in a struct being created in an arbitrary order from the keys of the dict (horribly dangerous, but I'd understand).

Actual behavior: struct was created "properly", in the sense that keys were matched to the field names of the struct, and values were used for values.

This is wonderful, but completely undocumented as far as I can tell.

{code}
import pyspark.sql.functions as F
import pyspark.sql.types as T

fields = 'abcdefgh'

def udf(type_):
  def to_udf(func):
    return F.udf(func, type_)
  return to_udf

struct = T.StructType()
for c in fields:
  struct.add(c, T.StringType())

@udf(struct)
def f(row):
  d = dict(zip(fields, fields))
  return d

df.select(f('value')).show()

'''
Output is unexpectedly "meaningful":
+------------------+
|PythonUDF#f(value)|
+------------------+
| [a,b,c,d,e,f,g,h]|
| [a,b,c,d,e,f,g,h]|
+------------------+
'''
{code}

  was:
According to the docs, StructType is equivalent only to python list and tuple. I accidentally returned a dict from a udf function that registered its return value as StructType.

Expected behavior: either (1) an exception is raised (if strict type is checked); or (2) dict is treated as an iterable, resulting in a struct being created in an arbitrary order from the keys of the dict (horribly dangerous, but I'd understand).

Actual behavior: struct was created "properly", in the sense that keys were matched to the field names of the struct, and values were used for values.

This is wonderful, but completely undocumented as far as I can tell.


> dict -> StructType conversion is undocumented
> ---------------------------------------------
>
>                 Key: SPARK-16205
>                 URL: https://issues.apache.org/jira/browse/SPARK-16205
>             Project: Spark
>          Issue Type: Documentation
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> According to the docs, StructType is equivalent only to python list and tuple. I accidentally returned a dict from a udf function that registered its return value as StructType.
> Expected behavior: either (1) an exception is raised (if strict type is checked); or (2) dict is treated as an iterable, resulting in a struct being created in an arbitrary order from the keys of the dict (horribly dangerous, but I'd understand).
> Actual behavior: struct was created "properly", in the sense that keys were matched to the field names of the struct, and values were used for values.
> This is wonderful, but completely undocumented as far as I can tell.
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> fields = 'abcdefgh'
> def udf(type_):
>   def to_udf(func):
>     return F.udf(func, type_)
>   return to_udf
> struct = T.StructType()
> for c in fields:
>   struct.add(c, T.StringType())
> @udf(struct)
> def f(row):
>   d = dict(zip(fields, fields))
>   return d
> df.select(f('value')).show()
> '''
> Output is unexpectedly "meaningful":
> +------------------+
> |PythonUDF#f(value)|
> +------------------+
> | [a,b,c,d,e,f,g,h]|
> | [a,b,c,d,e,f,g,h]|
> +------------------+
> '''
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org