You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Christian Zommerfelds (JIRA)" <ji...@apache.org> on 2016/06/02 14:23:59 UTC

[jira] [Updated] (SPARK-15642) Metadata gets lost when selecting a field of a StructType

     [ https://issues.apache.org/jira/browse/SPARK-15642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Zommerfelds updated SPARK-15642:
------------------------------------------
    Description: 
Hi,

When working with Data Frames, sometimes I find myself needing to write a function that creates multiple columns. Since that is not directly possible, I create a function that returns a StructType, and then call select() to assign the fields to different columns. However, I noticed that the metadata gets lost when I do that.

Example:

{code}
In: schema = StructType([StructField('foo', StructType([
    StructField('features', ArrayType(IntegerType())),
    StructField('label', DoubleType(), False,
                {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}
    )
]))])

In: df = sqlContext.createDataFrame([Row(foo=Row(features=[1,2], label=0.0)), Row(foo=Row(features=[3,4], label=1.0))], schema)

In: df.schema.fields[0].dataType.fields[1].metadata
Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}

In: df2 = df.select(df.foo['label'])

In: df2.schema.fields[0].metadata
Out: {}
{code}

Expected: same metadata (ml_attrib...)

My work around is to create a new Data Frame from RDD, because as far as I know PySpark doesn't support adding metadata once the DF is created (should I create another issue for that?). Work around example:

{code}
In: df3 = sqlContext.createDataFrame(df2.rdd, StructType([schema.fields[0].dataType.fields[1]]))

In: df3.schema.fields[0].metadata
Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}
{code}

I am not sure if this affects the Scala API. (EDIT: yes it does.)

Let me know if I can provide any other information.

  was:
Hi,

When working with Data Frames, sometimes I find myself needing to write a function that creates multiple columns. Since that is not directly possible, I create a function that returns a StructType, and then call select() to assign the fields to different columns. However, I noticed that the metadata gets lost when I do that.

Example:

{code}
In: schema = StructType([StructField('foo', StructType([
    StructField('features', ArrayType(IntegerType())),
    StructField('label', DoubleType(), False,
                {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}
    )
]))])

In: df = sqlContext.createDataFrame([Row(foo=Row(features=[1,2], label=0.0)), Row(foo=Row(features=[3,4], label=1.0))], schema)

In: df.schema.fields[0].dataType.fields[1].metadata
Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}

In: df2 = df.select(df.foo['label'])

In: df2.schema.fields[0].metadata
Out: {}
{code}

Expected: same metadata (ml_attrib...)

My work around is to create a new Data Frame from RDD, because as far as I know PySpark doesn't support adding metadata once the DF is created (should I create another issue for that?). Work around example:

{code}
In: df3 = sqlContext.createDataFrame(df2.rdd, StructType([schema.fields[0].dataType.fields[1]]))

In: df3.schema.fields[0].metadata
Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}
{code}

I am not sure if this affects the Scala API.

Let me know if I can provide any other information.

    Component/s:     (was: PySpark)
                 SQL

> Metadata gets lost when selecting a field of a StructType
> ---------------------------------------------------------
>
>                 Key: SPARK-15642
>                 URL: https://issues.apache.org/jira/browse/SPARK-15642
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0, 1.6.1
>            Reporter: Christian Zommerfelds
>
> Hi,
> When working with Data Frames, sometimes I find myself needing to write a function that creates multiple columns. Since that is not directly possible, I create a function that returns a StructType, and then call select() to assign the fields to different columns. However, I noticed that the metadata gets lost when I do that.
> Example:
> {code}
> In: schema = StructType([StructField('foo', StructType([
>     StructField('features', ArrayType(IntegerType())),
>     StructField('label', DoubleType(), False,
>                 {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}
>     )
> ]))])
> In: df = sqlContext.createDataFrame([Row(foo=Row(features=[1,2], label=0.0)), Row(foo=Row(features=[3,4], label=1.0))], schema)
> In: df.schema.fields[0].dataType.fields[1].metadata
> Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}
> In: df2 = df.select(df.foo['label'])
> In: df2.schema.fields[0].metadata
> Out: {}
> {code}
> Expected: same metadata (ml_attrib...)
> My work around is to create a new Data Frame from RDD, because as far as I know PySpark doesn't support adding metadata once the DF is created (should I create another issue for that?). Work around example:
> {code}
> In: df3 = sqlContext.createDataFrame(df2.rdd, StructType([schema.fields[0].dataType.fields[1]]))
> In: df3.schema.fields[0].metadata
> Out: {'ml_attr': {'type': 'nominal', 'vals': ['0.0', '1.0']}}
> {code}
> I am not sure if this affects the Scala API. (EDIT: yes it does.)
> Let me know if I can provide any other information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org