You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Szymon Matejczyk (JIRA)" <ji...@apache.org> on 2016/04/09 19:27:25 UTC

[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

    [ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233648#comment-15233648 ] 

Szymon Matejczyk commented on SPARK-13802:
------------------------------------------

"This shows that the schema names don't have to correspond to the row's names." IIUC, then in Row only the order of fields matters. And Row should be treated like Tuple, not dict. Then, the Row constructor variant that uses **kwargs is pointless as it sorts fields by names.

```
  def __new__(self, *args, **kwargs):
        if args and kwargs:
            raise ValueError("Can not use both args "
                             "and kwargs to create Row")
        if args:
            # create row class or objects
            return tuple.__new__(self, args)

        elif kwargs:
            # create row objects
            names = sorted(kwargs.keys())
            row = tuple.__new__(self, [kwargs[n] for n in names])
            row.__fields__ = names
            return row
```

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-13802
>                 URL: https://issues.apache.org/jira/browse/SPARK-13802
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>            Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are sorted by name. When Schema is reading the row, it is not using the fields in this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
>     StructField("id", StringType()),
>     StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +------+----------+
> |    id|first_name|
> +------+----------+
> |Szymon|        39|
> +------+----------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org