You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Thomas Dunne (JIRA)" <ji...@apache.org> on 2016/10/14 16:45:20 UTC

[jira] [Comment Edited] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

    [ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575825#comment-15575825 ] 

Thomas Dunne edited comment on SPARK-13802 at 10/14/16 4:45 PM:
----------------------------------------------------------------

This is especially troublesome when combined with creating a DataFrame, while using your own schema.

The data I am working on can contain a lot of empty fields, which makes the schema inference potentially have to scan every row to determine their type. Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of the provided schema, lets just change the order of one (the Row), and naively use zip(row, schema.fields). This means that even keeping both schema field order, and Row key value is not enough, due to Rows sorting keys, we need to manually sort schema fields too.

Doesn't seem consistent or desirable behavior at all.


was (Author: thomas9):
This is especially troublesome when combined with creating a DataFrame, while using your own schema.

The data I am working on can contain a lot of empty fields, which makes the schema inference potentially have to scan every row to determine their type. Providing our own schema should fix this, right?

Nope... Rather than matching up the keys of the Row, with the field names of the provided schema, lets just change the order of one (the Row), and naively use zip(row, schema.fields). This means that even keeping both schema field order, and Row key value is not enough, due to Rows sorting keys, we need to manually sort schema fields too.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-13802
>                 URL: https://issues.apache.org/jira/browse/SPARK-13802
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>            Reporter: Szymon Matejczyk
>
> When using Row constructor from kwargs, fields in the tuple underneath are sorted by name. When Schema is reading the row, it is not using the fields in this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
>     StructField("id", StringType()),
>     StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +------+----------+
> |    id|first_name|
> +------+----------+
> |Szymon|        39|
> +------+----------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org