You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bryan Cutler <cu...@gmail.com> on 2019/11/04 22:28:50 UTC

[DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Currently, when a PySpark Row is created with keyword arguments, the fields
are sorted alphabetically. This has created a lot of confusion with users
because it is not obvious (although it is stated in the pydocs) that they
will be sorted alphabetically. Then later when applying a schema and the
field order does not match, an error will occur. Here is a list of some of
the JIRAs that I have been tracking all related to this issue: SPARK-24915,
SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion of the issue
[1].

The original reason for sorting fields is because kwargs in python < 3.6
are not guaranteed to be in the same order that they were entered [2].
Sorting alphabetically ensures a consistent order. Matters are further
complicated with the flag _*from_dict*_ that allows the Row fields to to be
referenced by name when made by kwargs, but this flag is not serialized
with the Row and leads to inconsistent behavior. For instance:

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
Row(B='2', A='1')>>>
spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
B="2")]), "B string, A string").first()
Row(B='1', A='2')

I think the best way to fix this is to remove the sorting of fields when
constructing a Row. For users with Python 3.6+, nothing would change
because these versions of Python ensure that the kwargs stays in the
ordered entered. For users with Python < 3.6, using kwargs would check a
conf to either raise an error or fallback to a LegacyRow that sorts the
fields as before. With Python < 3.6 being deprecated now, this LegacyRow
can also be removed at the same time. There are also other ways to create
Rows that will not be affected. I have opened a JIRA [3] to capture this,
but I am wondering what others think about fixing this for Spark 3.0?

[1] https://github.com/apache/spark/pull/20280
[2] https://www.python.org/dev/peps/pep-0468/
[3] https://issues.apache.org/jira/browse/SPARK-29748

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Bryan Cutler <cu...@gmail.com>.

Thanks all. I created a WIP PR at https://github.com/apache/spark/pull/26496,
we can further discuss the details in there.

On Thu, Nov 7, 2019 at 7:01 PM Takuya UESHIN <ue...@happy-camper.st> wrote:

> +1
>
> On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp <sk...@berkeley.edu> wrote:
>
>> +1
>>
>> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>> >
>> > +1
>> >
>> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:
>> >>
>> >> Sounds reasonable to me. We should make the behavior consistent within
>> Spark.
>> >>
>> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
>> >>>
>> >>> Currently, when a PySpark Row is created with keyword arguments, the
>> fields are sorted alphabetically. This has created a lot of confusion with
>> users because it is not obvious (although it is stated in the pydocs) that
>> they will be sorted alphabetically. Then later when applying a schema and
>> the field order does not match, an error will occur. Here is a list of some
>> of the JIRAs that I have been tracking all related to this issue:
>> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
>> of the issue [1].
>> >>>
>> >>> The original reason for sorting fields is because kwargs in python <
>> 3.6 are not guaranteed to be in the same order that they were entered [2].
>> Sorting alphabetically ensures a consistent order. Matters are further
>> complicated with the flag _from_dict_ that allows the Row fields to to be
>> referenced by name when made by kwargs, but this flag is not serialized
>> with the Row and leads to inconsistent behavior. For instance:
>> >>>
>> >>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A
>> string").first()
>> >>> Row(B='2', A='1')
>> >>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
>> B="2")]), "B string, A string").first()
>> >>> Row(B='1', A='2')
>> >>>
>> >>> I think the best way to fix this is to remove the sorting of fields
>> when constructing a Row. For users with Python 3.6+, nothing would change
>> because these versions of Python ensure that the kwargs stays in the
>> ordered entered. For users with Python < 3.6, using kwargs would check a
>> conf to either raise an error or fallback to a LegacyRow that sorts the
>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
>> can also be removed at the same time. There are also other ways to create
>> Rows that will not be affected. I have opened a JIRA [3] to capture this,
>> but I am wondering what others think about fixing this for Spark 3.0?
>> >>>
>> >>> [1] https://github.com/apache/spark/pull/20280
>> >>> [2] https://www.python.org/dev/peps/pep-0468/
>> >>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Bryan Cutler <cu...@gmail.com>.

Thanks all. I created a WIP PR at https://github.com/apache/spark/pull/26496,
we can further discuss the details in there.

On Thu, Nov 7, 2019 at 7:01 PM Takuya UESHIN <ue...@happy-camper.st> wrote:

> +1
>
> On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp <sk...@berkeley.edu> wrote:
>
>> +1
>>
>> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>> >
>> > +1
>> >
>> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:
>> >>
>> >> Sounds reasonable to me. We should make the behavior consistent within
>> Spark.
>> >>
>> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
>> >>>
>> >>> Currently, when a PySpark Row is created with keyword arguments, the
>> fields are sorted alphabetically. This has created a lot of confusion with
>> users because it is not obvious (although it is stated in the pydocs) that
>> they will be sorted alphabetically. Then later when applying a schema and
>> the field order does not match, an error will occur. Here is a list of some
>> of the JIRAs that I have been tracking all related to this issue:
>> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
>> of the issue [1].
>> >>>
>> >>> The original reason for sorting fields is because kwargs in python <
>> 3.6 are not guaranteed to be in the same order that they were entered [2].
>> Sorting alphabetically ensures a consistent order. Matters are further
>> complicated with the flag _from_dict_ that allows the Row fields to to be
>> referenced by name when made by kwargs, but this flag is not serialized
>> with the Row and leads to inconsistent behavior. For instance:
>> >>>
>> >>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A
>> string").first()
>> >>> Row(B='2', A='1')
>> >>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
>> B="2")]), "B string, A string").first()
>> >>> Row(B='1', A='2')
>> >>>
>> >>> I think the best way to fix this is to remove the sorting of fields
>> when constructing a Row. For users with Python 3.6+, nothing would change
>> because these versions of Python ensure that the kwargs stays in the
>> ordered entered. For users with Python < 3.6, using kwargs would check a
>> conf to either raise an error or fallback to a LegacyRow that sorts the
>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
>> can also be removed at the same time. There are also other ways to create
>> Rows that will not be affected. I have opened a JIRA [3] to capture this,
>> but I am wondering what others think about fixing this for Spark 3.0?
>> >>>
>> >>> [1] https://github.com/apache/spark/pull/20280
>> >>> [2] https://www.python.org/dev/peps/pep-0468/
>> >>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Takuya UESHIN <ue...@happy-camper.st>.

+1

On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp <sk...@berkeley.edu> wrote:

> +1
>
> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gu...@gmail.com> wrote:
> >
> > +1
> >
> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:
> >>
> >> Sounds reasonable to me. We should make the behavior consistent within
> Spark.
> >>
> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
> >>>
> >>> Currently, when a PySpark Row is created with keyword arguments, the
> fields are sorted alphabetically. This has created a lot of confusion with
> users because it is not obvious (although it is stated in the pydocs) that
> they will be sorted alphabetically. Then later when applying a schema and
> the field order does not match, an error will occur. Here is a list of some
> of the JIRAs that I have been tracking all related to this issue:
> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
> of the issue [1].
> >>>
> >>> The original reason for sorting fields is because kwargs in python <
> 3.6 are not guaranteed to be in the same order that they were entered [2].
> Sorting alphabetically ensures a consistent order. Matters are further
> complicated with the flag _from_dict_ that allows the Row fields to to be
> referenced by name when made by kwargs, but this flag is not serialized
> with the Row and leads to inconsistent behavior. For instance:
> >>>
> >>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A
> string").first()
> >>> Row(B='2', A='1')
> >>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
> B="2")]), "B string, A string").first()
> >>> Row(B='1', A='2')
> >>>
> >>> I think the best way to fix this is to remove the sorting of fields
> when constructing a Row. For users with Python 3.6+, nothing would change
> because these versions of Python ensure that the kwargs stays in the
> ordered entered. For users with Python < 3.6, using kwargs would check a
> conf to either raise an error or fallback to a LegacyRow that sorts the
> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
> can also be removed at the same time. There are also other ways to create
> Rows that will not be affected. I have opened a JIRA [3] to capture this,
> but I am wondering what others think about fixing this for Spark 3.0?
> >>>
> >>> [1] https://github.com/apache/spark/pull/20280
> >>> [2] https://www.python.org/dev/peps/pep-0468/
> >>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Takuya UESHIN <ue...@happy-camper.st>.

+1

On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp <sk...@berkeley.edu> wrote:

> +1
>
> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gu...@gmail.com> wrote:
> >
> > +1
> >
> > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:
> >>
> >> Sounds reasonable to me. We should make the behavior consistent within
> Spark.
> >>
> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
> >>>
> >>> Currently, when a PySpark Row is created with keyword arguments, the
> fields are sorted alphabetically. This has created a lot of confusion with
> users because it is not obvious (although it is stated in the pydocs) that
> they will be sorted alphabetically. Then later when applying a schema and
> the field order does not match, an error will occur. Here is a list of some
> of the JIRAs that I have been tracking all related to this issue:
> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
> of the issue [1].
> >>>
> >>> The original reason for sorting fields is because kwargs in python <
> 3.6 are not guaranteed to be in the same order that they were entered [2].
> Sorting alphabetically ensures a consistent order. Matters are further
> complicated with the flag _from_dict_ that allows the Row fields to to be
> referenced by name when made by kwargs, but this flag is not serialized
> with the Row and leads to inconsistent behavior. For instance:
> >>>
> >>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A
> string").first()
> >>> Row(B='2', A='1')
> >>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
> B="2")]), "B string, A string").first()
> >>> Row(B='1', A='2')
> >>>
> >>> I think the best way to fix this is to remove the sorting of fields
> when constructing a Row. For users with Python 3.6+, nothing would change
> because these versions of Python ensure that the kwargs stays in the
> ordered entered. For users with Python < 3.6, using kwargs would check a
> conf to either raise an error or fallback to a LegacyRow that sorts the
> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
> can also be removed at the same time. There are also other ways to create
> Rows that will not be affected. I have opened a JIRA [3] to capture this,
> but I am wondering what others think about fixing this for Spark 3.0?
> >>>
> >>> [1] https://github.com/apache/spark/pull/20280
> >>> [2] https://www.python.org/dev/peps/pep-0468/
> >>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Shane Knapp <sk...@berkeley.edu>.

+1

On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> +1
>
> 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:
>>
>> Sounds reasonable to me. We should make the behavior consistent within Spark.
>>
>> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
>>>
>>> Currently, when a PySpark Row is created with keyword arguments, the fields are sorted alphabetically. This has created a lot of confusion with users because it is not obvious (although it is stated in the pydocs) that they will be sorted alphabetically. Then later when applying a schema and the field order does not match, an error will occur. Here is a list of some of the JIRAs that I have been tracking all related to this issue: SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion of the issue [1].
>>>
>>> The original reason for sorting fields is because kwargs in python < 3.6 are not guaranteed to be in the same order that they were entered [2]. Sorting alphabetically ensures a consistent order. Matters are further complicated with the flag _from_dict_ that allows the Row fields to to be referenced by name when made by kwargs, but this flag is not serialized with the Row and leads to inconsistent behavior. For instance:
>>>
>>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
>>> Row(B='2', A='1')
>>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
>>> Row(B='1', A='2')
>>>
>>> I think the best way to fix this is to remove the sorting of fields when constructing a Row. For users with Python 3.6+, nothing would change because these versions of Python ensure that the kwargs stays in the ordered entered. For users with Python < 3.6, using kwargs would check a conf to either raise an error or fallback to a LegacyRow that sorts the fields as before. With Python < 3.6 being deprecated now, this LegacyRow can also be removed at the same time. There are also other ways to create Rows that will not be affected. I have opened a JIRA [3] to capture this, but I am wondering what others think about fixing this for Spark 3.0?
>>>
>>> [1] https://github.com/apache/spark/pull/20280
>>> [2] https://www.python.org/dev/peps/pep-0468/
>>> [3] https://issues.apache.org/jira/browse/SPARK-29748



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Shane Knapp <sk...@berkeley.edu>.

+1

On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon <gu...@gmail.com> wrote:
>
> +1
>
> 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:
>>
>> Sounds reasonable to me. We should make the behavior consistent within Spark.
>>
>> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
>>>
>>> Currently, when a PySpark Row is created with keyword arguments, the fields are sorted alphabetically. This has created a lot of confusion with users because it is not obvious (although it is stated in the pydocs) that they will be sorted alphabetically. Then later when applying a schema and the field order does not match, an error will occur. Here is a list of some of the JIRAs that I have been tracking all related to this issue: SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion of the issue [1].
>>>
>>> The original reason for sorting fields is because kwargs in python < 3.6 are not guaranteed to be in the same order that they were entered [2]. Sorting alphabetically ensures a consistent order. Matters are further complicated with the flag _from_dict_ that allows the Row fields to to be referenced by name when made by kwargs, but this flag is not serialized with the Row and leads to inconsistent behavior. For instance:
>>>
>>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
>>> Row(B='2', A='1')
>>> >>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
>>> Row(B='1', A='2')
>>>
>>> I think the best way to fix this is to remove the sorting of fields when constructing a Row. For users with Python 3.6+, nothing would change because these versions of Python ensure that the kwargs stays in the ordered entered. For users with Python < 3.6, using kwargs would check a conf to either raise an error or fallback to a LegacyRow that sorts the fields as before. With Python < 3.6 being deprecated now, this LegacyRow can also be removed at the same time. There are also other ways to create Rows that will not be affected. I have opened a JIRA [3] to capture this, but I am wondering what others think about fixing this for Spark 3.0?
>>>
>>> [1] https://github.com/apache/spark/pull/20280
>>> [2] https://www.python.org/dev/peps/pep-0468/
>>> [3] https://issues.apache.org/jira/browse/SPARK-29748



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Hyukjin Kwon <gu...@gmail.com>.

+1

2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:

> Sounds reasonable to me. We should make the behavior consistent within
> Spark.
>
> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
>
>> Currently, when a PySpark Row is created with keyword arguments, the
>> fields are sorted alphabetically. This has created a lot of confusion with
>> users because it is not obvious (although it is stated in the pydocs) that
>> they will be sorted alphabetically. Then later when applying a schema and
>> the field order does not match, an error will occur. Here is a list of some
>> of the JIRAs that I have been tracking all related to this issue:
>> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
>> of the issue [1].
>>
>> The original reason for sorting fields is because kwargs in python < 3.6
>> are not guaranteed to be in the same order that they were entered [2].
>> Sorting alphabetically ensures a consistent order. Matters are further
>> complicated with the flag _*from_dict*_ that allows the Row fields to to
>> be referenced by name when made by kwargs, but this flag is not serialized
>> with the Row and leads to inconsistent behavior. For instance:
>>
>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
>> Row(B='2', A='1')>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
>> Row(B='1', A='2')
>>
>> I think the best way to fix this is to remove the sorting of fields when
>> constructing a Row. For users with Python 3.6+, nothing would change
>> because these versions of Python ensure that the kwargs stays in the
>> ordered entered. For users with Python < 3.6, using kwargs would check a
>> conf to either raise an error or fallback to a LegacyRow that sorts the
>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
>> can also be removed at the same time. There are also other ways to create
>> Rows that will not be affected. I have opened a JIRA [3] to capture this,
>> but I am wondering what others think about fixing this for Spark 3.0?
>>
>> [1] https://github.com/apache/spark/pull/20280
>> [2] https://www.python.org/dev/peps/pep-0468/
>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>>
>>

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Hyukjin Kwon <gu...@gmail.com>.

+1

2019년 11월 6일 (수) 오후 11:38, Wenchen Fan <cl...@gmail.com>님이 작성:

> Sounds reasonable to me. We should make the behavior consistent within
> Spark.
>
> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:
>
>> Currently, when a PySpark Row is created with keyword arguments, the
>> fields are sorted alphabetically. This has created a lot of confusion with
>> users because it is not obvious (although it is stated in the pydocs) that
>> they will be sorted alphabetically. Then later when applying a schema and
>> the field order does not match, an error will occur. Here is a list of some
>> of the JIRAs that I have been tracking all related to this issue:
>> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
>> of the issue [1].
>>
>> The original reason for sorting fields is because kwargs in python < 3.6
>> are not guaranteed to be in the same order that they were entered [2].
>> Sorting alphabetically ensures a consistent order. Matters are further
>> complicated with the flag _*from_dict*_ that allows the Row fields to to
>> be referenced by name when made by kwargs, but this flag is not serialized
>> with the Row and leads to inconsistent behavior. For instance:
>>
>> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
>> Row(B='2', A='1')>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
>> Row(B='1', A='2')
>>
>> I think the best way to fix this is to remove the sorting of fields when
>> constructing a Row. For users with Python 3.6+, nothing would change
>> because these versions of Python ensure that the kwargs stays in the
>> ordered entered. For users with Python < 3.6, using kwargs would check a
>> conf to either raise an error or fallback to a LegacyRow that sorts the
>> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
>> can also be removed at the same time. There are also other ways to create
>> Rows that will not be affected. I have opened a JIRA [3] to capture this,
>> but I am wondering what others think about fixing this for Spark 3.0?
>>
>> [1] https://github.com/apache/spark/pull/20280
>> [2] https://www.python.org/dev/peps/pep-0468/
>> [3] https://issues.apache.org/jira/browse/SPARK-29748
>>
>>

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Wenchen Fan <cl...@gmail.com>.

Sounds reasonable to me. We should make the behavior consistent within
Spark.

On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:

> Currently, when a PySpark Row is created with keyword arguments, the
> fields are sorted alphabetically. This has created a lot of confusion with
> users because it is not obvious (although it is stated in the pydocs) that
> they will be sorted alphabetically. Then later when applying a schema and
> the field order does not match, an error will occur. Here is a list of some
> of the JIRAs that I have been tracking all related to this issue:
> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
> of the issue [1].
>
> The original reason for sorting fields is because kwargs in python < 3.6
> are not guaranteed to be in the same order that they were entered [2].
> Sorting alphabetically ensures a consistent order. Matters are further
> complicated with the flag _*from_dict*_ that allows the Row fields to to
> be referenced by name when made by kwargs, but this flag is not serialized
> with the Row and leads to inconsistent behavior. For instance:
>
> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
> Row(B='2', A='1')>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
> Row(B='1', A='2')
>
> I think the best way to fix this is to remove the sorting of fields when
> constructing a Row. For users with Python 3.6+, nothing would change
> because these versions of Python ensure that the kwargs stays in the
> ordered entered. For users with Python < 3.6, using kwargs would check a
> conf to either raise an error or fallback to a LegacyRow that sorts the
> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
> can also be removed at the same time. There are also other ways to create
> Rows that will not be affected. I have opened a JIRA [3] to capture this,
> but I am wondering what others think about fixing this for Spark 3.0?
>
> [1] https://github.com/apache/spark/pull/20280
> [2] https://www.python.org/dev/peps/pep-0468/
> [3] https://issues.apache.org/jira/browse/SPARK-29748
>
>

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

Posted by Wenchen Fan <cl...@gmail.com>.

Sounds reasonable to me. We should make the behavior consistent within
Spark.

On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler <cu...@gmail.com> wrote:

> Currently, when a PySpark Row is created with keyword arguments, the
> fields are sorted alphabetically. This has created a lot of confusion with
> users because it is not obvious (although it is stated in the pydocs) that
> they will be sorted alphabetically. Then later when applying a schema and
> the field order does not match, an error will occur. Here is a list of some
> of the JIRAs that I have been tracking all related to this issue:
> SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion
> of the issue [1].
>
> The original reason for sorting fields is because kwargs in python < 3.6
> are not guaranteed to be in the same order that they were entered [2].
> Sorting alphabetically ensures a consistent order. Matters are further
> complicated with the flag _*from_dict*_ that allows the Row fields to to
> be referenced by name when made by kwargs, but this flag is not serialized
> with the Row and leads to inconsistent behavior. For instance:
>
> >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
> Row(B='2', A='1')>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first()
> Row(B='1', A='2')
>
> I think the best way to fix this is to remove the sorting of fields when
> constructing a Row. For users with Python 3.6+, nothing would change
> because these versions of Python ensure that the kwargs stays in the
> ordered entered. For users with Python < 3.6, using kwargs would check a
> conf to either raise an error or fallback to a LegacyRow that sorts the
> fields as before. With Python < 3.6 being deprecated now, this LegacyRow
> can also be removed at the same time. There are also other ways to create
> Rows that will not be affected. I have opened a JIRA [3] to capture this,
> but I am wondering what others think about fixing this for Spark 3.0?
>
> [1] https://github.com/apache/spark/pull/20280
> [2] https://www.python.org/dev/peps/pep-0468/
> [3] https://issues.apache.org/jira/browse/SPARK-29748
>
>