You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "David Haglund (external)" <Da...@husqvarnagroup.com> on 2017/02/13 12:51:48 UTC

Order of rows not preserved after cache + count + coalesce

Hi,

I found something that surprised me, I expected the order of the rows to be preserved, so I suspect this might be a bug. The problem is illustrated with the Python example below:

In [1]:
df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
df.cache()
df.count()
df.coalesce(2).rdd.glom().collect()
Out[1]:
[[Row(n=1)], [Row(n=0), Row(n=2)]]

Note how n=1 comes before n=0, above.


If I remove the cache line I get the rows in the correct order and the same if I use df.rdd.count() instead of df.count(), see examples below:

In [2]:
df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
df.count()
df.coalesce(2).rdd.glom().collect()
Out[2]:
[[Row(n=0)], [Row(n=1), Row(n=2)]]

In [3]:
df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
df.cache()
df.rdd.count()
df.coalesce(2).rdd.glom().collect()
Out[3]:
[[Row(n=0)], [Row(n=1), Row(n=2)]]


I use spark 2.1.0 and pyspark.

Regards,
/David

The information in this email may be confidential and/or legally privileged. It has been sent for the sole use of the intended recipient(s). If you are not an intended recipient, you are strictly prohibited from reading, disclosing, distributing, copying or using this email or any of its contents, in any way whatsoever. If you have received this email in error, please contact the sender by reply email and destroy all copies of the original message. Please also be advised that emails are not a secure form for communication, and may contain errors.

Re: Order of rows not preserved after cache + count + coalesce

Posted by Jon Gregg <co...@gmail.com>.

Spark has a zipWithIndex function for RDDs (
http://stackoverflow.com/a/26081548) that adds an index column right after
you create an RDD, and I believe it preserves order.  Then you can sort it
by the index after the cache step.

I haven't tried this with a Dataframe but this answer seems promising:
http://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex



On Mon, Feb 13, 2017 at 8:34 AM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> RDDs and DataFrames do not guarantee any specific ordering of data. They
> are like tables in a SQL database. The only way to get a guaranteed
> ordering of rows is to explicitly specify an orderBy() clause in your
> statement. Any ordering you see otherwise is incidental.
> 
>
> On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) <
> David.Haglund@husqvarnagroup.com> wrote:
>
>> Hi,
>>
>>
>>
>> I found something that surprised me, I expected the order of the rows to
>> be preserved, so I suspect this might be a bug. The problem is illustrated
>> with the Python example below:
>>
>>
>>
>> In [1]:
>>
>> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>>
>> df.cache()
>>
>> df.count()
>>
>> df.coalesce(2).rdd.glom().collect()
>>
>> Out[1]:
>>
>> [[Row(n=1)], [Row(n=0), Row(n=2)]]
>>
>>
>>
>> Note how n=1 comes before n=0, above.
>>
>>
>>
>>
>>
>> If I remove the cache line I get the rows in the correct order and the
>> same if I use df.rdd.count() instead of df.count(), see examples below:
>>
>>
>>
>> In [2]:
>>
>> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>>
>> df.count()
>>
>> df.coalesce(2).rdd.glom().collect()
>>
>> Out[2]:
>>
>> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>>
>>
>>
>> In [3]:
>>
>> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>>
>> df.cache()
>>
>> df.rdd.count()
>>
>> df.coalesce(2).rdd.glom().collect()
>>
>> Out[3]:
>>
>> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>>
>>
>>
>>
>>
>> I use spark 2.1.0 and pyspark.
>>
>>
>>
>> Regards,
>>
>> /David
>>
>> The information in this email may be confidential and/or legally
>> privileged. It has been sent for the sole use of the intended recipient(s).
>> If you are not an intended recipient, you are strictly prohibited from
>> reading, disclosing, distributing, copying or using this email or any of
>> its contents, in any way whatsoever. If you have received this email in
>> error, please contact the sender by reply email and destroy all copies of
>> the original message. Please also be advised that emails are not a secure
>> form for communication, and may contain errors.
>>
>

Re: Order of rows not preserved after cache + count + coalesce

Posted by Nicholas Chammas <ni...@gmail.com>.

RDDs and DataFrames do not guarantee any specific ordering of data. They
are like tables in a SQL database. The only way to get a guaranteed
ordering of rows is to explicitly specify an orderBy() clause in your
statement. Any ordering you see otherwise is incidental.


On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) <
David.Haglund@husqvarnagroup.com> wrote:

> Hi,
>
>
>
> I found something that surprised me, I expected the order of the rows to
> be preserved, so I suspect this might be a bug. The problem is illustrated
> with the Python example below:
>
>
>
> In [1]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[1]:
>
> [[Row(n=1)], [Row(n=0), Row(n=2)]]
>
>
>
> Note how n=1 comes before n=0, above.
>
>
>
>
>
> If I remove the cache line I get the rows in the correct order and the
> same if I use df.rdd.count() instead of df.count(), see examples below:
>
>
>
> In [2]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[2]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
> In [3]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.rdd.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[3]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
>
>
> I use spark 2.1.0 and pyspark.
>
>
>
> Regards,
>
> /David
>
> The information in this email may be confidential and/or legally
> privileged. It has been sent for the sole use of the intended recipient(s).
> If you are not an intended recipient, you are strictly prohibited from
> reading, disclosing, distributing, copying or using this email or any of
> its contents, in any way whatsoever. If you have received this email in
> error, please contact the sender by reply email and destroy all copies of
> the original message. Please also be advised that emails are not a secure
> form for communication, and may contain errors.
>