You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Colin Alstad <co...@pokitdok.com> on 2015/10/28 16:38:52 UTC
Inconsistent Persistence of DataFrames in Spark 1.5
We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some
inconsistent behavior in persisting DataFrames.
df1 = sqlContext.read.parquet(“df1.parquet”)
df1.count()
> 161,100,982
df2 = sqlContext.read.parquet(“df2.parquet”)
df2.count()
> 67,498,706
join_df = df1.join(df2, ‘id’)
join_df.count()
> 160,608,147
join_df.write.parquet(“join.parquet”)
join_parquet = sqlContext.read.parquet(“join.parquet”)
join_parquet.count()
> 67,698,892
join_df.write.json(“join.json”)
join_json = sqlContext.read.parquet(“join.json”)
join_son.count()
> 67,695,663
The first major issue is that there is an order of magnitude difference
between the count of the join DataFrame and the persisted join DataFrame.
Secondly, persisting the same DataFrame into 2 different formats yields
different results.
Does anyone have any idea on what could be going on here?
--
Colin Alstad
Data Scientist
colin.alstad@pokitdok.com
<http://www.pokitdok.com/>
Re: Inconsistent Persistence of DataFrames in Spark 1.5
Posted by Michael Armbrust <mi...@databricks.com>.
There were several bugs in Spark 1.5 and we strongly recommend you upgrade
to 1.5.1. If the issue persists it would be helpful to see the result of
calling explain.
On Wed, Oct 28, 2015 at 4:46 PM, <Sa...@wellsfargo.com> wrote:
> Hi, just a couple cents.
>
>
>
> are your joining columns StringTypes (id field)? I have recently reported
> a bug where having inconsistent results when filtering String fields in
> group operations.
>
>
>
> Saif
>
>
>
> *From:* Colin Alstad [mailto:colin.alstad@pokitdok.com]
> *Sent:* Wednesday, October 28, 2015 12:39 PM
> *To:* user@spark.apache.org
> *Subject:* Inconsistent Persistence of DataFrames in Spark 1.5
>
>
>
> We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some
> inconsistent behavior in persisting DataFrames.
>
>
>
> df1 = sqlContext.read.parquet(“df1.parquet”)
>
> df1.count()
>
> > 161,100,982
>
>
>
> df2 = sqlContext.read.parquet(“df2.parquet”)
>
> df2.count()
>
> > 67,498,706
>
>
>
> join_df = df1.join(df2, ‘id’)
>
> join_df.count()
>
> > 160,608,147
>
>
>
> join_df.write.parquet(“join.parquet”)
>
> join_parquet = sqlContext.read.parquet(“join.parquet”)
>
> join_parquet.count()
>
> > 67,698,892
>
>
>
> join_df.write.json(“join.json”)
>
> join_json = sqlContext.read.parquet(“join.json”)
>
> join_son.count()
>
> > 67,695,663
>
>
>
> The first major issue is that there is an order of magnitude difference
> between the count of the join DataFrame and the persisted join DataFrame.
> Secondly, persisting the same DataFrame into 2 different formats yields
> different results.
>
>
>
> Does anyone have any idea on what could be going on here?
>
>
>
> --
>
> Colin Alstad
>
> Data Scientist
>
> colin.alstad@pokitdok.com
>
>
>
> <http://www.pokitdok.com/>
>
RE: Inconsistent Persistence of DataFrames in Spark 1.5
Posted by Sa...@wellsfargo.com.
Hi, just a couple cents.
are your joining columns StringTypes (id field)? I have recently reported a bug where having inconsistent results when filtering String fields in group operations.
Saif
From: Colin Alstad [mailto:colin.alstad@pokitdok.com]
Sent: Wednesday, October 28, 2015 12:39 PM
To: user@spark.apache.org
Subject: Inconsistent Persistence of DataFrames in Spark 1.5
We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some inconsistent behavior in persisting DataFrames.
df1 = sqlContext.read.parquet(“df1.parquet”)
df1.count()
> 161,100,982
df2 = sqlContext.read.parquet(“df2.parquet”)
df2.count()
> 67,498,706
join_df = df1.join(df2, ‘id’)
join_df.count()
> 160,608,147
join_df.write.parquet(“join.parquet”)
join_parquet = sqlContext.read.parquet(“join.parquet”)
join_parquet.count()
> 67,698,892
join_df.write.json(“join.json”)
join_json = sqlContext.read.parquet(“join.json”)
join_son.count()
> 67,695,663
The first major issue is that there is an order of magnitude difference between the count of the join DataFrame and the persisted join DataFrame. Secondly, persisting the same DataFrame into 2 different formats yields different results.
Does anyone have any idea on what could be going on here?
--
Colin Alstad
Data Scientist
colin.alstad@pokitdok.com<ma...@pokitdok.com>
[https://platform.pokitdok.com/static/pd-assets/images/brand-nav.png]<http://www.pokitdok.com/>