You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by mhornbech <mo...@datasolvr.com> on 2016/08/19 23:16:37 UTC

Spark 2.0 regression when querying very wide data frames

Hi

We currently have some workloads in Spark 1.6.2 with queries operating on a
data frame with 1500+ columns (17000 rows). This has never been quite
stable, and some queries, such as "select *" would yield empty result sets,
but queries restricting to specific columns have mostly worked. Needless to
say that 1500+ columns isn't "desirable", but that's what the client's data
looks like and our preference have been to load it and normalize it through
Spark.

We have been waiting to see how this would work with Spark 2.0, and
unfortunately the problem has gotten worse. Almost all queries on this large
data frame that worked before will now return data frames with only null
values.

Is this a known issue with Spark? If yes, does anyone know why it has been
left untouched / made worse in Spark 2.0? If data frames with many columns
is a limitation that goes deep into Spark, I would prefer hard errors rather
than queries that run with meaningless results. The problem is easy to
reproduce, but I am not familiar enough debugging the Spark source code to
find the root cause. 

Hope some of you can enlighten me :-)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark 2.0 regression when querying very wide data frames

Posted by Sean Owen <so...@cloudera.com>.
Yes, have a look through JIRA in cases like this.
https://issues.apache.org/jira/browse/SPARK-16664

On Sat, Aug 20, 2016 at 1:57 AM, mhornbech <mo...@datasolvr.com> wrote:
> I did some extra digging. Running the query "select column1 from myTable" I
> can reproduce the problem on a frame with a single row - it occurs exactly
> when the frame has more than 200 columns, which smells a bit like a
> hardcoded limit.
>
> Interestingly the problem disappears when replacing the query with "select
> column1 from myTable limit N" where N is arbitrary. However it appears again
> when running "select * from myTable limit N" with sufficiently many columns
> (haven't determined the exact threshold here).
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27568.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark 2.0 regression when querying very wide data frames

Posted by ponkin <al...@ya.ru>.
I generated CSV file with 300 columns, and it seems to work fine with Spark
Dataframes(Spark 2.0).
I think you need to post your issue in spark-cassandra-connector community
(https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user)
- if you are using it.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark 2.0 regression when querying very wide data frames

Posted by mhornbech <mo...@datasolvr.com>.
I dont think thats the issue. It sound very much like this 
https://issues.apache.org/jira/browse/SPARK-16664

Morten

> Den 20. aug. 2016 kl. 21.24 skrev ponkin [via Apache Spark User List] <ml...@n3.nabble.com>:
> 
> Did you try to load wide, for example, CSV file or Parquet? May be the problem is in spark-cassandra-connector not Spark itself? Are you using spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)? 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27571.html
> To unsubscribe from Spark 2.0 regression when querying very wide data frames, click here.
> NAML




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 2.0 regression when querying very wide data frames

Posted by ponkin <al...@ya.ru>.
Did you try to load wide, for example, CSV file or Parquet? May be the
problem is in spark-cassandra-connector not Spark itself? Are you using
spark-cassandra-connector(https://github.com/datastax/spark-cassandra-connector)? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27571.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark 2.0 regression when querying very wide data frames

Posted by mhornbech <mo...@datasolvr.com>.
Cassandra. 

Morten

> Den 20. aug. 2016 kl. 13.53 skrev ponkin [via Apache Spark User List] <ml...@n3.nabble.com>:
> 
> Hi, 
> What kind of datasource do you have? CSV, Avro, Parquet? 
> 
> If you reply to this email, your message will be added to the discussion below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html
> To unsubscribe from Spark 2.0 regression when querying very wide data frames, click here.
> NAML




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 2.0 regression when querying very wide data frames

Posted by ponkin <al...@ya.ru>.
Hi,
What kind of datasource do you have? CSV, Avro, Parquet?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: Spark 2.0 regression when querying very wide data frames

Posted by mhornbech <mo...@datasolvr.com>.
I did some extra digging. Running the query "select column1 from myTable" I
can reproduce the problem on a frame with a single row - it occurs exactly
when the frame has more than 200 columns, which smells a bit like a
hardcoded limit.

Interestingly the problem disappears when replacing the query with "select
column1 from myTable limit N" where N is arbitrary. However it appears again
when running "select * from myTable limit N" with sufficiently many columns
(haven't determined the exact threshold here).



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-regression-when-querying-very-wide-data-frames-tp27567p27568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org