You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Michael Nguyen (JIRA)" <ji...@apache.org> on 2016/03/15 17:52:33 UTC

[jira] [Commented] (SPARK-13804) Spark SQL's DataFrame.count() Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows

    [ https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195625#comment-15195625 ] 

Michael Nguyen commented on SPARK-13804:
----------------------------------------

I posted this issue to user@ at 

http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-count-Major-Divergent-Non-Linear-Performance-Slowdown-when-data-set-increases-from-4-millis-td26493.html

However, it has not been accepted by the mailing list yet. What needs to be done for it to be accepted ? And what is the typical turn-around for postings to be accepted ?

> Spark SQL's DataFrame.count()  Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13804
>                 URL: https://issues.apache.org/jira/browse/SPARK-13804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: - 3 nodes Spark cluster: 1 master node and 2 slave nodes
> - Each node is an EC2 with c3.4xlarge
> - Each node has 16 cores and 30GB of RAM
>            Reporter: Michael Nguyen
>
> Spark SQL is used to load cvs files via com.databricks.spark.csv and then run dataFrame.count() 
> In the same environment with plenty of CPU and RAM, Spark SQL takes 
> - 18.25 seconds to load  a table with 4 millions vs
> - 346.624 seconds (5.77 minutes) to load a table with 16 million rows.
> Even though the number of rows increases by 4 times, the time it takes Spark SQL to run dataframe.count () increases by 19.22 times. So the performance of dataframe.count () diverges so drastically.
> 1. Why does Spark SQL's performance not proportional to the number of rows while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?
> 2. What can be done to fix  this performance issue ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org