You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Michael Nguyen (JIRA)" <ji...@apache.org> on 2016/03/11 02:29:40 UTC

[jira] [Comment Edited] (SPARK-13804) Spark SQL's DataFrame.count() Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows

    [ https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189827#comment-15189827 ] 

Michael Nguyen edited comment on SPARK-13804 at 3/11/16 1:28 AM:
-----------------------------------------------------------------

I tracked this issue further and it is tied to the increase in the data source. So I updated this issue to reflect that.


was (Author: michaelmnguyen):
HiveThriftServer2 is part of org.apache.spark.sql.hive.thriftserver package so it is an issue with Spark SQL. Also, the root cause could be with how dynamicDataFrame.registerTempTable interacts with hiveContext.dropTempTable for the same table, and possibly with large tables such as 7+ million rows. So further analysis is needed to determine the root cause.

> Spark SQL's DataFrame.count()  Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13804
>                 URL: https://issues.apache.org/jira/browse/SPARK-13804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: - 3 nodes Spark cluster: 1 master node and 2 slave nodes
> - Each node is an EC2 with c3.4xlarge
> - Each node has 16 cores and 30GB of RAM
>            Reporter: Michael Nguyen
>
> Spark SQL is used to load cvs files via com.databricks.spark.csv and then run dataFrame.count() 
> In the same environment with plenty of CPU and RAM, Spark SQL takes 
> - 18.25 seconds to load  a table with 4 millions vs
> - 346.624 seconds (5.77 minutes) to load a table with 16 million rows.
> Even though the number of rows increases by 4 times, the time it takes Spark SQL to run dataframe.count () increases by 19.22 times. So the performance of dataframe.count () diverges so drastically.
> 1. Why does Spark SQL's performance not proportional to the number of rows while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?
> 2. What can be done to fix  this performance issue ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org