You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Nguyen (JIRA)" <ji...@apache.org> on 2016/03/11 02:27:45 UTC
[jira] [Updated] (SPARK-13804) Spark SQL's DataFrame.count() Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows

     [ https://issues.apache.org/jira/browse/SPARK-13804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Nguyen updated SPARK-13804:
-----------------------------------
    Environment: 
- 3 nodes Spark cluster: 1 master node and 2 slave nodes
- Each node is an EC2 with c3.4xlarge
- Each node has 16 cores and 30GB of RAM
    Description: 
Spark SQL is used to load cvs files via com.databricks.spark.csv and then run dataFrame.count() 

In the same environment with plenty of CPU and RAM, Spark SQL takes 
- 18.25 seconds to load  a table with 4 millions vs
- 346.624 seconds (5.77 minutes) to load a table with 16 million rows.

Even though the number of rows increases by 4 times, the time it takes Spark SQL to run dataframe.count () increases by 19.22 times. So the performance of dataframe.count () diverges so drastically.

1. Why does Spark SQL's performance not proportional to the number of rows while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?

2. What can be done to fix  this performance issue ?

  was:
1. HiveThriftServer2 was started with startWithContext

2. Multiple temp tables were loaded and registered via registerTempTable .

3. HiveThriftServer2 was accessed via JDBC to access to those tables.

4. Some temp tables were dropped via hiveContext.dropTempTable(registerTableName); and reloaded to refresh their data. There are 1 to 7 million rows in these tables.

5. The same queries ran in step 3 were re-ran over the existing JDBC connection. This time HiveThriftServer2 receives those queries but at times HiveThriftServer2  hangs and does not return the results.  CPU utilization on both Spark driver and child nodes was around 1%. 10GB of RAM was used out of 30GB on the driver, and 3GB of RAM out of 30GB was used on the child nodes. So there was no resource starvation.

6. Wait about 5 minutes and rerun the same queries in step 5, and this time, HiveThriftServer2  returns the results of those queries fine.

This issue occurs intermittently when the steps 1-5 are repeated, so it may take several attempts to reproduce this issue.


        Summary: Spark SQL's DataFrame.count()  Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows  (was: org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 hangs intermittently)

> Spark SQL's DataFrame.count()  Major Divergent (Non-Linear) Performance Slowdown going from 4million rows to 16+ million rows
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-13804
>                 URL: https://issues.apache.org/jira/browse/SPARK-13804
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>         Environment: - 3 nodes Spark cluster: 1 master node and 2 slave nodes
> - Each node is an EC2 with c3.4xlarge
> - Each node has 16 cores and 30GB of RAM
>            Reporter: Michael Nguyen
>
> Spark SQL is used to load cvs files via com.databricks.spark.csv and then run dataFrame.count() 
> In the same environment with plenty of CPU and RAM, Spark SQL takes 
> - 18.25 seconds to load  a table with 4 millions vs
> - 346.624 seconds (5.77 minutes) to load a table with 16 million rows.
> Even though the number of rows increases by 4 times, the time it takes Spark SQL to run dataframe.count () increases by 19.22 times. So the performance of dataframe.count () diverges so drastically.
> 1. Why does Spark SQL's performance not proportional to the number of rows while there is plenty of CPU and RAM (it uses only 10GB out of 30GB RAM) ?
> 2. What can be done to fix  this performance issue ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org