You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Chris Sanjiv Xavier (JIRA)" <ji...@apache.org> on 2016/09/01 21:20:20 UTC

[jira] [Created] (SPARK-17366) Temp tables cached in spark - Joins performance

Chris Sanjiv Xavier created SPARK-17366:
-------------------------------------------

             Summary: Temp tables cached in spark - Joins performance
                 Key: SPARK-17366
                 URL: https://issues.apache.org/jira/browse/SPARK-17366
             Project: Spark
          Issue Type: Brainstorming
          Components: SQL
         Environment: Amazon S3
            Reporter: Chris Sanjiv Xavier


Hi ,

I have a use case wherein we have SPARK running on an EC2 instance from amazon . We are puling data from an S3 Bucket . We pull them into DF's and then cache the tables . 

We face a lot of performance issues when we try to Join the two tables which have been cached. It runs really slowly. 

Example of issue :-

Table A in memory 1000MB 
Table B in memory 1000MB

Pulling data using SQL interface on Zeppelin UI notebook on Amazon.

Select * from table A inner join table B on A.column 1 = B.column 1 where B.column 2 = 'SPARK' ; 

The above query returns results extremely slowly . 

This is a spark cluster with 6 nodes holding close to 250 GB memory in total.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org