You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chris Sanjiv Xavier (JIRA)" <ji...@apache.org> on 2016/09/01 21:20:20 UTC
[jira] [Created] (SPARK-17366) Temp tables cached in spark - Joins
performance
Chris Sanjiv Xavier created SPARK-17366:
-------------------------------------------
Summary: Temp tables cached in spark - Joins performance
Key: SPARK-17366
URL: https://issues.apache.org/jira/browse/SPARK-17366
Project: Spark
Issue Type: Brainstorming
Components: SQL
Environment: Amazon S3
Reporter: Chris Sanjiv Xavier
Hi ,
I have a use case wherein we have SPARK running on an EC2 instance from amazon . We are puling data from an S3 Bucket . We pull them into DF's and then cache the tables .
We face a lot of performance issues when we try to Join the two tables which have been cached. It runs really slowly.
Example of issue :-
Table A in memory 1000MB
Table B in memory 1000MB
Pulling data using SQL interface on Zeppelin UI notebook on Amazon.
Select * from table A inner join table B on A.column 1 = B.column 1 where B.column 2 = 'SPARK' ;
The above query returns results extremely slowly .
This is a spark cluster with 6 nodes holding close to 250 GB memory in total.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org