You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/09/01 21:37:20 UTC

[jira] [Commented] (SPARK-17366) Temp tables cached in spark - Joins performance

    [ https://issues.apache.org/jira/browse/SPARK-17366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456694#comment-15456694 ] 

Herman van Hovell commented on SPARK-17366:
-------------------------------------------

[~chris_sanjiv] Is this a question or a bug report?

There are to many unknowns to be able help you. What version of Spark are you using? What does your plan look like? What do you mean by to slow? Is the data you are joining skewed?

> Temp tables cached in spark - Joins performance
> -----------------------------------------------
>
>                 Key: SPARK-17366
>                 URL: https://issues.apache.org/jira/browse/SPARK-17366
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: SQL
>         Environment: Amazon S3
>            Reporter: Chris Sanjiv Xavier
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Hi ,
> I have a use case wherein we have SPARK running on an EC2 instance from amazon . We are puling data from an S3 Bucket . We pull them into DF's and then cache the tables . 
> We face a lot of performance issues when we try to Join the two tables which have been cached. It runs really slowly. 
> Example of issue :-
> Table A in memory 1000MB 
> Table B in memory 1000MB
> Pulling data using SQL interface on Zeppelin UI notebook on Amazon.
> Select * from table A inner join table B on A.column 1 = B.column 1 where B.column 2 = 'SPARK' ; 
> The above query returns results extremely slowly . 
> This is a spark cluster with 6 nodes holding close to 250 GB memory in total.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org