You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "irfan (JIRA)" <ji...@apache.org> on 2015/01/28 11:46:34 UTC

[jira] [Reopened] (SPARK-5452) We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue

     [ https://issues.apache.org/jira/browse/SPARK-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

irfan reopened SPARK-5452:
--------------------------

please provide some inputs.
this looks like performance issue considering the configuration and environment.

> We are migrating Tera Data SQL to Spark SQL. Query is taking long time. Please have a look on this issue
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-5452
>                 URL: https://issues.apache.org/jira/browse/SPARK-5452
>             Project: Spark
>          Issue Type: Test
>          Components: Spark Shell
>    Affects Versions: 1.2.0
>            Reporter: irfan
>              Labels: SparkSql
>
> Hi Team,
> we are migrating TeraData SQL to Spark SQL because of complexity we have spilted into below 4 sub-quries
> and we are running through  hive context
> ====================================================
> val HIVETMP1 = hc.sql("SELECT PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID,LMS_ACCOUNT_ID AS LMS_ACCOUNT_ID FROM VW_PARTY_ACCOUNT WHERE  PARTY_ACCOUNT_TYPE_CODE IN('04') AND  LMS_ACCOUNT_ID  IS NOT NULL")
> HIVETMP1.registerTempTable("VW_HIVETMP1")
> val HIVETMP2 = hc.sql("SELECT PACCNT.LMS_ACCOUNT_ID AS  LMS_ACCOUNT_ID, 'NULL' AS  RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS  MOST_RECENT_SPEND_LA ,STXN.PARTY_ACCOUNT_ID AS  MAX_SPEND_12WKS_LA ,STXN.MAX_SPEND_12WKS_LADATE  AS MAX_SPEND_12WKS_LADATE FROM VW_HIVETMP1 AS PACCNT  INNER JOIN (SELECT STXTMP.PARTY_ACCOUNT_ID AS PARTY_ACCOUNT_ID, SUM(CASE WHEN (CAST(STXTMP.TRANSACTION_DATE AS DATE ) > DATE_SUB(CAST(CONCAT(SUBSTRING(SYSTMP.OPTION_VAL,1,4),'-',SUBSTRING(SYSTMP.OPTION_VAL,5,2),'-',SUBSTRING(SYSTMP.OPTION_VAL,7,2)) AS DATE),84)) THEN STXTMP.TRANSACTION_VALUE ELSE 0.00 END) AS MAX_SPEND_12WKS_LADATE FROM VW_SHOPPING_TRANSACTION_TABLE AS STXTMP INNER JOIN SYSTEM_OPTION_TABLE AS SYSTMP ON STXTMP.FLAG == SYSTMP.FLAG AND  SYSTMP.OPTION_NAME = 'RID' AND STXTMP.PARTY_ACCOUNT_TYPE_CODE IN('04') GROUP BY STXTMP.PARTY_ACCOUNT_ID) AS STXN ON PACCNT.PARTY_ACCOUNT_ID = STXN.PARTY_ACCOUNT_ID WHERE  STXN.MAX_SPEND_12WKS_LADATE IS NOT NULL")
> HIVETMP2.registerTempTable("VW_HIVETMP2")
> val HIVETMP3 = hc.sql("SELECT LMS_ACCOUNT_ID,MAX(MAX_SPEND_12WKS_LA) AS MAX_SPEND_12WKS_LA, 1 AS RANK FROM VW_HIVETMP2 GROUP BY LMS_ACCOUNT_ID")
> HIVETMP3.registerTempTable("VW_HIVETMP3")
> val HIVETMP4 = hc.sql(" SELECT PACCNT.LMS_ACCOUNT_ID,'NULL' AS  RANDOM_PARTY_ACCOUNT_ID ,'NULL' AS  MOST_RECENT_SPEND_LA,STXN.MAX_SPEND_12WKS_LA AS MAX_SPEND_12WKS_LA,1 AS RANK1 FROM VW_HIVETMP2 AS PACCNT INNER JOIN VW_HIVETMP3 AS STXN ON PACCNT.LMS_ACCOUNT_ID = STXN.LMS_ACCOUNT_ID AND PACCNT.MAX_SPEND_12WKS_LA = STXN.MAX_SPEND_12WKS_LA")
> HIVETMP4.registerTempTable("WT03_ACCOUNT_BHVR3")
> HIVETMP4.saveAsTextFile("hdfs:/file/")
> ==========================
> This query has two Group By clauses which are running on huge files(19.5GB). And the query took 40min to get the final result. Is there any changes required in run time environment or Configuration Setting in Spark which can improve the query performance.
> below are our Environment and configuration details:
> Environment  details:
> No of nodes:4
> capacity on each node:62 GB RAM on each node.
> Storage capacity     :9TB on each node
> total cores          :48  
> Spark Configuration:
>  
> .set("spark.default.parallelism","64")
> .set("spark.driver.maxResultSize","2G")
> .set("spark.driver.memory","10g")
> .set("spark.rdd.compress","true")
> .set("spark.shuffle.spill.compress","true")
> .set("spark.shuffle.compress","true")
> .set("spark.shuffle.consolidateFiles","true/false")
> .set("spark.shuffle.spill","true/false") 
>  
> Data file size :
> SHOPPING_TRANSACTION 19.5GB
> PARTY_ACCOUNT        1.4GB
> SYSTEM_OPTIONS       11.6K
> please help us to resolve above issue.
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org