You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/12/29 10:59:49 UTC
[jira] [Resolved] (SPARK-12553) join is absloutly slow

     [ https://issues.apache.org/jira/browse/SPARK-12553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-12553.
-------------------------------
    Resolution: Invalid

Please put questions to user@. Without knowing what the data is like or how you are running this, it's not possible to evaluate whether something is "too slow" or not

> join is absloutly slow 
> -----------------------
>
>                 Key: SPARK-12553
>                 URL: https://issues.apache.org/jira/browse/SPARK-12553
>             Project: Spark
>          Issue Type: Bug
>         Environment: cloudera cdh 5 
> centos 6 
>            Reporter: malouke
>
> Hello ,
> I have 7 tables to join with a left join, I did this:
> start = time.time ()
> df_test = hc.sql ("select * from rapexp201412 left join CLIENT1412 is rapNUMCNT CLINMCLI = \
>                left join SRN1412 is SRNSIRET CLISIRET = \
>                left join bodacc2014 is SRNSIREN bodSORCS = \
>                left join sinagr2014 is rapNUMCNT sinagNUMCNT = \
>                left join sinfix2014 is rapNUMCNT sinfiNUMCNT = \
>                left join sinimag2014 is rapNUMCNT sinimNUMCNT = \
>                left join up2014 is rapNUMCNT up2NUMCNT = \
>                left join upagr2014 is rapNUMCNT upaNUMCNT = \
>                left join aeveh is rapNUMCNT aevNUMCNT = \
>                left join premiums are rapNUMCNT = priNUMCNT ")
> time.time () - start
> take : 2.289154052734375s
> after I do:
> df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
>       , partitionBy = "date_part")
> df_test.save("/group/afra_churn_auto/raw/IARD_ENTREPRISE/data/dfc2_join/",source='parquet',mode='overwrite'\
>       , partitionBy = "date_part")
> here is the configuration of my pyspark:
> sc._conf.set (u'spark.dynamicAllocation.enabled 'u'false') \
>  .set (u'spark.eventLog.enabled 'u'true') \
>  .set (u'spark.shuffle.service.enabled 'u'false') \
>  .set (u'spark.yarn.historyServer.address 'u'http: //prssncdhna02.bigplay.bigdata: 18088') \
> .set (u'spark.driver.port 'u'54330') \
> .set (u'spark.eventLog.dir 'u'hdfs: // bigplay-nameservice / user / spark / applicationHistory') \
> .set (u'spark.blockManager.port 'u'54332') \
>  .set (u'spark.yarn.jar 'u'local: /opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar ') \
>  .set (u'spark.dynamicAllocation.executorIdleTimeout 'u'60') \
>  .set (u'spark.serializer 'u'org.apache.spark.serializer.KryoSerializer') \
> .set (u'spark.authenticate 'u'false') \
>  .set (u'spark.serializer.objectStreamReset 'u'100') \
>  .set (u'spark.submit.deployMode 'u'client') \
>  .set (u'spark.executor.memory 'u'4g') \
>  .set (u'spark.master 'u'yarn client') \
>  .set (u'spark.driver.memory 'u'10g') \
>  .set (u'spark.driver.extraLibraryPath 'u' / opt / Cloudera / parcels / CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \
>  .set (u'spark.dynamicAllocation.schedulerBacklogTimeout 'U'1') \
> .set (u'spark.executor.instances 'u'8') \
>  .set (u'spark.shuffle.service.port 'u'7337') \
> .set (u'spark.fileserver.port 'u'54331') \
>  .set (u'spark.app.name 'u'PySparkShell') \
> .set (u'spark.yarn.config.gatewayPath 'u' / opt / Cloudera / parcels') \
> .set (u'spark.rdd.compress 'u'True') \
> .set (u'spark.yarn.config.replacementPath 'u' {{}} /../../ HADOOP_COMMON_HOME .. ') \
> .set (u'spark.yarn.isPython 'u'true') \
> .set (u'spark.dynamicAllocation.minExecutors 'u'0') \
> .set(u'spark.executor.extraLibraryPath',u'/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native')\
> .set (u'spark.ui.proxyBase 'u' / proxy / application_1450819756020_0615 ') \
> .set (u'spark.yarn.am.extraLibraryPath 'u' / opt / Cloudera / parcels / CDH-5.4.7-1.cdh5.4.7.p0.3 / lib / hadoop / lib / native ') \
> .set (u'hadoop.major.version 'u'yarn') \
> .set (u'spark.version 'u'1.5.2')
> I do not understand why the join does not work
> Thank you beforehand



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org