You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ajay Cherukuri (JIRA)" <ji...@apache.org> on 2017/06/06 17:47:18 UTC
[jira] [Commented] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

    [ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16039330#comment-16039330 ] 

Ajay Cherukuri commented on SPARK-18372:
----------------------------------------

I have this issue in Spark 2.0.2

> .Hive-staging folders created from Spark hiveContext are not getting cleaned up
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18372
>                 URL: https://issues.apache.org/jira/browse/SPARK-18372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.2, 1.6.3
>         Environment: spark standalone and spark yarn 
>            Reporter: Mingjie Tang
>            Assignee: Mingjie Tang
>             Fix For: 1.6.4
>
>         Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> ================
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =====
> with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
> .hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the components -
> <property> 
> <name>hive.exec.stagingdir</name> 
> <value>$ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging</value> 
> </property>
> a new .hive-staging folder was created in hive/warehouse/<tablename> folder
> moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
> The issue happens via Spark-submit as well - customer used the following command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org