You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/11/09 00:52:58 UTC
[jira] [Assigned] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

     [ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-18372:
------------------------------------

    Assignee: Apache Spark

> .Hive-staging folders created from Spark hiveContext are not getting cleaned up
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18372
>                 URL: https://issues.apache.org/jira/browse/SPARK-18372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.2, 1.6.3
>         Environment: spark standalone and spark yarn 
>            Reporter: mingjie tang
>            Assignee: Apache Spark
>             Fix For: 2.0.1
>
>
> Steps to reproduce:
> ================
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =====
> with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
> .hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the components -
> <property> 
> <name>hive.exec.stagingdir</name> 
> <value>$ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging</value> 
> </property>
> a new .hive-staging folder was created in hive/warehouse/<tablename> folder
> moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
> The issue happens via Spark-submit as well - customer used the following command to reproduce this -
> spark-submit test-hive-staging-cleanup.py
> Solution: 
> This bug is reported by customers.
> The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is expired. However, spark fail to notify the hive to remove the staging files.
> Thus, follow the code of spark 2.0.x, I just write one function inside the InsertIntoHiveTable to create the .staging directory, then, after the session expired of spark, this .staging directory would be removed.
> This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request is : 
> For the test, I have manually checking .staging files from table belong directory after the spark shell close. meanwhile, please advise how to write the test case? because the directory for the related tables can not get.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org