You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "mingjie tang (JIRA)" <ji...@apache.org> on 2016/11/09 00:55:59 UTC
[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

     [ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

mingjie tang updated SPARK-18372:
---------------------------------
    Description: 
Steps to reproduce:
================
1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=====
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
<property> 
<name>hive.exec.stagingdir</name> 
<value>$ {hive.exec.scratchdir}
/$
{user.name}
/.staging</value> 
</property>
a new .hive-staging folder was created in hive/warehouse/<tablename> folder
moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
The issue happens via Spark-submit as well - customer used the following command to reproduce this -
spark-submit test-hive-staging-cleanup.py


  was:
Steps to reproduce:
================
1. Launch spark-shell 
2. Run the following scala code via Spark-Shell 
scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
scala> import org.apache.spark.sql.DataFrameWriter 
scala> val dfw : DataFrameWriter = hivesampletabledf.write 
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
scala> dfw.insertInto("hivesampletablecopypy") 
scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
hivesampletablecopypydfdf.show
3. in HDFS (in our case, WASB), we can see the following folders 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000 
hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
the issue is that these don't get cleaned up and get accumulated
=====
with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
.hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
we have tried adding this property to hive-site.xml and restart the components -
<property> 
<name>hive.exec.stagingdir</name> 
<value>$ {hive.exec.scratchdir}
/$
{user.name}
/.staging</value> 
</property>
a new .hive-staging folder was created in hive/warehouse/<tablename> folder
moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
The issue happens via Spark-submit as well - customer used the following command to reproduce this -
spark-submit test-hive-staging-cleanup.py

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of (org.apache.hadoop.hive.) to create the staging directory. Default, from the hive side, this staging file would be removed after the hive session is expired. However, spark fail to notify the hive to remove the staging files.
Thus, follow the code of spark 2.0.x, I just write one function inside the InsertIntoHiveTable to create the .staging directory, then, after the session expired of spark, this .staging directory would be removed.
This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request is : 

For the test, I have manually checking .staging files from table belong directory after the spark shell close. meanwhile, please advise how to write the test case? because the directory for the related tables can not get.  


> .Hive-staging folders created from Spark hiveContext are not getting cleaned up
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18372
>                 URL: https://issues.apache.org/jira/browse/SPARK-18372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.2, 1.6.3
>         Environment: spark standalone and spark yarn 
>            Reporter: mingjie tang
>             Fix For: 2.0.1
>
>
> Steps to reproduce:
> ================
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( clientid string, querytime string, market string, deviceplatform string, devicemake string, devicemodel string, state string, country string, querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =====
> with the customer, we have tried setting "SET hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any difference.
> .hive-staging folders are created under the <TableName> folder - hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the components -
> <property> 
> <name>hive.exec.stagingdir</name> 
> <value>$ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging</value> 
> </property>
> a new .hive-staging folder was created in hive/warehouse/<tablename> folder
> moreover, please understand that if we run the hive query in pure Hive via Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark configuration already
> The issue happens via Spark-submit as well - customer used the following command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org