You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2018/10/19 10:49:00 UTC
[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

    [ https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656631#comment-16656631 ] 

Steve Loughran edited comment on SPARK-21725 at 10/19/18 10:48 AM:
-------------------------------------------------------------------

bq. can we fix it on the Hadoop side?


the only way to handle close() of > 1 FS would be moving to referenced counted filesystems everywhere. Otherwise: 

* Applications which know they get a unique version of an FS instance need to call close() on it. This matters especially for those connectors (object stores, etc) which create thread pools, http connection pools, etc.
* Applications which don't set up for a unique FS version, must not call close.

Ref counted FS clients would be the ultimate way to do this, but I suspect it is too late to do this

see: HADOOP-10792, HADOOP-4655, etc.

The general assumption is: if you want to manage the lifespan of your FS instance, create a unique one yourself using {{FileSystem.newInstance()}}. The method has been there since 0.21 so there's no reason not to adopt it. 


was (Author: stevel@apache.org):
bq. can we fix it on the Hadoop side?

fix what? 
the only way to handle close() of > 1 FS would be moving to referenced counted filesystems everywhere. Otherwise: 

* Applications which know they get a unique version of an FS instance need to call close() on it. This matters especially for those connectors (object stores, etc) which create thread pools, http connection pools, etc.
* Applications which don't set up for a unique FS version, must not call close.

Ref counted FS clients would be the ultimate way to do this, but I suspect it is too late to do this

see: HADOOP-10792, HADOOP-4655, etc.

The general assumption is: if you want to manage the lifespan of your FS instance, create a unique one yourself using {{FileSystem.newInstance()}}. The method has been there since 0.21 so there's no reason not to adopt it. 

> spark thriftserver insert overwrite table partition select 
> -----------------------------------------------------------
>
>                 Key: SPARK-21725
>                 URL: https://issues.apache.org/jira/browse/SPARK-21725
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>         Environment: centos 6.7 spark 2.1  jdk8
>            Reporter: xinzhang
>            Priority: Major
>              Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -------------------------------------------------------------------------------------
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ......
> ......
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-10000/part-00000 to destination hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-00000
>         at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
>         at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
>         at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
>         at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
>         ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> ....
> -------------------------------------------------------------------------------------
> the doc about the parquet table desc here http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.
> I am confused the problem appear in the table(partitions)  but it is ok with table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org