You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "CHC (Jira)" <ji...@apache.org> on 2022/04/05 06:10:00 UTC
[jira] [Commented] (SPARK-31675) Fail to insert data to a table with remote location which causes by hive encryption check
[ https://issues.apache.org/jira/browse/SPARK-31675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517237#comment-17517237 ]
CHC commented on SPARK-31675:
-----------------------------
Met the same problem, the SQL to reproduce the problem is shown below:
{code:sql}
CREATE TABLE `spark3_snap`( `id` string) PARTITIONED BY (`dt` string)
STORED AS ORC LOCATION 'hdfs://path/to/spark3_snap';
-- The file system of the partition location is different from the file system of the table location,
-- one is S3A, the other is HDFS
alter table tmp.spark3_snap add partition (dt='2020-09-10')
LOCATION 's3a://path/to/spark3_snap/dt=2020-09-10';
insert overwrite table tmp.spark3_snap partition(dt)
select '10' id, '2020-09-09' dt
union
select '20' id, '2020-09-10' dt
;
{code}
And we will get an exception:
{code:none}
java.lang.IllegalArgumentException: Wrong FS: s3a://path/to/spark3_snap/dt=2020-09-10, expected: hdfs://cluster1
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:666)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:214)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:816)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:812)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:823)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.$anonfun$commitJob$6(HadoopMapReduceCommitProtocol.scala:194)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.$anonfun$commitJob$6$adapted(HadoopMapReduceCommitProtocol.scala:194)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:141)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:194)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:240)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:605)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:240)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:187)
at ......
{code}
I will submit a PR later to fix rename and delete files with different filesystem at the `HadoopMapReduceCommitProtocol`
> Fail to insert data to a table with remote location which causes by hive encryption check
> -----------------------------------------------------------------------------------------
>
> Key: SPARK-31675
> URL: https://issues.apache.org/jira/browse/SPARK-31675
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.6, 3.0.0, 3.1.0
> Reporter: Kent Yao
> Priority: Major
>
> Before this fix https://issues.apache.org/jira/browse/HIVE-14380 in Hive 2.2.0, when moving files from staging dir to the final table dir, Hive will do encryption check for the srcPaths and destPaths
> {code:java}
> // Some comments here
> if (!isSrcLocal) {
> // For NOT local src file, rename the file
> if (hdfsEncryptionShim != null && (hdfsEncryptionShim.isPathEncrypted(srcf) || hdfsEncryptionShim.isPathEncrypted(destf))
> && !hdfsEncryptionShim.arePathsOnSameEncryptionZone(srcf, destf))
> {
> LOG.info("Copying source " + srcf + " to " + destf + " because HDFS encryption zones are different.");
> success = FileUtils.copy(srcf.getFileSystem(conf), srcf, destf.getFileSystem(conf), destf,
> true, // delete source
> replace, // overwrite destination
> conf);
> } else {
> {code}
> The hdfsEncryptionShim instance holds a global FileSystem instance belong to the default fileSystem. It causes failures when checking a path that belongs to a remote file system.
> For example, I
> {code:sql}
> key int NULL
> # Detailed Table Information
> Database bdms_hzyaoqin_test_2
> Table abc
> Owner bdms_hzyaoqin
> Created Time Mon May 11 15:14:15 CST 2020
> Last Access Thu Jan 01 08:00:00 CST 1970
> Created By Spark 2.4.3
> Type MANAGED
> Provider hive
> Table Properties [transient_lastDdlTime=1589181255]
> Location hdfs://cluster2/user/warehouse/bdms_hzyaoqin_test.db/abc
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat org.apache.hadoop.mapred.TextInputFormat
> OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Storage Properties [serialization.format=1]
> Partition Provider Catalog
> Time taken: 0.224 seconds, Fetched 18 row(s)
> {code}
> The table abc belongs to the remote hdfs 'hdfs://cluster2', and when we run command below via a spark sql job with default fs is ' 'hdfs://cluster1'
> {code:sql}
> insert into bdms_hzyaoqin_test_2.abc values(1);
> {code}
> {code:java}
> Error in query: java.lang.IllegalArgumentException: Wrong FS: hdfs://cluster2/user/warehouse/bdms_hzyaoqin_test.db/abc/.hive-staging_hive_2020-05-11_17-10-27_123_6306294638950056285-1/-ext-10000/part-00000-badf2a31-ab36-4b60-82a1-0848774e4af5-c000, expected: hdfs://cluster1
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org