You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/12/10 13:18:00 UTC
[jira] [Assigned] (SPARK-32208) Spark SQL throw Illegal character exception when load certain abnormal path of HDFS

     [ https://issues.apache.org/jira/browse/SPARK-32208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-32208:
------------------------------------

    Assignee:     (was: Apache Spark)

> Spark SQL throw  Illegal character exception when load certain abnormal path of HDFS 
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-32208
>                 URL: https://issues.apache.org/jira/browse/SPARK-32208
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.3, 3.2.0
>            Reporter: chenliang
>            Priority: Major
>
> In the distributed hdfs storage system，Space and other special character are allowed in the path：
> {code:java}
> hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
> {code}
> When we load data by using
> {code:java}
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> org.apache.spark.sql.execution.datasources.orcOrcFileFormat.scala
> org.apache.spark.sql.hive.orc.OrcFileFormat {code}
> , exception may throw as below:
> {code:java}
> Caused by: java.net.URISyntaxException: Illegal character in path at index 136: hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
> at java.net.URI$Parser.fail(URI.java:2848)
> at java.net.URI$Parser.checkChars(URI.java:3021)
> at java.net.URI$Parser.parseHierarchical(URI.java:3105)
> at java.net.URI$Parser.parse(URI.java:3053)
> at java.net.URI.<init>(URI.java:588)
> at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)atorg.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> anonfunbuildReaderWithPartitionValues1.apply(ParquetFileFormat.scala:352)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.orgapachesparksqlexecutiondatasourcesFileScanRDD
> anon
> readCurrentFile(FileScanRDD.scala:124)
> at org.apache.spark.sql.execution.datasources.FileScanRDD
> anon$1.nextIterator(FileScanRDD.scala:177)atorg.apache.spark.sql.execution.datasources.FileScanRDD
> anon1.hasNext(FileScanRDD.scala:101)atorg.apache.spark.sql.execution.datasources.FileFormatWriteranonfunorgapachesparksqlexecutiondatasourcesFileFormatWriter
> executeTask$3.apply(FileFormatWriter.scala:252)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
> anonfunorgapachesparksqlexecutiondatasourcesFileFormatWriterexecuteTask3.apply(FileFormatWriter.scala:250)
> at org.apache.spark.util.Utils.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)atorg.apache.spark.sql.execution.datasources.FileFormatWriter.orgapachesparksqlexecutiondatasourcesFileFormatWriter$$executeTask(FileFormatWriter.scala:256)
> ... 10 more
> {code}
>  Hdfs  has provided serveral  construct function to build path:
> [https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java|https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java#L132]
> We could fall back to  construct a path from a String rather than URI.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org