You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "chenliang (Jira)" <ji...@apache.org> on 2020/12/10 12:43:00 UTC

[jira] [Updated] (SPARK-32208) SparkSQL throw Illegal character exception when load certain abnormal path of HDFS

     [ https://issues.apache.org/jira/browse/SPARK-32208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

chenliang updated SPARK-32208:
------------------------------
    Affects Version/s: 3.2.0
          Description: 
In the distributed hdfs storage system,Space and other special character are allowed in the path:
{code:java}
//代码占位符
hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000

{code}
When we load data by using
{code:java}
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.hive.orc.OrcFileFormat {code}
, exception may throw as below:
{code:java}
Caused by: java.net.URISyntaxException: Illegal character in path at index 136: hdfs://DClusterNmg4/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)atorg.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
anonfunbuildReaderWithPartitionValues1.apply(ParquetFileFormat.scala:352)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.orgapachesparksqlexecutiondatasourcesFileScanRDD
anon
readCurrentFile(FileScanRDD.scala:124)
at org.apache.spark.sql.execution.datasources.FileScanRDD
anon$1.nextIterator(FileScanRDD.scala:177)atorg.apache.spark.sql.execution.datasources.FileScanRDD
anon1.hasNext(FileScanRDD.scala:101)atorg.apache.spark.sql.execution.datasources.FileFormatWriteranonfunorgapachesparksqlexecutiondatasourcesFileFormatWriter
executeTask$3.apply(FileFormatWriter.scala:252)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
anonfunorgapachesparksqlexecutiondatasourcesFileFormatWriterexecuteTask3.apply(FileFormatWriter.scala:250)
at org.apache.spark.util.Utils.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)atorg.apache.spark.sql.execution.datasources.FileFormatWriter.orgapachesparksqlexecutiondatasourcesFileFormatWriter$$executeTask(FileFormatWriter.scala:256)
... 10 more

{code}
 Hdfs  has provided serveral  construct function to build path:

[https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java|https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java#L132]

We could fall back to  construct a path from a String rather than URI.

 

 

 

> SparkSQL throw  Illegal character exception when load certain abnormal path of HDFS 
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-32208
>                 URL: https://issues.apache.org/jira/browse/SPARK-32208
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.3, 3.2.0
>            Reporter: chenliang
>            Priority: Major
>
> In the distributed hdfs storage system,Space and other special character are allowed in the path:
> {code:java}
> //代码占位符
> hdfs://ns1/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
> {code}
> When we load data by using
> {code:java}
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> org.apache.spark.sql.hive.orc.OrcFileFormat {code}
> , exception may throw as below:
> {code:java}
> Caused by: java.net.URISyntaxException: Illegal character in path at index 136: hdfs://DClusterNmg4/tmp2/hive-staging/hadoop_hive_2020-07-06_17-31-29_139_7042265710400397740-1/-ext-10000/test_table=2020-06-17 18%3A00%3A00/part-00000-84396c4e-ba05-4936-afc7-db46c4251bfa.c000
> at java.net.URI$Parser.fail(URI.java:2848)
> at java.net.URI$Parser.checkChars(URI.java:3021)
> at java.net.URI$Parser.parseHierarchical(URI.java:3105)
> at java.net.URI$Parser.parse(URI.java:3053)
> at java.net.URI.<init>(URI.java:588)
> at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)atorg.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
> anonfunbuildReaderWithPartitionValues1.apply(ParquetFileFormat.scala:352)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.orgapachesparksqlexecutiondatasourcesFileScanRDD
> anon
> readCurrentFile(FileScanRDD.scala:124)
> at org.apache.spark.sql.execution.datasources.FileScanRDD
> anon$1.nextIterator(FileScanRDD.scala:177)atorg.apache.spark.sql.execution.datasources.FileScanRDD
> anon1.hasNext(FileScanRDD.scala:101)atorg.apache.spark.sql.execution.datasources.FileFormatWriteranonfunorgapachesparksqlexecutiondatasourcesFileFormatWriter
> executeTask$3.apply(FileFormatWriter.scala:252)atorg.apache.spark.sql.execution.datasources.FileFormatWriter
> anonfunorgapachesparksqlexecutiondatasourcesFileFormatWriterexecuteTask3.apply(FileFormatWriter.scala:250)
> at org.apache.spark.util.Utils.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)atorg.apache.spark.sql.execution.datasources.FileFormatWriter.orgapachesparksqlexecutiondatasourcesFileFormatWriter$$executeTask(FileFormatWriter.scala:256)
> ... 10 more
> {code}
>  Hdfs  has provided serveral  construct function to build path:
> [https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java|https://github.com/apache/hadoop/blob/master/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Path.java#L132]
> We could fall back to  construct a path from a String rather than URI.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org