You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jorge Machado (Jira)" <ji...@apache.org> on 2020/01/27 10:56:00 UTC
[jira] [Commented] (SPARK-30647) When creating a custom datasource File NotFoundExpection happens

    [ https://issues.apache.org/jira/browse/SPARK-30647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17024242#comment-17024242 ] 

Jorge Machado commented on SPARK-30647:
---------------------------------------

I found a way to overcome this. I just replace %20 with " " like this:
{code:java}
(file: PartitionedFile) => {
    val origin = file.filePath.
            replace("%20", " ")
}{code}

> When creating a custom datasource File NotFoundExpection happens
> ----------------------------------------------------------------
>
>                 Key: SPARK-30647
>                 URL: https://issues.apache.org/jira/browse/SPARK-30647
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2
>            Reporter: Jorge Machado
>            Priority: Major
>
> Hello, I'm creating a datasource based on FileFormat and DataSourceRegister. 
> when I pass a path or a file that has a white space it seems to fail wit the error: 
> {code:java}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 213, localhost, executor driver): java.io.FileNotFoundException: File file:somePath/0019_leftImg8%20bit.png does not exist It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at org.apache.spark.sql.execution.columnar.CachedRDDBuilder$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:125) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> {code}
> I'm happy to fix this if someone tells me where I need to look.  
> I think it is on org.apache.spark.rdd.InputFileBlockHolder : 
> {code:java}
> inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, length))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org