You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 03:59:34 UTC

[jira] [Updated] (SPARK-21659) FileStreamSink checks for _spark_metadata even if path has globs

     [ https://issues.apache.org/jira/browse/SPARK-21659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-21659:
---------------------------------
    Labels: bulk-closed  (was: )

> FileStreamSink checks for _spark_metadata even if path has globs
> ----------------------------------------------------------------
>
>                 Key: SPARK-21659
>                 URL: https://issues.apache.org/jira/browse/SPARK-21659
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, SQL
>    Affects Versions: 2.2.0
>            Reporter: peay
>            Priority: Minor
>              Labels: bulk-closed
>
> I am using the GCS connector for Hadoop, and reading a Dataframe using {{context.read.format("parquet").load("...")}}.
> When my URI has glob patterns of the form
> {code}
> gs://uri/{a,b,c}
> {code}
> or as below, Spark incorrectly assumes that it is a single file path, and produces this rather verbose exception:
> {code}
> java.net.URISyntaxException: Illegal character in path at index xx: gs://bucket-name/path/to/date=2017-0{1-29,1-30,1-31,2-01,2-02,2-03,2-04}*/_spark_metadata
> 	at java.net.URI$Parser.fail(URI.java:2848)
> 	at java.net.URI$Parser.checkChars(URI.java:3021)
> 	at java.net.URI$Parser.parseHierarchical(URI.java:3105)
> 	at java.net.URI$Parser.parse(URI.java:3053)
> 	at java.net.URI.<init>(URI.java:588)
> 	at com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:93)
> 	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:171)
> 	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1421)
> 	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
> 	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
> 	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
> 	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> 	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> 	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> 	at py4j.Gateway.invoke(Gateway.java:280)
> 	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> 	at py4j.commands.CallCommand.execute(CallCommand.java:79)
> 	at py4j.GatewayConnection.run(GatewayConnection.java:214)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> I am not quite sure if the GCS connector deviates from the HCFS standard here in terms of behavior, but this makes logs really hard to read for jobs that load a bunch of files like this.
> https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L39 already has an explicit {{case Seq(singlePath) =>}}, except that it is misleading because {{singlePath}} can have wildcards. In addition, it could check for non-escaped glob characters, like
> {code}
> {, }, ?, *
> {code}
> and go to the multiple-paths case when those are present, where looking for metadata is skipped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org