You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "peay (JIRA)" <ji...@apache.org> on 2017/08/07 23:25:00 UTC
[jira] [Created] (SPARK-21659) FileStreamSink checks for _spark_metadata even if path has globs

peay created SPARK-21659:
----------------------------

             Summary: FileStreamSink checks for _spark_metadata even if path has globs
                 Key: SPARK-21659
                 URL: https://issues.apache.org/jira/browse/SPARK-21659
             Project: Spark
          Issue Type: Bug
          Components: Input/Output, SQL
    Affects Versions: 2.2.0
            Reporter: peay
            Priority: Minor


I am using the GCS connector for Hadoop, and reading a Dataframe using {{context.read.format("parquet").load("...")}}.

When my URI has glob patterns of the form
{code}
gs://uri/{a,b,c}
{code}
or as below, Spark incorrectly assumes that it is a single file path, and produces this rather verbose exception:

{code}
java.net.URISyntaxException: Illegal character in path at index xx: gs://bucket-name/path/to/date=2017-0{1-29,1-30,1-31,2-01,2-02,2-03,2-04}*/_spark_metadata
	at java.net.URI$Parser.fail(URI.java:2848)
	at java.net.URI$Parser.checkChars(URI.java:3021)
	at java.net.URI$Parser.parseHierarchical(URI.java:3105)
	at java.net.URI$Parser.parse(URI.java:3053)
	at java.net.URI.<init>(URI.java:588)
	at com.google.cloud.hadoop.gcsio.LegacyPathCodec.getPath(LegacyPathCodec.java:93)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:171)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1421)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:320)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)
{code}

I am not quite sure if the GCS connector deviates from the HCFS standard here in terms of behavior, but this makes logs really hard to read for jobs that load a bunch of files like this.

https://github.com/apache/spark/blob/3ac60930865209bf804ec6506d9d8b0ddd613157/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L39 already has an explicit {{case Seq(singlePath) =>}}, except that it is misleading because {{singlePath}} can have wildcards. In addition, it could check for non-escaped glob characters, like

{code}
{, }, ?, *
{code}

and go to the multiple-paths case when those are present, where looking for metadata is skipped.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org