You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Keiichi Hirobe (JIRA)" <ji...@apache.org> on 2018/12/11 13:28:00 UTC

[jira] [Created] (SPARK-26339) Behavior of reading files that start with underscore is confusing

Keiichi Hirobe created SPARK-26339:
--------------------------------------

             Summary: Behavior of reading files that start with underscore is confusing
                 Key: SPARK-26339
                 URL: https://issues.apache.org/jira/browse/SPARK-26339
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Keiichi Hirobe


Behavior of reading files that start with underscore is as follows.
 # spark.read (no schema) throws exception which message is confusing.
 # spark.read (userSpecificationSchema) succesfully reads, but content is emtpy.

Example of files are as follows.
 The same behavior occured when I read json files.
{code:bash}
$ cat test.csv
test1,10
test2,20
$ cp test.csv _test.csv
$ ./bin/spark-shell  --master local[2]
{code}
Spark shell snippet for reproduction:
{code:java}
scala> val df=spark.read.csv("test.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]

scala> df.show()
+-----+---+
|  _c0|_c1|
+-----+---+
|test1| 10|
|test2| 20|
+-----+---+

scala> val df = spark.read.schema("test STRING, number INT").csv("test.csv")
df: org.apache.spark.sql.DataFrame = [test: string, number: int]
scala> df.show()
+-----+------+
| test|number|
+-----+------+
|test1|    10|
|test2|    20|
+-----+------+

scala> val df=spark.read.csv("_test.csv")
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$13(DataSource.scala:185)
  at scala.Option.getOrElse(Option.scala:138)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:185)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:231)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:219)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:625)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:478)
  ... 49 elided

scala> val df=spark.read.schema("test STRING, number INT").csv("_test.csv")
df: org.apache.spark.sql.DataFrame = [test: string, number: int]

scala> df.show()
+----+------+
|test|number|
+----+------+
+----+------+
{code}
I noticed that spark cannot read files that start with underscore after I read some codes.(I could not find any documents about file name limitation)

Above behavior is not good especially userSpecificationSchema case, I think.

I suggest to throw exception which message is "Path does not exist" in both cases.







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org