You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/03/16 13:24:41 UTC

[jira] [Resolved] (SPARK-19381) spark 2.1.0 raises unrelated (unhelpful) error for parquet filenames beginning with '_'

     [ https://issues.apache.org/jira/browse/SPARK-19381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-19381.
----------------------------------
    Resolution: Cannot Reproduce

{code}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Python version 2.7.10 (default, Jul 30 2016 19:40:32)
SparkSession available as 'spark'.
>>> from pyspark.sql import Row
>>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(single=i, double=i ** 2)))
>>> df.write.parquet("debug.parquet")
>>> df.write.parquet("_debug.parquet")
>>> df = spark.read.parquet("debug.parquet")
>>> df = spark.read.parquet("_debug.parquet")
{code}

This seems fixed in the current master. I am resolving this as this can be reproduced as reported in the current master. It would be nice if someone identifies this JIRA and backports it if applicable.

> spark 2.1.0 raises unrelated (unhelpful) error for parquet filenames beginning with '_'
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-19381
>                 URL: https://issues.apache.org/jira/browse/SPARK-19381
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>            Reporter: Paul Pearce
>            Priority: Minor
>
> Under spark 2.1.0 if you attempt to read a parquet file with filename beginning with '_' the error returned is 
> "Unable to infer schema for Parquet. It must be specified manually."
> The bug is not the inability to read the file, rather that the error is unrelated to the actual problem. Below shows the generation of parquet files under spark 2.0.0 and the attempted reading of them under spark 2.1.0.
> Generation:
> {code}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0.cloudera1
>       /_/
> Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
> SparkSession available as 'spark'.
> >>> from pyspark.sql import Row
> >>> df = spark.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(single=i, double=i ** 2)))
> >>> df.write.parquet("debug.parquet")
> >>> df.write.parquet("_debug.parquet")
> {code}
> Reading
> {code}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>       /_/
> Using Python version 2.7.6 (default, Oct 26 2016 20:30:19)
> SparkSession available as 'spark'.
> >>> df = spark.read.parquet("debug.parquet")
> >>> df = spark.read.parquet("_debug.parquet")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/readwriter.py", line 274, in parquet
>     return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
>   File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
>   File "/opt/apache/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql/utils.py", line 69, in deco
>     raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
> {code}
> I only realized the source of the problem when reading issue: https://issues.apache.org/jira/browse/SPARK-16975 which describes a similar problem but with column names.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org