You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stuart Reynolds (JIRA)" <ji...@apache.org> on 2017/07/12 20:01:00 UTC
[jira] [Updated] (SPARK-21392) Unable to infer schema when loading
Parquet file
[ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stuart Reynolds updated SPARK-21392:
------------------------------------
Description:
The following boring code works
{code:python}
response = "mi_or_chd_5"
colname = "f_1000"
outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")
col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")
>>> print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
>>> print col.schema
StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
{code}
But then,
{code:python}
outcome2 = sqlc.read.parquet(response) # fail
col2 = sqlc.read.parquet(colname) # fail
{code}
fails with:
{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
{code}
in
{code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
{code}
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)
was:
The following boring code works
{{ response = "mi_or_chd_5"
colname = "f_1000"
outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite")
col = sqlc.sql("""select eid,{colname} as {colname}
from baseline_denull
where {colname} IS NOT NULL""".format(colname=colname))
col.write.parquet(colname, mode="overwrite")
>>> print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
>>> print col.schema
StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
}}.
But then,
{{
outcome2 = sqlc.read.parquet(response) # fail
col2 = sqlc.read.parquet(colname) # fail
}}.
fails with:
{{ AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
}}.
in
{{ /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
}}.
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)
> Unable to infer schema when loading Parquet file
> ------------------------------------------------
>
> Key: SPARK-21392
> URL: https://issues.apache.org/jira/browse/SPARK-21392
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.1.1
> Environment: Spark 2.1.1. python 2.7.6
> Reporter: Stuart Reynolds
> Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
> response = "mi_or_chd_5"
> colname = "f_1000"
> outcome = sqlc.sql("""select eid,{response} as response
> from outcomes
> where {response} IS NOT NULL""".format(response=response))
> outcome.write.parquet(response, mode="overwrite")
>
> col = sqlc.sql("""select eid,{colname} as {colname}
> from baseline_denull
> where {colname} IS NOT NULL""".format(colname=colname))
> col.write.parquet(colname, mode="overwrite")
> >>> print outcome.schema
> StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> >>> print col.schema
> StructType(List(StructField(eid,IntegerType,true),StructField(f_1000,DoubleType,true)))
> {code}
>
> But then,
> {code:python}
> outcome2 = sqlc.read.parquet(response) # fail
> col2 = sqlc.read.parquet(colname) # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
> {code}
> in
> {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org