You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rafal Wojdyla (Jira)" <ji...@apache.org> on 2021/05/12 13:56:00 UTC
[jira] [Created] (SPARK-35386) parquet read with schema should fail
on non-existing columns
Rafal Wojdyla created SPARK-35386:
-------------------------------------
Summary: parquet read with schema should fail on non-existing columns
Key: SPARK-35386
URL: https://issues.apache.org/jira/browse/SPARK-35386
Project: Spark
Issue Type: Bug
Components: Input/Output, PySpark
Affects Versions: 3.0.1
Reporter: Rafal Wojdyla
When read schema is specified as I user I would prefer/like if spark failed on missing columns.
{code:python}
spark: SparkSession = ...
spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]
# let's specify a custom read_schema, with **non nullable** col3 (which is not present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])
df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")
df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))
df.count()
# 0
{code}
Is this a feature or a bug? In this case there's just a single parquet file, I have also tried {{option("mergeSchema", "true")}}, which doesn't help.
Similar read pattern would fail on pandas (and likely dask).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org