You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Rafal Wojdyla (Jira)" <ji...@apache.org> on 2021/05/12 13:56:00 UTC

[jira] [Created] (SPARK-35386) parquet read with schema should fail on non-existing columns

Rafal Wojdyla created SPARK-35386:
-------------------------------------

             Summary: parquet read with schema should fail on non-existing columns
                 Key: SPARK-35386
                 URL: https://issues.apache.org/jira/browse/SPARK-35386
             Project: Spark
          Issue Type: Bug
          Components: Input/Output, PySpark
    Affects Versions: 3.0.1
            Reporter: Rafal Wojdyla


When read schema is specified as I user I would prefer/like if spark failed on missing columns.

{code:python}
spark: SparkSession = ...

spark.read.parquet("/tmp/data.snappy.parquet")
# inferred schema, includes 3 columns: col1, col2, new_col
# DataFrame[col1: bigint, col2: bigint, new_col: bigint]

# let's specify a custom read_schema, with **non nullable** col3 (which is not present):
read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])

df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")

df.schema
# we get a DataFrame with **nullable** col3:
# StructType(List(StructField(col3,DoubleType,true)))

df.count()
# 0
{code}

Is this a feature or a bug? In this case there's just a single parquet file, I have also tried {{option("mergeSchema", "true")}}, which doesn't help.

Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org