You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rafal Wojdyla (Jira)" <ji...@apache.org> on 2021/05/17 02:00:08 UTC

[jira] [Comment Edited] (SPARK-35386) parquet read with schema should fail on non-existing columns

    [ https://issues.apache.org/jira/browse/SPARK-35386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345821#comment-17345821 ] 

Rafal Wojdyla edited comment on SPARK-35386 at 5/17/21, 2:00 AM:
-----------------------------------------------------------------

{quote}
IIRC, it's not documented. We might think about documenting this behaviour.
{quote}

I don't really have the context for this, and also consider the current behaviour to be a bug.

{quote}
You could clip the schema from the file and user specified schema, and compare by recursively traversing.
{quote}

I mean, I certainly could, I'm just thinking what is the good user experience for this. Assuming post-read schema comparison:

{code:python}
my_read_schema = ...
df = spark.read.parquet(..., schema=my_read_schema)
# custom code to compare df.schema to my_read_schema, it would:
# ignore metadata, and fail on nullable -> required "promotion"
# but for example allow required -> nullable promotion
{code}

This is obviously not a great user experience, and likely won't handle all corner cases.

I think as a user I would prefer if:
 * if the user specified schema has a {{nullable=False}}/{{required}} column that doesn't exist in the data, the read should fail
 * if the user specified schema has a {{nullable=True}} column that doesn't exist in the data, the read would create a new "empty" column

This allows for schema evolution, where you can promote {{required}} to {{nullable}}. Furthermore allows for more deterministic read op, and avoid extra complicated schema comparison code.

[~hyukjin.kwon] wdyt?


was (Author: ravwojdyla):
{quote}
IIRC, it's not documented. We might think about documenting this behaviour.
{quote}

I don't really have the context for this, and also consider the current behaviour to be a bug.

{quote}
You could clip the schema from the file and user specified schema, and compare by recursively traversing.
{quote}

I mean, I certainly could, I'm just thinking what is the good user experience for this. Assuming post-read schema comparison:

{code:python}
my_read_schema = ...
df = spark.read.parquet(..., schema=my_read_schema)
# custom code to compare df.schema to my_read_schema, it would:
# ignore metadata, and fail on nullable -> required "promotion"
# but for example allow required -> nullable promotion
{code}

This is obviously not a great user experience, and likely won't handle all corner cases.

I think as a user I would prefer if:
 * if the user specified schema has a {{nullable=False}}/{{required}} column that doesn't exist in the data, the read should fail
 * if the user specified schema has a {{nullable=True}} column that doesn't exist in the data, the read would create a new "empty" column

This allows for schema evolution, where you can promote {{required}} to {{nullable}}. And allows for to user to control the read op, and avoid extra complicated schema comparison code.

[~hyukjin.kwon] wdyt?

> parquet read with schema should fail on non-existing columns
> ------------------------------------------------------------
>
>                 Key: SPARK-35386
>                 URL: https://issues.apache.org/jira/browse/SPARK-35386
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark
>    Affects Versions: 3.0.1
>            Reporter: Rafal Wojdyla
>            Priority: Major
>
> When read schema is specified as I user I would prefer/like if spark failed on missing columns.
> {code:python}
> from pyspark.sql.dataframe import DoubleType, StructType
> spark: SparkSession = ...
> spark.read.parquet("/tmp/data.snappy.parquet")
> # inferred schema, includes 3 columns: col1, col2, new_col
> # DataFrame[col1: bigint, col2: bigint, new_col: bigint]
> # let's specify a custom read_schema, with **non nullable** col3 (which is not present):
> read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])
> df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")
> df.schema
> # we get a DataFrame with **nullable** col3:
> # StructType(List(StructField(col3,DoubleType,true)))
> df.count()
> # 0
> {code}
> Is this a feature or a bug? In this case there's just a single parquet file, I have also tried {{option("mergeSchema", "true")}}, which doesn't help.
> Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org