You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jason White <ja...@shopify.com> on 2017/03/20 21:30:10 UTC

Why are DataFrames always read with nullable=True?

If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378

If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is already
supported now.

Is this functionality still desirable? Is it potentially not applicable for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Why are DataFrames always read with nullable=True?

Posted by Marek Novotny <mn...@gmail.com>.
Hi,
I would like to ask you whether there is still plan to solve this problem
with nullability when reading data from parquet files?

I've noticed that the related JIRA ticket  SPARK-19950
<https://issues.apache.org/jira/browse/SPARK-19950>   is still in progress
and the PR  #17293 <https://github.com/apache/spark/pull/17293>   was closed
without any merge half a year ago.

I'm more than keen to help to find a solution for the problem, but I'm
missing a broader context. Are there any blockers? I got an impression from
the PR comments and related links that it shouldn't be difficult to fix it.

Thanks,
Marek Novotny



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Why are DataFrames always read with nullable=True?

Posted by Jason White <ja...@shopify.com>.
Thanks for pointing to those JIRA tickets, I hadn't seen them. Encouraging
that they are recent. I hope we can find a solution there.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207p21218.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Why are DataFrames always read with nullable=True?

Posted by Takeshi Yamamuro <li...@gmail.com>.
Hi,

Have you check the related JIRA? e.g.,
https://issues.apache.org/jira/browse/SPARK-19950
If you have any ask and request, you'd better to do there.

Thanks!

// maropu


On Tue, Mar 21, 2017 at 6:30 AM, Jason White <ja...@shopify.com>
wrote:

> If I create a dataframe in Spark with non-nullable columns, and then save
> that to disk as a Parquet file, the columns are properly marked as
> non-nullable. I confirmed this using parquet-tools. Then, when loading it
> back, Spark forces the nullable back to True.
>
> https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/
> datasources/DataSource.scala#L378
>
> If I remove the `.asNullable` part, Spark performs exactly as I'd like by
> default, picking up the data using the schema either in the Parquet file or
> provided by me.
>
> This particular LoC goes back a year now, and I've seen a variety of
> discussions about this issue. In particular with Michael here:
> https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
> seemed to be discussing writing, not reading, though, and writing is
> already
> supported now.
>
> Is this functionality still desirable? Is it potentially not applicable for
> all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable
> to
> pass an option to the DataFrameReader to disable this functionality?
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Why-are-DataFrames-
> always-read-with-nullable-True-tp21207.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro

Re: Why are DataFrames always read with nullable=True?

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.
Hi,
Regarding reading part for nullable, it seems to be considered to add a 
data cleaning step as Xiao said at 
https://www.mail-archive.com/user@spark.apache.org/msg39233.html.

Here is a PR https://github.com/apache/spark/pull/17293 to add the data 
cleaning step that throws an exception if null exists in non-null column.
Any comments are appreciated.

Kazuaki Ishizaki



From:   Jason White <ja...@shopify.com>
To:     dev@spark.apache.org
Date:   2017/03/21 06:31
Subject:        Why are DataFrames always read with nullable=True?



If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378


If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file 
or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is 
already
supported now.

Is this functionality still desirable? Is it potentially not applicable 
for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable 
to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html

Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org