You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2018/07/16 10:43:06 UTC

JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?

Hi,

I think there is a sort of inconsistency in how DataFrameReader.jdbc deals
with a user-defined schema as it makes sure that there's no user-specified
schema [1][2] yet allows for setting one using customSchema option [3]. Why
is so? Has this been merely overlooked or similar?

I think assertNoSpecifiedSchema should be removed from DataFrameReader.jdbc
and support for DataFrameReader.schema for jdbc should be added (with
the customSchema option marked as deprecated to be removed in 2.4 or 3.0).

Should I file an issue in Spark JIRA and do the changes? WDYT?

[1]
https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala?utf8=%E2%9C%93#L249
[2]
https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala?utf8=%E2%9C%93#L320
[3]
https://github.com/apache/spark/blob/v2.3.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala#L167

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

Re: JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi Joseph,

Thanks for your explanation. It makes a lot of sense and I found
http://spark.apache.org/docs/latest/sql-programming-
guide.html#jdbc-to-other-databases giving more.

With that and after I reviewed the code, customSchema option is simply to
override the data type of the fields in a relation schema [1][2]. I think
the name of the option should be different with the word "override" to give
the exact meaning, shouldn't it?

With that said, I think the description of customSchema option may slightly
be incorrect. For example the following says:

"The custom schema to use for reading data from JDBC connectors"

and although it is used for reading it merely overrides the data types and
may not match the fields at all which makes no difference. Is that correct?

It's in the following sentence where the word of "type" appears:

"You can also specify partial fields, and the others use the default type
mapping."

But that begs for another question about "the default type mapping". What
the default type mapping is? That was one of my questions when I first
found the option.

What do you think about the following description of the customSchema
option. You're welcome to make further changes if needed.

====
customSchema - Specifies the custom data types of the read schema (that is
used at load time).

customSchema is a comma-separated list of field definitions with column
names and their data types in a canonical SQL representation, e.g. id
DECIMAL(38, 0), name STRING.

customSchema defines the data types of the columns that will override the
data types inferred from the table schema and follows the following pattern:

colTypeList
    : colType (',' colType)*
    ;

colType
    : identifier dataType (COMMENT STRING)?
    ;

dataType
    : complex=ARRAY '<' dataType '>'
#complexDataType
    | complex=MAP '<' dataType ',' dataType '>'
 #complexDataType
    | complex=STRUCT ('<' complexColTypeList? '>' | NEQ)
#complexDataType
    | identifier ('(' INTEGER_VALUE (',' INTEGER_VALUE)* ')')?
#primitiveDataType
    ;
====

Should I file a JIRA task for this?

[1] https://github.com/apache/spark/blob/v2.3.1/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.
scala?utf8=%E2%9C%93#L116-L118
[2] https://github.com/apache/spark/blob/v2.3.1/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.
scala#L785-L788

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jul 16, 2018 at 4:27 PM, Joseph Torres <joseph.torres@databricks.com
> wrote:

> I guess the question is partly about the semantics of
> DataFrameReader.schema. If it's supposed to mean "the loaded dataframe will
> definitely have exactly this schema", that doesn't quite match the behavior
> of the customSchema option. If it's only meant to be an arbitrary schema
> input which the source can interpret however it wants, it'd be fine.
>
> The second semantic is IMO more useful, so I'm in favor here.
>
> On Mon, Jul 16, 2018 at 3:43 AM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> I think there is a sort of inconsistency in how DataFrameReader.jdbc
>> deals with a user-defined schema as it makes sure that there's no
>> user-specified schema [1][2] yet allows for setting one using customSchema
>> option [3]. Why is so? Has this been merely overlooked or similar?
>>
>> I think assertNoSpecifiedSchema should be removed from
>> DataFrameReader.jdbc and support for DataFrameReader.schema for jdbc should
>> be added (with the customSchema option marked as deprecated to be removed
>> in 2.4 or 3.0).
>>
>> Should I file an issue in Spark JIRA and do the changes? WDYT?
>>
>> [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/DataFrameReader.scala?
>> utf8=%E2%9C%93#L249
>> [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/DataFrameReader.scala?
>> utf8=%E2%9C%93#L320
>> [3] https://github.com/apache/spark/blob/v2.3.1/sql/core/src
>> /main/scala/org/apache/spark/sql/execution/datasources/
>> jdbc/JDBCOptions.scala#L167
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://about.me/JacekLaskowski
>> Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>
>

Re: JDBC Data Source and customSchema option but DataFrameReader.assertNoSpecifiedSchema?

Posted by Joseph Torres <jo...@databricks.com>.

I guess the question is partly about the semantics of
DataFrameReader.schema. If it's supposed to mean "the loaded dataframe will
definitely have exactly this schema", that doesn't quite match the behavior
of the customSchema option. If it's only meant to be an arbitrary schema
input which the source can interpret however it wants, it'd be fine.

The second semantic is IMO more useful, so I'm in favor here.

On Mon, Jul 16, 2018 at 3:43 AM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I think there is a sort of inconsistency in how DataFrameReader.jdbc deals
> with a user-defined schema as it makes sure that there's no user-specified
> schema [1][2] yet allows for setting one using customSchema option [3]. Why
> is so? Has this been merely overlooked or similar?
>
> I think assertNoSpecifiedSchema should be removed from
> DataFrameReader.jdbc and support for DataFrameReader.schema for jdbc should
> be added (with the customSchema option marked as deprecated to be removed
> in 2.4 or 3.0).
>
> Should I file an issue in Spark JIRA and do the changes? WDYT?
>
> [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/
> src/main/scala/org/apache/spark/sql/DataFrameReader.
> scala?utf8=%E2%9C%93#L249
> [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/
> src/main/scala/org/apache/spark/sql/DataFrameReader.
> scala?utf8=%E2%9C%93#L320
> [3] https://github.com/apache/spark/blob/v2.3.1/sql/core/
> src/main/scala/org/apache/spark/sql/execution/
> datasources/jdbc/JDBCOptions.scala#L167
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>