You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Young, Matthew T" <ma...@intel.com> on 2015/07/17 19:15:05 UTC

Spark and SQL Server

Hello,

I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a couple of issues writing back. In Scala 2.10 I can write back to the database except for a couple of types.

1. When I read a DataFrame from a table that contains a datetime column it comes in as a java.sql.Timestamp object in the DataFrame. This is alright for Spark purposes, but when I go to write this back to the database with df.write.jdbc(…) it errors out because it is trying to write the TimeStamp type to SQL Server, which is not a date/time storing type in TSQL. I think it should be writing a datetime, but I’m not sure how to tell Spark this.

2. A related misunderstanding happens when I try to write a java.lang.boolean to the database; it errors out because Spark is trying to specify the width of the bit type, which is illegal in SQL Server (error msg: Cannot specify a column width on data type bit). Do I need to edit Spark source to fix this behavior, or is there a configuration option somewhere that I am not aware of?

When I attempt to write back to SQL Server in an IPython notebook, py4j seems unable to convert a Python dict into a Java hashmap, which is necessary for parameter passing. I’ve documented details of this problem with code examples here<http://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session>. Any advice would be appreciated.

Thank you for your time,

-- Matthew Young

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Spark and SQL Server

Posted by "Young, Matthew T" <ma...@intel.com>.

When attempting to write a Dataframe to SQL Server that contains java.sql.Timestamp or java.lang.boolean objects I get errors about the query that is formed being invalid. Specifically, java.sql.Timestamp objects try to be written as the Timestamp type, which is not appropriate for date/time storage in TSQL. They should be datetimes or datetime2s.

java.lang.boolean errors out because it tries to specify the width of the BIT field, which SQL Server doesn't like.

However, I can write strings/varchars and ints without any issues.


________________________________________
From: Davies Liu [davies@databricks.com]
Sent: Monday, July 20, 2015 9:08 AM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: Spark and SQL Server

Sorry for the confusing. What's the other issues?

On Mon, Jul 20, 2015 at 8:26 AM, Young, Matthew T
<ma...@intel.com> wrote:
> Thanks Davies, that resolves the issue with Python.
>
> I was using the Java/Scala DataFrame documentation <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameWriter.html> and assuming that it was the same for PySpark< http://spark.apache.org/docs/1.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter>. I will keep this distinction in mind going forward.
>
> I guess we have to wait for Microsoft to release an SQL Server connector for Spark to resolve the other issues.
>
> Cheers,
>
> -- Matthew Young
>
> ________________________________________
> From: Davies Liu [davies@databricks.com]
> Sent: Saturday, July 18, 2015 12:45 AM
> To: Young, Matthew T
> Cc: user@spark.apache.org
> Subject: Re: Spark and SQL Server
>
> I think you have a mistake on call jdbc(), it should be:
>
> jdbc(self, url, table, mode, properties)
>
> You had use properties as the third parameter.
>
> On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T
> <ma...@intel.com> wrote:
>> Hello,
>>
>> I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a couple of issues writing back. In Scala 2.10 I can write back to the database except for a couple of types.
>>
>>
>> 1.      When I read a DataFrame from a table that contains a datetime column it comes in as a java.sql.Timestamp object in the DataFrame. This is alright for Spark purposes, but when I go to write this back to the database with df.write.jdbc(…) it errors out because it is trying to write the TimeStamp type to SQL Server, which is not a date/time storing type in TSQL. I think it should be writing a datetime, but I’m not sure how to tell Spark this.
>>
>>
>>
>> 2.      A related misunderstanding happens when I try to write a java.lang.boolean to the database; it errors out because Spark is trying to specify the width of the bit type, which is illegal in SQL Server (error msg: Cannot specify a column width on data type bit). Do I need to edit Spark source to fix this behavior, or is there a configuration option somewhere that I am not aware of?
>>
>>
>> When I attempt to write back to SQL Server in an IPython notebook, py4j seems unable to convert a Python dict into a Java hashmap, which is necessary for parameter passing. I’ve documented details of this problem with code examples here<http://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session>. Any advice would be appreciated.
>>
>> Thank you for your time,
>>
>> -- Matthew Young
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark and SQL Server

Posted by Davies Liu <da...@databricks.com>.

Sorry for the confusing. What's the other issues?

On Mon, Jul 20, 2015 at 8:26 AM, Young, Matthew T
<ma...@intel.com> wrote:
> Thanks Davies, that resolves the issue with Python.
>
> I was using the Java/Scala DataFrame documentation <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameWriter.html> and assuming that it was the same for PySpark< http://spark.apache.org/docs/1.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter>. I will keep this distinction in mind going forward.
>
> I guess we have to wait for Microsoft to release an SQL Server connector for Spark to resolve the other issues.
>
> Cheers,
>
> -- Matthew Young
>
> ________________________________________
> From: Davies Liu [davies@databricks.com]
> Sent: Saturday, July 18, 2015 12:45 AM
> To: Young, Matthew T
> Cc: user@spark.apache.org
> Subject: Re: Spark and SQL Server
>
> I think you have a mistake on call jdbc(), it should be:
>
> jdbc(self, url, table, mode, properties)
>
> You had use properties as the third parameter.
>
> On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T
> <ma...@intel.com> wrote:
>> Hello,
>>
>> I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a couple of issues writing back. In Scala 2.10 I can write back to the database except for a couple of types.
>>
>>
>> 1.      When I read a DataFrame from a table that contains a datetime column it comes in as a java.sql.Timestamp object in the DataFrame. This is alright for Spark purposes, but when I go to write this back to the database with df.write.jdbc(…) it errors out because it is trying to write the TimeStamp type to SQL Server, which is not a date/time storing type in TSQL. I think it should be writing a datetime, but I’m not sure how to tell Spark this.
>>
>>
>>
>> 2.      A related misunderstanding happens when I try to write a java.lang.boolean to the database; it errors out because Spark is trying to specify the width of the bit type, which is illegal in SQL Server (error msg: Cannot specify a column width on data type bit). Do I need to edit Spark source to fix this behavior, or is there a configuration option somewhere that I am not aware of?
>>
>>
>> When I attempt to write back to SQL Server in an IPython notebook, py4j seems unable to convert a Python dict into a Java hashmap, which is necessary for parameter passing. I’ve documented details of this problem with code examples here<http://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session>. Any advice would be appreciated.
>>
>> Thank you for your time,
>>
>> -- Matthew Young
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: Spark and SQL Server

Posted by "Young, Matthew T" <ma...@intel.com>.

Thanks Davies, that resolves the issue with Python.

I was using the Java/Scala DataFrame documentation <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrameWriter.html> and assuming that it was the same for PySpark< http://spark.apache.org/docs/1.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter>. I will keep this distinction in mind going forward.

I guess we have to wait for Microsoft to release an SQL Server connector for Spark to resolve the other issues.

Cheers,

-- Matthew Young

________________________________________
From: Davies Liu [davies@databricks.com]
Sent: Saturday, July 18, 2015 12:45 AM
To: Young, Matthew T
Cc: user@spark.apache.org
Subject: Re: Spark and SQL Server

I think you have a mistake on call jdbc(), it should be:

jdbc(self, url, table, mode, properties)

You had use properties as the third parameter.

On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T
<ma...@intel.com> wrote:
> Hello,
>
> I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a couple of issues writing back. In Scala 2.10 I can write back to the database except for a couple of types.
>
>
> 1.      When I read a DataFrame from a table that contains a datetime column it comes in as a java.sql.Timestamp object in the DataFrame. This is alright for Spark purposes, but when I go to write this back to the database with df.write.jdbc(…) it errors out because it is trying to write the TimeStamp type to SQL Server, which is not a date/time storing type in TSQL. I think it should be writing a datetime, but I’m not sure how to tell Spark this.
>
>
>
> 2.      A related misunderstanding happens when I try to write a java.lang.boolean to the database; it errors out because Spark is trying to specify the width of the bit type, which is illegal in SQL Server (error msg: Cannot specify a column width on data type bit). Do I need to edit Spark source to fix this behavior, or is there a configuration option somewhere that I am not aware of?
>
>
> When I attempt to write back to SQL Server in an IPython notebook, py4j seems unable to convert a Python dict into a Java hashmap, which is necessary for parameter passing. I’ve documented details of this problem with code examples here<http://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session>. Any advice would be appreciated.
>
> Thank you for your time,
>
> -- Matthew Young
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark and SQL Server

Posted by Davies Liu <da...@databricks.com>.

I think you have a mistake on call jdbc(), it should be:

jdbc(self, url, table, mode, properties)

You had use properties as the third parameter.

On Fri, Jul 17, 2015 at 10:15 AM, Young, Matthew T
<ma...@intel.com> wrote:
> Hello,
>
> I am testing Spark interoperation with SQL Server via JDBC with Microsoft’s 4.2 JDBC Driver. Reading from the database works ok, but I have encountered a couple of issues writing back. In Scala 2.10 I can write back to the database except for a couple of types.
>
>
> 1.      When I read a DataFrame from a table that contains a datetime column it comes in as a java.sql.Timestamp object in the DataFrame. This is alright for Spark purposes, but when I go to write this back to the database with df.write.jdbc(…) it errors out because it is trying to write the TimeStamp type to SQL Server, which is not a date/time storing type in TSQL. I think it should be writing a datetime, but I’m not sure how to tell Spark this.
>
>
>
> 2.      A related misunderstanding happens when I try to write a java.lang.boolean to the database; it errors out because Spark is trying to specify the width of the bit type, which is illegal in SQL Server (error msg: Cannot specify a column width on data type bit). Do I need to edit Spark source to fix this behavior, or is there a configuration option somewhere that I am not aware of?
>
>
> When I attempt to write back to SQL Server in an IPython notebook, py4j seems unable to convert a Python dict into a Java hashmap, which is necessary for parameter passing. I’ve documented details of this problem with code examples here<http://stackoverflow.com/questions/31417653/java-util-hashmap-missing-in-pyspark-session>. Any advice would be appreciated.
>
> Thank you for your time,
>
> -- Matthew Young
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org