You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Santiago M. Mola (JIRA)" <ji...@apache.org> on 2015/06/15 15:00:02 UTC

[jira] [Commented] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names

    [ https://issues.apache.org/jira/browse/SPARK-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14585883#comment-14585883 ] 

Santiago M. Mola commented on SPARK-6666:
-----------------------------------------

I opened SPARK-8377 to track the general case, since I have this problem with other data sources, not just JDBC.

> org.apache.spark.sql.jdbc.JDBCRDD  does not escape/quote column names
> ---------------------------------------------------------------------
>
>                 Key: SPARK-6666
>                 URL: https://issues.apache.org/jira/browse/SPARK-6666
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment:  
>            Reporter: John Ferguson
>            Priority: Critical
>
> Is there a way to have JDBC DataFrames use quoted/escaped column names?  Right now, it looks like it "sees" the names correctly in the schema created but does not escape them in the SQL it creates when they are not compliant:
> org.apache.spark.sql.jdbc.JDBCRDD
> ....
> private val columnList: String = {
> val sb = new StringBuilder()
> columns.foreach(x => sb.append(",").append(x))
> if (sb.length == 0) "1" else sb.substring(1)
> }
> If you see value in this, I would take a shot at adding the quoting (escaping) of column names here.  If you don't do it, some drivers... like postgresql's will simply drop case all names when parsing the query.  As you can see in the TL;DR below that means they won't match the schema I am given.
> TL;DR:
> -------- 
> I am able to connect to a Postgres database in the shell (with driver referenced):
>    val jdbcDf = sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500")
> In fact when I run:
>    jdbcDf.registerTempTable("sp500")
>    val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI FROM sp500")
> and
>    val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share")))
> The values come back as expected.  However, if I try:
>    jdbcDf.show
> Or if I try
>    
>    val all = sqlContext.sql("SELECT * FROM sp500")
>    all.show
> I get errors about column names not being found.  In fact the error includes a mention of column names all lower cased.  For now I will change my schema to be more restrictive.  Right now it is, per a Stack Overflow poster, not ANSI compliant by doing things that are allowed by ""'s in pgsql, MySQL and SQLServer.  BTW, our users are giving us tables like this... because various tools they already use support non-compliant names.  In fact, this is mild compared to what we've had to support.
> Currently the schema in question uses mixed case, quoted names with special characters and spaces:
> CREATE TABLE sp500
> (
> "Symbol" text,
> "Name" text,
> "Sector" text,
> "Price" double precision,
> "Dividend Yield" double precision,
> "Price/Earnings" double precision,
> "Earnings/Share" double precision,
> "Book Value" double precision,
> "52 week low" double precision,
> "52 week high" double precision,
> "Market Cap" double precision,
> "EBITDA" double precision,
> "Price/Sales" double precision,
> "Price/Book" double precision,
> "SEC Filings" text
> ) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org