You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Richard Hillegas <rh...@us.ibm.com> on 2015/09/22 19:53:23 UTC

column identifiers in Spark SQL


I am puzzled by the behavior of column identifiers in Spark SQL. I don't
find any guidance in the "Spark SQL and DataFrame Guide" at
http://spark.apache.org/docs/latest/sql-programming-guide.html. I am seeing
odd behavior related to case-sensitivity and to delimited (quoted)
identifiers.

Consider the following declaration of a table in the Derby relational
database, whose dialect hews closely to the SQL Standard:

   create table app.t( a int, "b" int, "c""d" int );

Now let's load that table into Spark like this:

  import org.apache.spark.sql._
  import org.apache.spark.sql.types._

  val df = sqlContext.read.format("jdbc").options(
    Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
    "dbtable" -> "app.t")).load()
  df.registerTempTable("test_data")

The following query runs fine because the column name matches the
normalized form in which it is stored in the metadata catalogs of the
relational database:

  // normalized column names are recognized
  sqlContext.sql(s"""select A from test_data""").show

But the following query fails during name resolution. This puzzles me
because non-delimited identifiers are case-insensitive in the ANSI/ISO
Standard. They are also supposed to be case-insensitive in HiveQL, at least
according to section 2.3.1 of the QuotedIdentifier.html webpage attached to
https://issues.apache.org/jira/browse/HIVE-6013:

  // ...unnormalized column names raise this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
columns A, b, c"d;
  sqlContext.sql("""select a from test_data""").show

Delimited (quoted) identifiers are treated as string literals. Again,
non-Standard behavior:

  // this returns rows consisting of the string literal "b"
  sqlContext.sql("""select "b" from test_data""").show

Embedded quotes in delimited identifiers won't even parse:

  // embedded quotes raise this error: java.lang.RuntimeException: [1.11]
failure: ``union'' expected but "d" found
  sqlContext.sql("""select "c""d" from test_data""").show

This behavior is non-Standard and it strikes me as hard to describe to
users concisely. Would the community support an effort to bring the
handling of column identifiers into closer conformance with the Standard?
Would backward compatibility concerns even allow us to do that?

Thanks,
-Rick

Re: column identifiers in Spark SQL

Posted by Richard Hillegas <rh...@us.ibm.com>.
Thanks for that additional tip, Michael. Backticks fix the problem query in
which an identifier was transformed into a string literal. So this works
now...

  // now correctly resolves the unnormalized column id
  sqlContext.sql("""select `b` from test_data""").show

Any suggestion about how to escape an embedded double quote?

  // java.sql.SQLSyntaxErrorException: Syntax error: Encountered "\"" at
line 1, column 12.
  sqlContext.sql("""select `c"d` from test_data""").show

  // org.apache.spark.sql.AnalysisException: cannot resolve 'c\"d' given
input columns A, b, c"d; line 1 pos 7
  sqlContext.sql("""select `c\"d` from test_data""").show

Thanks,
-Rick

Michael Armbrust <mi...@databricks.com> wrote on 09/22/2015 01:16:12 PM:

> From: Michael Armbrust <mi...@databricks.com>
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: Dev <de...@spark.apache.org>
> Date: 09/22/2015 01:16 PM
> Subject: Re: column identifiers in Spark SQL
>
> HiveQL uses `backticks` for quoted identifiers.
>
> On Tue, Sep 22, 2015 at 1:06 PM, Richard Hillegas <rh...@us.ibm.com>
wrote:
> Thanks for that tip, Michael. I think that my sqlContext was a raw
> SQLContext originally. I have rebuilt Spark like so...
>
>   sbt/sbt -Phive assembly/assembly
>
> Now I see that my sqlContext is a HiveContext. That fixes one of the
> queries. Now unnormalized column names work:
>
>   // ...unnormalized column names work now
>   sqlContext.sql("""select a from test_data""").show
>
> However, quoted identifiers are still treated as string literals:
>
>   // this still returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> And embedded quotes inside quoted identifiers are swallowed up:
>
>   // this now returns rows consisting of the string literal "cd"
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> Thanks,
> -Rick
>
> Michael Armbrust <mi...@databricks.com> wrote on 09/22/2015 10:58:36
AM:
>
> > From: Michael Armbrust <mi...@databricks.com>
> > To: Richard Hillegas/San Francisco/IBM@IBMUS
> > Cc: Dev <de...@spark.apache.org>
> > Date: 09/22/2015 10:59 AM
> > Subject: Re: column identifiers in Spark SQL
>
> >
> > Are you using a SQLContext or a HiveContext?  The programming guide
> > suggests the latter, as the former is really only there because some
> > applications may have conflicts with Hive dependencies.  SQLContext
> > is case sensitive by default where as the HiveContext is not.  The
> > parser in HiveContext is also a lot better.
> >
> > On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rhilleg@us.ibm.com
> > wrote:
> > I am puzzled by the behavior of column identifiers in Spark SQL. I
> > don't find any guidance in the "Spark SQL and DataFrame Guide" at
> > http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> > seeing odd behavior related to case-sensitivity and to delimited
> > (quoted) identifiers.
> >
> > Consider the following declaration of a table in the Derby
> > relational database, whose dialect hews closely to the SQL Standard:
> >
> >    create table app.t( a int, "b" int, "c""d" int );
> >
> > Now let's load that table into Spark like this:
> >
> >   import org.apache.spark.sql._
> >   import org.apache.spark.sql.types._
> >
> >   val df = sqlContext.read.format("jdbc").options(
> >     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> >     "dbtable" -> "app.t")).load()
> >   df.registerTempTable("test_data")
> >
> > The following query runs fine because the column name matches the
> > normalized form in which it is stored in the metadata catalogs of
> > the relational database:
> >
> >   // normalized column names are recognized
> >   sqlContext.sql(s"""select A from test_data""").show
> >
> > But the following query fails during name resolution. This puzzles
> > me because non-delimited identifiers are case-insensitive in the
> > ANSI/ISO Standard. They are also supposed to be case-insensitive in
> > HiveQL, at least according to section 2.3.1 of the
> > QuotedIdentifier.html webpage attached to https://issues.apache.org/
> > jira/browse/HIVE-6013:
> >
> >   // ...unnormalized column names raise this error:
> > org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> > input columns A, b, c"d;
> >   sqlContext.sql("""select a from test_data""").show
> >
> > Delimited (quoted) identifiers are treated as string literals.
> > Again, non-Standard behavior:
> >
> >   // this returns rows consisting of the string literal "b"
> >   sqlContext.sql("""select "b" from test_data""").show
> >
> > Embedded quotes in delimited identifiers won't even parse:
> >
> >   // embedded quotes raise this error: java.lang.RuntimeException:
> > [1.11] failure: ``union'' expected but "d" found
> >   sqlContext.sql("""select "c""d" from test_data""").show
> >
> > This behavior is non-Standard and it strikes me as hard to describe
> > to users concisely. Would the community support an effort to bring
> > the handling of column identifiers into closer conformance with the
> > Standard? Would backward compatibility concerns even allow us to do
that?
> >
> > Thanks,
> > -Rick

Re: column identifiers in Spark SQL

Posted by Michael Armbrust <mi...@databricks.com>.
HiveQL uses `backticks` for quoted identifiers.

On Tue, Sep 22, 2015 at 1:06 PM, Richard Hillegas <rh...@us.ibm.com>
wrote:

> Thanks for that tip, Michael. I think that my sqlContext was a raw
> SQLContext originally. I have rebuilt Spark like so...
>
>   sbt/sbt -Phive assembly/assembly
>
> Now I see that my sqlContext is a HiveContext. That fixes one of the
> queries. Now unnormalized column names work:
>
>   // ...unnormalized column names work now
>   sqlContext.sql("""select a from test_data""").show
>
> However, quoted identifiers are still treated as string literals:
>
>   // this still returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> And embedded quotes inside quoted identifiers are swallowed up:
>
>   // this now returns rows consisting of the string literal "cd"
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> Thanks,
> -Rick
>
> Michael Armbrust <mi...@databricks.com> wrote on 09/22/2015 10:58:36 AM:
>
> > From: Michael Armbrust <mi...@databricks.com>
> > To: Richard Hillegas/San Francisco/IBM@IBMUS
> > Cc: Dev <de...@spark.apache.org>
> > Date: 09/22/2015 10:59 AM
> > Subject: Re: column identifiers in Spark SQL
>
> >
> > Are you using a SQLContext or a HiveContext?  The programming guide
> > suggests the latter, as the former is really only there because some
> > applications may have conflicts with Hive dependencies.  SQLContext
> > is case sensitive by default where as the HiveContext is not.  The
> > parser in HiveContext is also a lot better.
> >
> > On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rh...@us.ibm.com>
> wrote:
> > I am puzzled by the behavior of column identifiers in Spark SQL. I
> > don't find any guidance in the "Spark SQL and DataFrame Guide" at
> > http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> > seeing odd behavior related to case-sensitivity and to delimited
> > (quoted) identifiers.
> >
> > Consider the following declaration of a table in the Derby
> > relational database, whose dialect hews closely to the SQL Standard:
> >
> >    create table app.t( a int, "b" int, "c""d" int );
> >
> > Now let's load that table into Spark like this:
> >
> >   import org.apache.spark.sql._
> >   import org.apache.spark.sql.types._
> >
> >   val df = sqlContext.read.format("jdbc").options(
> >     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
> >     "dbtable" -> "app.t")).load()
> >   df.registerTempTable("test_data")
> >
> > The following query runs fine because the column name matches the
> > normalized form in which it is stored in the metadata catalogs of
> > the relational database:
> >
> >   // normalized column names are recognized
> >   sqlContext.sql(s"""select A from test_data""").show
> >
> > But the following query fails during name resolution. This puzzles
> > me because non-delimited identifiers are case-insensitive in the
> > ANSI/ISO Standard. They are also supposed to be case-insensitive in
> > HiveQL, at least according to section 2.3.1 of the
> > QuotedIdentifier.html webpage attached to https://issues.apache.org/
> > jira/browse/HIVE-6013:
> >
> >   // ...unnormalized column names raise this error:
> > org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> > input columns A, b, c"d;
> >   sqlContext.sql("""select a from test_data""").show
> >
> > Delimited (quoted) identifiers are treated as string literals.
> > Again, non-Standard behavior:
> >
> >   // this returns rows consisting of the string literal "b"
> >   sqlContext.sql("""select "b" from test_data""").show
> >
> > Embedded quotes in delimited identifiers won't even parse:
> >
> >   // embedded quotes raise this error: java.lang.RuntimeException:
> > [1.11] failure: ``union'' expected but "d" found
> >   sqlContext.sql("""select "c""d" from test_data""").show
> >
> > This behavior is non-Standard and it strikes me as hard to describe
> > to users concisely. Would the community support an effort to bring
> > the handling of column identifiers into closer conformance with the
> > Standard? Would backward compatibility concerns even allow us to do that?
> >
> > Thanks,
> > -Rick
>
>

Re: column identifiers in Spark SQL

Posted by Richard Hillegas <rh...@us.ibm.com>.
Thanks for that tip, Michael. I think that my sqlContext was a raw
SQLContext originally. I have rebuilt Spark like so...

  sbt/sbt -Phive assembly/assembly

Now I see that my sqlContext is a HiveContext. That fixes one of the
queries. Now unnormalized column names work:

  // ...unnormalized column names work now
  sqlContext.sql("""select a from test_data""").show

However, quoted identifiers are still treated as string literals:

  // this still returns rows consisting of the string literal "b"
  sqlContext.sql("""select "b" from test_data""").show

And embedded quotes inside quoted identifiers are swallowed up:

  // this now returns rows consisting of the string literal "cd"
  sqlContext.sql("""select "c""d" from test_data""").show

Thanks,
-Rick

Michael Armbrust <mi...@databricks.com> wrote on 09/22/2015 10:58:36 AM:

> From: Michael Armbrust <mi...@databricks.com>
> To: Richard Hillegas/San Francisco/IBM@IBMUS
> Cc: Dev <de...@spark.apache.org>
> Date: 09/22/2015 10:59 AM
> Subject: Re: column identifiers in Spark SQL
>
> Are you using a SQLContext or a HiveContext?  The programming guide
> suggests the latter, as the former is really only there because some
> applications may have conflicts with Hive dependencies.  SQLContext
> is case sensitive by default where as the HiveContext is not.  The
> parser in HiveContext is also a lot better.
>
> On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rh...@us.ibm.com>
wrote:
> I am puzzled by the behavior of column identifiers in Spark SQL. I
> don't find any guidance in the "Spark SQL and DataFrame Guide" at
> http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> seeing odd behavior related to case-sensitivity and to delimited
> (quoted) identifiers.
>
> Consider the following declaration of a table in the Derby
> relational database, whose dialect hews closely to the SQL Standard:
>
>    create table app.t( a int, "b" int, "c""d" int );
>
> Now let's load that table into Spark like this:
>
>   import org.apache.spark.sql._
>   import org.apache.spark.sql.types._
>
>   val df = sqlContext.read.format("jdbc").options(
>     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
>     "dbtable" -> "app.t")).load()
>   df.registerTempTable("test_data")
>
> The following query runs fine because the column name matches the
> normalized form in which it is stored in the metadata catalogs of
> the relational database:
>
>   // normalized column names are recognized
>   sqlContext.sql(s"""select A from test_data""").show
>
> But the following query fails during name resolution. This puzzles
> me because non-delimited identifiers are case-insensitive in the
> ANSI/ISO Standard. They are also supposed to be case-insensitive in
> HiveQL, at least according to section 2.3.1 of the
> QuotedIdentifier.html webpage attached to https://issues.apache.org/
> jira/browse/HIVE-6013:
>
>   // ...unnormalized column names raise this error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given
> input columns A, b, c"d;
>   sqlContext.sql("""select a from test_data""").show
>
> Delimited (quoted) identifiers are treated as string literals.
> Again, non-Standard behavior:
>
>   // this returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> Embedded quotes in delimited identifiers won't even parse:
>
>   // embedded quotes raise this error: java.lang.RuntimeException:
> [1.11] failure: ``union'' expected but "d" found
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> This behavior is non-Standard and it strikes me as hard to describe
> to users concisely. Would the community support an effort to bring
> the handling of column identifiers into closer conformance with the
> Standard? Would backward compatibility concerns even allow us to do that?
>
> Thanks,
> -Rick

Re: column identifiers in Spark SQL

Posted by Michael Armbrust <mi...@databricks.com>.
Are you using a SQLContext or a HiveContext?  The programming guide
suggests the latter, as the former is really only there because some
applications may have conflicts with Hive dependencies.  SQLContext is case
sensitive by default where as the HiveContext is not.  The parser in
HiveContext is also a lot better.

On Tue, Sep 22, 2015 at 10:53 AM, Richard Hillegas <rh...@us.ibm.com>
wrote:

> I am puzzled by the behavior of column identifiers in Spark SQL. I don't
> find any guidance in the "Spark SQL and DataFrame Guide" at
> http://spark.apache.org/docs/latest/sql-programming-guide.html. I am
> seeing odd behavior related to case-sensitivity and to delimited (quoted)
> identifiers.
>
> Consider the following declaration of a table in the Derby relational
> database, whose dialect hews closely to the SQL Standard:
>
>    create table app.t( a int, "b" int, "c""d" int );
>
> Now let's load that table into Spark like this:
>
>   import org.apache.spark.sql._
>   import org.apache.spark.sql.types._
>
>   val df = sqlContext.read.format("jdbc").options(
>     Map("url" -> "jdbc:derby:/Users/rhillegas/derby/databases/derby1",
>     "dbtable" -> "app.t")).load()
>   df.registerTempTable("test_data")
>
> The following query runs fine because the column name matches the
> normalized form in which it is stored in the metadata catalogs of the
> relational database:
>
>   // normalized column names are recognized
>   sqlContext.sql(s"""select A from test_data""").show
>
> But the following query fails during name resolution. This puzzles me
> because non-delimited identifiers are case-insensitive in the ANSI/ISO
> Standard. They are also supposed to be case-insensitive in HiveQL, at least
> according to section 2.3.1 of the QuotedIdentifier.html webpage attached to
> https://issues.apache.org/jira/browse/HIVE-6013:
>
>   // ...unnormalized column names raise this error:
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input
> columns A, b, c"d;
>   sqlContext.sql("""select a from test_data""").show
>
> Delimited (quoted) identifiers are treated as string literals. Again,
> non-Standard behavior:
>
>   // this returns rows consisting of the string literal "b"
>   sqlContext.sql("""select "b" from test_data""").show
>
> Embedded quotes in delimited identifiers won't even parse:
>
>   // embedded quotes raise this error: java.lang.RuntimeException: [1.11]
> failure: ``union'' expected but "d" found
>   sqlContext.sql("""select "c""d" from test_data""").show
>
> This behavior is non-Standard and it strikes me as hard to describe to
> users concisely. Would the community support an effort to bring the
> handling of column identifiers into closer conformance with the Standard?
> Would backward compatibility concerns even allow us to do that?
>
> Thanks,
> -Rick
>