You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Budde, Adam" <bu...@amazon.com> on 2014/07/31 20:16:11 UTC

Inconsistent Spark SQL behavior when column names contain dots

I’m working with a dataset where each row is stored as a single-line flat JSON object. I want to leverage Spark SQL to run relational queries on this data. Many of the object keys in this dataset have dots in them, e.g.:

{ “key.number1”: “value1”, “key.number2”: “value2” … }

I can successfully load the data as an RDD in Spark and construct a Spark SQL table using the jsonRDD function. If I print the schema of the table, I see that Spark SQL infers the full object key, dot included, as the column name:

> sqlTable.printSchema()
root
|-- key.number1: StringType
|-- key.number2: StringType
…

However, when I try to use one of these column names in a query, it seems that the Spark SQL parser always assumes I’m trying to reference a nested attribute. The same thing happens when using HiveQL. If there’s a way to escape the dot in the column name, I haven’t found it:

> sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
== Query Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: ‘key.number1, tree
Project [‘key.number1]
…

This is not a critical issue by any means— it’s simple enough to use map() to transform the dots to underscores after loading the JSON data as text. I just wanted to reach out to the community for some guidance as to whether or not this issue warrants a bug report. To me, this behavior seems to be inconsistent— you can create a table with column names containing dots, but AFAICT you cannot include such columns in a query.

Also, I’d greatly appreciate it if anybody has any pointers as to where in the source I should be looking if I wanted to patch this issue in my local branch. I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m too much of a Scala novice to make much headway here.

For reference, I’m using Spark 1.0.1. Thanks for your input.

Adam

Re: Inconsistent Spark SQL behavior when column names contain dots

Posted by Yin Huai <yh...@databricks.com>.

I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it.


On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam <bu...@amazon.com> wrote:

>  I still see the same “Unresolved attributes” error when using hql +
> backticks.
>
>  Here’s a code snippet that replicates this behavior:
>
>  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val sampleRDD = sc.parallelize(Array("""{"key.one": "value1", "key.two":
> "value2"}"""))
> val sampleTable = hiveContext.jsonRDD(sampleRDD)
> sampleTable.registerAsTable("sample_table")
> hiveContext.hql("SELECT `key.one` FROM sample_table")
>
>   From: Michael Armbrust <mi...@databricks.com>
> Reply-To: "user@spark.apache.org" <us...@spark.apache.org>
> Date: Thursday, July 31, 2014 at 11:20 AM
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: Re: Inconsistent Spark SQL behavior when column names contain
> dots
>
>   Ideally you'd use backticks to reference columns that contain weird
> characters.  I don't believe this works in sql parser, but I'm curious if
> using the hql parser in HiveContext would work for you?
>
>  If you wanted to add support for this in the sql parser I'd check out
> SqlParser.scala.  Thought it is likely we will abandon that code in the
> next release for something more complete.
>
>
> On Thu, Jul 31, 2014 at 11:16 AM, Budde, Adam <bu...@amazon.com> wrote:
>
>>  I’m working with a dataset where each row is stored as a single-line
>> flat JSON object. I want to leverage Spark SQL to run relational queries on
>> this data. Many of the object keys in this dataset have dots in them, e.g.:
>>
>>  { “key.number1”: “value1”, “key.number2”: “value2” … }
>>
>>  I can successfully load the data as an RDD in Spark and construct a
>> Spark SQL table using the jsonRDD function. If I print the schema of the
>> table, I see that Spark SQL infers the full object key, dot included, as
>> the column name:
>>
>>  > sqlTable.printSchema()
>> root
>> |-- key.number1: StringType
>> |-- key.number2: StringType
>> …
>>
>>  However, when I try to use one of these column names in a query, it
>> seems that the Spark SQL parser always assumes I’m trying to reference a
>> nested attribute. The same thing happens when using HiveQL. If there’s a
>> way to escape the dot in the column name, I haven’t found it:
>>
>>  > sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT
>> 1”).first
>> == Query Plan ==
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>> Unresolved attributes: ‘key.number1, tree
>> Project [‘key.number1]
>> …
>>
>>  This is not a critical issue by any means— it’s simple enough to use
>> map() to transform the dots to underscores after loading the JSON data as
>> text. I just wanted to reach out to the community for some guidance as to
>> whether or not this issue warrants a bug report. To me, this behavior seems
>> to be inconsistent— you can create a table with column names containing
>> dots, but AFAICT you cannot include such columns in a query.
>>
>>  Also, I’d greatly appreciate it if anybody has any pointers as to where
>> in the source I should be looking if I wanted to patch this issue in my
>> local branch. I’ve taken a glance at some of the Spark SQL Catalyst code
>> but I’m afraid I’m too much of a Scala novice to make much headway here.
>>
>>  For reference, I’m using Spark 1.0.1. Thanks for your input.
>>
>>  Adam
>>
>
>

Re: Inconsistent Spark SQL behavior when column names contain dots

Posted by "Budde, Adam" <bu...@amazon.com>.

I still see the same “Unresolved attributes” error when using hql + backticks.

Here’s a code snippet that replicates this behavior:

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val sampleRDD = sc.parallelize(Array("""{"key.one": "value1", "key.two": "value2"}"""))
val sampleTable = hiveContext.jsonRDD(sampleRDD)
sampleTable.registerAsTable("sample_table")
hiveContext.hql("SELECT `key.one` FROM sample_table")

From: Michael Armbrust <mi...@databricks.com>>
Reply-To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Date: Thursday, July 31, 2014 at 11:20 AM
To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Inconsistent Spark SQL behavior when column names contain dots

Ideally you'd use backticks to reference columns that contain weird characters.  I don't believe this works in sql parser, but I'm curious if using the hql parser in HiveContext would work for you?

If you wanted to add support for this in the sql parser I'd check out SqlParser.scala.  Thought it is likely we will abandon that code in the next release for something more complete.

On Thu, Jul 31, 2014 at 11:16 AM, Budde, Adam <bu...@amazon.com>> wrote:
I’m working with a dataset where each row is stored as a single-line flat JSON object. I want to leverage Spark SQL to run relational queries on this data. Many of the object keys in this dataset have dots in them, e.g.:

{ “key.number1”: “value1”, “key.number2”: “value2” … }

I can successfully load the data as an RDD in Spark and construct a Spark SQL table using the jsonRDD function. If I print the schema of the table, I see that Spark SQL infers the full object key, dot included, as the column name:

> sqlTable.printSchema()
root
|-- key.number1: StringType
|-- key.number2: StringType
…

However, when I try to use one of these column names in a query, it seems that the Spark SQL parser always assumes I’m trying to reference a nested attribute. The same thing happens when using HiveQL. If there’s a way to escape the dot in the column name, I haven’t found it:

> sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
== Query Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: ‘key.number1, tree
Project [‘key.number1]
…

This is not a critical issue by any means— it’s simple enough to use map() to transform the dots to underscores after loading the JSON data as text. I just wanted to reach out to the community for some guidance as to whether or not this issue warrants a bug report. To me, this behavior seems to be inconsistent— you can create a table with column names containing dots, but AFAICT you cannot include such columns in a query.

Also, I’d greatly appreciate it if anybody has any pointers as to where in the source I should be looking if I wanted to patch this issue in my local branch. I’ve taken a glance at some of the Spark SQL Catalyst code but I’m afraid I’m too much of a Scala novice to make much headway here.

For reference, I’m using Spark 1.0.1. Thanks for your input.

Adam

Re: Inconsistent Spark SQL behavior when column names contain dots

Posted by Michael Armbrust <mi...@databricks.com>.

Ideally you'd use backticks to reference columns that contain weird
characters.  I don't believe this works in sql parser, but I'm curious if
using the hql parser in HiveContext would work for you?

If you wanted to add support for this in the sql parser I'd check out
SqlParser.scala.  Thought it is likely we will abandon that code in the
next release for something more complete.


On Thu, Jul 31, 2014 at 11:16 AM, Budde, Adam <bu...@amazon.com> wrote:

>  I’m working with a dataset where each row is stored as a single-line
> flat JSON object. I want to leverage Spark SQL to run relational queries on
> this data. Many of the object keys in this dataset have dots in them, e.g.:
>
>  { “key.number1”: “value1”, “key.number2”: “value2” … }
>
>  I can successfully load the data as an RDD in Spark and construct a
> Spark SQL table using the jsonRDD function. If I print the schema of the
> table, I see that Spark SQL infers the full object key, dot included, as
> the column name:
>
>  > sqlTable.printSchema()
> root
> |-- key.number1: StringType
> |-- key.number2: StringType
> …
>
>  However, when I try to use one of these column names in a query, it
> seems that the Spark SQL parser always assumes I’m trying to reference a
> nested attribute. The same thing happens when using HiveQL. If there’s a
> way to escape the dot in the column name, I haven’t found it:
>
>  > sqlContext.sql(“SELECT key.number1 FROM TABLE sql_table LIMIT 1”).first
> == Query Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: ‘key.number1, tree
> Project [‘key.number1]
> …
>
>  This is not a critical issue by any means— it’s simple enough to use
> map() to transform the dots to underscores after loading the JSON data as
> text. I just wanted to reach out to the community for some guidance as to
> whether or not this issue warrants a bug report. To me, this behavior seems
> to be inconsistent— you can create a table with column names containing
> dots, but AFAICT you cannot include such columns in a query.
>
>  Also, I’d greatly appreciate it if anybody has any pointers as to where
> in the source I should be looking if I wanted to patch this issue in my
> local branch. I’ve taken a glance at some of the Spark SQL Catalyst code
> but I’m afraid I’m too much of a Scala novice to make much headway here.
>
>  For reference, I’m using Spark 1.0.1. Thanks for your input.
>
>  Adam
>