You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kohki Nishio <ta...@gmail.com> on 2023/01/27 23:34:50 UTC

Spark SQL question

this SQL works

select 1 as *`data.group`* from tbl group by *data.group*


Since there's no such field as *data,* I thought the SQL has to look like
this

select 1 as *`data.group`* from tbl group by `*data.group`*


 But that gives and error (cannot resolve '`data.group`') ... I'm no expert
in SQL, but feel like it's a strange behavior... does anybody have a good
explanation for it ?

Thanks

-- 
Kohki Nishio

SQL GROUP BY alias with dots, was: Spark SQL question

Posted by Enrico Minack <in...@enrico.minack.dev>.
Hi,

you are right, that is an interesting question.

Looks like GROUP BY is doing something funny / magic here (spark-shell 
3.3.1 and 3.5.0-SNAPSHOT):

With an alias, it behaves as you have pointed out:

spark.range(3).createTempView("ids_without_dots")
spark.sql("SELECT * FROM ids_without_dots").show()

// works
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
an.id").show()
// fails
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
`an.id`").show()


Without an alias, it behaves as expected, which is the opposite of above 
(a column with a dot exists, no alias used in SELECT):

spark.range(3).select($"id".as("an.id")).createTempView("ids_with_dots")
spark.sql("SELECT `an.id` FROM ids_with_dots").show()

// works
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY `an.id`").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY an.id").show()


With a struct column, it also behaves as expected:

spark.range(3).select(struct($"id").as("an")).createTempView("ids_with_struct")
spark.sql("SELECT an.id FROM ids_with_struct").show()

// works
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY an.id").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY an.id").show()
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY `an.id`").show()
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY `an.id`").show()


This does not feel very consistent.

Enrico



Am 28.01.23 um 00:34 schrieb Kohki Nishio:
> this SQL works
>
>     select 1 as *`data.group`* from tbl group by *data.group*
>
>
> Since there's no such field as *data,* I thought the SQL has to look 
> like this
>
>     select 1 as *`data.group`* from tbl group by `*data.group`*
>
>
>  But that gives and error (cannot resolve '`data.group`') ... I'm no 
> expert in SQL, but feel like it's a strange behavior... does anybody 
> have a good explanation for it ?
>
> Thanks
>
> -- 
> Kohki Nishio


Re: Spark SQL question

Posted by Bjørn Jørgensen <bj...@gmail.com>.
Hi Mich.
This is a Spark user group mailing list where people can ask *any*
questions about spark.
You know SQL and streaming, but I don't think it's necessary to start a
replay with "*LOL*" to the question that's being asked.
No questions are to stupid to be asked.


lør. 28. jan. 2023 kl. 09:22 skrev Mich Talebzadeh <
mich.talebzadeh@gmail.com>:

> LOL
>
> First one
>
> spark-sql> select 1 as `data.group` from abc group by data.group;
> 1
> Time taken: 0.198 seconds, Fetched 1 row(s)
>
> means that are assigning alias data.group to select and you are using that
> alias -> data.group in your group by statement
>
>
> This is equivalent to
>
>
> spark-sql> select 1 as `data.group` from abc group by 1;
>
> 1
>
> With regard to your second sql
>
>
> select 1 as *`data.group`* from tbl group by `*data.group`;*
>
>
> *will throw an error *
>
>
> *spark-sql> select 1 as `data.group` from abc group by `data.group`;*
>
> *Error in query: cannot resolve '`data.group`' given input columns:
> [spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
> line 1 pos 43;*
>
> *'Aggregate ['`data.group`], [1 AS data.group#225]*
>
> *+- SubqueryAlias spark_catalog.elayer.abc*
>
> *   +- HiveTableRelation [`elayer`.`abc`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
> [keyword#226, occurence#227L], Partition Cols: []]*
>
> `data.group` with quotes is neither the name of the column or its alias
>
>
> *HTH*
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 27 Jan 2023 at 23:36, Kohki Nishio <ta...@gmail.com> wrote:
>
>> this SQL works
>>
>> select 1 as *`data.group`* from tbl group by *data.group*
>>
>>
>> Since there's no such field as *data,* I thought the SQL has to look
>> like this
>>
>> select 1 as *`data.group`* from tbl group by `*data.group`*
>>
>>
>>  But that gives and error (cannot resolve '`data.group`') ... I'm no
>> expert in SQL, but feel like it's a strange behavior... does anybody have a
>> good explanation for it ?
>>
>> Thanks
>>
>> --
>> Kohki Nishio
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Re: Spark SQL question

Posted by Mich Talebzadeh <mi...@gmail.com>.
LOL

First one

spark-sql> select 1 as `data.group` from abc group by data.group;
1
Time taken: 0.198 seconds, Fetched 1 row(s)

means that are assigning alias data.group to select and you are using that
alias -> data.group in your group by statement


This is equivalent to


spark-sql> select 1 as `data.group` from abc group by 1;

1

With regard to your second sql


select 1 as *`data.group`* from tbl group by `*data.group`;*


*will throw an error *


*spark-sql> select 1 as `data.group` from abc group by `data.group`;*

*Error in query: cannot resolve '`data.group`' given input columns:
[spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
line 1 pos 43;*

*'Aggregate ['`data.group`], [1 AS data.group#225]*

*+- SubqueryAlias spark_catalog.elayer.abc*

*   +- HiveTableRelation [`elayer`.`abc`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
[keyword#226, occurence#227L], Partition Cols: []]*

`data.group` with quotes is neither the name of the column or its alias


*HTH*



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 27 Jan 2023 at 23:36, Kohki Nishio <ta...@gmail.com> wrote:

> this SQL works
>
> select 1 as *`data.group`* from tbl group by *data.group*
>
>
> Since there's no such field as *data,* I thought the SQL has to look like
> this
>
> select 1 as *`data.group`* from tbl group by `*data.group`*
>
>
>  But that gives and error (cannot resolve '`data.group`') ... I'm no
> expert in SQL, but feel like it's a strange behavior... does anybody have a
> good explanation for it ?
>
> Thanks
>
> --
> Kohki Nishio
>