You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "chenliang (Jira)" <ji...@apache.org> on 2020/12/10 02:33:00 UTC

[jira] [Updated] (SPARK-33721) Support to use Hive build-in functions by configuration

     [ https://issues.apache.org/jira/browse/SPARK-33721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

chenliang updated SPARK-33721:
------------------------------
    Description: 
Hive and Spark SQL engines have many differences in built-in functions.The differences between several functions are shown below:
||*build-in functions*||SQL|| result of Hive SQL ||result of Spark SQL||
|unix_timestamp|{{select}} {{unix_timestamp(concat(}}{{'2020-06-01'}}{{,  }}{{' 24:00:00'}}{{));}}|1591027200| NULL|
|to_date|{{select}} {{to_date(}}{{'0000-00-00'}}{{);}}|0002-11-30| NULL|
|datediff|{{select }}{{datediff(}}{{CURRENT_DATE}}{{, }}{{'0000-00-00'}}{{);}}|737986| NULL|
|collect_set|{{select}}{{c1}}{{,c2}}{{,concat_ws(}}{{'##'}}{{, collect_set(c3)) c3_set }}{{from}}{{bigdata_offline.test_collect_set }}{{group }}{{by }}{{c1, c2;}}
 {{bigdata_offline.test_collect_set contains data:}}
 {{(1, 1, }}{{'1'}}{{),}}{{(1, 1, }}{{'2'}}{{)}}{{,}}
 {{(1, 1, }}{{'3'}}{{)}}{{,}}{{(1, 1, }}{{'4'}}{{)}}{{,}}
 {{(1, 1, }}{{'5'}}{{)}}|{{c1  c2  c3_set}}
 {{1   1   2##3##4##5##1}}|{{c1  c2      c3_set}}
 {{1   1   3##1##2##5##4}}|

There is no conclusion on which engine is  more accurate. Users prefer to be able to make choices according to their real production environment.

I think we should do some improvement for this.

 

Hive version is 1.2.1 

 

  was:
Hive and Spark SQL engines have many differences in built-in functions.The differences between several functions are shown below:
||*build-in functions*||SQL|| result of Hive SQL ||result of Spark SQL||
|unix_timestamp|{{select}} {{unix_timestamp(concat(}}{{'2020-06-01'}}{{,  }}{{' 24:00:00'}}{{));}}|1591027200| NULL|
|to_date|{{select}} {{to_date(}}{{'0000-00-00'}}{{);}}|0002-11-30| NULL|
|datediff|{{select }}{{datediff(}}{{CURRENT_DATE}}{{, }}{{'0000-00-00'}}{{);}}|737986| NULL|
|collect_set|{{select}}{{c1}}{{,c2}}{{,concat_ws(}}{{'##'}}{{, collect_set(c3)) c3_set }}{{from}}{{bigdata_offline.test_collect_set }}{{group }}{{by }}{{c1, c2;}}
 {{bigdata_offline.test_collect_set contains data:}}
 {{(1, 1, }}{{'1'}}{{),}}{{(1, 1, }}{{'2'}}{{)}}{{,}}
 {{(1, 1, }}{{'3'}}{{)}}{{,}}{{(1, 1, }}{{'4'}}{{)}}{{,}}
 {{(1, 1, }}{{'5'}}{{)}}|{{c1  c2  c3_set}}
 {{1   1   2##3##4##5##1}}|{{c1  c2      c3_set}}
 {{1   1   3##1##2##5##4}}|

There is no conclusion on which engine is  more accurate. Users prefer to be able to make choices according to their real production environment.

I think we should do some improvement for this.

 

 


> Support to use Hive build-in functions by configuration
> -------------------------------------------------------
>
>                 Key: SPARK-33721
>                 URL: https://issues.apache.org/jira/browse/SPARK-33721
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.3, 3.2.0
>            Reporter: chenliang
>            Priority: Major
>
> Hive and Spark SQL engines have many differences in built-in functions.The differences between several functions are shown below:
> ||*build-in functions*||SQL|| result of Hive SQL ||result of Spark SQL||
> |unix_timestamp|{{select}} {{unix_timestamp(concat(}}{{'2020-06-01'}}{{,  }}{{' 24:00:00'}}{{));}}|1591027200| NULL|
> |to_date|{{select}} {{to_date(}}{{'0000-00-00'}}{{);}}|0002-11-30| NULL|
> |datediff|{{select }}{{datediff(}}{{CURRENT_DATE}}{{, }}{{'0000-00-00'}}{{);}}|737986| NULL|
> |collect_set|{{select}}{{c1}}{{,c2}}{{,concat_ws(}}{{'##'}}{{, collect_set(c3)) c3_set }}{{from}}{{bigdata_offline.test_collect_set }}{{group }}{{by }}{{c1, c2;}}
>  {{bigdata_offline.test_collect_set contains data:}}
>  {{(1, 1, }}{{'1'}}{{),}}{{(1, 1, }}{{'2'}}{{)}}{{,}}
>  {{(1, 1, }}{{'3'}}{{)}}{{,}}{{(1, 1, }}{{'4'}}{{)}}{{,}}
>  {{(1, 1, }}{{'5'}}{{)}}|{{c1  c2  c3_set}}
>  {{1   1   2##3##4##5##1}}|{{c1  c2      c3_set}}
>  {{1   1   3##1##2##5##4}}|
> There is no conclusion on which engine is  more accurate. Users prefer to be able to make choices according to their real production environment.
> I think we should do some improvement for this.
>  
> Hive version is 1.2.1 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org