You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/10/12 03:49:00 UTC

[jira] [Commented] (SPARK-40741) spark项目bin/beeline对于distribute by sort by语句支持不好，输出结果错误

    [ https://issues.apache.org/jira/browse/SPARK-40741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17616166#comment-17616166 ] 

Hyukjin Kwon commented on SPARK-40741:
--------------------------------------

[~lkqqingcao] would be great to file an issue in English because there are many maintainers who don't speak other languages.

> spark项目bin/beeline对于distribute by sort by语句支持不好，输出结果错误
> ------------------------------------------------------
>
>                 Key: SPARK-40741
>                 URL: https://issues.apache.org/jira/browse/SPARK-40741
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.0
>         Environment: spark 3.1
> hive 3.0
>            Reporter: kaiqingli
>            Priority: Major
>
> sql中使用distribute by ... sort by ...时，通过spark/bin/beeline执行的结果错误，使用hive/beeline输出结果正确，具体场景为，先基于posexplode拆分array数据，然后基于拆分的下标进行sort by,之后再collect list，结果与原始的array结果不一致，sql如下：
> select id,
> samplingtimesec,
> array_data = new_array_data flag,
> array_data,
> new_array_data
> from (
> select id,
> samplingtimesec,
> array_data,
> concat('[', concat_ws(',', collect_list(cell_voltage)), ']') new_array_data
> from (
> select id, samplingtimesec, array_data, cell_index, cell_voltage
> from (
> select id,
> samplingtimesec,
> array_data,--格式[1,2,3,4,5]
> row_number() over (partition by id,samplingtimesec order by samplingtimesec) r --去重
> from table
> WHERE dt = '20221007'
> and samplingtimesec <= 1665079200000
> ) tmp
> lateral view posexplode(split(replace(replace(array_data, '[', ''), ']', ''), ',')) v0 as cell_index, cell_voltage
> where r = 1
> distribute by id
> , samplingtimesec sort by cell_index
> ) tmp
> group by id, samplingtimesec, array_data
> ) tmp
> where array_data != new_array_data;
> 以上sql，对于hive/beeline输出结果为0条；
> 对于spark/beeline输出结果不为0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org