You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "kaiqingli (Jira)" <ji...@apache.org> on 2022/10/11 02:28:00 UTC
[jira] [Created] (SPARK-40741) spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误
kaiqingli created SPARK-40741:
---------------------------------
Summary: spark项目bin/beeline对于distribute by sort by语句支持不好,输出结果错误
Key: SPARK-40741
URL: https://issues.apache.org/jira/browse/SPARK-40741
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.1.0
Environment: spark 3.1
hive 3.0
Reporter: kaiqingli
sql中使用distribute by ... sort by ...时,通过spark/bin/beeline执行的结果错误,使用hive/beeline输出结果正确,具体场景为,先基于posexplode拆分array数据,然后基于拆分的下标进行sort by,之后再collect list,结果与原始的array结果不一致,sql如下:
select id,
samplingtimesec,
array_data = new_array_data flag,
array_data,
new_array_data
from (
select id,
samplingtimesec,
array_data,
concat('[', concat_ws(',', collect_list(cell_voltage)), ']') new_array_data
from (
select id, samplingtimesec, array_data, cell_index, cell_voltage
from (
select id,
samplingtimesec,
array_data,--格式[1,2,3,4,5]
row_number() over (partition by id,samplingtimesec order by samplingtimesec) r --去重
from table
WHERE dt = '20221007'
and samplingtimesec <= 1665079200000
) tmp
lateral view posexplode(split(replace(replace(array_data, '[', ''), ']', ''), ',')) v0 as cell_index, cell_voltage
where r = 1
distribute by id
, samplingtimesec sort by cell_index
) tmp
group by id, samplingtimesec, array_data
) tmp
where array_data != new_array_data;
以上sql,对于hive/beeline输出结果为0条;
对于spark/beeline输出结果不为0
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org