You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Andy Grove (Jira)" <ji...@apache.org> on 2020/08/14 13:22:00 UTC
[jira] [Created] (ARROW-9741) [Rust] [DataFusion] Incorrect count
in TPC-H query 1 result set
Andy Grove created ARROW-9741:
---------------------------------
Summary: [Rust] [DataFusion] Incorrect count in TPC-H query 1 result set
Key: ARROW-9741
URL: https://issues.apache.org/jira/browse/ARROW-9741
Project: Apache Arrow
Issue Type: Bug
Components: Rust, Rust - DataFusion
Reporter: Andy Grove
Assignee: Andy Grove
Fix For: 2.0.0
I am testing with the 100 GB (scale factor 100) data set, in Parquet format. The results overall match between Spark and DataFusion with the exception of one of the counts (Spark has 291241911 and DataFusion has 300058170 .. a difference of 8816259).
DataFusion query:
{code:java}
"select
l_returnflag,
l_linestatus,
sum(l_quantity),
sum(l_extendedprice),
sum(l_extendedprice * (1 - l_discount)),
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
avg(l_quantity),
avg(l_extendedprice),
avg(l_discount),
count(*)
from
lineitem
where
l_shipdate <= '1998-12-01'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus" {code}
DataFusion output:
{code:java}
+--------------+--------------+-----------------+----------------------+--------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------+----------------------+-----------------+
| l_returnflag | l_linestatus | sum(l_quantity) | sum(l_extendedprice) | sum(l_extendedprice Multiply CAST(Int64(1) as Float64) Minus l_discount) | sum(l_extendedprice Multiply CAST(Int64(1) as Float64) Minus l_discount Multiply CAST(Int64(1) as Float64) Plus l_tax) | avg(l_quantity) | avg(l_extendedprice) | avg(l_discount) | count(UInt8(1)) |
+--------------+--------------+-----------------+----------------------+--------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------+----------------------+-----------------+
| A | F | 3775127758 | 5660776097194.464 | 5377736398183.935 | 5592847429515.93 | 25.49937060623502 | 38236.11838745711 | 0.05000224145223291 | 148047881 |
| N | F | 98553062 | 147771098385.98004 | 140384965965.03473 | 145999793032.77594 | 25.501475096542002 | 38237.03209968505 | 0.0499850931498342 | 3864590 |
| N | O | 7651423419 | 11473321691083.244 | 10899667121317.215 | 11335664103186.313 | 25.499799813085986 | 38236.99077003657 | 0.04999757591275955 | 300058170 |
| R | F | 3775724970 | 5661603032745.35 | 5378513563915.415 | 5593662252666.921 | 25.500067651651772 | 38236.70005754084 | 0.050001305269911714 | 148067261 |
+--------------+--------------+-----------------+----------------------+--------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------+--------------------+----------------------+----------------------+-----------------+
{code}
Spark query:
{code:java}
| select
| l_returnflag,
| l_linestatus,
| sum(l_quantity) as sum_qty,
| sum(l_extendedprice) as sum_base_price,
| sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
| sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
| avg(l_quantity) as avg_qty,
| avg(l_extendedprice) as avg_price,
| avg(l_discount) as avg_disc,
| count(*) as count_order
| from
| lineitem
| where
| l_shipdate < '1998-09-01'
| group by
| l_returnflag,
| l_linestatus
| order by
| l_returnflag,
| l_linestatus {code}
Spark output:
{code:java}
+------------+------------+-------------+--------------------+--------------------+--------------------+------------------+------------------+--------------------+-----------+
|l_returnflag|l_linestatus| sum_qty| sum_base_price| sum_disc_price| sum_charge| avg_qty| avg_price| avg_disc|count_order|
+------------+------------+-------------+--------------------+--------------------+--------------------+------------------+------------------+--------------------+-----------+
| A| F|3.775127758E9|5.660776097194467E12|5.377736398183933E12|5.592847429515929E12|25.499370423275426| 38236.11698430501| 0.05000224353093977| 148047881|
| N| F| 9.8553062E7|1.477710983859800...|1.403849659650348E11|1.459997930327758...|25.501556956882876|38237.199388804525|0.049985284338051286| 3864590|
| N| O|7.426674812E9|1.113628734444901...|1.057947943676070...|1.100266737949706...| 25.50002088126664|38237.241701277715| 0.04999786229238074| 291241911|
| R| F| 3.77572497E9|5.661603032745349E12|5.378513563915412E12|5.593662252666918E12| 25.50006628406532| 38236.69725845302|0.050001304339664904| 148067261|
+------------+------------+-------------+--------------------+--------------------+--------------------+------------------+------------------+--------------------+-----------+
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)