You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "osawyerr (via GitHub)" <gi...@apache.org> on 2023/06/28 23:56:23 UTC

[GitHub] [arrow-datafusion] osawyerr opened a new issue, #6794: Incorrect results returned for TPC-H Query 8

osawyerr opened a new issue, #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794

   ### Describe the bug
   
   Datafusion gives incorrect results when running TPC-H Query 8 with parquet files. 
   
   ### To Reproduce
   
   1. Generate TPC-H parquet files for scale factor 10
   2. Open datafusion-cli and create external tables pointing to files
   ```sql
   create external table lineitem stored as parquet location '/path/to/lineitem/lineitem_1687987398_default/';
   create external table customer stored as parquet location '/path/to/customer/customer_1687987384_default/';
   create external table nation stored as parquet location '/path/to/nation/nation_1687988005_default/';
   create external table orders stored as parquet location '/path/to/orders/orders_1687988005_default/';
   create external table region stored as parquet location '/path/to/region/region_1687988223_default/';
   create external table supplier stored as parquet location '/path/to/supplier/supplier_1687988223_default/';
   create external table part stored as parquet location '/path/to/part/part_1687988133_default/';
   ```
   3. Run TPC-H Query 8
   ```sql
   select
     o_year, sum(case when nation = 'BRAZIL' then volume else 
   0
    end) / sum(volume) as mkt_share
   from
     (
       select
         extract(year from o_orderdate) as o_year,
         l_extendedprice * (
   1
    - l_discount) as volume,
         n2.n_name as nation
       from part, supplier, lineitem, orders, customer, nation n1, nation n2, region
       where
         p_partkey = l_partkey
         and s_suppkey = l_suppkey
         and l_orderkey = o_orderkey
         and o_custkey = c_custkey
         and c_nationkey = n1.n_nationkey
         and n1.n_regionkey = r_regionkey
         and r_name = 'AMERICA'
         and s_nationkey = n2.n_nationkey
         and o_orderdate between date '1995-01-01' and date '1996-12-31'
         and p_type = 'ECONOMY ANODIZED STEEL'
     ) as all_nations
   group by o_year
   order by o_year;
   ```
   4. Incorrect results displayed below
   ```
   +--------+-------------------------------------------+
   | o_year | mkt_share                                 |
   +--------+-------------------------------------------+
   | 1995.0 | -0.00000000000011380044067220119495060732 |
   | 1996.0 | 0.00000000000019588288500717383285218261  |
   +--------+-------------------------------------------+
   2 rows in set. Query took 1.131 seconds.
   ```
   
   
   ### Expected behavior
   
   The correct results should be:
   1. From Postgres:
   ```
    o_year |       mkt_share        
   --------+------------------------
      1995 | 0.03882014251433219622
      1996 | 0.03948968749183991638
   (2 rows)
   
   ```
   2. From DuckDb (with same parquet files):
   ```
   ┌────────┬─────────────────────┐
   │ o_year │      mkt_share      │
   │ int64  │       double        │
   ├────────┼─────────────────────┤
   │   1995 │  0.0388201425143322 │
   │   1996 │ 0.03948968749183992 │
   └────────┴─────────────────────┘
   Run Time (s): real 1.328 user 8.095730 sys 0.358842
   ```
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1670280139

   Opened https://github.com/apache/arrow-datafusion/pull/7233 to verify it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove closed issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove closed issue #6794: Incorrect results returned for TPC-H Query 8
URL: https://github.com/apache/arrow-datafusion/issues/6794


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1618842382

   That is why the internal scaling in the division kernel should be fixed point computation to allow precision loss instead of overflow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1613874240

   I'm going to add scalar version of fixed point decimal multiplication kernel at the upstream. We can use it to fix this once it is available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1671414033

   🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] osawyerr commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "osawyerr (via GitHub)" <gi...@apache.org>.
osawyerr commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1691489025

   Hi @alamb I think this may need to be reopened. Version 30 is generating incorrect results.
   
   ```
    o_year | mkt_share  
   --------+------------
      1995 | 0.04222322
      1996 | 0.04280077
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1613784849

   Currently you can only get a meaningful result by adding cast on the division, see `benchmarks/queries/q8.sql`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1612325504

   Might be related to `Decimal` dividing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1612323629

   I will take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] osawyerr commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "osawyerr (via GitHub)" <gi...@apache.org>.
osawyerr commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1670300042

   Sure. Just ran it. its working well now with the latest build on main branch. Nice speed boost as well.
   ```
   +--------+------------+
   | o_year | mkt_share  |
   +--------+------------+
   | 1995.0 | 0.03882014 |
   | 1996.0 | 0.03948968 |
   +--------+------------+
   2 rows in set. Query took 0.985 seconds.
   ```
   The ``o_year`` comes back as a decimal though. But I think thats related to the ``extract`` function in ``extract(year from o_orderdate) as o_year``.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1613782990

   That is because overflow happened in decimal divide kernel, see previous comment https://github.com/apache/arrow-datafusion/pull/5675/files#r1152896889. I should provide a scalar version of fixed point decimal multiplication kernel to fix it but haven't find time working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1670264898

   Can someone test this again now that https://github.com/apache/arrow-datafusion/pull/6832 has been merged?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #6794: Incorrect results returned for TPC-H Query 8

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #6794:
URL: https://github.com/apache/arrow-datafusion/issues/6794#issuecomment-1618266914

   I think this is actually a bug in the way that decimal type coercion is currently performed, which causes computations to overflow when they shouldn't - https://github.com/apache/arrow-datafusion/issues/6828
   
   This is something I hope to fix upstream as part of https://github.com/apache/arrow-rs/issues/3999


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org