You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by "itxiangkui (Jira)" <ji...@apache.org> on 2022/06/13 10:38:00 UTC
[jira] [Created] (CALCITE-5190) TPC-H testing of the Calcite framework

itxiangkui created CALCITE-5190:
-----------------------------------

             Summary: TPC-H testing of the Calcite framework
                 Key: CALCITE-5190
                 URL: https://issues.apache.org/jira/browse/CALCITE-5190
             Project: Calcite
          Issue Type: Wish
            Reporter: itxiangkui


What our project is doing:
1. Designed a time series database, made full use of the enumerable adapter, and made a horizontal comparison of benchmarks with other popular time databases
2. Design catalog->database->table, and use the adapter function of jdbc to connect multiple databases such as Mysql/Es in the backend to solve the join problem of heterogeneous data (for example: select * from catalog1.database .table1 ,catalog2,database.table2 .. where t1.c1='xx' )

But when we tested it, we found that, as far as Jdbc-Adapter is concerned, Calcite will disassemble the logical SQL into very fragmented query plan SQL, and hand it over to the underlying database instance to execute the specific physical SQL. Most of the SQL has no filter conditions to push down and the physical SQL looks very clumsy.

So we did a version of the TPC-H test (an industry standard in the database field), taking SQL-19 as an example:
{code:sql}
SELECT sum(l_extendedprice* (1 - l_discount)) AS revenue
FROM lineitem,
     part
WHERE (p_partkey = l_partkey
       AND p_brand = 'Brand#52'
       AND p_container in ('SM CASE',
                           'SM BOX',
                           'SM PACK',
                           'SM PKG')
       AND l_quantity >= 4
       AND l_quantity <= 4 + 10
       AND p_size BETWEEN 1 AND 5
       AND l_shipmode in ('AIR',
                          'AIR REG')
       AND l_shipinstruct = 'DELIVER IN PERSON')
  OR (p_partkey = l_partkey
      AND p_brand = 'Brand#11'
      AND p_container in ('MED BAG',
                          'MED BOX',
                          'MED PKG',
                          'MED PACK')
      AND l_quantity >= 18
      AND l_quantity <= 18 + 10
      AND p_size BETWEEN 1 AND 10
      AND l_shipmode in ('AIR',
                         'AIR REG')
      AND l_shipinstruct = 'DELIVER IN PERSON')
  OR (p_partkey = l_partkey
      AND p_brand = 'Brand#51'
      AND p_container in ('LG CASE',
                          'LG BOX',
                          'LG PACK',
                          'LG PKG')
      AND l_quantity >= 29
      AND l_quantity <= 29 + 10
      AND p_size BETWEEN 1 AND 15
      AND l_shipmode in ('AIR',
                         'AIR REG')
      AND l_shipinstruct = 'DELIVER IN PERSON');
{code}
The physical execution plan is:
{code:sql}
[SELECT *
FROM `tpch`.`lineitem`
LIMIT 3]

[SELECT `L_PARTKEY`, `L_QUANTITY`, `L_EXTENDEDPRICE`, `L_DISCOUNT`, `L_SHIPINSTRUCT`, `L_SHIPMODE`
FROM `tpch`.`lineitem`]

[SELECT `P_PARTKEY`, `P_BRAND`, `P_SIZE`, `P_CONTAINER`
FROM `tpch`.`part`]
{code}
this looks incredible...

 

My question is: Has Calcite ever tested something like TPC-H.


In theory, for a specific database instance, the query speed after using calcite will not drop much performance compared to the original database, so many users are happy to use Calcite to solve the problem of data islands. …



--
This message was sent by Atlassian Jira
(v8.20.7#820007)