You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/03/02 00:22:00 UTC

[jira] [Commented] (IMPALA-4530) Sort node after exchange should start sorting after first RowBatch is received

    [ https://issues.apache.org/jira/browse/IMPALA-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695382#comment-17695382 ] 

ASF subversion and git services commented on IMPALA-4530:
---------------------------------------------------------

Commit 939a6ae14e5416845a43d3d41d01c43e8123a0b9 in impala's branch refs/heads/master from noemi
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=939a6ae14 ]

IMPALA-11477: Adding Codegen to sorted-run-merger

SortedRunMerger is used to merge multiple, already sorted runs.
It is used for external merge in the sorter (SortNode), and in
KRPC data stream receiver (ExchangeNode).

SortedRunMerger builds and maintains a min heap of the sorted input
runs. Rewrote SortedRunMerger::Heapify from recursive to iterative
and moved to a separate new source file: sorted-run-merger-ir.cc.
Added a static Codegen() to SortedRunMerger and call it from the
corresponding ExecNodes: SortNode and ExchangeNode.

This change lets the merger use the codegened version of
TupleRowComparator instead of the interpreted one, which can increase
the speed, especially in case of complex comparison expressions.
This change also serves as a base for further codegen-related
optimizations in the merger.

Testing:
 - run existing E2E sort tests (test-sort.py)
 - manual testing: run queries that instantiate sort nodes and
   merging exchange nodes
Benchmarking:
 - did not cause regression on TPCH query set
 - made merge-intensive queries and IMPALA-4530 (in-memory merge of
   quicksorted small runs) faster

Change-Id: Ic35c7460bdbd54b8ec5872a83680e2f41ceae9fd
Reviewed-on: http://gerrit.cloudera.org:8080/18824
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Sort node after exchange should start sorting after first RowBatch is received
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-4530
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4530
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.8.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Noemi Pap-Takacs
>            Priority: Minor
>
> Sort node after exchange doesn't start sorting until all data is received which add lots of latency to the query. 
> Not clear if this optimization would still make sense for a Scan followed by a sort run using the same thread. 
> Query
> {code}
> insert into tpcds_1000_parquet.store_sales_insert  partition(ss_sold_date_sk, ss_quantity)  /*+ clustered*/
> select
> ss_sold_time_sk,
>   ss_item_sk ,
>   ss_customer_sk,
>   ss_cdemo_sk,
>   ss_hdemo_sk,
>   ss_addr_sk,
>   ss_store_sk,
>   ss_promo_sk,
>   ss_ticket_number ,
>   ss_wholesale_cost ,
>   ss_list_price ,
>   ss_sales_price ,
>   ss_ext_discount_amt ,
>   ss_ext_sales_price ,
>   ss_ext_wholesale_cost ,
>   ss_ext_list_price ,
>   ss_ext_tax ,
>   ss_coupon_amt ,
>   ss_net_paid ,
>   ss_net_paid_inc_tax ,
>   ss_net_profit,
>   ss_sold_date_sk  , ss_quantity
> from   store_sales
> {code}
> Plan
> {code}
> WRITE TO HDFS [tpcds_1000_parquet.store_sales_insert, OVERWRITE=false, PARTITION-KEYS=(ss_sold_date_sk,ss_quantity)]
> |  partitions=180576
> |  hosts=15 per-host-mem=17.88GB
> |
> 02:SORT
> |  order by: ss_sold_date_sk DESC NULLS LAST, ss_quantity DESC NULLS LAST
> |  hosts=15 per-host-mem=1.45GB
> |  tuple-ids=1 row-size=100B cardinality=2879987999
> |
> 01:EXCHANGE [HASH(ss_sold_date_sk,ss_quantity)]
> |  hosts=15 per-host-mem=0B
> |  tuple-ids=0 row-size=100B cardinality=2879987999
> |
> 00:SCAN HDFS [tpcds_1000_parquet.store_sales, RANDOM]
>    partitions=1824/1824 files=1824 size=189.24GB
>    table stats: 2879987999 rows total
>    column stats: all
>    hosts=15 per-host-mem=88.00MB
>    tuple-ids=0 row-size=100B cardinality=2879987999
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org