You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/06/27 09:21:00 UTC

[jira] [Commented] (KUDU-3455) Improve space complexity about prune hash partitions for in-list predicate

    [ https://issues.apache.org/jira/browse/KUDU-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737548#comment-17737548 ] 

ASF subversion and git services commented on KUDU-3455:
-------------------------------------------------------

Commit 5d6774b1022f26d33abd6cc9fb99507490849428 in kudu's branch refs/heads/master from duyuqi
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=5d6774b10 ]

[cpp-client] KUDU-3455 Reduce space complexity and speed up hash partition pruning for in-list predicate

This patch comes from a committed patch, its committed id is 'b69dbeb'. As that
patch said, logic of pruning hash partitions for in-list predicate in Kudu cpp
client has also a high space complexity and slow. Old algorithm must keep
all intermedium objects because they are incomplete until they are
completed and can be computed hash.

This patch fixes the problems and provides a recursive algorithm the
same as the one used by java client.

Compared with java client, the cpp client is less likely to cause the
OOM condition because it does not keep too many intermediate results.
This optimization has good benefits too. The benefits are related to
the in-list length and the number of primary keys, the performance
would be better if in-list length is longer. For example,
PartitionPrunerTest::TestMultiColumnInListHashPruningManyValues,
using 10 key columns and kMaxInListLength=50, old algorithm memory cost
may reach 600MB, while new algorithm's memory cost can be ignored
(it only need one objects and a few stacks for contexts). At the same
time, new algorithm has a good speedup, some effect as below:

combination_count: 5554006920000, old cost: 428238us, new cost: 713us, speedup: 600.6x
combination_count: 89083783664568, old cost: 2764924us, new cost: 1145us, speedup: 2414.7x
combination_count: 27194091724800, old cost: 1610475us, new cost: 1151us, speedup: 1399.2x
combination_count: 7116622216704, old cost: 34544289us, new cost: 375us, speedup: 92118.1x
combination_count: 37570734489600, old cost: 1733205us, new cost: 901us, speedup: 1923.6x

Change-Id: Ie4bea5c10b4ac2c62b85625fe9d2a33ceb4fb2e9
Reviewed-on: http://gerrit.cloudera.org:8080/19794
Reviewed-by: Yingchun Lai <la...@apache.org>
Tested-by: Yingchun Lai <la...@apache.org>
Reviewed-by: Yuqi Du <sh...@gmail.com>


> Improve space complexity about prune hash partitions for in-list predicate
> --------------------------------------------------------------------------
>
>                 Key: KUDU-3455
>                 URL: https://issues.apache.org/jira/browse/KUDU-3455
>             Project: Kudu
>          Issue Type: Task
>            Reporter: Yuqi Du
>            Assignee: Yuqi Du
>            Priority: Major
>         Attachments: image-2023-03-06-17-23-35-119.png, image-2023-03-11-16-57-16-589.png
>
>
> My partner(Chenbo Lu) has countered an oom problem when in his application which uses kudu java client. And he collects some information and do a lot of analytics for this problem, I shared his work for this issue.
> Application program was killed by OS very frequently because of oom.  When java heap memory 8GB(inner heap 5.5GB available), more than 10000 rows  in-list predicate would not work(oom happens). The kudu table in his case exists about 1500 columns.  His scan requests like '{*}select * from profile_wos where id in (...){*}'.
>  
> The problem only happened when KuduScanPredicate is In-List predicate, other predicates have no problem.
> He found the memory consumption is positive correlation to count of (ids * count of columns). In fact, I think it's also a very important key factor that the count of every in-list columns' values.
>  
> When using kudu api to build a scanner, the memory will reach a very high watermark and multi-thread will make the problem worse. A picture can explain this and prove in-list predicate consumes very high memory.
>  
> !image-2023-03-11-16-57-16-589.png!
>  
>  
>  
> Reduce space complexity about prune hash partitions for in-list predicate
>     Pruning hash partitions for in-list predicate at java-client, the logic
>     codes has a high space complexity, and it may cause java-client out
>     of memory.  And at the same time, PartialRow has many deep copy, it may be slow.
>  
> !image-2023-03-06-17-23-35-119.png!
>  
>  
> So, we need to fix the problem to improve the space complexity and speed optimization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)