You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/04/28 02:26:00 UTC

[jira] [Commented] (KUDU-3455) Improve space complexity about prune hash partitions for in-list predicate

    [ https://issues.apache.org/jira/browse/KUDU-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717428#comment-17717428 ] 

ASF subversion and git services commented on KUDU-3455:
-------------------------------------------------------

Commit b69dbeb6c64d04a32ff0e9f7d59bed1fa8165124 in kudu's branch refs/heads/master from duyuqi
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=b69dbeb6c ]

[java] KUDU-3455 Reduce space complexity of hash partition pruning for in-list predicate

Logic of pruning hash partitions for in-list predicate in Kudu Java client
has a high space complexity, and it may cause java-client to go out of memory.
At the same time, there are many deep copies for 'PartialRow', which makes the
current algorithm slow.

This patch fixes the problems and provides a recursive algorithm, that
uses an approach like DFS-based algorithm to pick all combinations for
every in-list columns and try to release PartialRow objects ASAP.

At the same time, new algorithm has a good speedup by reducing lots of heavy
copies of 'PartialRow' objects. A performance test case show that new
algorithm has around 100x gain over older one when latter doesn't cause
OOM.

After Yifan Zhang's reminder, same problem about memory was found to be present
in cpp-client per code-review. I'll study it later and fix it in another patch.

Change-Id: Icd6f213cb705e1b2a001562cc7cebe4164281723
Reviewed-on: http://gerrit.cloudera.org:8080/19568
Reviewed-by: Yuqi Du <sh...@gmail.com>
Tested-by: Kudu Jenkins
Reviewed-by: Yifan Zhang <ch...@163.com>
Reviewed-by: Yingchun Lai <la...@apache.org>


> Improve space complexity about prune hash partitions for in-list predicate
> --------------------------------------------------------------------------
>
>                 Key: KUDU-3455
>                 URL: https://issues.apache.org/jira/browse/KUDU-3455
>             Project: Kudu
>          Issue Type: Task
>            Reporter: shenxingwuying
>            Assignee: shenxingwuying
>            Priority: Major
>         Attachments: image-2023-03-06-17-23-35-119.png, image-2023-03-11-16-57-16-589.png
>
>
> My partner(Chenbo Lu) has countered an oom problem when in his application which uses kudu java client. And he collects some information and do a lot of analytics for this problem, I shared his work for this issue.
> Application program was killed by OS very frequently because of oom.  When java heap memory 8GB(inner heap 5.5GB available), more than 10000 rows  in-list predicate would not work(oom happens). The kudu table in his case exists about 1500 columns.  His scan requests like '{*}select * from profile_wos where id in (...){*}'.
>  
> The problem only happened when KuduScanPredicate is In-List predicate, other predicates have no problem.
> He found the memory consumption is positive correlation to count of (ids * count of columns). In fact, I think it's also a very important key factor that the count of every in-list columns' values.
>  
> When using kudu api to build a scanner, the memory will reach a very high watermark and multi-thread will make the problem worse. A picture can explain this and prove in-list predicate consumes very high memory.
>  
> !image-2023-03-11-16-57-16-589.png!
>  
>  
>  
> Reduce space complexity about prune hash partitions for in-list predicate
>     Pruning hash partitions for in-list predicate at java-client, the logic
>     codes has a high space complexity, and it may cause java-client out
>     of memory.  And at the same time, PartialRow has many deep copy, it may be slow.
>  
> !image-2023-03-06-17-23-35-119.png!
>  
>  
> So, we need to fix the problem to improve the space complexity and speed optimization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)