You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Xu Yao (JIRA)" <ji...@apache.org> on 2019/08/06 05:06:00 UTC
[jira] [Created] (KUDU-2917) Split a tablet into primary key ranges
by number of row
Xu Yao created KUDU-2917:
----------------------------
Summary: Split a tablet into primary key ranges by number of row
Key: KUDU-2917
URL: https://issues.apache.org/jira/browse/KUDU-2917
Project: Kudu
Issue Type: Improvement
Reporter: Xu Yao
Assignee: Xu Yao
Since we implemented [KUDU-2437|https://issues.apache.org/jira/browse/KUDU-2437] and [KUDU-2670|https://issues.apache.org/jira/browse/KUDU-2670], the spark job can read data inside the tablet in parallel. However, we found in actual use that splitting key range by size may cause the spark task to read long tails. (Some tasks read more data when the data size in KeyRange is basically the same.)
I think this issue is caused by the encoding and compression of column-wise. So I think maybe split the primary key range by the number of rows might be a good choice.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)