You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "yangz (JIRA)" <ji...@apache.org> on 2019/01/24 03:44:00 UTC
[jira] [Commented] (KUDU-2437) Split a tablet into primary key
ranges by size
[ https://issues.apache.org/jira/browse/KUDU-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750687#comment-16750687 ]
yangz commented on KUDU-2437:
-----------------------------
Hi [~granthenke] , we give a new issue KUDU-2670 to do this.
> Split a tablet into primary key ranges by size
> ----------------------------------------------
>
> Key: KUDU-2437
> URL: https://issues.apache.org/jira/browse/KUDU-2437
> Project: Kudu
> Issue Type: Improvement
> Components: client, tablet
> Reporter: Xu Yao
> Assignee: Xu Yao
> Priority: Major
> Fix For: 1.8.0
>
>
> When reading data in a kudu table using spark, if there is a large amount of data in the tablet, reading the data takes a long time. The reason is that KuduRDD uses a tablet to generate the scanToken, so a spark task needs to process all the data in a tablet.
> We think that TabletServer should provide an RPC interface, which can be split tablet into multiple primary key ranges by size. The kudu-client can choose whether to perform parallel scan according to the case.
> RPC interface:
> {code:java}
> // A split key range request. Split tablet to key ranges, the request
> // doesn't change layout of tablet.
> message SplitKeyRangeRequestPB {
> required bytes tablet_id = 1;
> // Encoded primary key to begin scanning at (inclusive).
> optional bytes start_primary_key = 2 [(kudu.REDACT) = true];
> // Encoded primary key to stop scanning at (exclusive).
> optional bytes stop_primary_key = 3 [(kudu.REDACT) = true];
> // Number of bytes to try to return in each chunk. This is a hint.
> // The tablet server may return chunks larger or smaller than this value.
> optional uint64 target_chunk_size_bytes = 4;
> // The columns to consider when chunking.
> // If specified, then the size estimate used for 'target_chunk_size_bytes'
> // should only include these columns. This can be used if a query will
> // only scan a certain subset of the columns.
> repeated ColumnSchemaPB columns = 5;
> }
> // The primary key range of a Kudu tablet.
> message KeyRangePB {
> // Encoded primary key to begin scanning at (inclusive).
> optional bytes start_primary_key = 1 [(kudu.REDACT) = true];
> // Encoded primary key to stop scanning at (exclusive).
> optional bytes stop_primary_key = 2 [(kudu.REDACT) = true];
> // Number of bytes in chunk.
> required uint64 size_bytes_estimates = 3;
> }
> message SplitKeyRangeResponsePB {
> // The error, if an error occurred with this request.
> optional TabletServerErrorPB error = 1;
> repeated KeyRangePB ranges = 2;
> }
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)