You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "yangz (JIRA)" <ji...@apache.org> on 2019/02/01 04:01:00 UTC

[jira] [Commented] (KUDU-2670) Splitting more tasks for spark job, and add more concurrent for scan operation

    [ https://issues.apache.org/jira/browse/KUDU-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757942#comment-16757942 ] 

yangz commented on KUDU-2670:
-----------------------------

[~granthenke], I rebase to master now. https://gerrit.cloudera.org/#/c/12323/

> Splitting more tasks for spark job, and add more concurrent for scan operation
> ------------------------------------------------------------------------------
>
>                 Key: KUDU-2670
>                 URL: https://issues.apache.org/jira/browse/KUDU-2670
>             Project: Kudu
>          Issue Type: Improvement
>          Components: java, spark
>    Affects Versions: 1.8.0
>            Reporter: yangz
>            Priority: Major
>              Labels: backup, performance
>
> Refer to the KUDU-2437 Split a tablet into primary key ranges by size.
> We need a java client implementation to support the split the tablet scan operation.
> We suggest two new implementation for the java client.
>  # A ConcurrentKuduScanner to get more scanner read data at the same time. This will be useful for one case.  We scanner only one row, but the predicate doesn't contain the primary key, for this case, we will send a lot scanner request but only one row return.It will be slow to send so much scanner request one by one. So we need a concurrent way. And by this case we test, for a 10G tablet, it will save a lot time for one machine.
>  # A way to split more spark task. To do so, we need get scanner tokens for two step, first we send to the tserver to give range, then with this range we get more scanner tokens. For our usage we make a tablet 10G, but we split a task to process only 1G data. So we get better performance.
> And all this feature has run well for us for half a year. We hope this feature will be useful for the community.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)