You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/08/13 05:07:22 UTC

[jira] [Commented] (KUDU-1454) Spark and MR jobs running without scan locality

    [ https://issues.apache.org/jira/browse/KUDU-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419810#comment-15419810 ] 

Todd Lipcon commented on KUDU-1454:
-----------------------------------

I tried to address this tonight with https://gist.github.com/5c572b1890d73aa5a40fc2fc0d5e7c16
but ran into the issue that ScanTokenPBs don't "remember" who was the leader at the time they were constructed. Since we use the scan token builder in Spark, there's no way to tell it which node to try to schedule on.

So, I think we either need a way to maintain that information (and only set the leader as 'local'), or need to implement the necessary stuff in the Java client to scan at the closest replica. [~danburkert] any thoughts?

> Spark and MR jobs running without scan locality
> -----------------------------------------------
>
>                 Key: KUDU-1454
>                 URL: https://issues.apache.org/jira/browse/KUDU-1454
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, perf, spark
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Zain Maqsood
>            Priority: Critical
>
> Spark (and according to [~danburkert] MR also now) add all of the locations of a tablet as split locations. This makes sense except that the Java client currently always scans the leader replica. So in many cases we schedule a task which is "local" to a follower, and then it ends up having to do a remote scan.
> This makes Spark queries take about twice as long on tables with replicas compared to unreplicated tables, and I think is a regression on the MR side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)