You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2010/06/11 18:54:14 UTC

[jira] Commented: (CASSANDRA-981) Dynamic endpoint snitch

    [ https://issues.apache.org/jira/browse/CASSANDRA-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877855#action_12877855 ] 

Jonathan Ellis commented on CASSANDRA-981:
------------------------------------------

(1)

combine

+        windows.putIfAbsent(host, new AdaptiveLatencyTracker(WINDOW_SIZE));
+        AdaptiveLatencyTracker tracker = windows.get(host);

to

+        AdaptiveLatencyTracker tracker = windows.putIfAbsent(host, new AdaptiveLatencyTracker(WINDOW_SIZE));

Even better: get, then putIfAbsent only if null, would avoid creating new ALT objects every time

(2)

receiveTiming isn't threadsafe.  use AtomicInteger?

(3)

AdaptiveLatencyTracker doesn't look threadsafe either.  definitely LBD isn't.  I think just using a threadsafe queue like CLQ would work?  (My fault for naming ASD a Deque, when it only really needs a Queue)

(4) 

do we need unregister()?  if not, let's drop that api

(5)

sortByProximity needs to return zero if both scores are null.  even better, take a non-dynamic snitch and use the static topology when there is no score info yet (this would save us from sending data requests to another data center after every clear of the stats).  So, rather than using DES directly in the config, maybe having a boolean for whether to wrap your regular snitch, with the dynamic one, is the way to go.

what if reset cleared scores, instead of latencies?  this would result in a more gradual aging out of both slow and fast latencies as new ones were pushed in, which would make it more tolerant of brief hiccups where a mostly fast node had a couple slow responses.  Feels more how phi was meant to work, to me.

(6)

+        if (address != FBUtilities.getLocalAddress()) // we only know about ourself
+            return addresses;

let's change this to an assert

(7)

deque.offer is more idiomatic than try/except in java

(8)

let's use a single timer for both update and reset

(9)

it's not completely clear to me that the phi code designed for telling "how long has been too long to wait for updates that are supposed to arrive at the same interval," applies well to latency information that arrives in bursts up to our max per interval.  can you add some tests showing that it does the right thing, given several mixes of latencies?

> Dynamic endpoint snitch
> -----------------------
>
>                 Key: CASSANDRA-981
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-981
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>             Fix For: 0.7
>
>         Attachments: 981.txt
>
>
> An endpoint snitch that automatically and dynamically infers "distance" to other machines without having to explicitly configure rack and datacenter positions solves two problems:
> The killer feature here is adapting to things like compaction or a failing-but-not-yet-dead disk.  This is important, since when we are doing reads we pick the "closest" replica for actually reading data from (and only read md5s from other replicas).  This means that if the closest replica by network topology is temporarily slow due to compaction (for instance), we'll have to block for its reply even if we get the other replies much much faster.
> Not having to manually re-sync your configuration with your network topology when changes (adding machines) are made is a nice bonus.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.