You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2014/02/24 15:47:19 UTC

[jira] [Commented] (CASSANDRA-6747) MessagingService should handle failures on remote nodes.

    [ https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910360#comment-13910360 ] 

Jonathan Ellis commented on CASSANDRA-6747:
-------------------------------------------

MS is primarily designed around the needs of mutations and reads, where it's probably not worth distinguishing between failure and timeout since (a) they should both be rare and (b) when the replica does fail completely it turns into a timeout anyway.

But for repair specifically where Prepare can take arbitrarily long (so it's difficult to just pick a timeout and assume, "if we haven't heard back it must have failed") then I agree we should make a bigger effort to notify peers of failures.

> MessagingService should handle failures on remote nodes.
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6747
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Priority: Minor
>              Labels: Core
>
> While going through the code of MessagingService, I discovered that we don't handle callbacks on failure very well. If a Verb Handler on the remote machine throws an exception, it goes right through uncaught exception handler. The machine which triggered the message will keep waiting and will timeout. On timeout, it will so some stuff hard coded in the MS like hints and add to Latency. There is no way in IAsyncCallback to specify that to do on timeouts and also on failures. 
> Here are some examples which I found will help if we enhance this system to also propagate failures back.  So IAsyncCallback will have methods like onFailure.
> 1) From ActiveRepairService.prepareForRepair
>    IAsyncCallback callback = new IAsyncCallback()
>        {
>            @Override
>            public void response(MessageIn msg)
>            {
>                prepareLatch.countDown();
>            }
>            @Override
>            public boolean isLatencyForSnitch()
>            {
>                return false;
>            }
>        };
>        List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size());
>        for (ColumnFamilyStore cfs : columnFamilyStores)
>            cfIds.add(cfs.metadata.cfId);
>        for(InetAddress neighbour : endpoints)
>        {
>            PrepareMessage message = new PrepareMessage(parentRepairSession, cfIds, ranges);
>            MessageOut<RepairMessage> msg = message.createMessage();
>            MessagingService.instance().sendRR(msg, neighbour, callback);
>        }
>        try
>        {
>            prepareLatch.await(1, TimeUnit.HOURS);
>        }
>        catch (InterruptedException e)
>        {
>            parentRepairSessions.remove(parentRepairSession);
>            throw new RuntimeException("Did not get replies from all endpoints.", e);
>        }
> 2) During snapshot phase in repair, if SnapshotVerbHandler throws an exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)