You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2017/05/30 19:51:04 UTC
[jira] [Commented] (SOLR-10745) Reliably create nodeAdded / nodeLost events

    [ https://issues.apache.org/jira/browse/SOLR-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030023#comment-16030023 ] 

Shalin Shekhar Mangar commented on SOLR-10745:
----------------------------------------------

Thanks Andrzej. I made a pass through the code at jira/SOLR-10745 branch.

A few comments:
# Should we write events to nodeLost, nodeAdded even when there are no corresponding (active) triggers? -- it seems wasteful and worse the data will keep growing with no one to delete it
# I agree with your choice of using persistent znodes for nodeLost events. Same for using ephemeral for nodeAdded because if the node goes away, the znode does too and we obviously never want to fire a nodeAdded trigger if the node itself is no more. I can't think of any cons to using ephemeral here except it is inconsistent with how we handle nodeLost events.
# While processing these events, i.e. before adding them to the tracking map, we must check actual state of the node at the time e.g. if a node came back, we don't want to add it to the NodeLostTrigger's tracking map
# Perhaps add some error handling code which ensures that we mark the node as live even if the multi op fails? I don't think it can fail but I just want to ensure that we fail to start Solr if cannot create the live node.
# TriggerIntegrationTest can use SolrZkClient.clean() which does the same thing as deleteChildrenRecursively
# nodeNameVsTimeAdded is now ConcurrentHashMap but it is never accessed concurrently?
# I'd prefer that retreiving marker paths should be done once during startup in ScheduledTrigger.run(). Doing that each time the trigger is run is redundant.
# minor nit - in testNodesEventRegistration, the code comment says "we want both triggers to fire" but the latch is initialized with 3.

> Reliably create nodeAdded / nodeLost events
> -------------------------------------------
>
>                 Key: SOLR-10745
>                 URL: https://issues.apache.org/jira/browse/SOLR-10745
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>              Labels: autoscaling
>             Fix For: master (7.0)
>
>
> When Overseer node goes down then depending on the current phase of trigger execution a {{nodeLost}} event may not have been generated. Similarly, when a new node is added and Overseer goes down before the trigger saves a checkpoint (and before it produces {{nodeAdded}} event) this event may never be generated.
> The proposed solution would be to modify how nodeLost / nodeAdded information is recorded in the cluster:
> * new nodes should do a ZK multi-write to both {{/live_nodes}} and additionally to a predefined location eg. {{/autoscaling/nodeAdded/<nodeName>}}. On the first execution of Trigger.run in the new Overseer leader it would check this location for new znodes, which would indicate that node has been added, and then generate a new event and remove the znode that corresponds to the event.
> * node lost events should also be recorded to a predefined location eg. {{/autoscaling/nodeLost/<nodeName>}}. Writing to this znode would be attempted simultaneously by a few randomly selected nodes to make sure at least one of them succeeds. On the first run of the new trigger instance (in new Overseer leader) event generation would follow the sequence described above.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org