You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2014/01/10 06:35:52 UTC

[jira] [Updated] (SOLR-5474) Have a new mode for SolrJ to not watch any ZKNode

     [ https://issues.apache.org/jira/browse/SOLR-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Potter updated SOLR-5474:
---------------------------------

    Attachment: SOLR-5474.patch

Here's a new patch that takes a slightly different approach to the previous. The previous did the following:
 
1 – Client application creates a request targeted to external collection "foo"
 
2 - CloudSolrServer (on the client side) doesn't know about "foo", so it fetches a one-time snapshot of foo's state from ZK using lazy loading. It caches that state and keeps track of the state version, e.g. 1
 
3 - CloudSolrServer sends the request to one of the nodes servicing “foo” based on the state information it retrieved from ZK. If the request is an update request, it will go to the leader, if it is a query, the request will go to one of the replicas using LBSolrServer. Every request contains the _stateVer_ parameter, e.g. _stateVer_=foo:1
 
4 - Server-side compares the _stateVer_ it receives in the request from the client with its _stateVer_ for foo and generates an INVALID_STATE error if they don't match. The server does have a watcher for foo’s state in each replica.
 
There are some subtle issues with this:
 
1) If a new replica is added (or recovers) in "foo", then the state of "foo" on the server changes and the request fails with an INVALID_STATE even though it probably shouldn't fail, but that's the only way now to tell the client its state is stale.
 
There is retry logic in the client and the retry may work, but it might not because there's nothing that prevents the state from changing again in between the client receiving the INVALID_STATE response, re-fetching state from ZK, and re-issuing the request. Also, failing a request when a positive state change occurs (e.g. adding a replica) just to invalidate cache seems problematic to me. In other words, the state of a collection has changed, but in a positive way that shouldn’t lead to a request failing. Of course with the correct amount of retries, the request will likely work in the end but one can envision a number of network round-trips between the client and server just to respond to potentially benign state changes.
 
2) Since the client-side is not "watching" any znodes, it runs the risk of trying to send a request to a node that is no longer live. Currently, the CloudSolrServer consults /live_nodes to make sure a node is "live" before it attempts to send a request to it. Without watchers, the client side has no way of knowing a node isn't "live" until an error occurs. So now it has to wait for some time for the request to timeout and then refresh /live_nodes from ZooKeeper.
 
3) Aliases – what happens if a collection is added to an alias? Without watchers, the client won’t know the alias changed. I’m sure we could implement a similar _stateVer_ solution for aliases but that seems less elegant than just watching the znode.
 
4) Queries that span multiple collections … I think problems #1 and 2 mentioned above just get worse when dealing with queries that span multiple collections.
 
So based on my discussions with Noble, the new patch takes the following approach:

1) No more LazyCloudSolrServer; just adding support for external collections in CloudSolrServer

2) Still watch shared znodes, such as /aliases and /live_nodes

3) State for external collections loaded on demand and cached

As it stands now, the CloudSolrServer does not watch external collections when running on the client side. The idea there being there may be too many external collections to setup watchers for. Thus, state is requested on demand and cached. This opens the door for the cached state to go stale, leading to an INVALID_STATE error.

However, this presents the need for a new public method on ZkStateReader (currently named refreshAllCollectionsOnClusterStateChange), which refreshes the internal allCollections set containing the names of internal (those in /clusterstate.json and external). While this approach works, it seems like an external object telling an internal object to fix itself, which is somewhat anti-OO. One improvement would be to dynamically update allCollections when a new external collection is discovered. Please advise.

External collections get watched on the server-side only, which gets setup by the ZkController. So client-side uses of CloudSolrServer will not have watchers setup for external collections.

The remaining issue with this patch is how to handle requests that span multiple external collections as the _stateVer_ parameter only supports a single collection at this time. A simple comma-delimited list of collection:ver pairs could be passed and the server could check each one. However, the test case for multiple collections is not passing and is commented out currently. Next patch will address that issue.


> Have a new mode for SolrJ to not watch any ZKNode
> -------------------------------------------------
>
>                 Key: SOLR-5474
>                 URL: https://issues.apache.org/jira/browse/SOLR-5474
>             Project: Solr
>          Issue Type: Sub-task
>          Components: SolrCloud
>            Reporter: Noble Paul
>         Attachments: SOLR-5474.patch, SOLR-5474.patch
>
>
> In this mode SolrJ would not watch any ZK node
> It fetches the state  on demand and cache the most recently used n collections in memory.
> SolrJ would not listen to any ZK node. When a request comes for a collection ‘xcoll’
> it would first check if such a collection exists
> If yes it first looks up the details in the local cache for that collection
> If not found in cache , it fetches the node /collections/xcoll/state.json and caches the information
> Any query/update will be sent with extra query param specifying the collection name , shard name, Role (Leader/Replica), and range (example \_target_=xcoll:shard1:L:80000000-b332ffff) . A node would throw an error (INVALID_NODE) if it does not the serve the collection/shard/Role/range combo.
> If SolrJ gets INVALID_NODE error it would invalidate the cache and fetch fresh state information for that collection (and caches it again)
> If there is a connection timeout, SolrJ assumes the node is down and re-fetch the state for the collection and try again



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org