You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@lucene.apache.org by "Cassandra Targett (Confluence)" <co...@apache.org> on 2013/09/11 22:51:00 UTC

[CONF] Apache Solr Reference Guide > Shards and Indexing Data in SolrCloud

Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Shards and Indexing Data in SolrCloud (https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud)

Edited by Cassandra Targett:
---------------------------------------------------------------------
When your data is too large for one node, you can break it up and store it in sections by creating one or more *shards*. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index.

A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined.

Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however, several problems with the distributed approach that necessitated improvement with SolrCloud:

# Splitting of the core into shards was somewhat manual.
# There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn't figure out on its own what shards to send documents to.
# There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one shard died it was just gone.

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have multiple replicas for additional robustness.

Unlike Solr 3.x, in SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at [http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection.|http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection].

If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard ID.

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader.
* If the machine is a replica, the document is forwarded to the leader for processing.
* If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, an d forwards the index notation to itself and any replicas.

h2. Document Routing

Solr 4.1 added the ability to co-locate documents to improve query performance.

First, if you specify {{numShards}} when you create a collection, you will use the "compositeId" router by default, which will then allow you to send documents with a prefix in the document ID. The prefix will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark ('!') is critical here, as it defines the shard to direct the document to.

Then at query time, you include the prefix(es) into your query with the {{shard.keys}} parameter (i.e., {{q=solr&shard.keys=IBM!}}). In some situations, this may improve query performance because it overcomes network latency when querying all the shards.

If you do not want to influence how documents are stored, you don't need to specify a prefix in your document ID.

If you did not create the collection with the {{numShards}} parameter, you will be using the "implicit" router by default. In this case, you could use the {{\_shard_}} parameter or a field to name a specific shard.

h2. Shard Splitting

Until Solr 4.3, when you created a collection in SolrCloud, you had to decide on your number of shards when you created the collection and you could not change it later. It can be difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment's notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data.

The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you're ready.

More details on how to use shard splitting is in the section on the [Collections API].

{scrollbar}

Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action