You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by John Doe <ma...@gmail.com> on 2016/07/06 21:32:45 UTC

Shard vs Replica

Hey,

I have have the same question on freenode channel , people answered me , 
but I believe that I still got doubts. Just because I never had approach 
to such data store technologies before it makes me hardly understand 
what is exactly is replica and shard in solr. I believe once I 
understand what exactly are these two, then I would be able to see the 
difference.

According to English dictionary replica is exact copy of something, 
which sounds like a true to me, but what is shard then here and how is 
it connected with all this context ? Can someone explain this in brief 
and give some examples ?

Thank you in advance

Re: Shard vs Replica

Posted by Susheel Kumar <su...@gmail.com>.

To understand shard & replica, let's first understand what is sharding and
why it is needed.

Sharding -  Assume your index grows large that it doesn't fit into a single
machine (for e.g. your index size is 80GB and your machine is 64GB in which
case index won't fit into memory).  Now to get better performance either
you increase your RAM or have another machine with similar configuration
and you divide (aka partitioning) your index into two and have each 40GB
index fit into two machines.    So your complete index = Index1 + Index2
(each 40GB).  These index1 and index2 etc. are called shards.  and
depending on how big is your index and machines resources, you may plan to
have N shards.  To do a complete search, the search has to be performed on
all the shards.

 Hope that clarifies and explains why sharding and what is shard.

Replication - Assume one of the above machine goes down and now you won't
be able to search on complete index since half of the copy/data is not
available. To avoid this single point of failure, you can create replica
(which is copy of shard) on either existing machines (or have another new
machines depending on requirements)

Machine 1 = Shard1 +  Copy of Shard2 (Shard2_Replica1)
Machine 2 = Shard2 +  Copy of Shard1 (Shard1_Replica1)

So you create replica to avoid single point of failure and also to serve
higher queries per second (in case if replica gets created on another
machines)

Hope that clarifies.

Thanks,
Susheel

On Wed, Jul 6, 2016 at 5:32 PM, John Doe <ma...@gmail.com> wrote:

> Hey,
>
> I have have the same question on freenode channel , people answered me ,
> but I believe that I still got doubts. Just because I never had approach to
> such data store technologies before it makes me hardly understand what is
> exactly is replica and shard in solr. I believe once I understand what
> exactly are these two, then I would be able to see the difference.
>
> According to English dictionary replica is exact copy of something, which
> sounds like a true to me, but what is shard then here and how is it
> connected with all this context ? Can someone explain this in brief and
> give some examples ?
>
> Thank you in advance
>

Re: Shard vs Replica

Posted by Anshum Gupta <an...@anshumgupta.net>.

A collection in SolrCloud is a logical entity that encapsulates documents
that confirm to a shared schema. As a distributed system, the data needs to
be split and so the collection is logically split into 'Shards'.
Shard(s):
 * don't represent a physical index.
 * are logical entities

Replica:
 * is physical manifestation of a shard
 * is an actual lucene index
 * therefore, can independently serve requests and accept document updates
 * Unlike the dictionary meaning, it is not a 'replica' of anything but is
just a physical manifestation (I'm repeating this, I know)

Moving on, for each shard, there are a few things that need a single
controlling point e.g. versioning the incoming documents and maintaining
optimistic concurrency. One of the replicas for each shard is given those
responsibilities and is called the 'leader'.
The leader changes via leader election. I'm not going to go into the
details of leader election and when it happens here.

All other non-leader replicas (we at times refer to them as followers)
receive updates from the leader, who versions the documents.

To sum it up, if you are a Java developer, in terms of analogy,
collections, and shards are classes but replicas are objects.

Imagine a 'wikipedia' collection. It may have 10 shards that split all of
wikipedia into 10 parts for the sake of manageability.
Depending upon our traffic, we may choose the number of replicas (called
replication factor) for each shard.

*NOTE*: a replication factor of 1 means, there is 1 replica for each shard
i.e. there is ONE physical index for each shard definition. In such a case,
this replica would also be the leader.

If the replication factor was 2, there would be 2 physical index copies of
each shard and one of the 2 would be assigned the role of a leader.

Hope this helps.

On Wed, Jul 6, 2016 at 2:32 PM, John Doe <ma...@gmail.com> wrote:

> Hey,
>
> I have have the same question on freenode channel , people answered me ,
> but I believe that I still got doubts. Just because I never had approach to
> such data store technologies before it makes me hardly understand what is
> exactly is replica and shard in solr. I believe once I understand what
> exactly are these two, then I would be able to see the difference.
>
> According to English dictionary replica is exact copy of something, which
> sounds like a true to me, but what is shard then here and how is it
> connected with all this context ? Can someone explain this in brief and
> give some examples ?
>
> Thank you in advance
>

-- 
Anshum Gupta