You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Amit Jha <sh...@gmail.com> on 2015/06/05 18:42:28 UTC

Real Time indexing and Scalability

Hi,

In my use case, I am adding a document to Solr through spring application using spring-data-solr. This setup works well with single Solr. In current setup it is single point of failure. So we decided to use solr replication because we also need centralized search. Therefore we setup two instances both in repeater mode. The problem with this setup was, some time data was not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still sometime we found that documents are not getting indexed.

I would like to know what is the best way to have highly available setup.

Rgds
AJ

Re: Real Time indexing and Scalability

Posted by Amit Jha <sh...@gmail.com>.

Thanks everyone. I got the answer.

Rgds
AJ

> On Jun 6, 2015, at 7:00 AM, Erick Erickson <er...@gmail.com> wrote:
> 
> bq: if 2 servers are master that means writing can be done on both.
> 
> If there's a single piece of documentation that supports this contention,
> we'll correct it immediately. But it's simply not true.
> 
> As Shawn says, the entire design behind master/slave
> architecture is that there is exactly one (and only one) master that
> _ever_ gets documents indexed to it. Repeaters were introduced
> as a way to "fan out" the replication process, particularly across data
> centers that had "expensive" pipes connecting them. You could have
> the repeater in DC2 relay the index form the master in DC1 to  all slaves in
> DC2. In that kind of setup, you then replicate the index
> across the expensive pipe once rather than once for each slave in
> DC2.
> 
> But even in this situation you are only ever indexing to the master
> on DC1.
> 
> Best,
> Erick
> 
>> On Fri, Jun 5, 2015 at 1:20 PM, Amit Jha <sh...@gmail.com> wrote:
>> Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud.
>> 
>> I agree that repeater is a slave and acts as master for other slaves. But still it's a master and logically it has to obey the what master suppose to obey.
>> 
>> if 2 servers are master that means writing can be done on both. If I setup replication between 2 servers and configure both as repeater, than both can act master and slave for each other. Therefore writing can be done on both.
>> 
>> 
>> Rgds
>> AJ
>> 
>>>> On Jun 6, 2015, at 1:26 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>>>> 
>>>> On 6/5/2015 1:38 PM, Amit Jha wrote:
>>>> Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct?
>>>> 
>>>> I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command.
>>>> 
>>>> If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are?
>>>> 
>>>> In repeater mode document can be indexed on both the Solr instance. Is that understanding correct?
>>>> 
>>>> Also why you say that commit is inappropriate?
>>> 
>>> If you are not using SolrCloud, then you must index to the master
>>> *ONLY*.  A repeater does not enable two-way replication.  A repeater is
>>> a slave that is also a master for additional slaves.  Master-slave
>>> replication is *only* one-way - from the master to slaves, and if any of
>>> those slaves are repeaters, from there to additional slaves.
>>> 
>>> SolrCloud is probably a far better choice for your setup, especially if
>>> you are using the SolrJ client.  You mentioned LBHttpSolrServer, which
>>> is why I am thinking you're using SolrJ.
>>> 
>>> With a proper configuration on your collection, SolrCloud lets you index
>>> to any machine in the cloud and the data will end up exactly where it
>>> needs to go.  If you use CloudSolrServer/CloudSolrClient and a very
>>> recent Solr/SolrJ version, the data will be sent directly to the correct
>>> instance for best performance.
>>> 
>>> Thanks,
>>> Shawn
>>>

Re: Real Time indexing and Scalability

Posted by Erick Erickson <er...@gmail.com>.

bq: if 2 servers are master that means writing can be done on both.

If there's a single piece of documentation that supports this contention,
we'll correct it immediately. But it's simply not true.

As Shawn says, the entire design behind master/slave
architecture is that there is exactly one (and only one) master that
_ever_ gets documents indexed to it. Repeaters were introduced
as a way to "fan out" the replication process, particularly across data
centers that had "expensive" pipes connecting them. You could have
the repeater in DC2 relay the index form the master in DC1 to  all slaves in
DC2. In that kind of setup, you then replicate the index
across the expensive pipe once rather than once for each slave in
DC2.

But even in this situation you are only ever indexing to the master
on DC1.

Best,
Erick

On Fri, Jun 5, 2015 at 1:20 PM, Amit Jha <sh...@gmail.com> wrote:
> Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud.
>
> I agree that repeater is a slave and acts as master for other slaves. But still it's a master and logically it has to obey the what master suppose to obey.
>
> if 2 servers are master that means writing can be done on both. If I setup replication between 2 servers and configure both as repeater, than both can act master and slave for each other. Therefore writing can be done on both.
>
>
> Rgds
> AJ
>
>> On Jun 6, 2015, at 1:26 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>>
>>> On 6/5/2015 1:38 PM, Amit Jha wrote:
>>> Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct?
>>>
>>> I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command.
>>>
>>> If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are?
>>>
>>> In repeater mode document can be indexed on both the Solr instance. Is that understanding correct?
>>>
>>> Also why you say that commit is inappropriate?
>>
>> If you are not using SolrCloud, then you must index to the master
>> *ONLY*.  A repeater does not enable two-way replication.  A repeater is
>> a slave that is also a master for additional slaves.  Master-slave
>> replication is *only* one-way - from the master to slaves, and if any of
>> those slaves are repeaters, from there to additional slaves.
>>
>> SolrCloud is probably a far better choice for your setup, especially if
>> you are using the SolrJ client.  You mentioned LBHttpSolrServer, which
>> is why I am thinking you're using SolrJ.
>>
>> With a proper configuration on your collection, SolrCloud lets you index
>> to any machine in the cloud and the data will end up exactly where it
>> needs to go.  If you use CloudSolrServer/CloudSolrClient and a very
>> recent Solr/SolrJ version, the data will be sent directly to the correct
>> instance for best performance.
>>
>> Thanks,
>> Shawn
>>

Re: Real Time indexing and Scalability

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/5/2015 2:20 PM, Amit Jha wrote:
> Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud. 
>
> I agree that repeater is a slave and acts as master for other slaves. But still it's a master and logically it has to obey the what master suppose to obey. 
>
> if 2 servers are master that means writing can be done on both. If I setup replication between 2 servers and configure both as repeater, than both can act master and slave for each other. Therefore writing can be done on both.

Don't try to set up two servers as master and slave to each other.  You
run the risk of data loss or even Lucene index corruption.  That is
**NOT** a supported configuration.  There is no way to do master/master
replication without SolrCloud.  SolrCloud can do it because it is a true
cluster, there *is* no master.

Thanks,
Shawn

Re: Real Time indexing and Scalability

Posted by Amit Jha <sh...@gmail.com>.

Thanks Shawn, for reminding CloudSolrServer, yes I have moved to SolrCloud. 

I agree that repeater is a slave and acts as master for other slaves. But still it's a master and logically it has to obey the what master suppose to obey. 

if 2 servers are master that means writing can be done on both. If I setup replication between 2 servers and configure both as repeater, than both can act master and slave for each other. Therefore writing can be done on both.


Rgds
AJ

> On Jun 6, 2015, at 1:26 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 6/5/2015 1:38 PM, Amit Jha wrote:
>> Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct?
>> 
>> I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command. 
>> 
>> If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are?
>> 
>> In repeater mode document can be indexed on both the Solr instance. Is that understanding correct?
>> 
>> Also why you say that commit is inappropriate?
> 
> If you are not using SolrCloud, then you must index to the master
> *ONLY*.  A repeater does not enable two-way replication.  A repeater is
> a slave that is also a master for additional slaves.  Master-slave
> replication is *only* one-way - from the master to slaves, and if any of
> those slaves are repeaters, from there to additional slaves.
> 
> SolrCloud is probably a far better choice for your setup, especially if
> you are using the SolrJ client.  You mentioned LBHttpSolrServer, which
> is why I am thinking you're using SolrJ.
> 
> With a proper configuration on your collection, SolrCloud lets you index
> to any machine in the cloud and the data will end up exactly where it
> needs to go.  If you use CloudSolrServer/CloudSolrClient and a very
> recent Solr/SolrJ version, the data will be sent directly to the correct
> instance for best performance.
> 
> Thanks,
> Shawn
>

Re: Real Time indexing and Scalability

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/5/2015 1:38 PM, Amit Jha wrote:
> Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct?
>
> I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command. 
>
> If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are?
>
> In repeater mode document can be indexed on both the Solr instance. Is that understanding correct?
>
> Also why you say that commit is inappropriate? 

If you are not using SolrCloud, then you must index to the master
*ONLY*.  A repeater does not enable two-way replication.  A repeater is
a slave that is also a master for additional slaves.  Master-slave
replication is *only* one-way - from the master to slaves, and if any of
those slaves are repeaters, from there to additional slaves.

SolrCloud is probably a far better choice for your setup, especially if
you are using the SolrJ client.  You mentioned LBHttpSolrServer, which
is why I am thinking you're using SolrJ.

With a proper configuration on your collection, SolrCloud lets you index
to any machine in the cloud and the data will end up exactly where it
needs to go.  If you use CloudSolrServer/CloudSolrClient and a very
recent Solr/SolrJ version, the data will be sent directly to the correct
instance for best performance.

Thanks,
Shawn

Re: Real Time indexing and Scalability

Posted by Amit Jha <sh...@gmail.com>.

Thanks Eric, what about document is committed to master?Then document should be visible from master. Is that correct?

I was using replication with repeater mode because LBHttpSolrServer can send write request to any of the Solr server, and that Solr should index the document because it a master. we have a polling interval of 2 sec. After polling interval slave can poll the data. It is worth to mention here is application request the commit command. 

If document is committed to master and a search request coming to the same master then document should be retrieved. Irrespective of replication because master doesn't know who the slave are?

In repeater mode document can be indexed on both the Solr instance. Is that understanding correct?

Also why you say that commit is inappropriate? 

Rgds
AJ

> On Jun 5, 2015, at 11:16 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> You have to provide a _lot_ more details. You say:
> "The problem... some data was not get indexed... still sometime we
> found that documents are not getting indexed".
> 
> Neither of these should be happening, so I suspect
> 1> you're expectations aren't correct. For instance, in the
> master/slave setup you won't see docs on the slave until after the
> polling interval is expired and the index is replicated.
> 2> In SolrCloud you aren't committing appropriately.
> 
> You might review: http://wiki.apache.org/solr/UsingMailingLists
> 
> Best,
> Erick
> 
> 
>> On Fri, Jun 5, 2015 at 9:45 AM, Amit Jha <sh...@gmail.com> wrote:
>> I want to have realtime index and realtime search.
>> 
>> Rgds
>> AJ
>> 
>>> On Jun 5, 2015, at 10:12 PM, Amit Jha <sh...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> In my use case, I am adding a document to Solr through spring application using spring-data-solr. This setup works well with single Solr. In current setup it is single point of failure. So we decided to use solr replication because we also need centralized search. Therefore we setup two instances both in repeater mode. The problem with this setup was, some time data was not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still sometime we found that documents are not getting indexed.
>>> 
>>> I would like to know what is the best way to have highly available setup.
>>> 
>>> Rgds
>>> AJ

Re: Real Time indexing and Scalability

Posted by Erick Erickson <er...@gmail.com>.

You have to provide a _lot_ more details. You say:
"The problem... some data was not get indexed... still sometime we
found that documents are not getting indexed".

Neither of these should be happening, so I suspect
1> you're expectations aren't correct. For instance, in the
master/slave setup you won't see docs on the slave until after the
polling interval is expired and the index is replicated.
2> In SolrCloud you aren't committing appropriately.

You might review: http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

On Fri, Jun 5, 2015 at 9:45 AM, Amit Jha <sh...@gmail.com> wrote:
> I want to have realtime index and realtime search.
>
> Rgds
> AJ
>
>> On Jun 5, 2015, at 10:12 PM, Amit Jha <sh...@gmail.com> wrote:
>>
>> Hi,
>>
>> In my use case, I am adding a document to Solr through spring application using spring-data-solr. This setup works well with single Solr. In current setup it is single point of failure. So we decided to use solr replication because we also need centralized search. Therefore we setup two instances both in repeater mode. The problem with this setup was, some time data was not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still sometime we found that documents are not getting indexed.
>>
>> I would like to know what is the best way to have highly available setup.
>>
>> Rgds
>> AJ

Re: Real Time indexing and Scalability

Posted by Amit Jha <sh...@gmail.com>.

I want to have realtime index and realtime search.

Rgds
AJ

> On Jun 5, 2015, at 10:12 PM, Amit Jha <sh...@gmail.com> wrote:
> 
> Hi,
> 
> In my use case, I am adding a document to Solr through spring application using spring-data-solr. This setup works well with single Solr. In current setup it is single point of failure. So we decided to use solr replication because we also need centralized search. Therefore we setup two instances both in repeater mode. The problem with this setup was, some time data was not get indexed. So we moved to SolrCloud, with 3zk and 2 shards and 2 replica setup, but still sometime we found that documents are not getting indexed.
> 
> I would like to know what is the best way to have highly available setup.
> 
> Rgds
> AJ