You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Peter Karich <pe...@yahoo.de> on 2010/11/15 21:06:11 UTC

Possibilities of (near) real time search with solr

Hi,

I wanted to provide my indexed docs (tweets) relative fast: so 1 to 10 
sec or even 30 sec would be ok.

At the moment I am using the read only core scenario described here 
(point 5)*
with a commit frequency of 180 seconds which was fine until some days. 
(I am using solr1.4.1)
Now the time a commit takes is too high (40-80s) and too CPU-heavy 
because the index is too large >7GB.

I thought about some possible solutions:
1. using solr NRT patches**
2. using shards (+ multicore) where I feed into a relative small core 
and merges them later (every hour or so) to reduce the number of cores
3. It would be also nice if someone could explain what and if there are 
benefits when using solr4.0 ...

The problem for 1. is that I haven't found a guide how to apply all the 
patches. Or is NRT not possible at the moment with solr? Does anybody 
has a link for me?

Then I looked into solution 2. It seems to me that the CPU- and 
administration-overhead of sharding can be quite high. Any hints (I am 
using SolrJ)? E.g. I need to include the date facet patch

Or how would you solve this?

Regards,
Peter.

*
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3CAANLkTinCgekJLbxe_BSaAhLCt_hLr_KwUxM5ZxOvt_GJ@mail.gmail.com%3E

**
https://issues.apache.org/jira/browse/SOLR-1606


-- 
http://jetwick.com twitter search prototype

Re: Possibilities of (near) real time search with solr

Posted by Peter Sturge <pe...@gmail.com>.

> no, I only thought you use one day :-)
> so you don't or do you have 31 shards?
>

No, we use 1 shard per month - e.g. 7 shards will hold 7 month's of data.
It can be set to 1 day, but you would need to have a huge amount of
data in a single day to warrant doing that.



On Thu, Nov 18, 2010 at 8:20 PM, Peter Karich <pe...@yahoo.de> wrote:
>
>
>>  Does yours need to be once a day?
>
> no, I only thought you use one day :-)
> so you don't or do you have 31 shards?
>
>
>>  having a look at Solr Cloud or Katta - could be useful
>>  here in dynamically allocating shards.
>
> ah, thx! I will take a look at it (after trying solr4)!
>
> Regards,
> Peter.
>
>
>>> Maybe I didn't fully understood what you explained: but doesn't this mean
>>> that you'll have one index per day?
>>> Or are you overwriting, via replicating, every shard and the number of
>>> shard
>>> is fixed?
>>> And why are you replicating from the local replica to the next shard?
>>> (why
>>> not directly from active to next shard?)
>>
>> Yes, you can have one index per day (for us, our boundary is typically
>> 1 month, so is less of an issue).
>> The 'oldest' replica in the round robin is overwritten, yes. We use
>> fixed shard numbers, but you don't have to.
>> Does yours need to be once a day?
>> We used our own round robin code because it was pre-Solr Cloud...
>> I'm not too familiar with them, but I believe it's certainly worth
>> having a look at Solr Cloud or Katta - could be useful here in
>> dynamically allocating shards.
>>
>> Peter
>>
>>
>>
>> On Thu, Nov 18, 2010 at 5:41 PM, Peter Karich<pe...@yahoo.de>  wrote:
>>>
>>>  Hi Peter!
>>>
>>>> * I believe the NRT patches are included in the 4.x trunk. I don't
>>>> think there's any support as yet in 3x (uses features in Lucene 3.0).
>>>
>>> I'll investage how much effort it is to update to solr4
>>>
>>>> * For merging, I'm talking about commits/writes. If you merge while
>>>> commits are going on, things can get a bit messy (maybe on source
>>>> cores this is ok, but I have a feeling it's not).
>>>
>>> ok
>>>
>>>> * For moving data to a an 'offline' read-only core, this is the
>>>> trickiest
>>>> bit.
>>>> We do this today by using a round-robin chain of remote shards and 2
>>>> local cores. At the boundary time (e.g. 1 day), the 'active' core is
>>>> replicated locally, then this local replica is replicated to the next
>>>> shard in the chain. Once everything is complete, the local replica is
>>>> discarded, and the 'active' core is cleaned, being careful not to
>>>> delete any new data since the replicated commit point.
>>>
>>> Maybe I didn't fully understood what you explained: but doesn't this mean
>>> that you'll have one index per day?
>>> Or are you overwriting, via replicating, every shard and the number of
>>> shard
>>> is fixed?
>>> And why are you replicating from the local replica to the next shard?
>>> (why
>>> not directly from active to next shard?)
>>>
>>> Regards,
>>> Peter.
>>>
>
>
> --
> http://jetwick.com twitter search prototype
>
>

Re: Possibilities of (near) real time search with solr

Posted by Peter Karich <pe...@yahoo.de>.


>  Does yours need to be once a day?

no, I only thought you use one day :-)
so you don't or do you have 31 shards?


>  having a look at Solr Cloud or Katta - could be useful
>  here in dynamically allocating shards.

ah, thx! I will take a look at it (after trying solr4)!

Regards,
Peter.


>> Maybe I didn't fully understood what you explained: but doesn't this mean
>> that you'll have one index per day?
>> Or are you overwriting, via replicating, every shard and the number of shard
>> is fixed?
>> And why are you replicating from the local replica to the next shard? (why
>> not directly from active to next shard?)
> Yes, you can have one index per day (for us, our boundary is typically
> 1 month, so is less of an issue).
> The 'oldest' replica in the round robin is overwritten, yes. We use
> fixed shard numbers, but you don't have to.
> Does yours need to be once a day?
> We used our own round robin code because it was pre-Solr Cloud...
> I'm not too familiar with them, but I believe it's certainly worth
> having a look at Solr Cloud or Katta - could be useful here in
> dynamically allocating shards.
>
> Peter
>
>
>
> On Thu, Nov 18, 2010 at 5:41 PM, Peter Karich<pe...@yahoo.de>  wrote:
>>   Hi Peter!
>>
>>> * I believe the NRT patches are included in the 4.x trunk. I don't
>>> think there's any support as yet in 3x (uses features in Lucene 3.0).
>> I'll investage how much effort it is to update to solr4
>>
>>> * For merging, I'm talking about commits/writes. If you merge while
>>> commits are going on, things can get a bit messy (maybe on source
>>> cores this is ok, but I have a feeling it's not).
>> ok
>>
>>> * For moving data to a an 'offline' read-only core, this is the trickiest
>>> bit.
>>> We do this today by using a round-robin chain of remote shards and 2
>>> local cores. At the boundary time (e.g. 1 day), the 'active' core is
>>> replicated locally, then this local replica is replicated to the next
>>> shard in the chain. Once everything is complete, the local replica is
>>> discarded, and the 'active' core is cleaned, being careful not to
>>> delete any new data since the replicated commit point.
>> Maybe I didn't fully understood what you explained: but doesn't this mean
>> that you'll have one index per day?
>> Or are you overwriting, via replicating, every shard and the number of shard
>> is fixed?
>> And why are you replicating from the local replica to the next shard? (why
>> not directly from active to next shard?)
>>
>> Regards,
>> Peter.
>>


-- 
http://jetwick.com twitter search prototype

Re: Possibilities of (near) real time search with solr

Posted by Peter Sturge <pe...@gmail.com>.

> Maybe I didn't fully understood what you explained: but doesn't this mean
> that you'll have one index per day?
> Or are you overwriting, via replicating, every shard and the number of shard
> is fixed?
> And why are you replicating from the local replica to the next shard? (why
> not directly from active to next shard?)

Yes, you can have one index per day (for us, our boundary is typically
1 month, so is less of an issue).
The 'oldest' replica in the round robin is overwritten, yes. We use
fixed shard numbers, but you don't have to.
Does yours need to be once a day?
We used our own round robin code because it was pre-Solr Cloud...
I'm not too familiar with them, but I believe it's certainly worth
having a look at Solr Cloud or Katta - could be useful here in
dynamically allocating shards.

Peter



On Thu, Nov 18, 2010 at 5:41 PM, Peter Karich <pe...@yahoo.de> wrote:
>  Hi Peter!
>
>> * I believe the NRT patches are included in the 4.x trunk. I don't
>> think there's any support as yet in 3x (uses features in Lucene 3.0).
>
> I'll investage how much effort it is to update to solr4
>
>> * For merging, I'm talking about commits/writes. If you merge while
>> commits are going on, things can get a bit messy (maybe on source
>> cores this is ok, but I have a feeling it's not).
>
> ok
>
>> * For moving data to a an 'offline' read-only core, this is the trickiest
>> bit.
>> We do this today by using a round-robin chain of remote shards and 2
>> local cores. At the boundary time (e.g. 1 day), the 'active' core is
>> replicated locally, then this local replica is replicated to the next
>> shard in the chain. Once everything is complete, the local replica is
>> discarded, and the 'active' core is cleaned, being careful not to
>> delete any new data since the replicated commit point.
>
> Maybe I didn't fully understood what you explained: but doesn't this mean
> that you'll have one index per day?
> Or are you overwriting, via replicating, every shard and the number of shard
> is fixed?
> And why are you replicating from the local replica to the next shard? (why
> not directly from active to next shard?)
>
> Regards,
> Peter.
>

Re: Possibilities of (near) real time search with solr

Posted by Peter Karich <pe...@yahoo.de>.

  Hi Peter!

> * I believe the NRT patches are included in the 4.x trunk. I don't
> think there's any support as yet in 3x (uses features in Lucene 3.0).

I'll investage how much effort it is to update to solr4

> * For merging, I'm talking about commits/writes. If you merge while
> commits are going on, things can get a bit messy (maybe on source
> cores this is ok, but I have a feeling it's not).

ok

> * For moving data to a an 'offline' read-only core, this is the trickiest bit.
> We do this today by using a round-robin chain of remote shards and 2
> local cores. At the boundary time (e.g. 1 day), the 'active' core is
> replicated locally, then this local replica is replicated to the next
> shard in the chain. Once everything is complete, the local replica is
> discarded, and the 'active' core is cleaned, being careful not to
> delete any new data since the replicated commit point.

Maybe I didn't fully understood what you explained: but doesn't this 
mean that you'll have one index per day?
Or are you overwriting, via replicating, every shard and the number of 
shard is fixed?
And why are you replicating from the local replica to the next shard? 
(why not directly from active to next shard?)

Regards,
Peter.

Re: Possibilities of (near) real time search with solr

Posted by Peter Sturge <pe...@gmail.com>.

* I believe the NRT patches are included in the 4.x trunk. I don't
think there's any support as yet in 3x (uses features in Lucene 3.0).

* For merging, I'm talking about commits/writes. If you merge while
commits are going on, things can get a bit messy (maybe on source
cores this is ok, but I have a feeling it's not).

* For moving data to a an 'offline' read-only core, this is the trickiest bit.
We do this today by using a round-robin chain of remote shards and 2
local cores. At the boundary time (e.g. 1 day), the 'active' core is
replicated locally, then this local replica is replicated to the next
shard in the chain. Once everything is complete, the local replica is
discarded, and the 'active' core is cleaned, being careful not to
delete any new data since the replicated commit point.

It's not the easiest thing to implement, but boy it scales forever!

Peter

Re: Possibilities of (near) real time search with solr

Posted by Peter Karich <pe...@yahoo.de>.

  Hi Peter,

thanks for your response. I will dig into the sharding stuff asap :-)

>  This may have changed recently, but the NRT stuff - e.g. per-segment
>  commits etc. is for the latest Solr 4 trunk only.

Do I need to turn something 'on'?
Or do you know wether the NRT patches are documented somewhere?

>  Be careful about merging, as all involved cores will pause
>  for the merging period.

Really all involved cores? Not only the target core?

>  The trickiest bit about the above is defining when data is
>  deemed to be 'old'

In my case this should be simply: Just tweets which are one day old?

>  and then moving that data in an efficient manner to a read-only shard.

How do you do this? query the old core + index the new with the queried 
data? Or from DB?

Regards,
Peter (K).


> Hi Peter,
>
> First off, many thanks for putting together the NRT Wiki page!
>
> This may have changed recently, but the NRT stuff - e.g. per-segment
> commits etc. is for the latest Solr 4 trunk only.
> If your setup uses the 3x Solr code branch, then there's a bit of work
> to do to move to the new version.
> Some of this is due to the new 3.x Lucene, which has a lot of cool new
> stuff in it, but also deprecates a lot of old stuff,
> so existing SolrJ clients and custom server-side code/configuration
> will need to take this into account.
> We've not had the time to do this, so that's about as far as I can go
> on that one for now.
>
> We have had some very good success with distributed/shard searching -
> i.e. 'new' data arrives in a relatively small index, and so can remain
> fast, whilst distributed shards hold 'older' data and so can keep
> their caches warm (i.e. very few/no commits). This works particularly
> well for summary data  (facets, filter queries etc. that sit in
> caches) .
> Be careful about merging, as all involved cores will pause for the
> merging period. Really needs to be done out-of-hours, or better still,
> offline (i.e. replicate the cores, then merge, then bring them live).
> The trickiest bit about the above is defining when data is deemed to
> be 'old' and then moving that data in an efficient manner to a
> read-only shard. Using SolrJ can help in this regard as it can offload
> some of the administration from the server(s).
>
> Thanks,
> Peter
>
>
> On Mon, Nov 15, 2010 at 8:06 PM, Peter Karich<pe...@yahoo.de>  wrote:
>> Hi,
>>
>> I wanted to provide my indexed docs (tweets) relative fast: so 1 to 10 sec
>> or even 30 sec would be ok.
>>
>> At the moment I am using the read only core scenario described here (point
>> 5)*
>> with a commit frequency of 180 seconds which was fine until some days. (I am
>> using solr1.4.1)
>> Now the time a commit takes is too high (40-80s) and too CPU-heavy because
>> the index is too large>7GB.
>>
>> I thought about some possible solutions:
>> 1. using solr NRT patches**
>> 2. using shards (+ multicore) where I feed into a relative small core and
>> merges them later (every hour or so) to reduce the number of cores
>> 3. It would be also nice if someone could explain what and if there are
>> benefits when using solr4.0 ...
>>
>> The problem for 1. is that I haven't found a guide how to apply all the
>> patches. Or is NRT not possible at the moment with solr? Does anybody has a
>> link for me?
>>
>> Then I looked into solution 2. It seems to me that the CPU- and
>> administration-overhead of sharding can be quite high. Any hints (I am using
>> SolrJ)? E.g. I need to include the date facet patch
>>
>> Or how would you solve this?
>>
>> Regards,
>> Peter.
>>
>> *
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3CAANLkTinCgekJLbxe_BSaAhLCt_hLr_KwUxM5ZxOvt_GJ@mail.gmail.com%3E
>>
>> **
>> https://issues.apache.org/jira/browse/SOLR-1606


-- 
http://jetwick.com twitter search prototype

Re: Possibilities of (near) real time search with solr

Posted by Peter Sturge <pe...@gmail.com>.

Hi Peter,

First off, many thanks for putting together the NRT Wiki page!

This may have changed recently, but the NRT stuff - e.g. per-segment
commits etc. is for the latest Solr 4 trunk only.
If your setup uses the 3x Solr code branch, then there's a bit of work
to do to move to the new version.
Some of this is due to the new 3.x Lucene, which has a lot of cool new
stuff in it, but also deprecates a lot of old stuff,
so existing SolrJ clients and custom server-side code/configuration
will need to take this into account.
We've not had the time to do this, so that's about as far as I can go
on that one for now.

We have had some very good success with distributed/shard searching -
i.e. 'new' data arrives in a relatively small index, and so can remain
fast, whilst distributed shards hold 'older' data and so can keep
their caches warm (i.e. very few/no commits). This works particularly
well for summary data  (facets, filter queries etc. that sit in
caches) .
Be careful about merging, as all involved cores will pause for the
merging period. Really needs to be done out-of-hours, or better still,
offline (i.e. replicate the cores, then merge, then bring them live).
The trickiest bit about the above is defining when data is deemed to
be 'old' and then moving that data in an efficient manner to a
read-only shard. Using SolrJ can help in this regard as it can offload
some of the administration from the server(s).

Thanks,
Peter

On Mon, Nov 15, 2010 at 8:06 PM, Peter Karich <pe...@yahoo.de> wrote:
> Hi,
>
> I wanted to provide my indexed docs (tweets) relative fast: so 1 to 10 sec
> or even 30 sec would be ok.
>
> At the moment I am using the read only core scenario described here (point
> 5)*
> with a commit frequency of 180 seconds which was fine until some days. (I am
> using solr1.4.1)
> Now the time a commit takes is too high (40-80s) and too CPU-heavy because
> the index is too large >7GB.
>
> I thought about some possible solutions:
> 1. using solr NRT patches**
> 2. using shards (+ multicore) where I feed into a relative small core and
> merges them later (every hour or so) to reduce the number of cores
> 3. It would be also nice if someone could explain what and if there are
> benefits when using solr4.0 ...
>
> The problem for 1. is that I haven't found a guide how to apply all the
> patches. Or is NRT not possible at the moment with solr? Does anybody has a
> link for me?
>
> Then I looked into solution 2. It seems to me that the CPU- and
> administration-overhead of sharding can be quite high. Any hints (I am using
> SolrJ)? E.g. I need to include the date facet patch
>
> Or how would you solve this?
>
> Regards,
> Peter.
>
> *
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3CAANLkTinCgekJLbxe_BSaAhLCt_hLr_KwUxM5ZxOvt_GJ@mail.gmail.com%3E
>
> **
> https://issues.apache.org/jira/browse/SOLR-1606
>
>
> --
> http://jetwick.com twitter search prototype
>