You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Saikat Kanjilal <sx...@hotmail.com> on 2013/05/24 16:25:39 UTC

Keeping a rolling window of indexes around solr

Hello Solr community folks,
I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current.   However the architecture seems a bit more complex than I'd like with a lot of moving pieces.  I was wondering if anyone has ever handled/designed an architecture around a "conveyor belt" or rolling window of indexes around n days of data and if there are best practices around this.  One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master.


Anyways would love to hear thoughts and usecases that are similar from the community.

Regards

RE: Keeping a rolling window of indexes around solr

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I would like to see something similar to this existing in the solr world or  I could gladly help create this:

https://github.com/karussell/elasticsearch-rollindex


We are evaluating both elasticsearch and our current solr architecture and need to manage write heavy use-cases within a rolling window.

> Date: Fri, 24 May 2013 09:07:38 -0600
> From: elyograg@elyograg.org
> To: solr-user@lucene.apache.org
> Subject: Re: Keeping a rolling window of indexes around solr
> 
> On 5/24/2013 8:56 AM, Shawn Heisey wrote:
> > On 5/24/2013 8:25 AM, Saikat Kanjilal wrote:
> >> Anyways would love to hear thoughts and usecases that are similar from the community.
> > 
> > Your use-case sounds a lot like what loggly was doing back in 2010.
> > 
> > http://loggly.com/videos/lucene-revolution-2010/
> 
> While I was writing that, I accidentally pressed the key combination
> that told my mail client to send the message before I was done.
> 
> Loggly created a new shard every five minutes, and merged older shards
> to longer time intervals.  I personally don't need this capability, but
> it is a useful pattern.  I was wondering recently whether a custom
> document router could be built for SolrCloud that automatically manages
> time-divided shards - creating, merging, and if you're not keeping the
> data forever, deleting.
> 
> Thanks,
> Shawn
>

Re: Keeping a rolling window of indexes around solr

Posted by Shawn Heisey <el...@elyograg.org>.

On 5/24/2013 8:56 AM, Shawn Heisey wrote:
> On 5/24/2013 8:25 AM, Saikat Kanjilal wrote:
>> Anyways would love to hear thoughts and usecases that are similar from the community.
> 
> Your use-case sounds a lot like what loggly was doing back in 2010.
> 
> http://loggly.com/videos/lucene-revolution-2010/

While I was writing that, I accidentally pressed the key combination
that told my mail client to send the message before I was done.

Loggly created a new shard every five minutes, and merged older shards
to longer time intervals.  I personally don't need this capability, but
it is a useful pattern.  I was wondering recently whether a custom
document router could be built for SolrCloud that automatically manages
time-divided shards - creating, merging, and if you're not keeping the
data forever, deleting.

Thanks,
Shawn

Re: Keeping a rolling window of indexes around solr

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/24/2013 8:25 AM, Saikat Kanjilal wrote:
> Anyways would love to hear thoughts and usecases that are similar from the community.

Your use-case sounds a lot like what loggly was doing back in 2010.

http://loggly.com/videos/lucene-revolution-2010/

Re: Keeping a rolling window of indexes around solr

Posted by Erick Erickson <er...@gmail.com>.

I suspect you're worrying about something you don't need to. At 1 insert every
30 seconds, and assuming 30,000,000 records will fit on a machine (I've seen
this), you're talking 1,000,000 seconds worth of data on a single box!
Or roughly
10,000 day's worth of data. Test, of course, YMMV.

Or I'm mis-understanding what "1 log insert" means, I guess it could be a full
log file....

But do the simple thing first, just let Solr do what it does by
default and periodically
do a delete by query on documents you want to roll off the end. Especially since
you say that queries happen every few days. The tricks for utilizing
"hot shards" are
probably not very useful for you with that low a query rate.

Test, of course
Best
Erick

On Tue, May 28, 2013 at 8:42 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> Volume of data:
> 1 log insert every 30 seconds, queries done sporadically asynchronously every so often at a much lower frequency every few days
>
> Also the majority of the requests are indeed going to be within a splice of time (typically hours or at most a few days)
>
> Type of queries:
> Keyword or termsearch
> Search by guid (or id as known in the solr world)
> Reserved or percolation queries to be executed when new data becomes available
> Search by dates as mentioned above
>
> Regards
>
>
> Sent from my iPhone
>
> On May 28, 2013, at 4:25 PM, Chris Hostetter <ho...@fucit.org> wrote:
>
>>
>> : This is kind of the approach used by elastic search , if I'm not using
>> : solrcloud will I be able to use shard aliasing, also with this approach
>> : how would replication work, is it even needed?
>>
>> you haven't said much about hte volume of data you expect to deal with,
>> nor have you really explained what types of queries you intend to do --
>> ie: you said you were intersted in a "rolling window of indexes
>> around n days of data" but you never clarified why you think a
>> rolling window of indexes would be useful to you or how exactly you would
>> use it.
>>
>> The primary advantage of sharding by date is if you know that a large
>> percentage of your queries are only going to be within a small range of
>> time, and therefore you can optimize those requests to only hit the shards
>> neccessary to satisfy that small windo of time.
>>
>> if the majority of requests are going to be across your entire "n days" of
>> data, then date based sharding doesn't really help you -- you can just use
>> arbitrary (randomized) sharding using periodic deleteByQuery commands to
>> purge anything older then N days.  Query the whole collection by default,
>> and add a filter query if/when you want to restrict your search to only a
>> narrow date range of documents.
>>
>> this is the same general approach you would use on a non-distributed /
>> non-SolrCloud setup if you just had a single collection on a single master
>> replicated to some number of slaves for horizontal scaling.
>>
>>
>> -Hoss
>>

Re: Keeping a rolling window of indexes around solr

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Volume of data:
1 log insert every 30 seconds, queries done sporadically asynchronously every so often at a much lower frequency every few days

Also the majority of the requests are indeed going to be within a splice of time (typically hours or at most a few days)

Type of queries:
Keyword or termsearch
Search by guid (or id as known in the solr world)
Reserved or percolation queries to be executed when new data becomes available 
Search by dates as mentioned above

Regards


Sent from my iPhone

On May 28, 2013, at 4:25 PM, Chris Hostetter <ho...@fucit.org> wrote:

> 
> : This is kind of the approach used by elastic search , if I'm not using 
> : solrcloud will I be able to use shard aliasing, also with this approach 
> : how would replication work, is it even needed?
> 
> you haven't said much about hte volume of data you expect to deal with, 
> nor have you really explained what types of queries you intend to do -- 
> ie: you said you were intersted in a "rolling window of indexes
> around n days of data" but you never clarified why you think a 
> rolling window of indexes would be useful to you or how exactly you would 
> use it.
> 
> The primary advantage of sharding by date is if you know that a large 
> percentage of your queries are only going to be within a small range of 
> time, and therefore you can optimize those requests to only hit the shards 
> neccessary to satisfy that small windo of time.
> 
> if the majority of requests are going to be across your entire "n days" of 
> data, then date based sharding doesn't really help you -- you can just use 
> arbitrary (randomized) sharding using periodic deleteByQuery commands to 
> purge anything older then N days.  Query the whole collection by default, 
> and add a filter query if/when you want to restrict your search to only a 
> narrow date range of documents.
> 
> this is the same general approach you would use on a non-distributed / 
> non-SolrCloud setup if you just had a single collection on a single master 
> replicated to some number of slaves for horizontal scaling.
> 
> 
> -Hoss
>

Re: Keeping a rolling window of indexes around solr

Posted by Chris Hostetter <ho...@fucit.org>.

: This is kind of the approach used by elastic search , if I'm not using 
: solrcloud will I be able to use shard aliasing, also with this approach 
: how would replication work, is it even needed?

you haven't said much about hte volume of data you expect to deal with, 
nor have you really explained what types of queries you intend to do -- 
ie: you said you were intersted in a "rolling window of indexes
around n days of data" but you never clarified why you think a 
rolling window of indexes would be useful to you or how exactly you would 
use it.

The primary advantage of sharding by date is if you know that a large 
percentage of your queries are only going to be within a small range of 
time, and therefore you can optimize those requests to only hit the shards 
neccessary to satisfy that small windo of time.

if the majority of requests are going to be across your entire "n days" of 
data, then date based sharding doesn't really help you -- you can just use 
arbitrary (randomized) sharding using periodic deleteByQuery commands to 
purge anything older then N days.  Query the whole collection by default, 
and add a filter query if/when you want to restrict your search to only a 
narrow date range of documents.

this is the same general approach you would use on a non-distributed / 
non-SolrCloud setup if you just had a single collection on a single master 
replicated to some number of slaves for horizontal scaling.


-Hoss

Re: Keeping a rolling window of indexes around solr

Posted by Saikat Kanjilal <sx...@hotmail.com>.

This is kind of the approach used by elastic search , if I'm not using solrcloud will I be able to use shard aliasing, also with this approach how would replication work, is it even needed?

Sent from my iPhone

On May 24, 2013, at 12:00 PM, Alexandre Rafalovitch <ar...@gmail.com> wrote:

> Would collection aliasing help here? From Solr 4.2 release notes:
> Collection Aliasing. Got time based data? Want to re-index in a
> temporary collection and then swap it into production? Done. Stay
> tuned for Shard Aliasing.
> 
> Regards,
>  Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>> Hello Solr community folks,
>> I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current.   However the architecture seems a bit more complex than I'd like with a lot of moving pieces.  I was wondering if anyone has ever handled/designed an architecture around a "conveyor belt" or rolling window of indexes around n days of data and if there are best practices around this.  One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master.
>> 
>> 
>> Anyways would love to hear thoughts and usecases that are similar from the community.
>> 
>> Regards
>

Re: Keeping a rolling window of indexes around solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Would collection aliasing help here? From Solr 4.2 release notes:
Collection Aliasing. Got time based data? Want to re-index in a
temporary collection and then swap it into production? Done. Stay
tuned for Shard Aliasing.

Regards,
  Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, May 24, 2013 at 10:25 AM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> Hello Solr community folks,
> I am doing some investigative work around how to roll and manage indexes inside our solr configuration, to date I've come up with an architecture that separates a set of masters that are focused on writes and get replicated periodically and a set of slave shards strictly docused on reads, additionally for each master index the design contains partial purges which get performed on each of the slave shards as well as the master to keep the data current.   However the architecture seems a bit more complex than I'd like with a lot of moving pieces.  I was wondering if anyone has ever handled/designed an architecture around a "conveyor belt" or rolling window of indexes around n days of data and if there are best practices around this.  One thing I was thinking about was whether to keep a conveyor belt list of the slave shards and rotate them as needed and drop the master periodically and make its backup temporarily the master.
>
>
> Anyways would love to hear thoughts and usecases that are similar from the community.
>
> Regards

RE: Keeping a rolling window of indexes around solr

Posted by Saikat Kanjilal <sx...@hotmail.com>.

At first glance unless I missed something hourglass will definitely not work for our use-case which just involves real time inserts of new log data and no appends at all.  However I would like to examine the guts of hourglass to see if we can customize it for our use-case.

> From: arafalov@gmail.com
> Date: Mon, 27 May 2013 16:17:12 -0400
> Subject: Re: Keeping a rolling window of indexes around solr
> To: solr-user@lucene.apache.org
> 
> But how is Hourglass going to help Solr? Or is it a portable implementation?
> 
> Regards,
>    Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> 
> 
> On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic
> <ot...@gmail.com> wrote:
> > Hi,
> >
> > SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
> > the link now but Zoie from LinkedIn has Hourglass, which is uses for
> > circular buffer sort of index setup if I recall correctly.
> >
> > Otis
> > Solr & ElasticSearch Support
> > http://sematext.com/
> > On May 24, 2013 10:26 AM, "Saikat Kanjilal" <sx...@hotmail.com> wrote:
> >
> >> Hello Solr community folks,
> >> I am doing some investigative work around how to roll and manage indexes
> >> inside our solr configuration, to date I've come up with an architecture
> >> that separates a set of masters that are focused on writes and get
> >> replicated periodically and a set of slave shards strictly docused on
> >> reads, additionally for each master index the design contains partial
> >> purges which get performed on each of the slave shards as well as the
> >> master to keep the data current.   However the architecture seems a bit
> >> more complex than I'd like with a lot of moving pieces.  I was wondering if
> >> anyone has ever handled/designed an architecture around a "conveyor belt"
> >> or rolling window of indexes around n days of data and if there are best
> >> practices around this.  One thing I was thinking about was whether to keep
> >> a conveyor belt list of the slave shards and rotate them as needed and drop
> >> the master periodically and make its backup temporarily the master.
> >>
> >>
> >> Anyways would love to hear thoughts and usecases that are similar from the
> >> community.
> >>
> >> Regards

Re: Keeping a rolling window of indexes around solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

But how is Hourglass going to help Solr? Or is it a portable implementation?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, May 27, 2013 at 3:48 PM, Otis Gospodnetic
<ot...@gmail.com> wrote:
> Hi,
>
> SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
> the link now but Zoie from LinkedIn has Hourglass, which is uses for
> circular buffer sort of index setup if I recall correctly.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On May 24, 2013 10:26 AM, "Saikat Kanjilal" <sx...@hotmail.com> wrote:
>
>> Hello Solr community folks,
>> I am doing some investigative work around how to roll and manage indexes
>> inside our solr configuration, to date I've come up with an architecture
>> that separates a set of masters that are focused on writes and get
>> replicated periodically and a set of slave shards strictly docused on
>> reads, additionally for each master index the design contains partial
>> purges which get performed on each of the slave shards as well as the
>> master to keep the data current.   However the architecture seems a bit
>> more complex than I'd like with a lot of moving pieces.  I was wondering if
>> anyone has ever handled/designed an architecture around a "conveyor belt"
>> or rolling window of indexes around n days of data and if there are best
>> practices around this.  One thing I was thinking about was whether to keep
>> a conveyor belt list of the slave shards and rotate them as needed and drop
>> the master periodically and make its backup temporarily the master.
>>
>>
>> Anyways would love to hear thoughts and usecases that are similar from the
>> community.
>>
>> Regards

Re: Keeping a rolling window of indexes around solr

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

SolrCloud now has the same index aliasing as Elasticsearch.  I can't lookup
the link now but Zoie from LinkedIn has Hourglass, which is uses for
circular buffer sort of index setup if I recall correctly.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On May 24, 2013 10:26 AM, "Saikat Kanjilal" <sx...@hotmail.com> wrote:

> Hello Solr community folks,
> I am doing some investigative work around how to roll and manage indexes
> inside our solr configuration, to date I've come up with an architecture
> that separates a set of masters that are focused on writes and get
> replicated periodically and a set of slave shards strictly docused on
> reads, additionally for each master index the design contains partial
> purges which get performed on each of the slave shards as well as the
> master to keep the data current.   However the architecture seems a bit
> more complex than I'd like with a lot of moving pieces.  I was wondering if
> anyone has ever handled/designed an architecture around a "conveyor belt"
> or rolling window of indexes around n days of data and if there are best
> practices around this.  One thing I was thinking about was whether to keep
> a conveyor belt list of the slave shards and rotate them as needed and drop
> the master periodically and make its backup temporarily the master.
>
>
> Anyways would love to hear thoughts and usecases that are similar from the
> community.
>
> Regards