You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Upayavira <uv...@odoko.co.uk> on 2012/12/21 18:17:39 UTC

Dynamic collections in SolrCloud for log indexing

I'm working on a system for indexing logs. We're probably looking at
filling one core every month.

We'll maintain a short term index containing the last 7 days - that one
is easy to handle.

For the longer term stuff, we'd like to maintain a collection that will
query across all the historic data, but that means every month we need
to add another core to an existing collection, which as I understand it
in 4.0 is not possible.

How do people handle this sort of situation where you have rolling new
content arriving? I'm sure I've heard people using SolrCloud for this
sort of thing.

Given it is logs, distributed IDF has no real bearing.

Upayavira

Re: Dynamic collections in SolrCloud for log indexing

Posted by Otis Gospodnetic <ot...@gmail.com>.

Added https://issues.apache.org/jira/browse/SOLR-4237

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html



On Tue, Dec 25, 2012 at 9:13 PM, Mark Miller <ma...@gmail.com> wrote:

> I've been thinking about aliases for a while as well. Seem very handy and
> fairly easy to implement. So far there has just always been higher priority
> things (need to finish collection api responses this week…) but this is
> something I'd def help work on.
>
> - Mark
>
> On Dec 25, 2012, at 1:49 AM, Otis Gospodnetic <ot...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Right, this is not really about routing in ElasticSearch-sense.
> > What's handy for indexing logs are index aliases.... which I thought I
> had
> > added to JIRA a while back, but it looks like I have not.
> > Index aliases would let you keep a "last 7 days" alias fixed while
> > underneath you push and pop an index every day without the client app
> > having to adjust.
> >
> > Otis
> > --
> > Performance Monitoring - http://sematext.com/spm/index.html
> > Search Analytics - http://sematext.com/search-analytics/index.html
> >
> >
> >
> > On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen <st...@designware.dk>
> wrote:
> >
> >> I believe it is a misunderstandig to use custom routing (or sharding as
> >> Erick calls it) for this kind of stuff. Custom routing is nice if you
> want
> >> to control which slice/shard under a collection a specific document
> goes to
> >> - mainly to be able to control that two (or more) documents are indexed
> on
> >> the same slice/shard, but also just to be able to control on which
> >> slice/shard a specific document is indexed. Knowing/controlling this
> kind
> >> of stuff can be used for a lot of nice purposes. But you dont want to
> move
> >> slices/shards around among collection or delete/add slices from/to a
> >> collection - unless its for elasticity reasons.
> >>
> >> I think you should fill a collection every week/month and just keep
> those
> >> collections as is. Instead of ending up with a big "historic" collection
> >> containing many slices/shards/cores (one for each historic week/month),
> you
> >> will end up with many historic collections (one for each historic
> >> week/month). Searching historic data you will have to cross-search those
> >> historic collections, but that is no problem at all. If Solr Cloud is
> made
> >> at it is supposed to be made (and I believe it is) it shouldnt require
> more
> >> resouces or be harder in any way to cross-search X slices across many
> >> collections, than it is to cross-search X slices under the same
> collection.
> >>
> >> Besides that see my answer for topic "Will SolrCloud always slice by ID
> >> hash?" a few days back.
> >>
> >> Regards, Per Steffensen
> >>
> >>
> >> On 12/24/12 1:07 AM, Erick Erickson wrote:
> >>
> >>> I think this is one of the primary use-cases for custom sharding. Solr
> 4.0
> >>> doesn't really lend itself to this scenario, but I _believe_ that the
> >>> patch
> >>> for custom sharding has been committed...
> >>>
> >>> That said, I'm not quite sure how you drop off the old shard if you
> don't
> >>> need to keep old data. I'd guess it's possible, but haven't implemented
> >>> anything like that myself.
> >>>
> >>> FWIW,
> >>> Erick
> >>>
> >>>
> >>> On Fri, Dec 21, 2012 at 12:17 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >>>
> >>> I'm working on a system for indexing logs. We're probably looking at
> >>>> filling one core every month.
> >>>>
> >>>> We'll maintain a short term index containing the last 7 days - that
> one
> >>>> is easy to handle.
> >>>>
> >>>> For the longer term stuff, we'd like to maintain a collection that
> will
> >>>> query across all the historic data, but that means every month we need
> >>>> to add another core to an existing collection, which as I understand
> it
> >>>> in 4.0 is not possible.
> >>>>
> >>>> How do people handle this sort of situation where you have rolling new
> >>>> content arriving? I'm sure I've heard people using SolrCloud for this
> >>>> sort of thing.
> >>>>
> >>>> Given it is logs, distributed IDF has no real bearing.
> >>>>
> >>>> Upayavira
> >>>>
> >>>>
> >>
>
>

Re: Dynamic collections in SolrCloud for log indexing

Posted by Mark Miller <ma...@gmail.com>.

I've been thinking about aliases for a while as well. Seem very handy and fairly easy to implement. So far there has just always been higher priority things (need to finish collection api responses this week…) but this is something I'd def help work on.

- Mark

On Dec 25, 2012, at 1:49 AM, Otis Gospodnetic <ot...@gmail.com> wrote:

> Hi,
> 
> Right, this is not really about routing in ElasticSearch-sense.
> What's handy for indexing logs are index aliases.... which I thought I had
> added to JIRA a while back, but it looks like I have not.
> Index aliases would let you keep a "last 7 days" alias fixed while
> underneath you push and pop an index every day without the client app
> having to adjust.
> 
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Search Analytics - http://sematext.com/search-analytics/index.html
> 
> 
> 
> On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen <st...@designware.dk> wrote:
> 
>> I believe it is a misunderstandig to use custom routing (or sharding as
>> Erick calls it) for this kind of stuff. Custom routing is nice if you want
>> to control which slice/shard under a collection a specific document goes to
>> - mainly to be able to control that two (or more) documents are indexed on
>> the same slice/shard, but also just to be able to control on which
>> slice/shard a specific document is indexed. Knowing/controlling this kind
>> of stuff can be used for a lot of nice purposes. But you dont want to move
>> slices/shards around among collection or delete/add slices from/to a
>> collection - unless its for elasticity reasons.
>> 
>> I think you should fill a collection every week/month and just keep those
>> collections as is. Instead of ending up with a big "historic" collection
>> containing many slices/shards/cores (one for each historic week/month), you
>> will end up with many historic collections (one for each historic
>> week/month). Searching historic data you will have to cross-search those
>> historic collections, but that is no problem at all. If Solr Cloud is made
>> at it is supposed to be made (and I believe it is) it shouldnt require more
>> resouces or be harder in any way to cross-search X slices across many
>> collections, than it is to cross-search X slices under the same collection.
>> 
>> Besides that see my answer for topic "Will SolrCloud always slice by ID
>> hash?" a few days back.
>> 
>> Regards, Per Steffensen
>> 
>> 
>> On 12/24/12 1:07 AM, Erick Erickson wrote:
>> 
>>> I think this is one of the primary use-cases for custom sharding. Solr 4.0
>>> doesn't really lend itself to this scenario, but I _believe_ that the
>>> patch
>>> for custom sharding has been committed...
>>> 
>>> That said, I'm not quite sure how you drop off the old shard if you don't
>>> need to keep old data. I'd guess it's possible, but haven't implemented
>>> anything like that myself.
>>> 
>>> FWIW,
>>> Erick
>>> 
>>> 
>>> On Fri, Dec 21, 2012 at 12:17 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>> 
>>> I'm working on a system for indexing logs. We're probably looking at
>>>> filling one core every month.
>>>> 
>>>> We'll maintain a short term index containing the last 7 days - that one
>>>> is easy to handle.
>>>> 
>>>> For the longer term stuff, we'd like to maintain a collection that will
>>>> query across all the historic data, but that means every month we need
>>>> to add another core to an existing collection, which as I understand it
>>>> in 4.0 is not possible.
>>>> 
>>>> How do people handle this sort of situation where you have rolling new
>>>> content arriving? I'm sure I've heard people using SolrCloud for this
>>>> sort of thing.
>>>> 
>>>> Given it is logs, distributed IDF has no real bearing.
>>>> 
>>>> Upayavira
>>>> 
>>>> 
>>

Re: Dynamic collections in SolrCloud for log indexing

Posted by Upayavira <uv...@odoko.co.uk>.

This is precisely it. It is a 'collections alias', allowing you to group
collections together into 'super-collections'.

You add a new collection (made up of a core on n hosts) every
day/week/month/whatever. When you do so, you add this collection to your
super-collection. Many you do a quick audit if those cores in your short
term super-collections, but the net result is some names you can use to
address various subsets of your total content.

Upayavira (who's kids are still asleep, so no excitement yet...)

On Tue, Dec 25, 2012, at 06:49 AM, Otis Gospodnetic wrote:
> Hi,
> 
> Right, this is not really about routing in ElasticSearch-sense.
> What's handy for indexing logs are index aliases.... which I thought I
> had
> added to JIRA a while back, but it looks like I have not.
> Index aliases would let you keep a "last 7 days" alias fixed while
> underneath you push and pop an index every day without the client app
> having to adjust.
> 
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Search Analytics - http://sematext.com/search-analytics/index.html
> 
> 
> 
> On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen <st...@designware.dk>
> wrote:
> 
> > I believe it is a misunderstandig to use custom routing (or sharding as
> > Erick calls it) for this kind of stuff. Custom routing is nice if you want
> > to control which slice/shard under a collection a specific document goes to
> > - mainly to be able to control that two (or more) documents are indexed on
> > the same slice/shard, but also just to be able to control on which
> > slice/shard a specific document is indexed. Knowing/controlling this kind
> > of stuff can be used for a lot of nice purposes. But you dont want to move
> > slices/shards around among collection or delete/add slices from/to a
> > collection - unless its for elasticity reasons.
> >
> > I think you should fill a collection every week/month and just keep those
> > collections as is. Instead of ending up with a big "historic" collection
> > containing many slices/shards/cores (one for each historic week/month), you
> > will end up with many historic collections (one for each historic
> > week/month). Searching historic data you will have to cross-search those
> > historic collections, but that is no problem at all. If Solr Cloud is made
> > at it is supposed to be made (and I believe it is) it shouldnt require more
> > resouces or be harder in any way to cross-search X slices across many
> > collections, than it is to cross-search X slices under the same collection.
> >
> > Besides that see my answer for topic "Will SolrCloud always slice by ID
> > hash?" a few days back.
> >
> > Regards, Per Steffensen
> >
> >
> > On 12/24/12 1:07 AM, Erick Erickson wrote:
> >
> >> I think this is one of the primary use-cases for custom sharding. Solr 4.0
> >> doesn't really lend itself to this scenario, but I _believe_ that the
> >> patch
> >> for custom sharding has been committed...
> >>
> >> That said, I'm not quite sure how you drop off the old shard if you don't
> >> need to keep old data. I'd guess it's possible, but haven't implemented
> >> anything like that myself.
> >>
> >> FWIW,
> >> Erick
> >>
> >>
> >> On Fri, Dec 21, 2012 at 12:17 PM, Upayavira <uv...@odoko.co.uk> wrote:
> >>
> >>  I'm working on a system for indexing logs. We're probably looking at
> >>> filling one core every month.
> >>>
> >>> We'll maintain a short term index containing the last 7 days - that one
> >>> is easy to handle.
> >>>
> >>> For the longer term stuff, we'd like to maintain a collection that will
> >>> query across all the historic data, but that means every month we need
> >>> to add another core to an existing collection, which as I understand it
> >>> in 4.0 is not possible.
> >>>
> >>> How do people handle this sort of situation where you have rolling new
> >>> content arriving? I'm sure I've heard people using SolrCloud for this
> >>> sort of thing.
> >>>
> >>> Given it is logs, distributed IDF has no real bearing.
> >>>
> >>> Upayavira
> >>>
> >>>
> >

Re: Dynamic collections in SolrCloud for log indexing

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

Right, this is not really about routing in ElasticSearch-sense.
What's handy for indexing logs are index aliases.... which I thought I had
added to JIRA a while back, but it looks like I have not.
Index aliases would let you keep a "last 7 days" alias fixed while
underneath you push and pop an index every day without the client app
having to adjust.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html



On Mon, Dec 24, 2012 at 4:30 AM, Per Steffensen <st...@designware.dk> wrote:

> I believe it is a misunderstandig to use custom routing (or sharding as
> Erick calls it) for this kind of stuff. Custom routing is nice if you want
> to control which slice/shard under a collection a specific document goes to
> - mainly to be able to control that two (or more) documents are indexed on
> the same slice/shard, but also just to be able to control on which
> slice/shard a specific document is indexed. Knowing/controlling this kind
> of stuff can be used for a lot of nice purposes. But you dont want to move
> slices/shards around among collection or delete/add slices from/to a
> collection - unless its for elasticity reasons.
>
> I think you should fill a collection every week/month and just keep those
> collections as is. Instead of ending up with a big "historic" collection
> containing many slices/shards/cores (one for each historic week/month), you
> will end up with many historic collections (one for each historic
> week/month). Searching historic data you will have to cross-search those
> historic collections, but that is no problem at all. If Solr Cloud is made
> at it is supposed to be made (and I believe it is) it shouldnt require more
> resouces or be harder in any way to cross-search X slices across many
> collections, than it is to cross-search X slices under the same collection.
>
> Besides that see my answer for topic "Will SolrCloud always slice by ID
> hash?" a few days back.
>
> Regards, Per Steffensen
>
>
> On 12/24/12 1:07 AM, Erick Erickson wrote:
>
>> I think this is one of the primary use-cases for custom sharding. Solr 4.0
>> doesn't really lend itself to this scenario, but I _believe_ that the
>> patch
>> for custom sharding has been committed...
>>
>> That said, I'm not quite sure how you drop off the old shard if you don't
>> need to keep old data. I'd guess it's possible, but haven't implemented
>> anything like that myself.
>>
>> FWIW,
>> Erick
>>
>>
>> On Fri, Dec 21, 2012 at 12:17 PM, Upayavira <uv...@odoko.co.uk> wrote:
>>
>>  I'm working on a system for indexing logs. We're probably looking at
>>> filling one core every month.
>>>
>>> We'll maintain a short term index containing the last 7 days - that one
>>> is easy to handle.
>>>
>>> For the longer term stuff, we'd like to maintain a collection that will
>>> query across all the historic data, but that means every month we need
>>> to add another core to an existing collection, which as I understand it
>>> in 4.0 is not possible.
>>>
>>> How do people handle this sort of situation where you have rolling new
>>> content arriving? I'm sure I've heard people using SolrCloud for this
>>> sort of thing.
>>>
>>> Given it is logs, distributed IDF has no real bearing.
>>>
>>> Upayavira
>>>
>>>
>

Re: Dynamic collections in SolrCloud for log indexing

Posted by Per Steffensen <st...@designware.dk>.

I believe it is a misunderstandig to use custom routing (or sharding as 
Erick calls it) for this kind of stuff. Custom routing is nice if you 
want to control which slice/shard under a collection a specific document 
goes to - mainly to be able to control that two (or more) documents are 
indexed on the same slice/shard, but also just to be able to control on 
which slice/shard a specific document is indexed. Knowing/controlling 
this kind of stuff can be used for a lot of nice purposes. But you dont 
want to move slices/shards around among collection or delete/add slices 
from/to a collection - unless its for elasticity reasons.

I think you should fill a collection every week/month and just keep 
those collections as is. Instead of ending up with a big "historic" 
collection containing many slices/shards/cores (one for each historic 
week/month), you will end up with many historic collections (one for 
each historic week/month). Searching historic data you will have to 
cross-search those historic collections, but that is no problem at all. 
If Solr Cloud is made at it is supposed to be made (and I believe it is) 
it shouldnt require more resouces or be harder in any way to 
cross-search X slices across many collections, than it is to 
cross-search X slices under the same collection.

Besides that see my answer for topic "Will SolrCloud always slice by ID 
hash?" a few days back.

Regards, Per Steffensen

On 12/24/12 1:07 AM, Erick Erickson wrote:
> I think this is one of the primary use-cases for custom sharding. Solr 4.0
> doesn't really lend itself to this scenario, but I _believe_ that the patch
> for custom sharding has been committed...
>
> That said, I'm not quite sure how you drop off the old shard if you don't
> need to keep old data. I'd guess it's possible, but haven't implemented
> anything like that myself.
>
> FWIW,
> Erick
>
>
> On Fri, Dec 21, 2012 at 12:17 PM, Upayavira <uv...@odoko.co.uk> wrote:
>
>> I'm working on a system for indexing logs. We're probably looking at
>> filling one core every month.
>>
>> We'll maintain a short term index containing the last 7 days - that one
>> is easy to handle.
>>
>> For the longer term stuff, we'd like to maintain a collection that will
>> query across all the historic data, but that means every month we need
>> to add another core to an existing collection, which as I understand it
>> in 4.0 is not possible.
>>
>> How do people handle this sort of situation where you have rolling new
>> content arriving? I'm sure I've heard people using SolrCloud for this
>> sort of thing.
>>
>> Given it is logs, distributed IDF has no real bearing.
>>
>> Upayavira
>>

Re: Dynamic collections in SolrCloud for log indexing

Posted by Erick Erickson <er...@gmail.com>.

I think this is one of the primary use-cases for custom sharding. Solr 4.0
doesn't really lend itself to this scenario, but I _believe_ that the patch
for custom sharding has been committed...

That said, I'm not quite sure how you drop off the old shard if you don't
need to keep old data. I'd guess it's possible, but haven't implemented
anything like that myself.

FWIW,
Erick

On Fri, Dec 21, 2012 at 12:17 PM, Upayavira <uv...@odoko.co.uk> wrote:

> I'm working on a system for indexing logs. We're probably looking at
> filling one core every month.
>
> We'll maintain a short term index containing the last 7 days - that one
> is easy to handle.
>
> For the longer term stuff, we'd like to maintain a collection that will
> query across all the historic data, but that means every month we need
> to add another core to an existing collection, which as I understand it
> in 4.0 is not possible.
>
> How do people handle this sort of situation where you have rolling new
> content arriving? I'm sure I've heard people using SolrCloud for this
> sort of thing.
>
> Given it is logs, distributed IDF has no real bearing.
>
> Upayavira
>