You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Daniel Bruegge <da...@googlemail.com> on 2012/01/18 23:59:47 UTC

How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

Hi,

I am just wondering how I can 'grow' a distributed Solr setup to an index
size of a couple of terabytes, when one of the distributed Solr limitations
is max. 4000 characters in URI limitation. See:

*The number of shards is limited by number of characters allowed for GET
> method's URI; most Web servers generally support at least 4000 characters,
> but many servers limit URI length to reduce their vulnerability to Denial
> of Service (DoS) attacks.
> *



> *(via
> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
> )*
>

Is the only way then to make multiple distributed solr clusters and query
them independently and merge them in application code?

Thanks. Daniel

Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Daniel,


----- Original Message -----
> From: Daniel Bruegge <da...@googlemail.com>
> To: solr-user@lucene.apache.org; Otis Gospodnetic <ot...@yahoo.com>
> Cc: 
> Sent: Thursday, January 19, 2012 5:49 AM
> Subject: Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?
> 
> On Thu, Jan 19, 2012 at 4:51 AM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>> 
>>  Huge is relative. ;)
>>  Huge Solr clusters also often have huge hardware. Servers with 16 cores
>>  and 32 GM RAM are becoming very common, for example.
>>  Another thing to keep in mind is that while lots of organizations have
>>  huge indices, only some portions of them may be hot at any one time.  
> We've
>>  had a number of clients who index social media or news data and while all
>>  of them have giant indices, typically only the most recent data is really
>>  actively searched.
>
> So let's say, if I have for example an index of 100GB with million of
> documents, but 99% of the queries only hit the latest 200.000 documents in
> the index, I can easily handle this on a machine which is not so powerful?
> So with 'hot' you mean a subset of the whole index. You don't mean, 
> that
> there is e.g. one huge archive-index and a active-index in separate Solr
> instances?

That's correct, I'm not referring to one huge archive index and one smaller active index.

Otis

----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>>  > Because I also read often, that the Index size of one shard
>>  >should fit into RAM.
>> 
>>  Nah.  Don't take this as "the whole index needs to fit in 
> RAM".  Just "the
>>  hot parts of the index should fit in RAM".  This is related to what I 
> wrote
>>  above.
>> 
> 
> Ah, ok. Good to know. I always tried to split the index over multiple
> shards, because I recognized a big performance loss, when I try to put it
> on one machine. But maybe this is also connected to the 'hot' and 
> 'not hot'
> parts. thanks.
> 
> 
>> 
>>  > Or at least the heap size should be as big as the
>>  > index size. So I see a lots of limitations hardware-wise. Or am I on 
> the
>>  > totally wrong track?
>> 
>>  Regarding heap - nah, that's not correct.  The heap is usually much
>>  smaller than the index and RAM is given to the OS to use for data caching.
>> 
> 
> Oh, ok. Thanks for this information. Maybe I can tweak the settings then a
> bit. But I got several GC-errors etc. so I am always trying to modify all
> these heap/gc settings. But I haven't found the perfect settings up to now.
> 
> Thanks.
> 
> Daniel
> 
> 
>> 
>>  Otis
>>  ----
>>  Performance Monitoring SaaS for Solr -
>>  http://sematext.com/spm/solr-performance-monitoring/index.html
>> 
>> 
>> 
>>  >On Thu, Jan 19, 2012 at 12:14 AM, Mark Miller 
> <ma...@gmail.com>
>>  wrote:
>>  >
>>  >> You can raise the limit to a point.
>>  >>
>>  >> On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:
>>  >>
>>  >> > Hi,
>>  >> >
>>  >> > I am just wondering how I can 'grow' a distributed 
> Solr setup to an
>>  index
>>  >> > size of a couple of terabytes, when one of the distributed 
> Solr
>>  >> limitations
>>  >> > is max. 4000 characters in URI limitation. See:
>>  >> >
>>  >> > *The number of shards is limited by number of characters 
> allowed for
>>  GET
>>  >> >> method's URI; most Web servers generally support at 
> least 4000
>>  >> characters,
>>  >> >> but many servers limit URI length to reduce their 
> vulnerability to
>>  >> Denial
>>  >> >> of Service (DoS) attacks.
>>  >> >> *
>>  >> >
>>  >> >
>>  >> >
>>  >> >> *(via
>>  >> >>
>>  >>
>> 
> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
>>  >> >> )*
>>  >> >>
>>  >> >
>>  >> > Is the only way then to make multiple distributed solr 
> clusters and
>>  query
>>  >> > them independently and merge them in application code?
>>  >> >
>>  >> > Thanks. Daniel
>>  >>
>>  >> - Mark Miller
>>  >> lucidimagination.com
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>
>>  >
>>  >
>>  >
>> 
>> 
>

Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

Posted by Daniel Bruegge <da...@googlemail.com>.

On Thu, Jan 19, 2012 at 4:51 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:
>
> Huge is relative. ;)
> Huge Solr clusters also often have huge hardware. Servers with 16 cores
> and 32 GM RAM are becoming very common, for example.
> Another thing to keep in mind is that while lots of organizations have
> huge indices, only some portions of them may be hot at any one time.  We've
> had a number of clients who index social media or news data and while all
> of them have giant indices, typically only the most recent data is really
> actively searched.
>

So let's say, if I have for example an index of 100GB with million of
documents, but 99% of the queries only hit the latest 200.000 documents in
the index, I can easily handle this on a machine which is not so powerful?
So with 'hot' you mean a subset of the whole index. You don't mean, that
there is e.g. one huge archive-index and a active-index in separate Solr
instances?


>
> > Because I also read often, that the Index size of one shard
> >should fit into RAM.
>
> Nah.  Don't take this as "the whole index needs to fit in RAM".  Just "the
> hot parts of the index should fit in RAM".  This is related to what I wrote
> above.
>

Ah, ok. Good to know. I always tried to split the index over multiple
shards, because I recognized a big performance loss, when I try to put it
on one machine. But maybe this is also connected to the 'hot' and 'not hot'
parts. thanks.


>
> > Or at least the heap size should be as big as the
> > index size. So I see a lots of limitations hardware-wise. Or am I on the
> > totally wrong track?
>
> Regarding heap - nah, that's not correct.  The heap is usually much
> smaller than the index and RAM is given to the OS to use for data caching.
>

Oh, ok. Thanks for this information. Maybe I can tweak the settings then a
bit. But I got several GC-errors etc. so I am always trying to modify all
these heap/gc settings. But I haven't found the perfect settings up to now.

Thanks.

Daniel


>
> Otis
> ----
> Performance Monitoring SaaS for Solr -
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
> >On Thu, Jan 19, 2012 at 12:14 AM, Mark Miller <ma...@gmail.com>
> wrote:
> >
> >> You can raise the limit to a point.
> >>
> >> On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:
> >>
> >> > Hi,
> >> >
> >> > I am just wondering how I can 'grow' a distributed Solr setup to an
> index
> >> > size of a couple of terabytes, when one of the distributed Solr
> >> limitations
> >> > is max. 4000 characters in URI limitation. See:
> >> >
> >> > *The number of shards is limited by number of characters allowed for
> GET
> >> >> method's URI; most Web servers generally support at least 4000
> >> characters,
> >> >> but many servers limit URI length to reduce their vulnerability to
> >> Denial
> >> >> of Service (DoS) attacks.
> >> >> *
> >> >
> >> >
> >> >
> >> >> *(via
> >> >>
> >>
> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
> >> >> )*
> >> >>
> >> >
> >> > Is the only way then to make multiple distributed solr clusters and
> query
> >> > them independently and merge them in application code?
> >> >
> >> > Thanks. Daniel
> >>
> >> - Mark Miller
> >> lucidimagination.com
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
>
>

Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Daniel,

>________________________________
> From: Daniel Bruegge <da...@googlemail.com>
>Subject: Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?
> 
>But you can read so often about huge solr clusters and I am wondering how
>they do this. 

Huge is relative. ;)
Huge Solr clusters also often have huge hardware. Servers with 16 cores and 32 GM RAM are becoming very common, for example.
Another thing to keep in mind is that while lots of organizations have huge indices, only some portions of them may be hot at any one time.  We've had a number of clients who index social media or news data and while all of them have giant indices, typically only the most recent data is really actively searched.

> Because I also read often, that the Index size of one shard 
>should fit into RAM. 

Nah.  Don't take this as "the whole index needs to fit in RAM".  Just "the hot parts of the index should fit in RAM".  This is related to what I wrote above.

> Or at least the heap size should be as big as the
> index size. So I see a lots of limitations hardware-wise. Or am I on the
> totally wrong track?

Regarding heap - nah, that's not correct.  The heap is usually much smaller than the index and RAM is given to the OS to use for data caching.

Otis
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html

>On Thu, Jan 19, 2012 at 12:14 AM, Mark Miller <ma...@gmail.com> wrote:
>
>> You can raise the limit to a point.
>>
>> On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:
>>
>> > Hi,
>> >
>> > I am just wondering how I can 'grow' a distributed Solr setup to an index
>> > size of a couple of terabytes, when one of the distributed Solr
>> limitations
>> > is max. 4000 characters in URI limitation. See:
>> >
>> > *The number of shards is limited by number of characters allowed for GET
>> >> method's URI; most Web servers generally support at least 4000
>> characters,
>> >> but many servers limit URI length to reduce their vulnerability to
>> Denial
>> >> of Service (DoS) attacks.
>> >> *
>> >
>> >
>> >
>> >> *(via
>> >>
>> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
>> >> )*
>> >>
>> >
>> > Is the only way then to make multiple distributed solr clusters and query
>> > them independently and merge them in application code?
>> >
>> > Thanks. Daniel
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>

Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

Posted by Daniel Bruegge <da...@googlemail.com>.

But you can read so often about huge solr clusters and I am wondering how
they do this. Because I also read often, that the Index size of one shard
should fit into RAM. Or at least the heap size should be as big as the
index size. So I see a lots of limitations hardware-wise. Or am I on the
totally wrong track?

Daniel

On Thu, Jan 19, 2012 at 12:14 AM, Mark Miller <ma...@gmail.com> wrote:

> You can raise the limit to a point.
>
> On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:
>
> > Hi,
> >
> > I am just wondering how I can 'grow' a distributed Solr setup to an index
> > size of a couple of terabytes, when one of the distributed Solr
> limitations
> > is max. 4000 characters in URI limitation. See:
> >
> > *The number of shards is limited by number of characters allowed for GET
> >> method's URI; most Web servers generally support at least 4000
> characters,
> >> but many servers limit URI length to reduce their vulnerability to
> Denial
> >> of Service (DoS) attacks.
> >> *
> >
> >
> >
> >> *(via
> >>
> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
> >> )*
> >>
> >
> > Is the only way then to make multiple distributed solr clusters and query
> > them independently and merge them in application code?
> >
> > Thanks. Daniel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

Posted by Mark Miller <ma...@gmail.com>.

You can raise the limit to a point.

On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:

> Hi,
> 
> I am just wondering how I can 'grow' a distributed Solr setup to an index
> size of a couple of terabytes, when one of the distributed Solr limitations
> is max. 4000 characters in URI limitation. See:
> 
> *The number of shards is limited by number of characters allowed for GET
>> method's URI; most Web servers generally support at least 4000 characters,
>> but many servers limit URI length to reduce their vulnerability to Denial
>> of Service (DoS) attacks.
>> *
> 
> 
> 
>> *(via
>> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
>> )*
>> 
> 
> Is the only way then to make multiple distributed solr clusters and query
> them independently and merge them in application code?
> 
> Thanks. Daniel

- Mark Miller
lucidimagination.com

Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

Posted by Darren Govoni <da...@ontrenet.com>.

Try changing the URI/HTTP/GET size limitation on your app server.

On 01/18/2012 05:59 PM, Daniel Bruegge wrote:
> Hi,
>
> I am just wondering how I can 'grow' a distributed Solr setup to an index
> size of a couple of terabytes, when one of the distributed Solr limitations
> is max. 4000 characters in URI limitation. See:
>
> *The number of shards is limited by number of characters allowed for GET
>> method's URI; most Web servers generally support at least 4000 characters,
>> but many servers limit URI length to reduce their vulnerability to Denial
>> of Service (DoS) attacks.
>> *
>
>
>> *(via
>> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
>> )*
>>
> Is the only way then to make multiple distributed solr clusters and query
> them independently and merge them in application code?
>
> Thanks. Daniel
>