You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Srikant Jakilinki <sr...@gmail.com> on 2008/01/16 06:23:05 UTC

Solr in a distributed multi-machine high-performance environment

Hi All,

There is a requirement in our group of indexing and searching several
millions of documents (TREC) in real-time and millisecond responses.
For the moment we are preferring scale-out (throw more commodity
machines) approaches rather than scale-up (faster disks, more
RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper
(mail me if you want a copy) in which it was proven that this kind of
distribution scales better and is more resilient.

So, are there any resources available (Wiki, Tutorials, Slides, README
etc.) that throw light and guide newbies on how to run Solr in a
multi-machine scenario? I have gone through the mailing lists and site
but could not really find any answers or hands-on stuff to do so. An
adhoc guideline to get things working with 2 machines might just be
enough but for the sake of thinking out loud and solicit responses
from the list, here are my questions:

1) Solr that has to handle a fairly large index which has to be split
up on multiple disks (using Multicore?)
- Space is not a problem since we can use NFS but that is not
recommended as we would only exploit 1 processor
2) Solr that has to handle a large collective index which has to be
split up on multi-machines
- The index is ever increasing (TB scale) and dynamic and all of it
has to be searched at any point
3) Solr that has to exploit multi-machines because we have plenty of
them in a tightly coupled P2P scenario
- Machines are not a problem but will they be if they are of varied
configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
1.1 to 1.6)
4) Solr that has to distribute load on several machines
- The index(s) could be common though like say using a distributed
filesystem (Hadoop?)

In each the above cases (we might use all of these strategies at
various use cases) the application should use Solr as a strict backend
and named service (IP or host:port) so that we can expose this
application (and the service) to the web or intranet. Machine failures
should be tolerated too. Also, does Solr manage load balancing out of
the box if it was indeed configured to work with multi-machines?

Maybe it is superfluous but is Solr and/or Nutch the only way to use
Lucene in a multi-machine environment? Or is there some hidden
document/project somewhere that makes it possible by exposing a
regular Lucene process over the network using RMI or something? It is
my understanding (could be wrong) that Nutch and to some extent, Solr
do not perform well when there is a lot of indexing activity in
parallel to search. Batch processing is also there and perhaps we can
use Nutch/Solr there. Even so, we need multi-machine directions.

I am sure that multi-machines make possible for a lot of other ways
which might solve the goal better and that others have practical
experience on. So, any advise and tips are also very welcome. We
intend to document things and do some benchmarking along the way in
the open spirit.

Really sorry for the length but I hope some answers are forthcoming.

Cheers,
Srikant

Re: Solr in a distributed multi-machine high-performance environment

Posted by Mike Klaas <mi...@gmail.com>.

On 15-Jan-08, at 9:23 PM, Srikant Jakilinki wrote:

> 2) Solr that has to handle a large collective index which has to be
> split up on multi-machines
> - The index is ever increasing (TB scale) and dynamic and all of it
> has to be searched at any point

This will require significant development on your part.  Nutch may be  
able to provide more of what you need OOB.

> 3) Solr that has to exploit multi-machines because we have plenty of
> them in a tightly coupled P2P scenario
> - Machines are not a problem but will they be if they are of varied
> configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
> 1.1 to 1.6)

Solr requires java 1.5, lucene requires java 1.4.  Also, there is  
certainly no point mixing PIII's and modern cpus: trying to achieve  
the appropriate balance between machines of such disparate capability  
will take much more effort than will be gained out of using them.

-Mike

Re: Solr in a distributed multi-machine high-performance environment

Posted by Mike Klaas <mi...@gmail.com>.

On 16-Jan-08, at 11:09 AM, Srikant Jakilinki wrote:

> Thanks for that Shalin. Looks like I have to wait and keep track of  
> developments.
>
> Forgetting about indexes that cannot be fit on a single machine  
> (distributed search), any links to have Solr running in a 2-machine  
> environment? I want to measure how much improvement there will be  
> in performance with the addition of machines for computation (space  
> later) and I need a 2-machine setup for that.

If you are looking for automatic replication and load-balancing  
across multiple machines, Solr does not provide that.  The typical  
strategy is as follows: index half the documents on one machine and  
half on another.  Execute both queries simultaneously (using threads,  
f.i.), and combine the results.  You should observe a speed up.

-Mike

Re: Solr in a distributed multi-machine high-performance environment

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Solr provides a few scripts to create a multiple-machine deployment. One box
is setup as the master (used primarily for writes) and others as slaves.
Slaves are added as per application requirements. The index is transferred
using rsync. Look at http://wiki.apache.org/solr/CollectionDistribution for
details.

You can put the slaves behind a load balancer or share the slaves among your
front-end servers to measure performance.

On Jan 17, 2008 12:39 AM, Srikant Jakilinki <sr...@bluebottle.com>
wrote:

> Thanks for that Shalin. Looks like I have to wait and keep track of
> developments.
>
> Forgetting about indexes that cannot be fit on a single machine
> (distributed search), any links to have Solr running in a 2-machine
> environment? I want to measure how much improvement there will be in
> performance with the addition of machines for computation (space later)
> and I need a 2-machine setup for that.
>
> Thanks
> Srikant
>
> Shalin Shekhar Mangar wrote:
> > Look at http://issues.apache.org/jira/browse/SOLR-303
> >
> > Please note that it is still work in progress. So you may not be able to
> use
> > it immeadiately.
> >
>
> ----------------------------------------------------------------------
> Find out how you can get spam free email.
> http://www.bluebottle.com/tag/3
>
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr in a distributed multi-machine high-performance environment

Posted by Srikant Jakilinki <sr...@bluebottle.com>.

Thanks for that Shalin. Looks like I have to wait and keep track of 
developments.

Forgetting about indexes that cannot be fit on a single machine 
(distributed search), any links to have Solr running in a 2-machine 
environment? I want to measure how much improvement there will be in 
performance with the addition of machines for computation (space later) 
and I need a 2-machine setup for that.

Thanks
Srikant

Shalin Shekhar Mangar wrote:
> Look at http://issues.apache.org/jira/browse/SOLR-303
>
> Please note that it is still work in progress. So you may not be able to use
> it immeadiately.
>

----------------------------------------------------------------------
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3

Solr-303 Re: Solr in a distributed multi-machine high-performance environment

Posted by Srikant Jakilinki <sr...@bluebottle.com>.

Yes, 303 looks very promising. And I would like to get involved. I have 
gone to the JIRA thread and very impressed by the activity going on 
there. It is THE hangout :-)

Following up, does anyone (especially Yonik or Sharad) have any 
documentation of this feature? Such as goals, use cases, requirements, 
design, implementation spcification, task lists, install manual etc. 
Just about anything at all... Even text notes would do as I intend to 
write it up properly.

If so, please pass them over by replying to my ID. I want to make Solr 
work in such a way OOB that when one sets it up to work on a bunch of 
machines (one config file only need be touched), the 'distributed 
cluster' - herafter, referred to as a 'node' - should be able to handle 
an index the size of the RAMs combined for the fastest response possible 
with near real-time reflexes of index/search/results. The maximum 
storage would be the size of the HDDs combined but comes at a penalty of 
responsiveness which is nothing but classic space-time tradeoff.
The user should be able to set a response limit which when hit 
(autonomic principle?), the node should alert the user to add more RAM 
or machines. The node should be fault-tolerant and adding a machine to 
the node should make the system scale linearly. Yada yada yada...
Solr is a black box for search. Solr 'node' should be so too.

Of course, any requests can be sent too and I will incorporate in the 
documentation.

So, guys, any documentaion on Solr-303 please,
Srikant

Shalin Shekhar Mangar wrote:
> Look at http://issues.apache.org/jira/browse/SOLR-303
>
> Please note that it is still work in progress. So you may not be able to use
> it immeadiately.
>
> On Jan 16, 2008 10:53 AM, Srikant Jakilinki <sr...@gmail.com> wrote:
>
>   
>> Hi All,
>>
>> There is a requirement in our group of indexing and searching several
>> millions of documents (TREC) in real-time and millisecond responses.
>> For the moment we are preferring scale-out (throw more commodity
>> machines) approaches rather than scale-up (faster disks, more
>> RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper
>> (mail me if you want a copy) in which it was proven that this kind of
>> distribution scales better and is more resilient.
>>
>> So, are there any resources available (Wiki, Tutorials, Slides, README
>> etc.) that throw light and guide newbies on how to run Solr in a
>> multi-machine scenario? I have gone through the mailing lists and site
>> but could not really find any answers or hands-on stuff to do so. An
>> adhoc guideline to get things working with 2 machines might just be
>> enough but for the sake of thinking out loud and solicit responses
>> from the list, here are my questions:
>>
>> 1) Solr that has to handle a fairly large index which has to be split
>> up on multiple disks (using Multicore?)
>> - Space is not a problem since we can use NFS but that is not
>> recommended as we would only exploit 1 processor
>> 2) Solr that has to handle a large collective index which has to be
>> split up on multi-machines
>> - The index is ever increasing (TB scale) and dynamic and all of it
>> has to be searched at any point
>> 3) Solr that has to exploit multi-machines because we have plenty of
>> them in a tightly coupled P2P scenario
>> - Machines are not a problem but will they be if they are of varied
>> configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
>> 1.1 to 1.6)
>> 4) Solr that has to distribute load on several machines
>> - The index(s) could be common though like say using a distributed
>> filesystem (Hadoop?)
>>
>> In each the above cases (we might use all of these strategies at
>> various use cases) the application should use Solr as a strict backend
>> and named service (IP or host:port) so that we can expose this
>> application (and the service) to the web or intranet. Machine failures
>> should be tolerated too. Also, does Solr manage load balancing out of
>> the box if it was indeed configured to work with multi-machines?
>>
>> Maybe it is superfluous but is Solr and/or Nutch the only way to use
>> Lucene in a multi-machine environment? Or is there some hidden
>> document/project somewhere that makes it possible by exposing a
>> regular Lucene process over the network using RMI or something? It is
>> my understanding (could be wrong) that Nutch and to some extent, Solr
>> do not perform well when there is a lot of indexing activity in
>> parallel to search. Batch processing is also there and perhaps we can
>> use Nutch/Solr there. Even so, we need multi-machine directions.
>>
>> I am sure that multi-machines make possible for a lot of other ways
>> which might solve the goal better and that others have practical
>> experience on. So, any advise and tips are also very welcome. We
>> intend to document things and do some benchmarking along the way in
>> the open spirit.
>>
>> Really sorry for the length but I hope some answers are forthcoming.
>>
>> Cheers,
>> Srikant
>>
>>     
>
>
>
>

----------------------------------------------------------------------
Get a free email account with anti spam protection.
http://www.bluebottle.com/tag/2

Re: Solr in a distributed multi-machine high-performance environment

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Look at http://issues.apache.org/jira/browse/SOLR-303

Please note that it is still work in progress. So you may not be able to use
it immeadiately.

On Jan 16, 2008 10:53 AM, Srikant Jakilinki <sr...@gmail.com> wrote:

> Hi All,
>
> There is a requirement in our group of indexing and searching several
> millions of documents (TREC) in real-time and millisecond responses.
> For the moment we are preferring scale-out (throw more commodity
> machines) approaches rather than scale-up (faster disks, more
> RAM). This is in-turn inspired by the "Scale-out vs. Scale-up" paper
> (mail me if you want a copy) in which it was proven that this kind of
> distribution scales better and is more resilient.
>
> So, are there any resources available (Wiki, Tutorials, Slides, README
> etc.) that throw light and guide newbies on how to run Solr in a
> multi-machine scenario? I have gone through the mailing lists and site
> but could not really find any answers or hands-on stuff to do so. An
> adhoc guideline to get things working with 2 machines might just be
> enough but for the sake of thinking out loud and solicit responses
> from the list, here are my questions:
>
> 1) Solr that has to handle a fairly large index which has to be split
> up on multiple disks (using Multicore?)
> - Space is not a problem since we can use NFS but that is not
> recommended as we would only exploit 1 processor
> 2) Solr that has to handle a large collective index which has to be
> split up on multi-machines
> - The index is ever increasing (TB scale) and dynamic and all of it
> has to be searched at any point
> 3) Solr that has to exploit multi-machines because we have plenty of
> them in a tightly coupled P2P scenario
> - Machines are not a problem but will they be if they are of varied
> configurations (PIII to Core2; Linux to Vista; 32-bit to 64-bit; J2SE
> 1.1 to 1.6)
> 4) Solr that has to distribute load on several machines
> - The index(s) could be common though like say using a distributed
> filesystem (Hadoop?)
>
> In each the above cases (we might use all of these strategies at
> various use cases) the application should use Solr as a strict backend
> and named service (IP or host:port) so that we can expose this
> application (and the service) to the web or intranet. Machine failures
> should be tolerated too. Also, does Solr manage load balancing out of
> the box if it was indeed configured to work with multi-machines?
>
> Maybe it is superfluous but is Solr and/or Nutch the only way to use
> Lucene in a multi-machine environment? Or is there some hidden
> document/project somewhere that makes it possible by exposing a
> regular Lucene process over the network using RMI or something? It is
> my understanding (could be wrong) that Nutch and to some extent, Solr
> do not perform well when there is a lot of indexing activity in
> parallel to search. Batch processing is also there and perhaps we can
> use Nutch/Solr there. Even so, we need multi-machine directions.
>
> I am sure that multi-machines make possible for a lot of other ways
> which might solve the goal better and that others have practical
> experience on. So, any advise and tips are also very welcome. We
> intend to document things and do some benchmarking along the way in
> the open spirit.
>
> Really sorry for the length but I hope some answers are forthcoming.
>
> Cheers,
> Srikant
>



-- 
Regards,
Shalin Shekhar Mangar.