You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by souravm <SO...@infosys.com> on 2008/11/06 22:58:43 UTC

Solr Multicore ...

Hi,

Can I use multi core feature to have multiple indexes (That is each core would take care of one type of index) within a single Solar instance ?

Will there be any performance impact due to this type of setup ?

Regards,
Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are not 
to copy, disclose, or distribute this e-mail or its contents to any other person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken 
every reasonable precaution to minimize this risk, but is not liable for any damage 
you may sustain as a result of any virus in this e-mail. You should carry out your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Re: Solr for large volume data processing with minimal full-text serach

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

If you need anything close to realtime (~ few seconds) hadoop and its
ilk is not a choice. Solr is fine. But be prepared to dedicate a lot
of hardware for that

On Fri, Nov 7, 2008 at 10:53 PM, souravm <SO...@infosys.com> wrote:
> Hi Shalin,
>
> Thanks for your input.
>
> Yes I agree that my application is not much about full text search.
>
> Hive/Chukwa/Pig (a combination) running on Hadoop can be a good bet. But where they fall short is in online querying of the huge data.
>
> I am specifically talking about Pig in this case which has benchmarking figure in the order of 3-10 minutes with 11 nodes for around 4GB data size (200 M records). Where as for Solr I can see processing time is under second at 1 node (but higher memory) for around 1 GB data size (0.5 M records).
>
> Since for my application online query performance is one of the key requirement (I think irrespective of type of application no user would like to wait on the screen for more than a minute) I'm in dilemma.
>
> Regards,
> Sourav
>
>
>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
> Sent: Friday, November 07, 2008 7:48 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Multicore ...
>
> From what I can understand, you have little full-text search involved here.
> You should probably look at Hadoop and its contrib and sub-projects such as
> Pig, Hive and Chukwa.
>
> http://wiki.apache.org/hadoop/
> http://wiki.apache.org/hadoop/Hive
> http://wiki.apache.org/hadoop/Chukwa
> http://incubator.apache.org/pig/
>
> On Fri, Nov 7, 2008 at 9:03 PM, souravm <SO...@infosys.com> wrote:
>
>> Hi Guys,
>>
>> Here I'm struggling with to decide whether Solr would be a fitting solution
>> for me. Highly appreciate you
>>
>> The key requirements can be summarized as below -
>>
>> 1. Need to process very high volume of data online from log files of
>> various applications - around 100s of Millions of total size may be varying
>> within a range of 30-40 GB.
>>
>> 2. Flexibility - Log file formats from different applications would be
>> different. Also for the same application log file formats can vary. However,
>> the log files would be in xml and if a new type has to be supported then the
>> schema for the same would be known before hand.
>>
>> 3. The type of queries to be supported -
>> a) Mostly aggregation type statistics (min, max, average, sd, count etc.)
>> of response times, sales numbers etc.
>> b) Ability to support adhoc queries relating multiple fields in a given
>> logfile, joining similar fields in multiple logfiles
>>
>> 4. Flexibility - Log file formats from different applications would be
>> different. Also for the same application log file formats can vary. However,
>> the log files would be in xml and if a new type has to be supported then the
>> schema for the same would be known before hand.
>>
>> 5. Expected performance would be around 10 to 20 sec for majority of the
>> queries. For rest it may be a bit more higher.
>>
>> I'm planning to use Solr with multicore and distributed search feature.
>> However also considering Hadoop with Hbase as that looks to be a natural
>> solution to support multiple file formats and handling adhoc queries.
>>
>> I would surely like to have your viewpoints on this regard - whether given
>> the key requirements above Solr is a right choice or Hadoop+HBase would be
>> better (or any other open source product).
>>
>> Thanks in advance.
>>
>> Regards,
>> Sourav
>>
>> **************** CAUTION - Disclaimer *****************
>> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
>> solely
>> for the use of the addressee(s). If you are not the intended recipient,
>> please
>> notify the sender by e-mail and delete the original message. Further, you
>> are not
>> to copy, disclose, or distribute this e-mail or its contents to any other
>> person and
>> any such actions are unlawful. This e-mail may contain viruses. Infosys has
>> taken
>> every reasonable precaution to minimize this risk, but is not liable for
>> any damage
>> you may sustain as a result of any virus in this e-mail. You should carry
>> out your
>> own virus checks before opening the e-mail or attachment. Infosys reserves
>> the
>> right to monitor and review the content of all messages sent to or from
>> this e-mail
>> address. Messages sent to or from this e-mail address may be stored on the
>> Infosys e-mail system.
>> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul

Solr for large volume data processing with minimal full-text serach

Posted by souravm <SO...@infosys.com>.

Hi Shalin,

Thanks for your input.

Yes I agree that my application is not much about full text search.

Hive/Chukwa/Pig (a combination) running on Hadoop can be a good bet. But where they fall short is in online querying of the huge data.

I am specifically talking about Pig in this case which has benchmarking figure in the order of 3-10 minutes with 11 nodes for around 4GB data size (200 M records). Where as for Solr I can see processing time is under second at 1 node (but higher memory) for around 1 GB data size (0.5 M records).

Since for my application online query performance is one of the key requirement (I think irrespective of type of application no user would like to wait on the screen for more than a minute) I'm in dilemma.

Regards,
Sourav



-----Original Message-----
From: Shalin Shekhar Mangar [mailto:shalinmangar@gmail.com]
Sent: Friday, November 07, 2008 7:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Multicore ...

>From what I can understand, you have little full-text search involved here.
You should probably look at Hadoop and its contrib and sub-projects such as
Pig, Hive and Chukwa.

http://wiki.apache.org/hadoop/
http://wiki.apache.org/hadoop/Hive
http://wiki.apache.org/hadoop/Chukwa
http://incubator.apache.org/pig/

On Fri, Nov 7, 2008 at 9:03 PM, souravm <SO...@infosys.com> wrote:

> Hi Guys,
>
> Here I'm struggling with to decide whether Solr would be a fitting solution
> for me. Highly appreciate you
>
> The key requirements can be summarized as below -
>
> 1. Need to process very high volume of data online from log files of
> various applications - around 100s of Millions of total size may be varying
> within a range of 30-40 GB.
>
> 2. Flexibility - Log file formats from different applications would be
> different. Also for the same application log file formats can vary. However,
> the log files would be in xml and if a new type has to be supported then the
> schema for the same would be known before hand.
>
> 3. The type of queries to be supported -
> a) Mostly aggregation type statistics (min, max, average, sd, count etc.)
> of response times, sales numbers etc.
> b) Ability to support adhoc queries relating multiple fields in a given
> logfile, joining similar fields in multiple logfiles
>
> 4. Flexibility - Log file formats from different applications would be
> different. Also for the same application log file formats can vary. However,
> the log files would be in xml and if a new type has to be supported then the
> schema for the same would be known before hand.
>
> 5. Expected performance would be around 10 to 20 sec for majority of the
> queries. For rest it may be a bit more higher.
>
> I'm planning to use Solr with multicore and distributed search feature.
> However also considering Hadoop with Hbase as that looks to be a natural
> solution to support multiple file formats and handling adhoc queries.
>
> I would surely like to have your viewpoints on this regard - whether given
> the key requirements above Solr is a right choice or Hadoop+HBase would be
> better (or any other open source product).
>
> Thanks in advance.
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



--
Regards,
Shalin Shekhar Mangar.

Re: Solr Multicore ...

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

>From what I can understand, you have little full-text search involved here.
You should probably look at Hadoop and its contrib and sub-projects such as
Pig, Hive and Chukwa.

http://wiki.apache.org/hadoop/
http://wiki.apache.org/hadoop/Hive
http://wiki.apache.org/hadoop/Chukwa
http://incubator.apache.org/pig/

On Fri, Nov 7, 2008 at 9:03 PM, souravm <SO...@infosys.com> wrote:

> Hi Guys,
>
> Here I'm struggling with to decide whether Solr would be a fitting solution
> for me. Highly appreciate you
>
> The key requirements can be summarized as below -
>
> 1. Need to process very high volume of data online from log files of
> various applications - around 100s of Millions of total size may be varying
> within a range of 30-40 GB.
>
> 2. Flexibility - Log file formats from different applications would be
> different. Also for the same application log file formats can vary. However,
> the log files would be in xml and if a new type has to be supported then the
> schema for the same would be known before hand.
>
> 3. The type of queries to be supported -
> a) Mostly aggregation type statistics (min, max, average, sd, count etc.)
> of response times, sales numbers etc.
> b) Ability to support adhoc queries relating multiple fields in a given
> logfile, joining similar fields in multiple logfiles
>
> 4. Flexibility - Log file formats from different applications would be
> different. Also for the same application log file formats can vary. However,
> the log files would be in xml and if a new type has to be supported then the
> schema for the same would be known before hand.
>
> 5. Expected performance would be around 10 to 20 sec for majority of the
> queries. For rest it may be a bit more higher.
>
> I'm planning to use Solr with multicore and distributed search feature.
> However also considering Hadoop with Hbase as that looks to be a natural
> solution to support multiple file formats and handling adhoc queries.
>
> I would surely like to have your viewpoints on this regard - whether given
> the key requirements above Solr is a right choice or Hadoop+HBase would be
> better (or any other open source product).
>
> Thanks in advance.
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



-- 
Regards,
Shalin Shekhar Mangar.

RE: Solr Multicore ...

Posted by souravm <SO...@infosys.com>.

Hi Guys,

Here I'm struggling with to decide whether Solr would be a fitting solution for me. Highly appreciate you

The key requirements can be summarized as below -

1. Need to process very high volume of data online from log files of various applications - around 100s of Millions of total size may be varying within a range of 30-40 GB.

2. Flexibility - Log file formats from different applications would be different. Also for the same application log file formats can vary. However, the log files would be in xml and if a new type has to be supported then the schema for the same would be known before hand.

3. The type of queries to be supported -
a) Mostly aggregation type statistics (min, max, average, sd, count etc.) of response times, sales numbers etc.
b) Ability to support adhoc queries relating multiple fields in a given logfile, joining similar fields in multiple logfiles

4. Flexibility - Log file formats from different applications would be different. Also for the same application log file formats can vary. However, the log files would be in xml and if a new type has to be supported then the schema for the same would be known before hand.

5. Expected performance would be around 10 to 20 sec for majority of the queries. For rest it may be a bit more higher.

I'm planning to use Solr with multicore and distributed search feature. However also considering Hadoop with Hbase as that looks to be a natural solution to support multiple file formats and handling adhoc queries.

I would surely like to have your viewpoints on this regard - whether given the key requirements above Solr is a right choice or Hadoop+HBase would be better (or any other open source product).

Thanks in advance.

Regards,
Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are not
to copy, disclose, or distribute this e-mail or its contents to any other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
every reasonable precaution to minimize this risk, but is not liable for any damage
you may sustain as a result of any virus in this e-mail. You should carry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

RE: Solr Multicore ...

Posted by souravm <SO...@infosys.com>.

Thanks Noble for your answer.

Regards,
Sourav

-----Original Message-----
From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:noble.paul@gmail.com]
Sent: Thursday, November 06, 2008 7:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Multicore ...

On Fri, Nov 7, 2008 at 3:28 AM, souravm <SO...@infosys.com> wrote:
>
> Hi,
>
> Can I use multi core feature to have multiple indexes (That is each core would take care of one type of index) within a single Solar instance ?
Yes .And this is why it is conceived
>
> Will there be any performance impact due to this type of setup ?
no.
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are not
> to copy, disclose, or distribute this e-mail or its contents to any other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
> every reasonable precaution to minimize this risk, but is not liable for any damage
> you may sustain as a result of any virus in this e-mail. You should carry out your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



--
--Noble Paul

Re: Solr Multicore ...

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

On Fri, Nov 7, 2008 at 3:28 AM, souravm <SO...@infosys.com> wrote:
>
> Hi,
>
> Can I use multi core feature to have multiple indexes (That is each core would take care of one type of index) within a single Solar instance ?
Yes .And this is why it is conceived
>
> Will there be any performance impact due to this type of setup ?
no.
>
> Regards,
> Sourav
>
> **************** CAUTION - Disclaimer *****************
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are not
> to copy, disclose, or distribute this e-mail or its contents to any other person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
> every reasonable precaution to minimize this risk, but is not liable for any damage
> you may sustain as a result of any virus in this e-mail. You should carry out your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>



-- 
--Noble Paul