You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Sandeep Gupta <sa...@gmail.com> on 2013/10/24 13:46:24 UTC

Solr subset searching in 100-million document index

Hi,

We have a Solr index of around 100 million documents with each document
being given a region id growing at a rate of about 10 million documents per
month - the average document size being aronud 10KB of pure text. The total
number of region ids are themselves in the range of 2.5 million.

We want to search for a query with a given list of region ids. The number
of region ids in this list is usually around 250-300 (most of the time),
but can be upto 500, with a maximum cap of around 2000 ids in one request.


What is the best way to model such queries besides using an IN param in the
query, or using a Filter FQ in the query? Are there any other faster
methods available?


If it may help, the index is on a VM with 4 virtual-cores and has currently
4GB of Java memory allocated out of 16GB in the machine. The number of
queries do not exceed more than 1 per minute for now. If needed, we can
throw more hardware to the index - but the index will still be only on a
single machine for atleast 6 months.

Regards,
Sandeep Gupta

Re: Solr subset searching in 100-million document index

Posted by Aloke Ghoshal <al...@gmail.com>.

Hi Sandeep,

You are quite likely below capacity with this current set-up:
http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache

Few things for you to confirm:
1. Which version of Solr are you using?
2. The size of your index.
- Are fields stored? How much are these stored fields contributing to the
overall index size (File types:
http://lucene.apache.org/core/2_9_4/fileformats.html#file-names).
- You are not bloating the index further with term vectors, norms, ngrams,
reverse wild card, etc.
3. Response time (Solr & client side) with your typical queries. Also
utilization numbers for memory, CPU.

For your modelling, if possible, you could consider grouping the regions,
and searching via one regions-group-id in place of 250+ region ids (in an
OR query, not in an "IN param").

Regards,
Aloke



On Thu, Oct 24, 2013 at 8:25 PM, Joel Bernstein <jo...@gmail.com> wrote:

> Sandeep,
>
> This type of operation can often be expressed as a PostFilter very
> efficiently. This is particularly true if the region id's are integer keys.
>
> Joel
>
> On Thu, Oct 24, 2013 at 7:46 AM, Sandeep Gupta <sa...@gmail.com>
> wrote:
>
> > Hi,
> >
> > We have a Solr index of around 100 million documents with each document
> > being given a region id growing at a rate of about 10 million documents
> per
> > month - the average document size being aronud 10KB of pure text. The
> total
> > number of region ids are themselves in the range of 2.5 million.
> >
> > We want to search for a query with a given list of region ids. The number
> > of region ids in this list is usually around 250-300 (most of the time),
> > but can be upto 500, with a maximum cap of around 2000 ids in one
> request.
> >
> >
> > What is the best way to model such queries besides using an IN param in
> the
> > query, or using a Filter FQ in the query? Are there any other faster
> > methods available?
> >
> >
> > If it may help, the index is on a VM with 4 virtual-cores and has
> currently
> > 4GB of Java memory allocated out of 16GB in the machine. The number of
> > queries do not exceed more than 1 per minute for now. If needed, we can
> > throw more hardware to the index - but the index will still be only on a
> > single machine for atleast 6 months.
> >
> > Regards,
> > Sandeep Gupta
> >
>
>
>
> --
>

Re: Solr subset searching in 100-million document index

Posted by Sandeep Gupta <sa...@gmail.com>.

Hi Joel,

Thanks a lot for the information - I haven't worked with PostFilter's
before but found an example at
http://java.dzone.com/articles/custom-security-filtering-solr.

Will try it over the next few days and come back if still have questions.

Thanks again!



Keep Walking,
~ Sandeep


On Thu, Oct 24, 2013 at 8:25 PM, Joel Bernstein <jo...@gmail.com> wrote:

> Sandeep,
>
> This type of operation can often be expressed as a PostFilter very
> efficiently. This is particularly true if the region id's are integer keys.
>
> Joel
>
> On Thu, Oct 24, 2013 at 7:46 AM, Sandeep Gupta <sa...@gmail.com>
> wrote:
>
> > Hi,
> >
> > We have a Solr index of around 100 million documents with each document
> > being given a region id growing at a rate of about 10 million documents
> per
> > month - the average document size being aronud 10KB of pure text. The
> total
> > number of region ids are themselves in the range of 2.5 million.
> >
> > We want to search for a query with a given list of region ids. The number
> > of region ids in this list is usually around 250-300 (most of the time),
> > but can be upto 500, with a maximum cap of around 2000 ids in one
> request.
> >
> >
> > What is the best way to model such queries besides using an IN param in
> the
> > query, or using a Filter FQ in the query? Are there any other faster
> > methods available?
> >
> >
> > If it may help, the index is on a VM with 4 virtual-cores and has
> currently
> > 4GB of Java memory allocated out of 16GB in the machine. The number of
> > queries do not exceed more than 1 per minute for now. If needed, we can
> > throw more hardware to the index - but the index will still be only on a
> > single machine for atleast 6 months.
> >
> > Regards,
> > Sandeep Gupta
> >
>
>
>
> --
>

Re: Solr subset searching in 100-million document index

Posted by Joel Bernstein <jo...@gmail.com>.

Sandeep,

This type of operation can often be expressed as a PostFilter very
efficiently. This is particularly true if the region id's are integer keys.

Joel

On Thu, Oct 24, 2013 at 7:46 AM, Sandeep Gupta <sa...@gmail.com> wrote:

> Hi,
>
> We have a Solr index of around 100 million documents with each document
> being given a region id growing at a rate of about 10 million documents per
> month - the average document size being aronud 10KB of pure text. The total
> number of region ids are themselves in the range of 2.5 million.
>
> We want to search for a query with a given list of region ids. The number
> of region ids in this list is usually around 250-300 (most of the time),
> but can be upto 500, with a maximum cap of around 2000 ids in one request.
>
>
> What is the best way to model such queries besides using an IN param in the
> query, or using a Filter FQ in the query? Are there any other faster
> methods available?
>
>
> If it may help, the index is on a VM with 4 virtual-cores and has currently
> 4GB of Java memory allocated out of 16GB in the machine. The number of
> queries do not exceed more than 1 per minute for now. If needed, we can
> throw more hardware to the index - but the index will still be only on a
> single machine for atleast 6 months.
>
> Regards,
> Sandeep Gupta
>



--