You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Hastings <ha...@gmail.com> on 2017/02/10 21:55:29 UTC

commongrams

Hey All,
I followed an old blog post about  implementing the common grams, and used
the 400 most popular words file on a subset of my data.  original index
size was 33gb with 2.2 million documents, using the 400, it grep to 96gb.
I scaled it down to the 100 most common words and got to about 76gb, but
with a cold phrase search going from 4 seconds at 400 words to 6 with 100.
 this will not really scale well, as the base index that this is a subset
of right now has 22 million documents and sits around 360 gb.  at this
rate, it would be around a TB index size.  is there a common
hardware/software configuration to handle TB size indexes?
thanks,
DH

Re: commongrams

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2/10/2017 2:55 PM, David Hastings wrote:
> of right now has 22 million documents and sits around 360 gb. at this
> rate, it would be around a TB index size. is there a common
> hardware/software configuration to handle TB size indexes?

Memory is the secret to Solr performance.  Lots and lots of memory, so
the OS can effectively cache the index and make sure that the system
doesn't have to actually read the disk for most queries.  The amount of
memory required is frequently surprising to people.  Very large memory
sizes are typically quite expensive, especially in a virtualized world
like Amazon AWS.

If the index reaches a terabyte, you'll probably want between 512GB and
1TB of total memory (across all Solr servers that contain the index). 
If you want to have one or more additional redundant copies of the index
for high availability, plan on adding the same amount of memory again
for each additional copy.

I maintain a wiki page where this is discussed in greater detail:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn