You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ali Nazemian <al...@gmail.com> on 2015/01/04 15:36:34 UTC

Hardware requirement for 500 million documents

Hi,
I was wondering what is the hardware requirement for indexing 500 million
documents in Solr? Suppose maximum number of concurrent users in peak time
would be 20.
Thank you very much.

-- 
A.Nazemian

RE: Hardware requirement for 500 million documents

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
Ali Nazemian [alinazemian@gmail.com] wrote:
> I was wondering what is the hardware requirement for indexing 500 million
> documents in Solr? Suppose maximum number of concurrent users in peak time
> would be 20.

The thread "How large is your solr index" might help, as might https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

We might be able to give some hints, but you really should build a prototype and measure. If we are to give hints, you need to give more information

* How large do you expect your index to be, measured in bytes?
* Will you be indexing while searching? How much? How frequent?
* What is a typical query? What kind of faceting do you perform?
* Which response times are you looking for?

- Toke Eskildsen

Re: Hardware requirement for 500 million documents

Posted by "Jürgen Wagner (DVT)" <ju...@devoteam.com>.
Hi Ali,
  the sizing is not just determined by the number of indexed documents
(and even less by the number of concurrent users).

- Document volume (number of documents, amount of  text data to be
indexed with each document, number and types of fields, the cardinality
of fields) guide you to the number of primary shards or collections you
want to have in your environment.

- Query volume determines replication factors to deal with proper
response times.

- The amount of concurrency (e.g., do you have primarily insertions of
new documents and then queries, or is there also a significant deletion
process running in parallel - partial updates count as
deletion+insertion) and the frequency of required index updates also
influences the sizing.

- Usually, processing (document to text, extractions, enrichment, ...) 
will be handled outside Solr (and has to be taken into account for the
entire platform scaling of hardware).

Some figures you may want to know before tackling this project are

- Are there different types of documents (e.g., text, media, data) that
have different textual amounts for indexing (e.g., plain text ~100%,
HTML ~90%, Microsoft Word ~15%, PDF ~10%, ...) to be handled?

- What are the size distributions (possibly over these types of documents)?

- What is the expected update frequency? Can you do incremental crawling?

- What types of attributes and facets are you planning to have for these
documents?

- How fresh an index do you need?

- Is this concurrent indexing and querying or will indexing happen,
e.g., at night, while during the day, users will query the platform?

- What are the types of typical queries issued by users?

- Will you have to take security into account (possibly leading to large
Boolean expressions added to queries to filter by entitlement groups)?

This will guide you into a first direction. Then run a prototype to
measure representative figures for scaling and make your estimates.

Best regards,
--Jürgen




On 04.01.2015 15:36, Ali Nazemian wrote:
> Hi,
> I was wondering what is the hardware requirement for indexing 500 million
> documents in Solr? Suppose maximum number of concurrent users in peak time
> would be 20.
> Thank you very much.
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
<ma...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071