You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Amit Jha <sh...@gmail.com> on 2014/01/09 19:55:25 UTC

Index size - to determine storage

Hi,

I would like to know if I index a file I.e PDF of 100KB then what would be the size of index. What all factors should be consider to determine the disk size?

Rgds
AJ

Re: Index size - to determine storage

Posted by Sumit Arora <su...@gmail.com>.
Hi Amit,

This excel sheet will help you estimating the index size.

size-estimator-lucene-solr.xls
<http://lucene.472066.n3.nabble.com/file/n4111365/size-estimator-lucene-solr.xls>  




-----
Sumit Arora
--
View this message in context: http://lucene.472066.n3.nabble.com/Index-size-to-determine-storage-tp4110522p4111365.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Index size - to determine storage

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Try running PDF through standalone Tika and see what comes back. That's the
size of the input. It usually be quite a small proportion of PDF size.
Possibly down to metadata only and no text, if your PDF does not include
text layer.

Then, it depends on your storing and indexing options, your tokenizers,
whether you are using ngrams, synonyms or anything else that multiplies the
content. And so on.

And remember, that you need (2? 3?) times more space on disk than a single
index for when Solr does segment merges.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 10, 2014 at 1:55 AM, Amit Jha <sh...@gmail.com> wrote:

> Hi,
>
> I would like to know if I index a file I.e PDF of 100KB then what would be
> the size of index. What all factors should be consider to determine the
> disk size?
>
> Rgds
> AJ

Re: Index size - to determine storage

Posted by Michael Della Bitta <mi...@appinions.com>.
Hi Amit,

It really boils down to how much of that 100kb is actually text, and how
you analyze and store the text. Meaning, it's really hard for us to say.
You're probably going to need to experiment to figure out what the storage
needs for your use case are.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Jan 9, 2014 at 1:55 PM, Amit Jha <sh...@gmail.com> wrote:

> Hi,
>
> I would like to know if I index a file I.e PDF of 100KB then what would be
> the size of index. What all factors should be consider to determine the
> disk size?
>
> Rgds
> AJ