You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike <mi...@gmail.com> on 2013/04/25 22:00:53 UTC

Massive Positions Files

Hi All,

I'm indexing a pretty large collection of documents (about 500K relatively
long documents taking up >1TB space, mostly in MS Office formats), and am
confused about the file sizes in the index.  I've gotten through about 180K
documents, and the *.pos files add up to 325GB, while the all of the rest
combined are using less than 5GB--including some large stored fields and
term vectors.  It makes sense to me that the compression on stored fields
helps to keep that part down on large text fields, and that term vectors
wouldn't be too big since they don't need position information, but the
magnitude of the difference is alarming.  Is that to be expected?  Is there
any way to reduce the size of the positions index if phrase searching is a
requirement?

I am using Solr 4.2.1.  These documents have some a number of small
metadata elements, along with the big content field.  Like the default
schema, I'm storing but not indexing the content field, and a lot of the
fields get put into a catchall that is indexed and uses term vectors, but
is not stored.

Thanks,
Mike

Re: Massive Positions Files

Posted by Jack Krupansky <ja...@basetechnology.com>.
These are the "postings" for all terms - the lists of positions for every 
occurrence of every term for all documents. Sounds to me like it could be 
huge.

Did you try a back of the envelope calculation?

3.25 GB divided by 180K = 18 K per doc (call it 2K).

How many "words" in a document? You say they are "long".

Even if there were only 5000 to 10000 postings per "long" document, that 
would work out to 2 to 4 bytes or so per posting. I have no idea how big an 
"average" term posting might be, but these numbers do not seem at all 
unreasonable.

Now, let's see what kind of precise answer the Lucene guys give you!

-- Jack Krupansky

-----Original Message----- 
From: Mike
Sent: Thursday, April 25, 2013 4:00 PM
To: solr-user@lucene.apache.org
Subject: Massive Positions Files

Hi All,

I'm indexing a pretty large collection of documents (about 500K relatively
long documents taking up >1TB space, mostly in MS Office formats), and am
confused about the file sizes in the index.  I've gotten through about 180K
documents, and the *.pos files add up to 325GB, while the all of the rest
combined are using less than 5GB--including some large stored fields and
term vectors.  It makes sense to me that the compression on stored fields
helps to keep that part down on large text fields, and that term vectors
wouldn't be too big since they don't need position information, but the
magnitude of the difference is alarming.  Is that to be expected?  Is there
any way to reduce the size of the positions index if phrase searching is a
requirement?

I am using Solr 4.2.1.  These documents have some a number of small
metadata elements, along with the big content field.  Like the default
schema, I'm storing but not indexing the content field, and a lot of the
fields get put into a catchall that is indexed and uses term vectors, but
is not stored.

Thanks,
Mike