You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Phillip Farber <pf...@umich.edu> on 2008/11/03 18:31:23 UTC

Huge increase in index size adding just 2 fields

Hi,

We're indexing a lot of dirty OCR. So the index is really huge due to 
the size of the position file.  We still get ok response time though 
with a median of 100ms.  Phrase queries are a different matter 
obviously.  But we're seeing some really large increases in index size 
as we add a couple of fields that do not make sense.

Our 500,000 document index is 120G. It's simple schema is:

<field name="id" type="string" indexed="true" stored="true" 
required="true"/>
<field name="ocr" type="Ocr" indexed="true" stored="false" required="true"/>
<field name="title" type="Ocr" indexed="true" stored="true" 
required="true"/>
<field name="author" type="Ocr" indexed="true" stored="true" 
required="true"/>
<field name="rights" type="sint" indexed="true" stored="true" 
required="true"/>

We added the following 2 fields to the above schema as follows:

<field name="date" type="date" indexed="true" stored="true" 
required="true"/>
<field name="hlb" type="string" indexed="true" stored="true" 
multiValued="true"/>

where the "hlb" field consists of not more than 3-4 strings such as 
"Social Sicence"/

Our 500,000 document index size increased to 166G!  This seems 
completely wrong.  Looking at the directory listings for each case it 
appears every one of the files grew in size.

How can this be?

Phil

===

120G index:

-rw-r--r--  1 tomcat admin     81023261 Sep 24 06:00 _fj.fdt
-rw-r--r--  1 tomcat admin      4000072 Sep 24 06:00 _fj.fdx
-rw-r--r--  1 tomcat admin           33 Sep 24 06:00 _fj.fnm
-rw-r--r--  1 tomcat admin  14069125169 Sep 24 06:16 _fj.frq
-rw-r--r--  1 tomcat admin      1500031 Sep 24 06:16 _fj.nrm
-rw-r--r--  1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
-rw-r--r--  1 tomcat admin     58677668 Sep 24 08:25 _fj.tii
-rw-r--r--  1 tomcat admin   4319853217 Sep 24 08:32 _fj.tis
-rw-r--r--  1 tomcat admin           42 Sep 24 08:32 segments_fo
-rw-r--r--  1 tomcat admin           20 Sep 24 08:32 segments.gen

166G index (+ 2 fields)

-rw-r--r-- 1 tomcat admin    113530692 Oct 21 10:42 _fh.fdt
-rw-r--r-- 1 tomcat admin      3960256 Oct 21 10:42 _fh.fdx
-rw-r--r-- 1 tomcat admin           44 Oct 21 10:42 _fh.fnm
-rw-r--r-- 1 tomcat admin  15242830112 Oct 21 12:58 _fh.frq
-rw-r--r-- 1 tomcat admin      1485100 Oct 21 12:58 _fh.nrm
-rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
-rw-r--r-- 1 tomcat admin     72760439 Oct 21 12:58 _fh.tii
-rw-r--r-- 1 tomcat admin   5337669551 Oct 21 12:58 _fh.tis
-rw-r--r-- 1 tomcat admin           42 Oct 21 12:58 segments_fk
-rw-r--r-- 1 tomcat admin           20 Oct 21 12:58 segments.gen

Re: Huge increase in index size adding just 2 fields

Posted by Phillip Farber <pf...@umich.edu>.
Hi Otis and Hoss,

My dates are not too granular.  They're always YYYY-MM-DD 00:00:00 but I 
see that I did not omitNorms on the date field and hlb field.  Thanks 
for pointing me in the right direction.

Phil


Chris Hostetter wrote:
> : We added the following 2 fields to the above schema as follows:
> : 
> : <field name="date" type="date" indexed="true" stored="true" required="true"/>
> : <field name="hlb" type="string" indexed="true" stored="true"
> : multiValued="true"/>
> : 
> : where the "hlb" field consists of not more than 3-4 strings such as "Social
> : Sicence"/
> : 
> : Our 500,000 document index size increased to 166G!  This seems completely
> 
> if you don't need fieldNorms for these fields (it almost never makes sense 
> for dates and based on your description of hlb i doesn't sound like you'd 
> need it there either) make sure that's disabled (you might already be 
> doing that in the fieldType declarations, but i'm not sure)
> 
> another way to reduce the amount of space (and improve date range query 
> speed) is to reduce the granulatiry of hte dates you index (ie: round off 
> to the nearest second, minute, hour, or day) so the number of unique terms 
> in the field is reduced.
> 
> -Hoss
> 

Re: Huge increase in index size adding just 2 fields

Posted by Chris Hostetter <ho...@fucit.org>.
: We added the following 2 fields to the above schema as follows:
: 
: <field name="date" type="date" indexed="true" stored="true" required="true"/>
: <field name="hlb" type="string" indexed="true" stored="true"
: multiValued="true"/>
: 
: where the "hlb" field consists of not more than 3-4 strings such as "Social
: Sicence"/
: 
: Our 500,000 document index size increased to 166G!  This seems completely

if you don't need fieldNorms for these fields (it almost never makes sense 
for dates and based on your description of hlb i doesn't sound like you'd 
need it there either) make sure that's disabled (you might already be 
doing that in the fieldType declarations, but i'm not sure)

another way to reduce the amount of space (and improve date range query 
speed) is to reduce the granulatiry of hte dates you index (ie: round off 
to the nearest second, minute, hour, or day) so the number of unique terms 
in the field is reduced.

-Hoss


Re: Huge increase in index size adding just 2 fields

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I'll make a very wild guess and say that it's possible for this to happen if your dates are very granular (down to milliseconds).  All of a sudden you probably got 500,000 new terms there.  Wild guess.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Phillip Farber <pf...@umich.edu>
> To: solr-user@lucene.apache.org
> Sent: Thursday, November 6, 2008 11:08:18 AM
> Subject: Re: Huge increase in index size adding just 2 fields
> 
> May I ask again whether a index size increase from 120GB to 166GB is expected 
> simply by adding a stored date and a stored repeating string field if length 
> perhaps 20 and roughly 2 values per doc for 500,000 on average?  The doc is a 
> large body of OCR and the position index dominates due to the large number of 
> terms.
> 
> Thanks,
> 
> Phil
> 
> 
> Phillip Farber wrote:
> > 
> > Hi,
> > 
> > We're indexing a lot of dirty OCR. So the index is really huge due to the size 
> of the position file.  We still get ok response time though with a median of 
> 100ms.  Phrase queries are a different matter obviously.  But we're seeing some 
> really large increases in index size as we add a couple of fields that do not 
> make sense.
> > 
> > Our 500,000 document index is 120G. It's simple schema is:
> > 
> > 
> > 
> > 
> > 
> > 
> required="true"/>
> > 
> > We added the following 2 fields to the above schema as follows:
> > 
> > 
> > 
> multiValued="true"/>
> > 
> > where the "hlb" field consists of not more than 3-4 strings such as "Social 
> Sicence"/
> > 
> > Our 500,000 document index size increased to 166G!  This seems completely 
> wrong.  Looking at the directory listings for each case it appears every one of 
> the files grew in size.
> > 
> > How can this be?
> > 
> > Phil
> > 
> > ===
> > 
> > 120G index:
> > 
> > -rw-r--r--  1 tomcat admin     81023261 Sep 24 06:00 _fj.fdt
> > -rw-r--r--  1 tomcat admin      4000072 Sep 24 06:00 _fj.fdx
> > -rw-r--r--  1 tomcat admin           33 Sep 24 06:00 _fj.fnm
> > -rw-r--r--  1 tomcat admin  14069125169 Sep 24 06:16 _fj.frq
> > -rw-r--r--  1 tomcat admin      1500031 Sep 24 06:16 _fj.nrm
> > -rw-r--r--  1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
> > -rw-r--r--  1 tomcat admin     58677668 Sep 24 08:25 _fj.tii
> > -rw-r--r--  1 tomcat admin   4319853217 Sep 24 08:32 _fj.tis
> > -rw-r--r--  1 tomcat admin           42 Sep 24 08:32 segments_fo
> > -rw-r--r--  1 tomcat admin           20 Sep 24 08:32 segments.gen
> > 
> > 166G index (+ 2 fields)
> > 
> > -rw-r--r-- 1 tomcat admin    113530692 Oct 21 10:42 _fh.fdt
> > -rw-r--r-- 1 tomcat admin      3960256 Oct 21 10:42 _fh.fdx
> > -rw-r--r-- 1 tomcat admin           44 Oct 21 10:42 _fh.fnm
> > -rw-r--r-- 1 tomcat admin  15242830112 Oct 21 12:58 _fh.frq
> > -rw-r--r-- 1 tomcat admin      1485100 Oct 21 12:58 _fh.nrm
> > -rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
> > -rw-r--r-- 1 tomcat admin     72760439 Oct 21 12:58 _fh.tii
> > -rw-r--r-- 1 tomcat admin   5337669551 Oct 21 12:58 _fh.tis
> > -rw-r--r-- 1 tomcat admin           42 Oct 21 12:58 segments_fk
> > -rw-r--r-- 1 tomcat admin           20 Oct 21 12:58 segments.gen


Re: Huge increase in index size adding just 2 fields

Posted by Phillip Farber <pf...@umich.edu>.
May I ask again whether a index size increase from 120GB to 166GB is 
expected simply by adding a stored date and a stored repeating string 
field if length perhaps 20 and roughly 2 values per doc for 500,000 on 
average?  The doc is a large body of OCR and the position index 
dominates due to the large number of terms.

Thanks,

Phil


Phillip Farber wrote:
> 
> Hi,
> 
> We're indexing a lot of dirty OCR. So the index is really huge due to 
> the size of the position file.  We still get ok response time though 
> with a median of 100ms.  Phrase queries are a different matter 
> obviously.  But we're seeing some really large increases in index size 
> as we add a couple of fields that do not make sense.
> 
> Our 500,000 document index is 120G. It's simple schema is:
> 
> <field name="id" type="string" indexed="true" stored="true" 
> required="true"/>
> <field name="ocr" type="Ocr" indexed="true" stored="false" 
> required="true"/>
> <field name="title" type="Ocr" indexed="true" stored="true" 
> required="true"/>
> <field name="author" type="Ocr" indexed="true" stored="true" 
> required="true"/>
> <field name="rights" type="sint" indexed="true" stored="true" 
> required="true"/>
> 
> We added the following 2 fields to the above schema as follows:
> 
> <field name="date" type="date" indexed="true" stored="true" 
> required="true"/>
> <field name="hlb" type="string" indexed="true" stored="true" 
> multiValued="true"/>
> 
> where the "hlb" field consists of not more than 3-4 strings such as 
> "Social Sicence"/
> 
> Our 500,000 document index size increased to 166G!  This seems 
> completely wrong.  Looking at the directory listings for each case it 
> appears every one of the files grew in size.
> 
> How can this be?
> 
> Phil
> 
> ===
> 
> 120G index:
> 
> -rw-r--r--  1 tomcat admin     81023261 Sep 24 06:00 _fj.fdt
> -rw-r--r--  1 tomcat admin      4000072 Sep 24 06:00 _fj.fdx
> -rw-r--r--  1 tomcat admin           33 Sep 24 06:00 _fj.fnm
> -rw-r--r--  1 tomcat admin  14069125169 Sep 24 06:16 _fj.frq
> -rw-r--r--  1 tomcat admin      1500031 Sep 24 06:16 _fj.nrm
> -rw-r--r--  1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
> -rw-r--r--  1 tomcat admin     58677668 Sep 24 08:25 _fj.tii
> -rw-r--r--  1 tomcat admin   4319853217 Sep 24 08:32 _fj.tis
> -rw-r--r--  1 tomcat admin           42 Sep 24 08:32 segments_fo
> -rw-r--r--  1 tomcat admin           20 Sep 24 08:32 segments.gen
> 
> 166G index (+ 2 fields)
> 
> -rw-r--r-- 1 tomcat admin    113530692 Oct 21 10:42 _fh.fdt
> -rw-r--r-- 1 tomcat admin      3960256 Oct 21 10:42 _fh.fdx
> -rw-r--r-- 1 tomcat admin           44 Oct 21 10:42 _fh.fnm
> -rw-r--r-- 1 tomcat admin  15242830112 Oct 21 12:58 _fh.frq
> -rw-r--r-- 1 tomcat admin      1485100 Oct 21 12:58 _fh.nrm
> -rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
> -rw-r--r-- 1 tomcat admin     72760439 Oct 21 12:58 _fh.tii
> -rw-r--r-- 1 tomcat admin   5337669551 Oct 21 12:58 _fh.tis
> -rw-r--r-- 1 tomcat admin           42 Oct 21 12:58 segments_fk
> -rw-r--r-- 1 tomcat admin           20 Oct 21 12:58 segments.gen