You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Phillip Farber <pf...@umich.edu> on 2008/11/03 18:31:23 UTC
Huge increase in index size adding just 2 fields
Hi,
We're indexing a lot of dirty OCR. So the index is really huge due to
the size of the position file. We still get ok response time though
with a median of 100ms. Phrase queries are a different matter
obviously. But we're seeing some really large increases in index size
as we add a couple of fields that do not make sense.
Our 500,000 document index is 120G. It's simple schema is:
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="ocr" type="Ocr" indexed="true" stored="false" required="true"/>
<field name="title" type="Ocr" indexed="true" stored="true"
required="true"/>
<field name="author" type="Ocr" indexed="true" stored="true"
required="true"/>
<field name="rights" type="sint" indexed="true" stored="true"
required="true"/>
We added the following 2 fields to the above schema as follows:
<field name="date" type="date" indexed="true" stored="true"
required="true"/>
<field name="hlb" type="string" indexed="true" stored="true"
multiValued="true"/>
where the "hlb" field consists of not more than 3-4 strings such as
"Social Sicence"/
Our 500,000 document index size increased to 166G! This seems
completely wrong. Looking at the directory listings for each case it
appears every one of the files grew in size.
How can this be?
Phil
===
120G index:
-rw-r--r-- 1 tomcat admin 81023261 Sep 24 06:00 _fj.fdt
-rw-r--r-- 1 tomcat admin 4000072 Sep 24 06:00 _fj.fdx
-rw-r--r-- 1 tomcat admin 33 Sep 24 06:00 _fj.fnm
-rw-r--r-- 1 tomcat admin 14069125169 Sep 24 06:16 _fj.frq
-rw-r--r-- 1 tomcat admin 1500031 Sep 24 06:16 _fj.nrm
-rw-r--r-- 1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
-rw-r--r-- 1 tomcat admin 58677668 Sep 24 08:25 _fj.tii
-rw-r--r-- 1 tomcat admin 4319853217 Sep 24 08:32 _fj.tis
-rw-r--r-- 1 tomcat admin 42 Sep 24 08:32 segments_fo
-rw-r--r-- 1 tomcat admin 20 Sep 24 08:32 segments.gen
166G index (+ 2 fields)
-rw-r--r-- 1 tomcat admin 113530692 Oct 21 10:42 _fh.fdt
-rw-r--r-- 1 tomcat admin 3960256 Oct 21 10:42 _fh.fdx
-rw-r--r-- 1 tomcat admin 44 Oct 21 10:42 _fh.fnm
-rw-r--r-- 1 tomcat admin 15242830112 Oct 21 12:58 _fh.frq
-rw-r--r-- 1 tomcat admin 1485100 Oct 21 12:58 _fh.nrm
-rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
-rw-r--r-- 1 tomcat admin 72760439 Oct 21 12:58 _fh.tii
-rw-r--r-- 1 tomcat admin 5337669551 Oct 21 12:58 _fh.tis
-rw-r--r-- 1 tomcat admin 42 Oct 21 12:58 segments_fk
-rw-r--r-- 1 tomcat admin 20 Oct 21 12:58 segments.gen
Re: Huge increase in index size adding just 2 fields
Posted by Phillip Farber <pf...@umich.edu>.
Hi Otis and Hoss,
My dates are not too granular. They're always YYYY-MM-DD 00:00:00 but I
see that I did not omitNorms on the date field and hlb field. Thanks
for pointing me in the right direction.
Phil
Chris Hostetter wrote:
> : We added the following 2 fields to the above schema as follows:
> :
> : <field name="date" type="date" indexed="true" stored="true" required="true"/>
> : <field name="hlb" type="string" indexed="true" stored="true"
> : multiValued="true"/>
> :
> : where the "hlb" field consists of not more than 3-4 strings such as "Social
> : Sicence"/
> :
> : Our 500,000 document index size increased to 166G! This seems completely
>
> if you don't need fieldNorms for these fields (it almost never makes sense
> for dates and based on your description of hlb i doesn't sound like you'd
> need it there either) make sure that's disabled (you might already be
> doing that in the fieldType declarations, but i'm not sure)
>
> another way to reduce the amount of space (and improve date range query
> speed) is to reduce the granulatiry of hte dates you index (ie: round off
> to the nearest second, minute, hour, or day) so the number of unique terms
> in the field is reduced.
>
> -Hoss
>
Re: Huge increase in index size adding just 2 fields
Posted by Chris Hostetter <ho...@fucit.org>.
: We added the following 2 fields to the above schema as follows:
:
: <field name="date" type="date" indexed="true" stored="true" required="true"/>
: <field name="hlb" type="string" indexed="true" stored="true"
: multiValued="true"/>
:
: where the "hlb" field consists of not more than 3-4 strings such as "Social
: Sicence"/
:
: Our 500,000 document index size increased to 166G! This seems completely
if you don't need fieldNorms for these fields (it almost never makes sense
for dates and based on your description of hlb i doesn't sound like you'd
need it there either) make sure that's disabled (you might already be
doing that in the fieldType declarations, but i'm not sure)
another way to reduce the amount of space (and improve date range query
speed) is to reduce the granulatiry of hte dates you index (ie: round off
to the nearest second, minute, hour, or day) so the number of unique terms
in the field is reduced.
-Hoss
Re: Huge increase in index size adding just 2 fields
Posted by Otis Gospodnetic <ot...@yahoo.com>.
I'll make a very wild guess and say that it's possible for this to happen if your dates are very granular (down to milliseconds). All of a sudden you probably got 500,000 new terms there. Wild guess.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Phillip Farber <pf...@umich.edu>
> To: solr-user@lucene.apache.org
> Sent: Thursday, November 6, 2008 11:08:18 AM
> Subject: Re: Huge increase in index size adding just 2 fields
>
> May I ask again whether a index size increase from 120GB to 166GB is expected
> simply by adding a stored date and a stored repeating string field if length
> perhaps 20 and roughly 2 values per doc for 500,000 on average? The doc is a
> large body of OCR and the position index dominates due to the large number of
> terms.
>
> Thanks,
>
> Phil
>
>
> Phillip Farber wrote:
> >
> > Hi,
> >
> > We're indexing a lot of dirty OCR. So the index is really huge due to the size
> of the position file. We still get ok response time though with a median of
> 100ms. Phrase queries are a different matter obviously. But we're seeing some
> really large increases in index size as we add a couple of fields that do not
> make sense.
> >
> > Our 500,000 document index is 120G. It's simple schema is:
> >
> >
> >
> >
> >
> >
> required="true"/>
> >
> > We added the following 2 fields to the above schema as follows:
> >
> >
> >
> multiValued="true"/>
> >
> > where the "hlb" field consists of not more than 3-4 strings such as "Social
> Sicence"/
> >
> > Our 500,000 document index size increased to 166G! This seems completely
> wrong. Looking at the directory listings for each case it appears every one of
> the files grew in size.
> >
> > How can this be?
> >
> > Phil
> >
> > ===
> >
> > 120G index:
> >
> > -rw-r--r-- 1 tomcat admin 81023261 Sep 24 06:00 _fj.fdt
> > -rw-r--r-- 1 tomcat admin 4000072 Sep 24 06:00 _fj.fdx
> > -rw-r--r-- 1 tomcat admin 33 Sep 24 06:00 _fj.fnm
> > -rw-r--r-- 1 tomcat admin 14069125169 Sep 24 06:16 _fj.frq
> > -rw-r--r-- 1 tomcat admin 1500031 Sep 24 06:16 _fj.nrm
> > -rw-r--r-- 1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
> > -rw-r--r-- 1 tomcat admin 58677668 Sep 24 08:25 _fj.tii
> > -rw-r--r-- 1 tomcat admin 4319853217 Sep 24 08:32 _fj.tis
> > -rw-r--r-- 1 tomcat admin 42 Sep 24 08:32 segments_fo
> > -rw-r--r-- 1 tomcat admin 20 Sep 24 08:32 segments.gen
> >
> > 166G index (+ 2 fields)
> >
> > -rw-r--r-- 1 tomcat admin 113530692 Oct 21 10:42 _fh.fdt
> > -rw-r--r-- 1 tomcat admin 3960256 Oct 21 10:42 _fh.fdx
> > -rw-r--r-- 1 tomcat admin 44 Oct 21 10:42 _fh.fnm
> > -rw-r--r-- 1 tomcat admin 15242830112 Oct 21 12:58 _fh.frq
> > -rw-r--r-- 1 tomcat admin 1485100 Oct 21 12:58 _fh.nrm
> > -rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
> > -rw-r--r-- 1 tomcat admin 72760439 Oct 21 12:58 _fh.tii
> > -rw-r--r-- 1 tomcat admin 5337669551 Oct 21 12:58 _fh.tis
> > -rw-r--r-- 1 tomcat admin 42 Oct 21 12:58 segments_fk
> > -rw-r--r-- 1 tomcat admin 20 Oct 21 12:58 segments.gen
Re: Huge increase in index size adding just 2 fields
Posted by Phillip Farber <pf...@umich.edu>.
May I ask again whether a index size increase from 120GB to 166GB is
expected simply by adding a stored date and a stored repeating string
field if length perhaps 20 and roughly 2 values per doc for 500,000 on
average? The doc is a large body of OCR and the position index
dominates due to the large number of terms.
Thanks,
Phil
Phillip Farber wrote:
>
> Hi,
>
> We're indexing a lot of dirty OCR. So the index is really huge due to
> the size of the position file. We still get ok response time though
> with a median of 100ms. Phrase queries are a different matter
> obviously. But we're seeing some really large increases in index size
> as we add a couple of fields that do not make sense.
>
> Our 500,000 document index is 120G. It's simple schema is:
>
> <field name="id" type="string" indexed="true" stored="true"
> required="true"/>
> <field name="ocr" type="Ocr" indexed="true" stored="false"
> required="true"/>
> <field name="title" type="Ocr" indexed="true" stored="true"
> required="true"/>
> <field name="author" type="Ocr" indexed="true" stored="true"
> required="true"/>
> <field name="rights" type="sint" indexed="true" stored="true"
> required="true"/>
>
> We added the following 2 fields to the above schema as follows:
>
> <field name="date" type="date" indexed="true" stored="true"
> required="true"/>
> <field name="hlb" type="string" indexed="true" stored="true"
> multiValued="true"/>
>
> where the "hlb" field consists of not more than 3-4 strings such as
> "Social Sicence"/
>
> Our 500,000 document index size increased to 166G! This seems
> completely wrong. Looking at the directory listings for each case it
> appears every one of the files grew in size.
>
> How can this be?
>
> Phil
>
> ===
>
> 120G index:
>
> -rw-r--r-- 1 tomcat admin 81023261 Sep 24 06:00 _fj.fdt
> -rw-r--r-- 1 tomcat admin 4000072 Sep 24 06:00 _fj.fdx
> -rw-r--r-- 1 tomcat admin 33 Sep 24 06:00 _fj.fnm
> -rw-r--r-- 1 tomcat admin 14069125169 Sep 24 06:16 _fj.frq
> -rw-r--r-- 1 tomcat admin 1500031 Sep 24 06:16 _fj.nrm
> -rw-r--r-- 1 tomcat admin 109247382360 Sep 24 08:25 _fj.prx
> -rw-r--r-- 1 tomcat admin 58677668 Sep 24 08:25 _fj.tii
> -rw-r--r-- 1 tomcat admin 4319853217 Sep 24 08:32 _fj.tis
> -rw-r--r-- 1 tomcat admin 42 Sep 24 08:32 segments_fo
> -rw-r--r-- 1 tomcat admin 20 Sep 24 08:32 segments.gen
>
> 166G index (+ 2 fields)
>
> -rw-r--r-- 1 tomcat admin 113530692 Oct 21 10:42 _fh.fdt
> -rw-r--r-- 1 tomcat admin 3960256 Oct 21 10:42 _fh.fdx
> -rw-r--r-- 1 tomcat admin 44 Oct 21 10:42 _fh.fnm
> -rw-r--r-- 1 tomcat admin 15242830112 Oct 21 12:58 _fh.frq
> -rw-r--r-- 1 tomcat admin 1485100 Oct 21 12:58 _fh.nrm
> -rw-r--r-- 1 tomcat admin 117927610810 Oct 21 12:58 _fh.prx
> -rw-r--r-- 1 tomcat admin 72760439 Oct 21 12:58 _fh.tii
> -rw-r--r-- 1 tomcat admin 5337669551 Oct 21 12:58 _fh.tis
> -rw-r--r-- 1 tomcat admin 42 Oct 21 12:58 segments_fk
> -rw-r--r-- 1 tomcat admin 20 Oct 21 12:58 segments.gen