You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by jm <jm...@gmail.com> on 2007/03/14 11:03:21 UTC

ways to minimize index size?

Hi,

I want to make my index as small as possible. I noticed about
field.setOmitNorms(true), I read in the list the diff is 1 byte per
field per doc, not huge but hey...is the only effect the score being
different? I hardly mind about the score so that would be ok.

And can I add to an index without norms when it has previous doc with norms?

Any other way to minimize size of index? Most of my fields but one are
Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
tried compressing that one and size is reduced around 1% (it's a small
field), but I guess compression means worse performance so I am not
sure about applying that.

thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: ways to minimize index size?

Posted by Erick Erickson <er...@gmail.com>.

OK, I caused more confusion than rendered help by my stemming
statement. The only reason I mentioned it was to illustrate that
performance is not linearly related to size.

It took some effort to put stemming into the index, see
PorterStemmer etc. This is NOT the default. So I took it out
to see what the effect would be.

Why not stemming made things shorter: because we also
have the requirement that phrases (i.e. words in double quotes)
do NOT match the stemmed version. Thus if we index
running watching, the following searches have the
indicated results
run - hits
watch - hits
running - hits
"run watch" does NOT hit.
"running watching" hits

So I indexed the following terms...

run
running$
watch
watching&

with the two forms of run indexed in the same position (0)
and the two forms of watch in the same position (1).

I agree that if we didn't have the exact-phrase-match requirement
the stemmed version of the index should be smaller....


Sorry for the confusion
Erick

On 3/14/07, jm <jm...@gmail.com> wrote:
>
> hi Erick,
>
>
> Well, typically my application will start with some hundreds of
> indexes...and then grow at a rate of several per day, for ever. At
> some point I know I can do some merging etc if needed.
>
> Size is dependant on the customer, could be up to a 1G per index. That
> is way I would like to minimize them. I am not worried with search
> performance.
>
> I dont understand how not stemming can reduce the size of an index...I
> would think it happens the other way, does not stemming makes the
> words shorter? (I dont stemm, so I never looked into it)
>
> thanks
> On 3/14/07, Erick Erickson <er...@gmail.com> wrote:
> > Store as little as possible, index as little as possible <G>.....
> >
> > How big is your index, and how much do you expect it to grow?
> > I ask this because it's probably not worth your time to try to
> > reduce the index size below some threshold... I found that
> > reducing my index from 8G to 4G (through not stemming) gave
> > me about a 10% performance improvement, so at some point
> > it's just not worth the effort. Also, if you posted the index size,
> > it would give folks a chance to say "there's not much you can
> > gain by reducing things more". As it is, I don't have a clue
> > whether your index is 100M or 100T. The former is in the
> > "don't waste your time" class, and the latter is...er...
> > different....
> >
> > I wouldn't bother compressing for 1%....
> >
> > Question for "the guys" so I can check an assumption....
> > Is there any difference between these two?
> > Field(Name, Value, Store, index)
> > *<
> file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29
> >
> > *Field(Name, Value, Store, index, Field.TermVector.NO)
> >
> >
> > Best
> > Erick
> >
> > On 3/14/07, jm <jm...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I want to make my index as small as possible. I noticed about
> > > field.setOmitNorms(true), I read in the list the diff is 1 byte per
> > > field per doc, not huge but hey...is the only effect the score being
> > > different? I hardly mind about the score so that would be ok.
> > >
> > > And can I add to an index without norms when it has previous doc with
> > > norms?
> > >
> > > Any other way to minimize size of index? Most of my fields but one are
> > > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> > > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> > > tried compressing that one and size is reduced around 1% (it's a small
> > > field), but I guess compression means worse performance so I am not
> > > sure about applying that.
> > >
> > > thanks
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: ways to minimize index size?

Posted by jm <jm...@gmail.com>.

hi Erick,


Well, typically my application will start with some hundreds of
indexes...and then grow at a rate of several per day, for ever. At
some point I know I can do some merging etc if needed.

Size is dependant on the customer, could be up to a 1G per index. That
is way I would like to minimize them. I am not worried with search
performance.

I dont understand how not stemming can reduce the size of an index...I
would think it happens the other way, does not stemming makes the
words shorter? (I dont stemm, so I never looked into it)

thanks
On 3/14/07, Erick Erickson <er...@gmail.com> wrote:
> Store as little as possible, index as little as possible <G>.....
>
> How big is your index, and how much do you expect it to grow?
> I ask this because it's probably not worth your time to try to
> reduce the index size below some threshold... I found that
> reducing my index from 8G to 4G (through not stemming) gave
> me about a 10% performance improvement, so at some point
> it's just not worth the effort. Also, if you posted the index size,
> it would give folks a chance to say "there's not much you can
> gain by reducing things more". As it is, I don't have a clue
> whether your index is 100M or 100T. The former is in the
> "don't waste your time" class, and the latter is...er...
> different....
>
> I wouldn't bother compressing for 1%....
>
> Question for "the guys" so I can check an assumption....
> Is there any difference between these two?
> Field(Name, Value, Store, index)
> *<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29>
> *Field(Name, Value, Store, index, Field.TermVector.NO)
>
>
> Best
> Erick
>
> On 3/14/07, jm <jm...@gmail.com> wrote:
> >
> > Hi,
> >
> > I want to make my index as small as possible. I noticed about
> > field.setOmitNorms(true), I read in the list the diff is 1 byte per
> > field per doc, not huge but hey...is the only effect the score being
> > different? I hardly mind about the score so that would be ok.
> >
> > And can I add to an index without norms when it has previous doc with
> > norms?
> >
> > Any other way to minimize size of index? Most of my fields but one are
> > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is
> > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I
> > tried compressing that one and size is reduced around 1% (it's a small
> > field), but I guess compression means worse performance so I am not
> > sure about applying that.
> >
> > thanks
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org