You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Yuliya Palchaninava <yp...@solute.de> on 2010/01/07 17:23:08 UTC

Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Hi,

According to the api documentation: "In general, once the optimize completes, the total size of the index will be less than the size of the starting index. It could be quite a bit smaller (if there were many pending deletes) or just slightly smaller". In our case the index becomes not smaller but larger, namely thrice as large. 

The not optimized index doesn't contain compressed fields, what could have caused the growth of the index due to the otimization. So we cannot explain what happens.

Does someone have an explanation for the index growth due to the optimization?

Thanks,
Yuliya


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Simon Willnauer <si...@googlemail.com>.
Do you have a reader open on the index which was opened before your
your index was optimized? Maybe there is a reader around holding on
the references to the merged segments.

simon

On Thu, Jan 7, 2010 at 5:23 PM, Yuliya Palchaninava <yp...@solute.de> wrote:
> Hi,
>
> According to the api documentation: "In general, once the optimize completes, the total size of the index will be less than the size of the starting index. It could be quite a bit smaller (if there were many pending deletes) or just slightly smaller". In our case the index becomes not smaller but larger, namely thrice as large.
>
> The not optimized index doesn't contain compressed fields, what could have caused the growth of the index due to the otimization. So we cannot explain what happens.
>
> Does someone have an explanation for the index growth due to the optimization?
>
> Thanks,
> Yuliya
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Yuliya Palchaninava <yp...@solute.de>.
Mike,

thanks a lot!

That's exactly what we'll do.

Actually we have a lot of dynamic fields which are not analyzed and not involved in field/document boosting, so we can disable norms on these fields without problems. 

Thanks again.

Yuliya
 

> -----Ursprüngliche Nachricht-----
> Von: Michael McCandless [mailto:lucene@mikemccandless.com] 
> Gesendet: Freitag, 8. Januar 2010 14:38
> An: java-user@lucene.apache.org
> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as 
> large as the not optimized index
> 
> Lucene stores 1 byte (disk and RAM, when searching that 
> field) per document for any field that has norms enabled, 
> even for documents that do not contain that field.
> 
> In your case, that's ~20 MB per field (once optimize is done), times
> 559 fields = ~11TB of storage.
> 
> You should index these fields with 
> Field.Index.ANALYZED_NO_NORMS to turn off norms.  But, this 
> means field/doc boosting, and the normal length boosting 
> Lucene normally does (shorter documents get a better score), 
> will be silently disabled.  Also: you must fully re-index 
> from scratch, otherwise the norms will turn themselves back 
> on when segments merge together.
> 
> Mike
> 
> On Fri, Jan 8, 2010 at 7:55 AM, Yuliya Palchaninava 
> <yp...@solute.de> wrote:
> > Thanks Michael.
> >
> > You are probably wright.
> >
> > Not optimized size is 4.1G, optimized index is about 15G.
> >
> > Yes, our documents do have many different indexed fields 
> and norms are enabled.
> > Nr of fields: 559
> > Nr of documents: 20845906
> > Nr of terms: 25615389
> >
> > Could you please give me a more detailled explanation, how 
> the storage of norms effects the size of an index.
> > What do you mean exactly with "norms are not stored sparsely"?
> >
> > Thanks,
> > Yuliya
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Michael McCandless [mailto:lucene@mikemccandless.com]
> >> Gesendet: Donnerstag, 7. Januar 2010 18:00
> >> An: java-user@lucene.apache.org
> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice 
> as large 
> >> as the not optimized index
> >>
> >> Do your documents have many different indexed fields?  If 
> you do, and 
> >> norms are enabled, that could be the cause (norms are not stored 
> >> sparsely).
> >>
> >> But: what actual sizes are we talking about?
> >>
> >> Mike
> >>
> >> On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava 
> <yp...@solute.de> 
> >> wrote:
> >> > Otis,
> >> >
> >> > thanks for the answer.
> >> >
> >> > Unfortunatelly the index *directory* remains larger *after"
> >> the optimization.
> >> > In our case the otimization was/is completed successfully
> >> and, as you
> >> > say, there is only one segment in the directory.
> >> >
> >> > Some other ideas?
> >> >
> >> > Thanks,
> >> > Yuliya
> >> >
> >> >> -----Ursprüngliche Nachricht-----
> >> >> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> >> >> Gesendet: Donnerstag, 7. Januar 2010 17:35
> >> >> An: java-user@lucene.apache.org
> >> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice
> >> as large
> >> >> as the not optimized index
> >> >>
> >> >> Yuliya,
> >> >>
> >> >> The index *directory* will be larger *while* you are optimizing.
> >> >> After the optimization is completed successfully, the
> >> index directory
> >> >> will be smaller.  It is possible that your index directory is
> >> >> large(r) because you have some left-over segments (e.g. 
> from some 
> >> >> earlier failed/interrupted optimizations) that are not
> >> really a part
> >> >> of the index.  After optimizing, you should have only 1
> >> segment, so
> >> >> if you see more than 1 segment, look at the ones with older 
> >> >> timestamps.  Those can be (re)moved.
> >> >>
> >> >>  Otis
> >> >> --
> >> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >> >>
> >> >>
> >> >>
> >> >> ----- Original Message ----
> >> >> > From: Yuliya Palchaninava <yp...@solute.de>
> >> >> > To: "java-user@lucene.apache.org" 
> <ja...@lucene.apache.org>
> >> >> > Sent: Thu, January 7, 2010 11:23:08 AM
> >> >> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as
> >> >> large as the
> >> >> > not optimized index
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > According to the api documentation: "In general, once
> >> the optimize
> >> >> > completes, the total size of the index will be less than
> >> >> the size of
> >> >> > the starting index. It could be quite a bit smaller (if
> >> there were
> >> >> > many pending deletes) or just slightly smaller". In our
> >> >> case the index
> >> >> > becomes not smaller but larger, namely thrice as large.
> >> >> >
> >> >> > The not optimized index doesn't contain compressed fields,
> >> >> what could
> >> >> > have caused the growth of the index due to the
> >> otimization. So we
> >> >> > cannot explain what happens.
> >> >> >
> >> >> > Does someone have an explanation for the index growth due
> >> >> to the optimization?
> >> >> >
> >> >> > Thanks,
> >> >> > Yuliya
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> 
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: 
> >> >> > java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >>
> >> 
> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: 
> java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> 
> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Michael McCandless <lu...@mikemccandless.com>.
Super!  Thanks for bringing closure.

Mike

On Mon, Jan 11, 2010 at 12:55 PM, Yuliya Palchaninava <yp...@solute.de> wrote:
> Thanks again.
>
> Disabling norms, where it was possible without influencing the search quality,
> has solved the problem:
> - The not optimized version of the index has become smaller.
> - The optimized index has practically the same size as the not optimized one.
>
> Yuliya
>
>> -----Ursprüngliche Nachricht-----
>> Von: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Gesendet: Freitag, 8. Januar 2010 14:38
>> An: java-user@lucene.apache.org
>> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as
>> large as the not optimized index
>>
>> Lucene stores 1 byte (disk and RAM, when searching that
>> field) per document for any field that has norms enabled,
>> even for documents that do not contain that field.
>>
>> In your case, that's ~20 MB per field (once optimize is done), times
>> 559 fields = ~11TB of storage.
>>
>> You should index these fields with
>> Field.Index.ANALYZED_NO_NORMS to turn off norms.  But, this
>> means field/doc boosting, and the normal length boosting
>> Lucene normally does (shorter documents get a better score),
>> will be silently disabled.  Also: you must fully re-index
>> from scratch, otherwise the norms will turn themselves back
>> on when segments merge together.
>>
>> Mike
>>
>> On Fri, Jan 8, 2010 at 7:55 AM, Yuliya Palchaninava
>> <yp...@solute.de> wrote:
>> > Thanks Michael.
>> >
>> > You are probably wright.
>> >
>> > Not optimized size is 4.1G, optimized index is about 15G.
>> >
>> > Yes, our documents do have many different indexed fields
>> and norms are enabled.
>> > Nr of fields: 559
>> > Nr of documents: 20845906
>> > Nr of terms: 25615389
>> >
>> > Could you please give me a more detailled explanation, how
>> the storage of norms effects the size of an index.
>> > What do you mean exactly with "norms are not stored sparsely"?
>> >
>> > Thanks,
>> > Yuliya
>> >
>> >> -----Ursprüngliche Nachricht-----
>> >> Von: Michael McCandless [mailto:lucene@mikemccandless.com]
>> >> Gesendet: Donnerstag, 7. Januar 2010 18:00
>> >> An: java-user@lucene.apache.org
>> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice
>> as large
>> >> as the not optimized index
>> >>
>> >> Do your documents have many different indexed fields?  If
>> you do, and
>> >> norms are enabled, that could be the cause (norms are not stored
>> >> sparsely).
>> >>
>> >> But: what actual sizes are we talking about?
>> >>
>> >> Mike
>> >>
>> >> On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava
>> <yp...@solute.de>
>> >> wrote:
>> >> > Otis,
>> >> >
>> >> > thanks for the answer.
>> >> >
>> >> > Unfortunatelly the index *directory* remains larger *after"
>> >> the optimization.
>> >> > In our case the otimization was/is completed successfully
>> >> and, as you
>> >> > say, there is only one segment in the directory.
>> >> >
>> >> > Some other ideas?
>> >> >
>> >> > Thanks,
>> >> > Yuliya
>> >> >
>> >> >> -----Ursprüngliche Nachricht-----
>> >> >> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> >> >> Gesendet: Donnerstag, 7. Januar 2010 17:35
>> >> >> An: java-user@lucene.apache.org
>> >> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice
>> >> as large
>> >> >> as the not optimized index
>> >> >>
>> >> >> Yuliya,
>> >> >>
>> >> >> The index *directory* will be larger *while* you are optimizing.
>> >> >> After the optimization is completed successfully, the
>> >> index directory
>> >> >> will be smaller.  It is possible that your index directory is
>> >> >> large(r) because you have some left-over segments (e.g.
>> from some
>> >> >> earlier failed/interrupted optimizations) that are not
>> >> really a part
>> >> >> of the index.  After optimizing, you should have only 1
>> >> segment, so
>> >> >> if you see more than 1 segment, look at the ones with older
>> >> >> timestamps.  Those can be (re)moved.
>> >> >>
>> >> >>  Otis
>> >> >> --
>> >> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> >> >>
>> >> >>
>> >> >>
>> >> >> ----- Original Message ----
>> >> >> > From: Yuliya Palchaninava <yp...@solute.de>
>> >> >> > To: "java-user@lucene.apache.org"
>> <ja...@lucene.apache.org>
>> >> >> > Sent: Thu, January 7, 2010 11:23:08 AM
>> >> >> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as
>> >> >> large as the
>> >> >> > not optimized index
>> >> >> >
>> >> >> > Hi,
>> >> >> >
>> >> >> > According to the api documentation: "In general, once
>> >> the optimize
>> >> >> > completes, the total size of the index will be less than
>> >> >> the size of
>> >> >> > the starting index. It could be quite a bit smaller (if
>> >> there were
>> >> >> > many pending deletes) or just slightly smaller". In our
>> >> >> case the index
>> >> >> > becomes not smaller but larger, namely thrice as large.
>> >> >> >
>> >> >> > The not optimized index doesn't contain compressed fields,
>> >> >> what could
>> >> >> > have caused the growth of the index due to the
>> >> otimization. So we
>> >> >> > cannot explain what happens.
>> >> >> >
>> >> >> > Does someone have an explanation for the index growth due
>> >> >> to the optimization?
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Yuliya
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> >> >> > For additional commands, e-mail:
>> >> >> > java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >>
>> >>
>> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >>
>> ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >>
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Yuliya Palchaninava <yp...@solute.de>.
Thanks again.

Disabling norms, where it was possible without influencing the search quality,
has solved the problem:
- The not optimized version of the index has become smaller.
- The optimized index has practically the same size as the not optimized one.

Yuliya

> -----Ursprüngliche Nachricht-----
> Von: Michael McCandless [mailto:lucene@mikemccandless.com] 
> Gesendet: Freitag, 8. Januar 2010 14:38
> An: java-user@lucene.apache.org
> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as 
> large as the not optimized index
> 
> Lucene stores 1 byte (disk and RAM, when searching that 
> field) per document for any field that has norms enabled, 
> even for documents that do not contain that field.
> 
> In your case, that's ~20 MB per field (once optimize is done), times
> 559 fields = ~11TB of storage.
> 
> You should index these fields with 
> Field.Index.ANALYZED_NO_NORMS to turn off norms.  But, this 
> means field/doc boosting, and the normal length boosting 
> Lucene normally does (shorter documents get a better score), 
> will be silently disabled.  Also: you must fully re-index 
> from scratch, otherwise the norms will turn themselves back 
> on when segments merge together.
> 
> Mike
> 
> On Fri, Jan 8, 2010 at 7:55 AM, Yuliya Palchaninava 
> <yp...@solute.de> wrote:
> > Thanks Michael.
> >
> > You are probably wright.
> >
> > Not optimized size is 4.1G, optimized index is about 15G.
> >
> > Yes, our documents do have many different indexed fields 
> and norms are enabled.
> > Nr of fields: 559
> > Nr of documents: 20845906
> > Nr of terms: 25615389
> >
> > Could you please give me a more detailled explanation, how 
> the storage of norms effects the size of an index.
> > What do you mean exactly with "norms are not stored sparsely"?
> >
> > Thanks,
> > Yuliya
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Michael McCandless [mailto:lucene@mikemccandless.com]
> >> Gesendet: Donnerstag, 7. Januar 2010 18:00
> >> An: java-user@lucene.apache.org
> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice 
> as large 
> >> as the not optimized index
> >>
> >> Do your documents have many different indexed fields?  If 
> you do, and 
> >> norms are enabled, that could be the cause (norms are not stored 
> >> sparsely).
> >>
> >> But: what actual sizes are we talking about?
> >>
> >> Mike
> >>
> >> On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava 
> <yp...@solute.de> 
> >> wrote:
> >> > Otis,
> >> >
> >> > thanks for the answer.
> >> >
> >> > Unfortunatelly the index *directory* remains larger *after"
> >> the optimization.
> >> > In our case the otimization was/is completed successfully
> >> and, as you
> >> > say, there is only one segment in the directory.
> >> >
> >> > Some other ideas?
> >> >
> >> > Thanks,
> >> > Yuliya
> >> >
> >> >> -----Ursprüngliche Nachricht-----
> >> >> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> >> >> Gesendet: Donnerstag, 7. Januar 2010 17:35
> >> >> An: java-user@lucene.apache.org
> >> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice
> >> as large
> >> >> as the not optimized index
> >> >>
> >> >> Yuliya,
> >> >>
> >> >> The index *directory* will be larger *while* you are optimizing.
> >> >> After the optimization is completed successfully, the
> >> index directory
> >> >> will be smaller.  It is possible that your index directory is
> >> >> large(r) because you have some left-over segments (e.g. 
> from some 
> >> >> earlier failed/interrupted optimizations) that are not
> >> really a part
> >> >> of the index.  After optimizing, you should have only 1
> >> segment, so
> >> >> if you see more than 1 segment, look at the ones with older 
> >> >> timestamps.  Those can be (re)moved.
> >> >>
> >> >>  Otis
> >> >> --
> >> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >> >>
> >> >>
> >> >>
> >> >> ----- Original Message ----
> >> >> > From: Yuliya Palchaninava <yp...@solute.de>
> >> >> > To: "java-user@lucene.apache.org" 
> <ja...@lucene.apache.org>
> >> >> > Sent: Thu, January 7, 2010 11:23:08 AM
> >> >> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as
> >> >> large as the
> >> >> > not optimized index
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > According to the api documentation: "In general, once
> >> the optimize
> >> >> > completes, the total size of the index will be less than
> >> >> the size of
> >> >> > the starting index. It could be quite a bit smaller (if
> >> there were
> >> >> > many pending deletes) or just slightly smaller". In our
> >> >> case the index
> >> >> > becomes not smaller but larger, namely thrice as large.
> >> >> >
> >> >> > The not optimized index doesn't contain compressed fields,
> >> >> what could
> >> >> > have caused the growth of the index due to the
> >> otimization. So we
> >> >> > cannot explain what happens.
> >> >> >
> >> >> > Does someone have an explanation for the index growth due
> >> >> to the optimization?
> >> >> >
> >> >> > Thanks,
> >> >> > Yuliya
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> 
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: 
> >> >> > java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >>
> >> 
> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: 
> java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >> 
> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >
> >> >
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Michael McCandless <lu...@mikemccandless.com>.
Lucene stores 1 byte (disk and RAM, when searching that field) per
document for any field that has norms enabled, even for documents that
do not contain that field.

In your case, that's ~20 MB per field (once optimize is done), times
559 fields = ~11TB of storage.

You should index these fields with Field.Index.ANALYZED_NO_NORMS to
turn off norms.  But, this means field/doc boosting, and the normal
length boosting Lucene normally does (shorter documents get a better
score), will be silently disabled.  Also: you must fully re-index from
scratch, otherwise the norms will turn themselves back on when
segments merge together.

Mike

On Fri, Jan 8, 2010 at 7:55 AM, Yuliya Palchaninava <yp...@solute.de> wrote:
> Thanks Michael.
>
> You are probably wright.
>
> Not optimized size is 4.1G, optimized index is about 15G.
>
> Yes, our documents do have many different indexed fields and norms are enabled.
> Nr of fields: 559
> Nr of documents: 20845906
> Nr of terms: 25615389
>
> Could you please give me a more detailled explanation, how the storage of norms effects the size of an index.
> What do you mean exactly with "norms are not stored sparsely"?
>
> Thanks,
> Yuliya
>
>> -----Ursprüngliche Nachricht-----
>> Von: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Gesendet: Donnerstag, 7. Januar 2010 18:00
>> An: java-user@lucene.apache.org
>> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as
>> large as the not optimized index
>>
>> Do your documents have many different indexed fields?  If you
>> do, and norms are enabled, that could be the cause (norms are
>> not stored sparsely).
>>
>> But: what actual sizes are we talking about?
>>
>> Mike
>>
>> On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava
>> <yp...@solute.de> wrote:
>> > Otis,
>> >
>> > thanks for the answer.
>> >
>> > Unfortunatelly the index *directory* remains larger *after"
>> the optimization.
>> > In our case the otimization was/is completed successfully
>> and, as you
>> > say, there is only one segment in the directory.
>> >
>> > Some other ideas?
>> >
>> > Thanks,
>> > Yuliya
>> >
>> >> -----Ursprüngliche Nachricht-----
>> >> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> >> Gesendet: Donnerstag, 7. Januar 2010 17:35
>> >> An: java-user@lucene.apache.org
>> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice
>> as large
>> >> as the not optimized index
>> >>
>> >> Yuliya,
>> >>
>> >> The index *directory* will be larger *while* you are optimizing.
>> >> After the optimization is completed successfully, the
>> index directory
>> >> will be smaller.  It is possible that your index directory is
>> >> large(r) because you have some left-over segments (e.g. from some
>> >> earlier failed/interrupted optimizations) that are not
>> really a part
>> >> of the index.  After optimizing, you should have only 1
>> segment, so
>> >> if you see more than 1 segment, look at the ones with older
>> >> timestamps.  Those can be (re)moved.
>> >>
>> >>  Otis
>> >> --
>> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> > From: Yuliya Palchaninava <yp...@solute.de>
>> >> > To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
>> >> > Sent: Thu, January 7, 2010 11:23:08 AM
>> >> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as
>> >> large as the
>> >> > not optimized index
>> >> >
>> >> > Hi,
>> >> >
>> >> > According to the api documentation: "In general, once
>> the optimize
>> >> > completes, the total size of the index will be less than
>> >> the size of
>> >> > the starting index. It could be quite a bit smaller (if
>> there were
>> >> > many pending deletes) or just slightly smaller". In our
>> >> case the index
>> >> > becomes not smaller but larger, namely thrice as large.
>> >> >
>> >> > The not optimized index doesn't contain compressed fields,
>> >> what could
>> >> > have caused the growth of the index due to the
>> otimization. So we
>> >> > cannot explain what happens.
>> >> >
>> >> > Does someone have an explanation for the index growth due
>> >> to the optimization?
>> >> >
>> >> > Thanks,
>> >> > Yuliya
>> >> >
>> >> >
>> >> >
>> >>
>> ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >>
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Yuliya Palchaninava <yp...@solute.de>.
Thanks Michael.

You are probably wright.

Not optimized size is 4.1G, optimized index is about 15G.

Yes, our documents do have many different indexed fields and norms are enabled.
Nr of fields: 559
Nr of documents: 20845906
Nr of terms: 25615389

Could you please give me a more detailled explanation, how the storage of norms effects the size of an index.
What do you mean exactly with "norms are not stored sparsely"?

Thanks,
Yuliya

> -----Ursprüngliche Nachricht-----
> Von: Michael McCandless [mailto:lucene@mikemccandless.com] 
> Gesendet: Donnerstag, 7. Januar 2010 18:00
> An: java-user@lucene.apache.org
> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as 
> large as the not optimized index
> 
> Do your documents have many different indexed fields?  If you 
> do, and norms are enabled, that could be the cause (norms are 
> not stored sparsely).
> 
> But: what actual sizes are we talking about?
> 
> Mike
> 
> On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava 
> <yp...@solute.de> wrote:
> > Otis,
> >
> > thanks for the answer.
> >
> > Unfortunatelly the index *directory* remains larger *after" 
> the optimization.
> > In our case the otimization was/is completed successfully 
> and, as you 
> > say, there is only one segment in the directory.
> >
> > Some other ideas?
> >
> > Thanks,
> > Yuliya
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> >> Gesendet: Donnerstag, 7. Januar 2010 17:35
> >> An: java-user@lucene.apache.org
> >> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice 
> as large 
> >> as the not optimized index
> >>
> >> Yuliya,
> >>
> >> The index *directory* will be larger *while* you are optimizing.  
> >> After the optimization is completed successfully, the 
> index directory 
> >> will be smaller.  It is possible that your index directory is 
> >> large(r) because you have some left-over segments (e.g. from some 
> >> earlier failed/interrupted optimizations) that are not 
> really a part 
> >> of the index.  After optimizing, you should have only 1 
> segment, so 
> >> if you see more than 1 segment, look at the ones with older 
> >> timestamps.  Those can be (re)moved.
> >>
> >>  Otis
> >> --
> >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >>
> >>
> >>
> >> ----- Original Message ----
> >> > From: Yuliya Palchaninava <yp...@solute.de>
> >> > To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
> >> > Sent: Thu, January 7, 2010 11:23:08 AM
> >> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as
> >> large as the
> >> > not optimized index
> >> >
> >> > Hi,
> >> >
> >> > According to the api documentation: "In general, once 
> the optimize 
> >> > completes, the total size of the index will be less than
> >> the size of
> >> > the starting index. It could be quite a bit smaller (if 
> there were 
> >> > many pending deletes) or just slightly smaller". In our
> >> case the index
> >> > becomes not smaller but larger, namely thrice as large.
> >> >
> >> > The not optimized index doesn't contain compressed fields,
> >> what could
> >> > have caused the growth of the index due to the 
> otimization. So we 
> >> > cannot explain what happens.
> >> >
> >> > Does someone have an explanation for the index growth due
> >> to the optimization?
> >> >
> >> > Thanks,
> >> > Yuliya
> >> >
> >> >
> >> >
> >> 
> ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Michael McCandless <lu...@mikemccandless.com>.
Do your documents have many different indexed fields?  If you do, and
norms are enabled, that could be the cause (norms are not stored
sparsely).

But: what actual sizes are we talking about?

Mike

On Thu, Jan 7, 2010 at 11:50 AM, Yuliya Palchaninava <yp...@solute.de> wrote:
> Otis,
>
> thanks for the answer.
>
> Unfortunatelly the index *directory* remains larger *after" the optimization.
> In our case the otimization was/is completed successfully and, as you say,
> there is only one segment in the directory.
>
> Some other ideas?
>
> Thanks,
> Yuliya
>
>> -----Ursprüngliche Nachricht-----
>> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> Gesendet: Donnerstag, 7. Januar 2010 17:35
>> An: java-user@lucene.apache.org
>> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as
>> large as the not optimized index
>>
>> Yuliya,
>>
>> The index *directory* will be larger *while* you are
>> optimizing.  After the optimization is completed
>> successfully, the index directory will be smaller.  It is
>> possible that your index directory is large(r) because you
>> have some left-over segments (e.g. from some earlier
>> failed/interrupted optimizations) that are not really a part
>> of the index.  After optimizing, you should have only 1
>> segment, so if you see more than 1 segment, look at the ones
>> with older timestamps.  Those can be (re)moved.
>>
>>  Otis
>> --
>> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: Yuliya Palchaninava <yp...@solute.de>
>> > To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
>> > Sent: Thu, January 7, 2010 11:23:08 AM
>> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as
>> large as the
>> > not optimized index
>> >
>> > Hi,
>> >
>> > According to the api documentation: "In general, once the optimize
>> > completes, the total size of the index will be less than
>> the size of
>> > the starting index. It could be quite a bit smaller (if there were
>> > many pending deletes) or just slightly smaller". In our
>> case the index
>> > becomes not smaller but larger, namely thrice as large.
>> >
>> > The not optimized index doesn't contain compressed fields,
>> what could
>> > have caused the growth of the index due to the otimization. So we
>> > cannot explain what happens.
>> >
>> > Does someone have an explanation for the index growth due
>> to the optimization?
>> >
>> > Thanks,
>> > Yuliya
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Maybe you can paste a directory listing before optimization and after optimization?

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Yuliya Palchaninava <yp...@solute.de>
> To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
> Sent: Thu, January 7, 2010 11:50:29 AM
> Subject: AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index
> 
> Otis,
> 
> thanks for the answer. 
> 
> Unfortunatelly the index *directory* remains larger *after" the optimization.
> In our case the otimization was/is completed successfully and, as you say,
> there is only one segment in the directory.
> 
> Some other ideas?
> 
> Thanks,
> Yuliya
> 
> > -----Ursprüngliche Nachricht-----
> > Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> > Gesendet: Donnerstag, 7. Januar 2010 17:35
> > An: java-user@lucene.apache.org
> > Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as 
> > large as the not optimized index
> > 
> > Yuliya,
> > 
> > The index *directory* will be larger *while* you are 
> > optimizing.  After the optimization is completed 
> > successfully, the index directory will be smaller.  It is 
> > possible that your index directory is large(r) because you 
> > have some left-over segments (e.g. from some earlier 
> > failed/interrupted optimizations) that are not really a part 
> > of the index.  After optimizing, you should have only 1 
> > segment, so if you see more than 1 segment, look at the ones 
> > with older timestamps.  Those can be (re)moved.
> > 
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > 
> > 
> > 
> > ----- Original Message ----
> > > From: Yuliya Palchaninava 
> > > To: "java-user@lucene.apache.org" 
> > > Sent: Thu, January 7, 2010 11:23:08 AM
> > > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as 
> > large as the 
> > > not optimized index
> > > 
> > > Hi,
> > > 
> > > According to the api documentation: "In general, once the optimize 
> > > completes, the total size of the index will be less than 
> > the size of 
> > > the starting index. It could be quite a bit smaller (if there were 
> > > many pending deletes) or just slightly smaller". In our 
> > case the index 
> > > becomes not smaller but larger, namely thrice as large.
> > > 
> > > The not optimized index doesn't contain compressed fields, 
> > what could 
> > > have caused the growth of the index due to the otimization. So we 
> > > cannot explain what happens.
> > > 
> > > Does someone have an explanation for the index growth due 
> > to the optimization?
> > > 
> > > Thanks,
> > > Yuliya
> > > 
> > > 
> > > 
> > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> > 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Yuliya Palchaninava <yp...@solute.de>.
Otis,

thanks for the answer. 

Unfortunatelly the index *directory* remains larger *after" the optimization.
In our case the otimization was/is completed successfully and, as you say,
there is only one segment in the directory.

Some other ideas?

Thanks,
Yuliya

> -----Ursprüngliche Nachricht-----
> Von: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
> Gesendet: Donnerstag, 7. Januar 2010 17:35
> An: java-user@lucene.apache.org
> Betreff: Re: Lucene 2.9 and 3.0: Optimized index is thrice as 
> large as the not optimized index
> 
> Yuliya,
> 
> The index *directory* will be larger *while* you are 
> optimizing.  After the optimization is completed 
> successfully, the index directory will be smaller.  It is 
> possible that your index directory is large(r) because you 
> have some left-over segments (e.g. from some earlier 
> failed/interrupted optimizations) that are not really a part 
> of the index.  After optimizing, you should have only 1 
> segment, so if you see more than 1 segment, look at the ones 
> with older timestamps.  Those can be (re)moved.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: Yuliya Palchaninava <yp...@solute.de>
> > To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
> > Sent: Thu, January 7, 2010 11:23:08 AM
> > Subject: Lucene 2.9 and 3.0: Optimized index is thrice as 
> large as the 
> > not optimized index
> > 
> > Hi,
> > 
> > According to the api documentation: "In general, once the optimize 
> > completes, the total size of the index will be less than 
> the size of 
> > the starting index. It could be quite a bit smaller (if there were 
> > many pending deletes) or just slightly smaller". In our 
> case the index 
> > becomes not smaller but larger, namely thrice as large.
> > 
> > The not optimized index doesn't contain compressed fields, 
> what could 
> > have caused the growth of the index due to the otimization. So we 
> > cannot explain what happens.
> > 
> > Does someone have an explanation for the index growth due 
> to the optimization?
> > 
> > Thanks,
> > Yuliya
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Yuliya,

The index *directory* will be larger *while* you are optimizing.  After the optimization is completed successfully, the index directory will be smaller.  It is possible that your index directory is large(r) because you have some left-over segments (e.g. from some earlier failed/interrupted optimizations) that are not really a part of the index.  After optimizing, you should have only 1 segment, so if you see more than 1 segment, look at the ones with older timestamps.  Those can be (re)moved.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Yuliya Palchaninava <yp...@solute.de>
> To: "java-user@lucene.apache.org" <ja...@lucene.apache.org>
> Sent: Thu, January 7, 2010 11:23:08 AM
> Subject: Lucene 2.9 and 3.0: Optimized index is thrice as large as the not optimized index
> 
> Hi,
> 
> According to the api documentation: "In general, once the optimize completes, 
> the total size of the index will be less than the size of the starting index. It 
> could be quite a bit smaller (if there were many pending deletes) or just 
> slightly smaller". In our case the index becomes not smaller but larger, namely 
> thrice as large. 
> 
> The not optimized index doesn't contain compressed fields, what could have 
> caused the growth of the index due to the otimization. So we cannot explain what 
> happens.
> 
> Does someone have an explanation for the index growth due to the optimization?
> 
> Thanks,
> Yuliya
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org