You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by maureen tanuwidjaja <au...@yahoo.com> on 2007/03/13 09:04:16 UTC

Urgent : How much actually the disk space needed to optimize the index?

Dear All
  
  How much actually the disk space needed to optimize the index?The  explanation given in documentation seems to be very different with the  practical situation
  
  I have an index file of size 18.6 G and I am going to optimize it.I  keep this index in mobile Hard Disk with capacity of 100 Gb....I did  not use any index reader,and I merely call index writer to optimize  this index.However,to my surprise,now while optimizing, the Index size  grow to almost occupy all the free space.I am preety sure that later it  will terminated due to there is no sufficient disk space.
  
  This is the content on the index file
  ------------------------------------------------------------------------------------------
  Microsoft Windows XP [Version 5.1.2600]
  (C) Copyright 1985-2001 Microsoft Corp.
  F:\DI>dir
   Volume in drive F has no label.
   Volume Serial Number is 9454-C24E
  
   Directory of F:\DI
  
  03/13/2007  02:14 PM    <DIR>          .
  03/13/2007  02:14 PM    <DIR>          ..
  03/13/2007  02:14  PM                 20 segments.gen
  03/13/2007  02:14  PM                 67 segments_34s4
  03/13/2007  12:06  PM                  0 write.lock
  03/13/2007  02:14 PM    41,705,009,152 _1ke1.cfs
  03/13/2007  12:15 PM     1,638,320,227 _1ke1.fdt
  03/13/2007  12:15 PM         4,461,912 _1ke1.fdx
  03/13/2007  12:09 PM         6,295,065 _1ke1.fnm
  03/13/2007  12:26 PM       232,520,666 _1ke1.frq
  03/13/2007  02:08 PM    44,927,549,671 _1ke1.nrm
  03/13/2007  12:26 PM       170,766,513 _1ke1.prx
  03/13/2007  12:26 PM         1,281,924 _1ke1.tii
  03/13/2007  12:26 PM       103,094,835 _1ke1.tis
  03/13/2007  02:14 PM        51,688,575 _1ke1.tvd
  03/13/2007  02:14 PM       882,304,866 _1ke1.tvf
  03/13/2007  02:14 PM         4,461,916 _1ke1.tvx
  03/12/2007  03:24 PM     5,594,336,501 _8km.cfs
                16 File(s) 95,322,091,910 bytes
                  2 Dir(s)   3,915,960,320 bytes free
  
  F:\DI>
  
  
  ----------------------------------------------------------------------------------------
  
  
  I wonder what was happening...I read in the documentation that calling  the optimizer will need available disk space about 2 times current  index size.And I have more than 2 times 18.6 Gb of free space !!!
  
  I really confuse and dont know what is going wrong.This is my code for optimizing :
  ----------------------------------------------------------------------------------------
  package edu.ntu.ce.maureen.index.optimize;
  
  import java.util.Date;
  
  import org.apache.lucene.analysis.Analyzer;
  import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
  import org.apache.lucene.analysis.standard.StandardAnalyzer;
  import org.apache.lucene.index.IndexWriter;
  
  public class OptimizeDI {
  private static Analyzer analyzer = new SnowballAnalyzer("English",StandardAnalyzer.STOP_WORDS);    
  private static IndexWriter writerOpt;
  
  public static void OpenIndexDir(String indexDir)throws Exception{
      try
      {
          
          writerOpt = new IndexWriter(indexDir,analyzer,false);
      }
      catch (Exception e)
      {
          System.out.println("Cannot create index writer");
          e.printStackTrace();
      }
  }
  
  public static void OptimizeIndex() throws Exception{
      try
      {    
          System.out.println("Optimizing DI...");
          writerOpt.optimize();
          
      }catch(Exception e)
      {
          System.out.println("Exception in writerOpt.optimize()");
          e.printStackTrace();
      }
      
  }
  
  public static void closeIndex() throws Exception{
      try
      {    
          writerOpt.close();
          
      }catch(Exception e)
      {
          System.out.println("Cannot close index writer");
          e.printStackTrace();
      }
  }    
  public static void main(String args[]){
      long start = new Date().getTime();
      
      try{
      OptimizeDI.OpenIndexDir("F:/DI");
      OptimizeDI.OptimizeIndex();
      OptimizeDI.closeIndex();}catch(Exception e){
          System.out.println("Fail to optimize DI");
          e.printStackTrace();
      }
      long end = new Date().getTime();
      System.out.println("Optimized DI is created in "+(end-start)+" ms");
  }
  
  }
  
  ----------------------------------------------------------------------------------------
  
  Can somebody help me?Thanks a lot >_<
  
  
  Regards,
  Maureen
  

 
---------------------------------
Never miss an email again!
Yahoo! Toolbar alerts you the instant new Mail arrives. Check it out.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by Michael McCandless <lu...@mikemccandless.com>.
"Michael McCandless" <lu...@mikemccandless.com> wrote:

> The only simple workaround I can think of is to set maxMergeDocs to
> keep all segments "small".  But then you may have too many segments
> with time.  Either that or find a way to reduce the number of unique
> fields that you actually need to store.

Actually one more, even simpler, workaround is to turn off norms
for these fields.

I've opened Jira issue 830 to track this:

    http://issues.apache.org/jira/browse/LUCENE-830

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: How to disable lucene norm factor?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
ok mike.I'll try it and see wheter could work :) then I will proceed to optimize the index.
  Well then i guess it's fine to use the default value for maxMergeDocs which is INTEGER.MAX?
  
  Thanks a lot
  
  Regards,
  Maureen
  

Michael McCandless <lu...@mikemccandless.com> wrote:  
"maureen tanuwidjaja"  wrote:

>   How to disable lucene norm factor?

Once you've created a Field and before adding to your Document
index, just call field.setOmitNorms(true).

Note, however, that you must do this for all Field instances by that
same field name because whenever Lucene merges segments, if even one
Document did not disable norms then this will "spread" so that all documents
keep their norms, for the same field name.

Ie you must fully rebuild your index with the above code change to
truly not store norms.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.

Re: How to disable lucene norm factor?

Posted by Michael McCandless <lu...@mikemccandless.com>.
"maureen tanuwidjaja" <au...@yahoo.com> wrote:

>   How to disable lucene norm factor?

Once you've created a Field and before adding to your Document
index, just call field.setOmitNorms(true).

Note, however, that you must do this for all Field instances by that
same field name because whenever Lucene merges segments, if even one
Document did not disable norms then this will "spread" so that all documents
keep their norms, for the same field name.

Ie you must fully rebuild your index with the above code change to
truly not store norms.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


How to disable lucene norm factor?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
Hi all,
  How to disable lucene norm factor?
  
  Thanks,
  Maureen



 
---------------------------------
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
Hi Mike,
  
  
  How to disable/turn off the norm?is it while indexing?
  
  Thanks,
  Maureen
  
 
---------------------------------
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.

lengthNorm accessible?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
hmmm...now I wonder wheter it is possible to access this lengthNorm  value so that it can be used as before but without creating any nrm  file --> setOmitNorm = true
  
  Any other suggestion on how i could get the same rank as before by making use of this lengthNorm but without creating nrm file?
  
  
  Thanks,
  Maureen
  
  
  
  Xiaocheng Luan <je...@yahoo.com> wrote:  
You can store the fields in the index itself if you want, without  indexing them (just flag it as stored/unindexed). I believe storing  fields should not incur the "norms" size problem, please correct me if  I'm wrong.

Thanks,
Xiaocheng
maureen tanuwidjaja  wrote: Ya...I think i will store it in the database so that later it could be used in scoring/ranking for retrieval...:)
  
  Another thing i would like to see is whether the precision or recall will be much affaected by this...
  
  Regards,
  Maureen

Xiaocheng  Luan wrote:One side-effect of turning off the norms may be that the  scoring/ranking will be different? Do you need to search by each of  these many fields? If not, you probably don't have to index these  fields (but store them for retrieval?).

Just a thought.
Xiaocheng

Michael McCandless  wrote: "maureen tanuwidjaja"  wrote:
   
> "The only simple workaround I can think of is to set maxMergeDocs to
> keep all segments "small".  But then you may have too many segments
> with time.  Either that or find a way to reduce the number of unique
> fields that you actually need to store."
>   It is not possible for me to reduce the number of fields needed to
>   store...
>   
>   Could you recommend what is the maxMerge value that is small enough to
>   keep all segment small?
>   
>   I also would like to ask wheter, if optimize is successful,will it then
>    perform faster  searching significantly compared to the  unoptimized
>   one?

I think you'd need to test different values for your situation.  Maybe
try 66,000 which will give you ~ 10 segments at your current number of
docs?

>   I have the searching result in 30 to 3 minutes, which is actually quite
>    unacceptable for the "search engine" I build...Is there any 
>   recommendation on how faster searching could be done? 

I think you'll need to turn off norms.  I expect alot of the slowness is
in loading the large norms files for the first time.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
Sucker-punch spam with award-winning protection.
 Try the free Yahoo! Mail Beta.

 
---------------------------------
Don't be flakey. Get Yahoo! Mail for Mobile and 
always stay connected to friends.

 
---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.

 
---------------------------------
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by Xiaocheng Luan <je...@yahoo.com>.
You can store the fields in the index itself if you want, without indexing them (just flag it as stored/unindexed). I believe storing fields should not incur the "norms" size problem, please correct me if I'm wrong.

Thanks,
Xiaocheng
maureen tanuwidjaja <au...@yahoo.com> wrote: Ya...I think i will store it in the database so that later it could be used in scoring/ranking for retrieval...:)
  
  Another thing i would like to see is whether the precision or recall will be much affaected by this...
  
  Regards,
  Maureen

Xiaocheng Luan  wrote:One  side-effect of turning off the norms may be that the scoring/ranking  will be different? Do you need to search by each of these many fields?  If not, you probably don't have to index these fields (but store them  for retrieval?).

Just a thought.
Xiaocheng

Michael McCandless  wrote: "maureen tanuwidjaja"  wrote:
   
> "The only simple workaround I can think of is to set maxMergeDocs to
> keep all segments "small".  But then you may have too many segments
> with time.  Either that or find a way to reduce the number of unique
> fields that you actually need to store."
>   It is not possible for me to reduce the number of fields needed to
>   store...
>   
>   Could you recommend what is the maxMerge value that is small enough to
>   keep all segment small?
>   
>   I also would like to ask wheter, if optimize is successful,will it then
>    perform faster  searching significantly compared to the  unoptimized
>   one?

I think you'd need to test different values for your situation.  Maybe
try 66,000 which will give you ~ 10 segments at your current number of
docs?

>   I have the searching result in 30 to 3 minutes, which is actually quite
>    unacceptable for the "search engine" I build...Is there any 
>   recommendation on how faster searching could be done? 

I think you'll need to turn off norms.  I expect alot of the slowness is
in loading the large norms files for the first time.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
Sucker-punch spam with award-winning protection.
 Try the free Yahoo! Mail Beta.

 
---------------------------------
Don't be flakey. Get Yahoo! Mail for Mobile and 
always stay connected to friends.

 
---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
Ya...I think i will store it in the database so that later it could be used in scoring/ranking for retrieval...:)
  
  Another thing i would like to see is whether the precision or recall will be much affaected by this...
  
  Regards,
  Maureen

Xiaocheng Luan <je...@yahoo.com> wrote:One  side-effect of turning off the norms may be that the scoring/ranking  will be different? Do you need to search by each of these many fields?  If not, you probably don't have to index these fields (but store them  for retrieval?).

Just a thought.
Xiaocheng

Michael McCandless  wrote: "maureen tanuwidjaja"  wrote:
   
> "The only simple workaround I can think of is to set maxMergeDocs to
> keep all segments "small".  But then you may have too many segments
> with time.  Either that or find a way to reduce the number of unique
> fields that you actually need to store."
>   It is not possible for me to reduce the number of fields needed to
>   store...
>   
>   Could you recommend what is the maxMerge value that is small enough to
>   keep all segment small?
>   
>   I also would like to ask wheter, if optimize is successful,will it then
>    perform faster  searching significantly compared to the  unoptimized
>   one?

I think you'd need to test different values for your situation.  Maybe
try 66,000 which will give you ~ 10 segments at your current number of
docs?

>   I have the searching result in 30 to 3 minutes, which is actually quite
>    unacceptable for the "search engine" I build...Is there any 
>   recommendation on how faster searching could be done? 

I think you'll need to turn off norms.  I expect alot of the slowness is
in loading the large norms files for the first time.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
Sucker-punch spam with award-winning protection.
 Try the free Yahoo! Mail Beta.

 
---------------------------------
Don't be flakey. Get Yahoo! Mail for Mobile and 
always stay connected to friends.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by Xiaocheng Luan <je...@yahoo.com>.
One side-effect of turning off the norms may be that the scoring/ranking will be different? Do you need to search by each of these many fields? If not, you probably don't have to index these fields (but store them for retrieval?).

Just a thought.
Xiaocheng

Michael McCandless <lu...@mikemccandless.com> wrote: "maureen tanuwidjaja"  wrote:
   
> "The only simple workaround I can think of is to set maxMergeDocs to
> keep all segments "small".  But then you may have too many segments
> with time.  Either that or find a way to reduce the number of unique
> fields that you actually need to store."
>   It is not possible for me to reduce the number of fields needed to
>   store...
>   
>   Could you recommend what is the maxMerge value that is small enough to
>   keep all segment small?
>   
>   I also would like to ask wheter, if optimize is successful,will it then
>    perform faster  searching significantly compared to the  unoptimized
>   one?

I think you'd need to test different values for your situation.  Maybe
try 66,000 which will give you ~ 10 segments at your current number of
docs?

>   I have the searching result in 30 to 3 minutes, which is actually quite
>    unacceptable for the "search engine" I build...Is there any 
>   recommendation on how faster searching could be done? 

I think you'll need to turn off norms.  I expect alot of the slowness is
in loading the large norms files for the first time.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
Sucker-punch spam with award-winning protection.
 Try the free Yahoo! Mail Beta.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by Michael McCandless <lu...@mikemccandless.com>.
"maureen tanuwidjaja" <au...@yahoo.com> wrote:
   
> "The only simple workaround I can think of is to set maxMergeDocs to
> keep all segments "small".  But then you may have too many segments
> with time.  Either that or find a way to reduce the number of unique
> fields that you actually need to store."
>   It is not possible for me to reduce the number of fields needed to
>   store...
>   
>   Could you recommend what is the maxMerge value that is small enough to
>   keep all segment small?
>   
>   I also would like to ask wheter, if optimize is successful,will it then
>    perform faster  searching significantly compared to the  unoptimized
>   one?

I think you'd need to test different values for your situation.  Maybe
try 66,000 which will give you ~ 10 segments at your current number of
docs?

>   I have the searching result in 30 to 3 minutes, which is actually quite
>    unacceptable for the "search engine" I build...Is there any 
>   recommendation on how faster searching could be done? 

I think you'll need to turn off norms.  I expect alot of the slowness is
in loading the large norms files for the first time.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
Oops sorry,mistyping..

  I have the searching result in 30 SECONDS to 3 minutes, which is actually 
quite  unacceptable for the "search engine" I build...Is there any  
recommendation on how faster searching could be done? 
  

maureen tanuwidjaja <au...@yahoo.com> wrote:  Hi mike
  
  
"The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store."
  It is not possible for me to reduce the number of fields needed to store...
  
  Could you recommend what is the maxMerge value that is small enough to keep all segment small?
  
  I also would like to ask wheter, if optimize is successful,will it then  perform faster searching significantly compared to the unoptimized one?
  
  I have the searching result in 30 to 3 minutes, which is actually quite  unacceptable for the "search engine" I build...Is there any  recommendation on how faster searching could be done? 
  
  Thanks,
  Maureen
  
  

Michael McCandless  wrote:  "maureen tanuwidjaja"  wrote:

>   "One thing that stands out in your listing is: your norms file
>   (_1ke1.nrm) is enormous compared to all other files.  Are you indexing
>   many tiny docs where each docs has highly variable fields or
>   something?"
>   
>   Ya I also confuse why this nrm file is trmendous in size.
>   I am indexing a total of 657739 XML document .
>   Total number of fields are 37552 fields (I am using XML tags as the
>   field)

OK, this is going to be a problem for Lucene.

This case will definitely go over 2X disk usage during optimize.  I
will update the javadocs to call out this caveat.

The .nrm file (norms) require 1 byte per document per unique field in
the segment, regardless of whether that document has that field (ie,
it's not a "sparse" representation).

When you have many small docs, and each doc has (somewhat) different
fields from the others, this results in a tremendously large storage
for the norms.

The thing is, within one segment it may be OK since that segment has a
subset of all docs and fields.  But then when segments are merged
(like optimize does) the product of #docs and #fields grows
"multiplicatively" and results in far far more storage required than
the sum of the individual segments.

The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.

  
---------------------------------
Looking for earth-friendly autos? 
 Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.  

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
Hi mike
  
  
"The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store."
  It is not possible for me to reduce the number of fields needed to store...
  
  Could you recommend what is the maxMerge value that is small enough to keep all segment small?
  
  I also would like to ask wheter, if optimize is successful,will it then  perform faster  searching significantly compared to the  unoptimized one?
  
  I have the searching result in 30 to 3 minutes, which is actually quite  unacceptable for the "search engine" I build...Is there any  recommendation on how faster searching could be done? 
  
  Thanks,
  Maureen
  
  

Michael McCandless <lu...@mikemccandless.com> wrote:  "maureen tanuwidjaja"  wrote:

>   "One thing that stands out in your listing is: your norms file
>   (_1ke1.nrm) is enormous compared to all other files.  Are you indexing
>   many tiny docs where each docs has highly variable fields or
>   something?"
>   
>   Ya I also confuse why this nrm file is trmendous in size.
>   I am indexing a total of 657739 XML document .
>   Total number of fields are 37552 fields (I am using XML tags as the
>   field)

OK, this is going to be a problem for Lucene.

This case will definitely go over 2X disk usage during optimize.  I
will update the javadocs to call out this caveat.

The .nrm file (norms) require 1 byte per document per unique field in
the segment, regardless of whether that document has that field (ie,
it's not a "sparse" representation).

When you have many small docs, and each doc has (somewhat) different
fields from the others, this results in a tremendously large storage
for the norms.

The thing is, within one segment it may be OK since that segment has a
subset of all docs and fields.  But then when segments are merged
(like optimize does) the product of #docs and #fields grows
"multiplicatively" and results in far far more storage required than
the sum of the individual segments.

The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by Michael McCandless <lu...@mikemccandless.com>.
"maureen tanuwidjaja" <au...@yahoo.com> wrote:

>   "One thing that stands out in your listing is: your norms file
>   (_1ke1.nrm) is enormous compared to all other files.  Are you indexing
>   many tiny docs where each docs has highly variable fields or
>   something?"
>   
>   Ya I also confuse why this nrm file is trmendous in size.
>   I am indexing a total of 657739 XML document .
>   Total number of fields are 37552 fields (I am using XML tags as the
>   field)

OK, this is going to be a problem for Lucene.

This case will definitely go over 2X disk usage during optimize.  I
will update the javadocs to call out this caveat.

The .nrm file (norms) require 1 byte per document per unique field in
the segment, regardless of whether that document has that field (ie,
it's not a "sparse" representation).

When you have many small docs, and each doc has (somewhat) different
fields from the others, this results in a tremendously large storage
for the norms.

The thing is, within one segment it may be OK since that segment has a
subset of all docs and fields.  But then when segments are merged
(like optimize does) the product of #docs and #fields grows
"multiplicatively" and results in far far more storage required than
the sum of the individual segments.

The only simple workaround I can think of is to set maxMergeDocs to
keep all segments "small".  But then you may have too many segments
with time.  Either that or find a way to reduce the number of unique
fields that you actually need to store.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by maureen tanuwidjaja <au...@yahoo.com>.
Hi Mike..
  
  "One thing that stands out in your listing is: your norms file
  (_1ke1.nrm) is enormous compared to all other files.  Are you indexing
  many tiny docs where each docs has highly variable fields or something?"
  
  Ya I also confuse why this nrm file is trmendous in size.
  I am indexing a total of 657739 XML document .
  Total number of fields are 37552 fields (I am using XML tags as the field)
  
  
  OK,this is the listing of the index file before I optimize...
  
  D:\dual_index\DI>dir
   Volume in drive D is SELAB
   Volume Serial Number is 44A7-7D50
  
   Directory of D:\dual_index\DI
  
  03/13/2007  09:29 AM    <DIR>          .
  03/13/2007  09:29 AM    <DIR>          ..
  03/13/2007  05:56  AM                 20 segments.gen
  03/13/2007  05:56 AM               712 segments_34rz
  03/13/2007  01:56 AM     2,491,551,624 _16v6.cfs
  03/13/2007  04:30 AM     2,140,779,671 _1fft.cfs
  03/13/2007  04:42 AM        76,813,296 _1gao.cfs
  03/13/2007  04:53 AM        78,626,916 _1h5j.cfs
  03/13/2007  05:06 AM       101,981,232 _1i0e.cfs
  03/13/2007  05:24 AM       182,544,071 _1iv9.cfs
  03/13/2007  05:43 AM       185,825,480 _1jq4.cfs
  03/13/2007  05:44 AM        10,569,811 _1jt7.cfs
  03/13/2007  05:46 AM        12,100,629 _1jwa.cfs
  03/13/2007  05:48 AM        12,127,317 _1jzd.cfs
  03/13/2007  05:49 AM        11,478,747 _1k2g.cfs
  03/13/2007  05:51 AM        11,483,235 _1k5j.cfs
  03/13/2007  05:53 AM        11,864,730 _1k8m.cfs
  03/13/2007  05:54 AM        10,966,413 _1kbp.cfs
  03/13/2007  05:55 AM           936,961 _1kc0.cfs
  03/13/2007  05:55 AM         1,144,949 _1kcb.cfs
  03/13/2007  05:55 AM         1,314,375 _1kcm.cfs
  03/13/2007  05:55 AM           951,460 _1kcx.cfs
  03/13/2007  05:55 AM         1,175,376 _1kd8.cfs
  03/13/2007  05:55 AM         1,171,232 _1kdj.cfs
  03/13/2007  05:55 AM         1,176,141 _1kdu.cfs
  03/13/2007  05:56 AM           124,219 _1kdv.cfs
  03/13/2007  05:56 AM           117,425 _1kdw.cfs
  03/13/2007  05:56 AM           158,673 _1kdx.cfs
  03/13/2007  05:56 AM           117,591 _1kdy.cfs
  03/12/2007  03:24 PM     5,594,336,501 _8km.cfs
  03/12/2007  06:07 PM     3,322,027,221 _h59.cfs
  03/12/2007  08:51 PM     3,017,631,411 _ppw.cfs
  03/12/2007  11:25 PM     2,383,550,153 _yaj.cfs
                31 File(s) 19,664,647,592 bytes
                 2 Dir(s)  20,398,489,600 bytes free
  
----------------------------------------------------------------------------------------------
  
  And there is another thing I want to ask...is it searching on the  optimized index render significantly faster searching compared to the  unoptimized one?
  
  It tooks me various numbers from 40second to 3minutes in searching inside this unoptimized index....
  
  How bout the memory consumption?will it took greater amount of memory consumption if using the optimized one?
  
  
  
  Thanks a lot
  
  Regards,
  Maureen
  
  
Michael McCandless <lu...@mikemccandless.com> wrote:  
"maureen tanuwidjaja"  wrote:

>   How much actually the disk space needed to optimize the index?The 
>   explanation given in documentation seems to be very different with the 
>   practical situation
>   
>   I have an index file of size 18.6 G and I am going to optimize it.I 
>   keep this index in mobile Hard Disk with capacity of 100 Gb....I did 
>   not use any index reader,and I merely call index writer to optimize 
>   this index.However,to my surprise,now while optimizing, the Index size 
>   grow to almost occupy all the free space.I am preety sure that later it
>    will terminated due to there is no sufficient disk space.
>   
>   This is the content on the index file
>   ------------------------------------------------------------------------------------------
>   03/13/2007  02:14 PM              .
>   03/13/2007  02:14 PM              ..
>   03/13/2007  02:14  PM                 20 segments.gen
>   03/13/2007  02:14  PM                 67 segments_34s4
>   03/13/2007  12:06  PM                  0 write.lock
>   03/13/2007  02:14 PM    41,705,009,152 _1ke1.cfs
>   03/13/2007  12:15 PM     1,638,320,227 _1ke1.fdt
>   03/13/2007  12:15 PM         4,461,912 _1ke1.fdx
>   03/13/2007  12:09 PM         6,295,065 _1ke1.fnm
>   03/13/2007  12:26 PM       232,520,666 _1ke1.frq
>   03/13/2007  02:08 PM    44,927,549,671 _1ke1.nrm
>   03/13/2007  12:26 PM       170,766,513 _1ke1.prx
>   03/13/2007  12:26 PM         1,281,924 _1ke1.tii
>   03/13/2007  12:26 PM       103,094,835 _1ke1.tis
>   03/13/2007  02:14 PM        51,688,575 _1ke1.tvd
>   03/13/2007  02:14 PM       882,304,866 _1ke1.tvf
>   03/13/2007  02:14 PM         4,461,916 _1ke1.tvx
>   03/12/2007  03:24 PM     5,594,336,501 _8km.cfs


As best I know, it should only require 2X the disk space.  In your
case this means you should only have needed 18.6 GB of free space (ie,
1X is the current index, then another 1X in free space).

So something odd is happening here.

One thing that stands out in your listing is: your norms file
(_1ke1.nrm) is enormous compared to all other files.  Are you indexing
many tiny docs where each docs has highly variable fields or something?

Hmmm.  In fact if you are doing this, then on merge, the norms (which
are not stored "sparsely") could in fact grow far larger than 2X.

Can you send a listing of the 18.6 GB index before optimizing?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



 
---------------------------------
Sucker-punch spam with award-winning protection.
 Try the free Yahoo! Mail Beta.

Re: Urgent : How much actually the disk space needed to optimize the index?

Posted by Michael McCandless <lu...@mikemccandless.com>.
"maureen tanuwidjaja" <au...@yahoo.com> wrote:

>   How much actually the disk space needed to optimize the index?The 
>   explanation given in documentation seems to be very different with the 
>   practical situation
>   
>   I have an index file of size 18.6 G and I am going to optimize it.I 
>   keep this index in mobile Hard Disk with capacity of 100 Gb....I did 
>   not use any index reader,and I merely call index writer to optimize 
>   this index.However,to my surprise,now while optimizing, the Index size 
>   grow to almost occupy all the free space.I am preety sure that later it
>    will terminated due to there is no sufficient disk space.
>   
>   This is the content on the index file
>   ------------------------------------------------------------------------------------------
>   03/13/2007  02:14 PM    <DIR>          .
>   03/13/2007  02:14 PM    <DIR>          ..
>   03/13/2007  02:14  PM                 20 segments.gen
>   03/13/2007  02:14  PM                 67 segments_34s4
>   03/13/2007  12:06  PM                  0 write.lock
>   03/13/2007  02:14 PM    41,705,009,152 _1ke1.cfs
>   03/13/2007  12:15 PM     1,638,320,227 _1ke1.fdt
>   03/13/2007  12:15 PM         4,461,912 _1ke1.fdx
>   03/13/2007  12:09 PM         6,295,065 _1ke1.fnm
>   03/13/2007  12:26 PM       232,520,666 _1ke1.frq
>   03/13/2007  02:08 PM    44,927,549,671 _1ke1.nrm
>   03/13/2007  12:26 PM       170,766,513 _1ke1.prx
>   03/13/2007  12:26 PM         1,281,924 _1ke1.tii
>   03/13/2007  12:26 PM       103,094,835 _1ke1.tis
>   03/13/2007  02:14 PM        51,688,575 _1ke1.tvd
>   03/13/2007  02:14 PM       882,304,866 _1ke1.tvf
>   03/13/2007  02:14 PM         4,461,916 _1ke1.tvx
>   03/12/2007  03:24 PM     5,594,336,501 _8km.cfs


As best I know, it should only require 2X the disk space.  In your
case this means you should only have needed 18.6 GB of free space (ie,
1X is the current index, then another 1X in free space).

So something odd is happening here.

One thing that stands out in your listing is: your norms file
(_1ke1.nrm) is enormous compared to all other files.  Are you indexing
many tiny docs where each docs has highly variable fields or something?

Hmmm.  In fact if you are doing this, then on merge, the norms (which
are not stored "sparsely") could in fact grow far larger than 2X.

Can you send a listing of the 18.6 GB index before optimizing?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org