You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Koji Sekiguchi <ko...@m4.dion.ne.jp> on 2005/10/16 04:04:33 UTC

delete unnecessary files after optimize()

Hello,

My Tomcat application has several threads. These threads
share a single instance of IndexSearcher to seach contents.

At some point in time, I have the following index directory:

-rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
-rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
-rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
-rwx------+ 1 admin admin     64 Oct 16 10:21 segments

In this moment, I want to optimize() the index. I can do it safely
without interrupting Tomcat process.
After optimizing the index, I get a new compounf file _4ab.cfs:

-rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
-rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
-rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
-rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
-rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
-rwx------+ 1 admin admin     29 Oct 16 10:21 segments

Now I can let threads of Tomcat know that we have a new compound
file so that servlet can reopen IndexSearcher to use new segments.
But I want to delete old and unnecessary files (_1pp, _2kk,
_3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
to save disk space.

How can I get a list of unnecessary files to delete them?

regards,

Koji




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene in Action : example code -> document-parsing framework ...

Posted by ms...@aol.com.
Do you have the log4j.properties file in the classpath?
 
-----Original Message-----
From: Patricio Galeas <ga...@informatik.uni-siegen.de>
To: java-user@lucene.apache.org
Sent: Mon, 17 Oct 2005 15:50:46 +0200
Subject: Lucene in Action : example code -> document-parsing framework ...


Hi ALL, 
I try to run the an example of the "Lucene in Action" book : 
 
Chapter 7: Parsing Common Document Formats: 
lia.handlingtypes.framework.FileIndexer 
 
I have downloaded all the source code from www.manning.com/hatcher2 
and create a java project in Lucene 3.1. 
 
I become the following error message when the PDF document is indexed : 
--------------------------------------- 
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook-entry.xml 
log4j:WARN No appenders could be found for logger (org.apache.commons.digester.Digester.sax). 
log4j:WARN Please initialize the log4j system properly. 
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook.xml 
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\HTML.html 
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\MSWord.doc 
Indexing E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\PDF.pdf 
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Logger 
  at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70) 
  at lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.java:118) 
  at lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java:32) 
  at lia.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFileHandler.java:39) 
  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43) 
  at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36) 
  at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77) 
--------------------------------------- 
 
Have anybody some idea ?? 
Thank You 
Patricio 
 
 
--------------------------------------------------------------------- 
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org 
For additional commands, e-mail: java-user-help@lucene.apache.org 
 

Lucene in Action : example code -> document-parsing framework ...

Posted by Patricio Galeas <ga...@informatik.uni-siegen.de>.
Hi ALL,
I try to run the an example of the "Lucene in Action" book :

Chapter 7: Parsing Common Document Formats:
lia.handlingtypes.framework.FileIndexer

I have downloaded all the source code from www.manning.com/hatcher2
and create a java project in Lucene 3.1.

I become the following error message when the PDF document is indexed :
---------------------------------------
Indexing 
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook-entry.xml
log4j:WARN No appenders could be found for logger 
(org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Indexing 
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\addressbook.xml
Indexing 
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\HTML.html
Indexing 
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\MSWord.doc
Indexing 
E:\Galeas\downloads\LuceneInAction\LuceneInAction\src\lia\handlingtypes\data\PDF.pdf
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/log4j/Logger
    at org.pdfbox.pdfparser.BaseParser.<clinit>(BaseParser.java:70)
    at 
lia.handlingtypes.pdf.PDFBoxPDFHandler.parseDocument(PDFBoxPDFHandler.java:118)
    at 
lia.handlingtypes.pdf.PDFBoxPDFHandler.getDocument(PDFBoxPDFHandler.java:32)
    at 
lia.handlingtypes.framework.ExtensionFileHandler.getDocument(ExtensionFileHandler.java:39)
    at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:43)
    at lia.handlingtypes.framework.FileIndexer.index(FileIndexer.java:36)
    at lia.handlingtypes.framework.FileIndexer.main(FileIndexer.java:77)
---------------------------------------

Have anybody some idea ??
Thank You
Patricio



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: delete unnecessary files after optimize()

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.
> I've never used Lucene on windows, but if I recall correctly from past
> discussions on this topic, the IndexWriter will try to delete any file
> listed in deletable whenever it does any segment merging (ie: after adding
> some number of documents, when you call .optimize(), or when you call
> .close().

You are correct.
Calling addDocument() removes unnecessary files and makes size of
deletable 4.

Thank you,

Koji

> -----Original Message-----
> From: hossman@hal.rescomp.berkeley.edu
> [mailto:hossman@hal.rescomp.berkeley.edu]On Behalf Of Chris Hostetter
> Sent: Sunday, October 16, 2005 3:42 PM
> To: java-user@lucene.apache.org
> Subject: RE: delete unnecessary files after optimize()
>
>
>
> : > How can I get a list of unnecessary files to delete them?
> :
> : I can get such information from deletable file under Win32 environment,
> : correct?
>
> I've never used Lucene on windows, but if I recall correctly from past
> discussions on this topic, the IndexWriter will try to delete any file
> listed in deletable whenever it does any segment merging (ie: after adding
> some number of documents, when you call .optimize(), or when you call
> .close().
>
> the only reason those files won't be deleted is if some IndexReader has
> them open -- in which cas you won't be able to delete them either, so
> don't worry about it.  The safest thing to do is make sure you
> periodically reopen new IndexReaders, and if you're really in a hurry to
> get rid of those files, periodically open/close a new IndexWriter too
> (even if you don't need one) ... that should cause it to try to delete the
> files again.
>
>
> : > -----Original Message-----
> : > From: Koji Sekiguchi [mailto:koji.sekiguchi@m4.dion.ne.jp]
> : > Sent: Sunday, October 16, 2005 11:05 AM
> : > To: java-user@lucene.apache.org
> : > Subject: delete unnecessary files after optimize()
> : >
> : >
> : > Hello,
> : >
> : > My Tomcat application has several threads. These threads
> : > share a single instance of IndexSearcher to seach contents.
> : >
> : > At some point in time, I have the following index directory:
> : >
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> : > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> : > -rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
> : > -rwx------+ 1 admin admin     64 Oct 16 10:21 segments
> : >
> : > In this moment, I want to optimize() the index. I can do it safely
> : > without interrupting Tomcat process.
> : > After optimizing the index, I get a new compounf file _4ab.cfs:
> : >
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> : > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> : > -rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
> : > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> : > -rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
> : > -rwx------+ 1 admin admin     29 Oct 16 10:21 segments
> : >
> : > Now I can let threads of Tomcat know that we have a new compound
> : > file so that servlet can reopen IndexSearcher to use new segments.
> : > But I want to delete old and unnecessary files (_1pp, _2kk,
> : > _3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
> : > to save disk space.
> : >
> : > How can I get a list of unnecessary files to delete them?
> : >
> : > regards,
> : >
> : > Koji
> : >
> : >
> : >
> : >
> : > ---------------------------------------------------------------------
> : > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : > For additional commands, e-mail: java-user-help@lucene.apache.org
> : >
> : >
> :
> :
> :
> : ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: delete unnecessary files after optimize()

Posted by Chris Hostetter <ho...@fucit.org>.
: > How can I get a list of unnecessary files to delete them?
:
: I can get such information from deletable file under Win32 environment,
: correct?

I've never used Lucene on windows, but if I recall correctly from past
discussions on this topic, the IndexWriter will try to delete any file
listed in deletable whenever it does any segment merging (ie: after adding
some number of documents, when you call .optimize(), or when you call
.close().

the only reason those files won't be deleted is if some IndexReader has
them open -- in which cas you won't be able to delete them either, so
don't worry about it.  The safest thing to do is make sure you
periodically reopen new IndexReaders, and if you're really in a hurry to
get rid of those files, periodically open/close a new IndexWriter too
(even if you don't need one) ... that should cause it to try to delete the
files again.


: > -----Original Message-----
: > From: Koji Sekiguchi [mailto:koji.sekiguchi@m4.dion.ne.jp]
: > Sent: Sunday, October 16, 2005 11:05 AM
: > To: java-user@lucene.apache.org
: > Subject: delete unnecessary files after optimize()
: >
: >
: > Hello,
: >
: > My Tomcat application has several threads. These threads
: > share a single instance of IndexSearcher to seach contents.
: >
: > At some point in time, I have the following index directory:
: >
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
: > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
: > -rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
: > -rwx------+ 1 admin admin     64 Oct 16 10:21 segments
: >
: > In this moment, I want to optimize() the index. I can do it safely
: > without interrupting Tomcat process.
: > After optimizing the index, I get a new compounf file _4ab.cfs:
: >
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
: > -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
: > -rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
: > -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
: > -rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
: > -rwx------+ 1 admin admin     29 Oct 16 10:21 segments
: >
: > Now I can let threads of Tomcat know that we have a new compound
: > file so that servlet can reopen IndexSearcher to use new segments.
: > But I want to delete old and unnecessary files (_1pp, _2kk,
: > _3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
: > to save disk space.
: >
: > How can I get a list of unnecessary files to delete them?
: >
: > regards,
: >
: > Koji
: >
: >
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: delete unnecessary files after optimize()

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.
Hi again,

I've read http://lucene.apache.org/java/docs/fileformats.html
and now I think I know deletable file format.

> How can I get a list of unnecessary files to delete them?

I can get such information from deletable file under Win32 environment,
correct?

Koji

> -----Original Message-----
> From: Koji Sekiguchi [mailto:koji.sekiguchi@m4.dion.ne.jp]
> Sent: Sunday, October 16, 2005 11:05 AM
> To: java-user@lucene.apache.org
> Subject: delete unnecessary files after optimize()
>
>
> Hello,
>
> My Tomcat application has several threads. These threads
> share a single instance of IndexSearcher to seach contents.
>
> At some point in time, I have the following index directory:
>
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> -rwx------+ 1 admin admin       4 Oct 16 10:21 deletable
> -rwx------+ 1 admin admin     64 Oct 16 10:21 segments
>
> In this moment, I want to optimize() the index. I can do it safely
> without interrupting Tomcat process.
> After optimizing the index, I get a new compounf file _4ab.cfs:
>
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _1pp.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _2kk.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _3ff.cfs
> -rwx------+ 1 admin admin 158622 Oct 16 10:21 _4aa.cfs
> -rwx------+ 1 admin admin 791622 Oct 16 10:21 _4ab.cfs
> -rwx------+ 1 admin admin 158614 Oct 16 10:20 _uu.cfs
> -rwx------+ 1 admin admin     48 Oct 16 10:21 deletable
> -rwx------+ 1 admin admin     29 Oct 16 10:21 segments
>
> Now I can let threads of Tomcat know that we have a new compound
> file so that servlet can reopen IndexSearcher to use new segments.
> But I want to delete old and unnecessary files (_1pp, _2kk,
> _3ff, _4aa and _uu .cfs files) after reopening IndexSearcher
> to save disk space.
>
> How can I get a list of unnecessary files to delete them?
>
> regards,
>
> Koji
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org