You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by saisantoshi <sa...@gmail.com> on 2013/01/31 19:39:37 UTC

IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

I am using the following below for creating the IndexWriter (for my
indexing):

IndexWriterConfig indexWriterConfig = new
IndexWriterConfig(Version.LUCENE_40,
                new LimitTokenCountAnalyzer(analyzer,
MAX_FIELD_SCAN_LENGTH));
        
if (create) {  // create will be trure for indexing first time and later
will be false for updating the existing index
                indexWriterConfig.setOpenMode(OpenMode.CREATE);
        } else {
            indexWriterConfig.setOpenMode(OpenMode.APPEND);
  }
        TieredMergePolicy mergePolicy = new TieredMergePolicy();
        indexWriterConfig.setMergePolicy(mergePolicy.setNoCFSRatio(1.0d));
         IndexWriter indexWriter = new IndexWriter(directory,
indexWriterConfig);

For creating the first time, index it is creating the following files:

_0.cfe
_0.cfs
_0.csi
segments.gen
segments_1

on subsequent updates, it is creating additional files like the below:

_0.cfe
_0.cfs
_0.csi
_1.cfe
_1.cfs
_1.csi
_0_1.del
_2.cfe
_2.cfs
_2.csi
segments.gen
segments_3

I am assuming that it should override the files initially created for
indexing:

should have just the following below files:

_2.cfe
_2.cfs
_2.csi
segments.gen
segments_3

Can someone please let me know if this is intended behavior or something
wrong with the way I am creating the IndexWriterConfig.

Appreciate if you could shed some light on this as the docs are not very
clear.

Thanks,
Sai.








--
View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

Posted by saisantoshi <sa...@gmail.com>.
>>Are you closing or committing your IndexWriter after each added
document?  Because if you add 100 docs you should not see 100 versions
of these files, only one set of files in the end (many docs are
written to one segment). 

No. What I meant to say here is if 100 users have updated the document
separately, then it would produce more files in the index (with the current
version of the Lucene, 4.0). My question here is , why it is not overwriting
the existing index files by just updating it.

I have reverted back my code (2.4 base) and tried to update a document
couple of items, it does produce only a minimal set of files. One thing, I
have mention here is I was using the optimize() method which might be the
reason of keeping some number of files.

Both are using compound file structure format:

2.4   (lets say if I update the document 5 times separately).
====
_5.cfs
segments.gen
segments_5

with 4.0
======== ( the same number of updates to the same document, it produces more
files).. Unfortunately, I have to comment out the optimize() method as this
has been removed in 4.0: Please let me know if there is any other
alternative to it.

_0.cfe
_0.cfs
_0.si
_4.cfe
_4.cfs
_4.si
segments.gen
segments_5

When compared to 2.4, it is using way too many files. If I add couple more
new documents, it creates additional files.

May be its the optimize() method is doing the trick in 2.4.. How can we use
this method in 4.0? 

BTW,  I am using following book, the book only addresses ( 3.0 specific), I
have read the lucene index format section and it seems to be consistent with
2.4 stuff but not 4.0. 



Thanks,
Sai.






--
View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766p4038007.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

Posted by Michael McCandless <lu...@mikemccandless.com>.
It is by design, and 2.4 works the same way.

Are you closing or committing your IndexWriter after each added
document?  Because if you add 100 docs you should not see 100 versions
of these files, only one set of files in the end (many docs are
written to one segment).

Each segment holds the documents that were indexed by that instance of
IndexWriter.

When too many segments are in the index, they are merged according to
the merge policy.

Maybe you should pick up a copy of Lucene in Action (NOTE: I am a co
author) ... it explains the basics of how Lucene works.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jan 31, 2013 at 3:36 PM, saisantoshi <sa...@gmail.com> wrote:
> Is it by design. The older API (2.4) does not have this problem. Lets say if
> I have 100 updates or so.. then it will create 100 versions of those files
> in the index. This would increase the number of files in the index directory
> and might run into some file issues?
>
> It would be good to just have the following instead:
>
> _100.cfe
> _100.cfs
> _100.csi
> segments.gen
> segments_1
>
> why are the older files in the index folder? Are they used for anything?
> BTW,can I use any other option to just overwrite the existing files in the
> index directory such that there are only fewer files in the index directory.
>
> Thanks,
> Sai.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766p4037796.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

Posted by saisantoshi <sa...@gmail.com>.
Is it by design. The older API (2.4) does not have this problem. Lets say if
I have 100 updates or so.. then it will create 100 versions of those files
in the index. This would increase the number of files in the index directory
and might run into some file issues?

It would be good to just have the following instead:

_100.cfe
_100.cfs
_100.csi
segments.gen
segments_1

why are the older files in the index folder? Are they used for anything?
BTW,can I use any other option to just overwrite the existing files in the
index directory such that there are only fewer files in the index directory.

Thanks,
Sai.



--
View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766p4037796.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

Posted by Michael McCandless <lu...@mikemccandless.com>.
Then those files are expected.

Your 2nd open was with APPEND, which means newly indexed documents are
written into a new set of files.

Lucene is segment based, so your first batch of documents are in
segment _0, while your second batch is in _1 and _2.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jan 31, 2013 at 2:54 PM, saisantoshi <sa...@gmail.com> wrote:
> It's _0.si ( typo)
>
> For second update, create = "false".
>
> Thanks,
> Sai.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766p4037785.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

Posted by saisantoshi <sa...@gmail.com>.
It's _0.si ( typo)

For second update, create = "false".

Thanks,
Sai.



--
View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766p4037785.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriterConfig.OpenMode.CREATE vs OpenMode.APPEND (index files)

Posted by Michael McCandless <lu...@mikemccandless.com>.
I don't know what _0.csi is ... was that supposed to be _0.si?

Did you pass create=true or false for the 2nd update?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jan 31, 2013 at 1:39 PM, saisantoshi <sa...@gmail.com> wrote:
> I am using the following below for creating the IndexWriter (for my
> indexing):
>
> IndexWriterConfig indexWriterConfig = new
> IndexWriterConfig(Version.LUCENE_40,
>                 new LimitTokenCountAnalyzer(analyzer,
> MAX_FIELD_SCAN_LENGTH));
>
> if (create) {  // create will be trure for indexing first time and later
> will be false for updating the existing index
>                 indexWriterConfig.setOpenMode(OpenMode.CREATE);
>         } else {
>             indexWriterConfig.setOpenMode(OpenMode.APPEND);
>   }
>         TieredMergePolicy mergePolicy = new TieredMergePolicy();
>         indexWriterConfig.setMergePolicy(mergePolicy.setNoCFSRatio(1.0d));
>          IndexWriter indexWriter = new IndexWriter(directory,
> indexWriterConfig);
>
> For creating the first time, index it is creating the following files:
>
> _0.cfe
> _0.cfs
> _0.csi
> segments.gen
> segments_1
>
> on subsequent updates, it is creating additional files like the below:
>
> _0.cfe
> _0.cfs
> _0.csi
> _1.cfe
> _1.cfs
> _1.csi
> _0_1.del
> _2.cfe
> _2.cfs
> _2.csi
> segments.gen
> segments_3
>
> I am assuming that it should override the files initially created for
> indexing:
>
> should have just the following below files:
>
> _2.cfe
> _2.cfs
> _2.csi
> segments.gen
> segments_3
>
> Can someone please let me know if this is intended behavior or something
> wrong with the way I am creating the IndexWriterConfig.
>
> Appreciate if you could shed some light on this as the docs are not very
> clear.
>
> Thanks,
> Sai.
>
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/IndexWriterConfig-OpenMode-CREATE-vs-OpenMode-APPEND-index-files-tp4037766.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org