You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Chetan Mehrotra <ch...@gmail.com> on 2014/10/20 11:59:49 UTC

Does Lucene modifies existing file in an index

While working on copy on read directory support (OAK-1724) and was
checking how Lucene manages the index files. Following observation can
be made with various test runs

A - Small Index use Compound File format
------------------

If index contain few entries then it seems it uses the compound file
format as directory listing shows only following files (filename -
size)

_0.cfs - 621
_0.cfe - 194
segments.gen - 20
segments_1 - 81
_0.si - 266

If the index gets updates the _0.cfs file size changes and other remains same

B - Large index store index file seprately
--------------------

For large index (not sure of threshold) Lucene seems to store the
various index file separately and there probably the file do not get
modified and only new file get created

Question
-------------
1. Is this switch from cfs format to storing in separate files is
automatic and done by Lucene after index reaches certain size. Or this
done something specifically in Oak?
2. Lucene would not modify existing file in a directory unless
  a. In compound storage cfs file would get modified. There also
modification would be append only?
  b. segment.gen - This would get modified everytime
  c. If separate files are used then any file would never be modified
and only new files would be created

Chetan Mehrotra
PS: Probably the question is more appropriate for Lucene DL but
checking here first to see if something in Oak is different from
default

Re: Does Lucene modifies existing file in an index

Posted by Chetan Mehrotra <ch...@gmail.com>.
To close the thread. With modified testcase [1] the Lucene index file
do not get updated.

> Is this switch from cfs format to storing in separate files is
> automatic and done by Lucene after index reaches certain size. Or this
> done something specifically in Oak?

This is done because as per default setup Lucene uses following merge policy

mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10,
maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0,
floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0,
segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022207999E12,
noCFSRatio=0.1
i

When noCFSRatio crosses 0.1 then CFS file would not be used.

Below is the IndexConfig state used to create the index

------------------------
dir=OakDirectory@79a302e9
lockFactory=org.apache.lucene.store.NoLockFactory@183ee75d
index=
version=4.7.1 1582953 - sarowe - 2014-03-29 00:33:55
matchVersion=LUCENE_47
analyzer=org.apache.jackrabbit.oak.plugins.index.lucene.OakAnalyzer
ramBufferSizeMB=16.0
maxBufferedDocs=-1
maxBufferedDeleteTerms=-1
mergedSegmentWarmer=null
readerTermsIndexDivisor=1
termIndexInterval=32
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE_OR_APPEND
similarity=org.apache.lucene.search.similarities.DefaultSimilarity
mergeScheduler=org.apache.lucene.index.SerialMergeScheduler@7fe47c41
default WRITE_LOCK_TIMEOUT=1000
writeLockTimeout=1000
codec=Lucene46
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10,
maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0,
floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0,
segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022207999E12,
noCFSRatio=0.1
indexerThreadPool=org.apache.lucene.index.ThreadAffinityDocumentsWriterThreadPool@7199d0ff
readerPooling=false
perThreadHardLimitMB=1945
useCompoundFile=true
------------------------------

Chetan Mehrotra
[1] http://svn.apache.org/r1634504

Re: Does Lucene modifies existing file in an index

Posted by Chetan Mehrotra <ch...@gmail.com>.
> The index indeed gets rebuilt. In IndexUpdate.collectIndexEditors() the
provider

Did not realized that. So it can be safely assumed that file created
in Lucene index

1. Never get updated
2. Names never get reused

That would simplify the logic for CopyOnRead a lot. Now only thing to
take care of is the reindex case and clear the lucene replica on the
local. That was discussed on another thread [1]. Would try to
implement what Thomas suggested there

Thanks Marcel !

Chetan Mehrotra
[1] http://markmail.org/thread/dzvy7zumcdkegrgz


On Tue, Oct 21, 2014 at 12:20 PM, Alex Parvulescu
<al...@gmail.com> wrote:
>> The index indeed gets rebuilt. In IndexUpdate.collectIndexEditors() the
> provider does not return any editors and the following code is executed
>
> OAK-2203
>
> On Tue, Oct 21, 2014 at 8:37 AM, Marcel Reutegger <mr...@adobe.com>
> wrote:
>
>> Hi,
>>
>> this is the output when I run it on my machine within IntelliJ:
>>
>> 17:13:10.035 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
>> will be performed for following indexes: [/oak:index/lucene]
>> 17:13:10.172 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 1
>> nodes, done.
>> ================
>> _0.cfs - 621
>> _0.cfe - 194
>> segments.gen - 20
>> segments_1 - 81
>> _0.si - 252
>> 17:13:10.187 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
>> will be performed for following indexes: [/oak:index/lucene]
>> 17:13:10.200 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 2
>> nodes, done.
>> ================
>> _0.cfs - 789
>> _0.cfe - 194
>> segments.gen - 20
>> segments_1 - 81
>> _0.si - 252
>> 17:13:10.204 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
>> will be performed for following indexes: [/oak:index/lucene]
>> 17:13:10.220 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 3
>> nodes, done.
>> ================
>> _0.cfs - 952
>> _0.cfe - 194
>> segments.gen - 20
>> segments_1 - 81
>> _0.si - 252
>> 17:13:10.223 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
>> will be performed for following indexes: [/oak:index/lucene]
>> 17:13:10.238 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 2
>> nodes, done.
>> ================
>> _0.cfs - 789
>> _0.cfe - 194
>> segments.gen - 20
>> segments_1 - 81
>> _0.si - 252
>> 17:13:10.241 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
>> will be performed for following indexes: [/oak:index/lucene]
>> 17:13:10.256 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 3
>> nodes, done.
>> ================
>> _0.cfs - 955
>> _0.cfe - 194
>> segments.gen - 20
>> segments_1 - 81
>> _0.si - 252
>>
>>
>>
>>
>>
>> The index indeed gets rebuilt. In IndexUpdate.collectIndexEditors() the
>> provider
>> does not return any editors and the following code is executed:
>>
>> Editor editor = provider.getIndexEditor(type, definition, root,
>> updateCallback);
>> if (editor == null) {
>>     // trigger reindexing when an indexer becomes available
>>     definition.setProperty(REINDEX_PROPERTY_NAME, true);
>> } else ...
>>
>>
>> We need to detect a re-index and clear the lucene replica on the local
>> disk.
>> As we can see, lucene will start with generation zero again and increment
>> it
>> with every modification. This will eventually lead to a collision with the
>> replica on the local disk. In this extreme case, it even happens with every
>> modification ;)
>>
>> Regards
>>  Marcel
>>
>> On 20/10/14 14:24, "Chetan Mehrotra" <ch...@gmail.com> wrote:
>>
>> >Hi Marcel,
>> >
>> >> in my experience .cfs files are written once
>> >and never modified
>> >
>> >I have checked in a testcase with [1] and if you run that you would
>> >see following output which indicate that same file is getting updated.
>> >
>> >----
>> >================
>> >_0.cfs - 621
>> >_0.cfe - 194
>> >segments.gen - 20
>> >segments_1 - 81
>> >_0.si - 266
>> >================
>> >_0.cfs - 789
>> >_0.cfe - 194
>> >segments.gen - 20
>> >segments_1 - 81
>> >_0.si - 266
>> >================
>> >_0.cfs - 952
>> >_0.cfe - 194
>> >segments.gen - 20
>> >segments_1 - 81
>> >_0.si - 266
>> >================
>> >_0.cfs - 789
>> >_0.cfe - 194
>> >segments.gen - 20
>> >segments_1 - 81
>> >_0.si - 266
>> >================
>> >_0.cfs - 955
>> >_0.cfe - 194
>> >segments.gen - 20
>> >segments_1 - 81
>> >_0.si - 266
>> >---------
>> >
>> >Chetan Mehrotra
>> >[1] http://svn.apache.org/r1633123
>> >
>> >
>> >On Mon, Oct 20, 2014 at 5:34 PM, Thomas Mueller <mu...@adobe.com>
>> wrote:
>> >> Hi,
>> >>
>> >> This blog post is interesting: they are using a physical switch (similar
>> >> to a christmas light timer) to test a Lucene index doesn't get corrupt
>> >>on
>> >> power failure. It would be nice if we can do something similar with the
>> >> Segment storage at some point.
>> >>
>> >> Regards,
>> >> Thomas
>> >>
>> >>
>> >>
>> >> On 20/10/14 13:36, "Marcel Reutegger" <mr...@adobe.com> wrote:
>> >>
>> >>>Hi,
>> >>>
>> >>>this is very strange. in my experience .cfs files are written once
>> >>>and never modified. this write-once pattern is actually used for
>> >>>almost all files, except the segments.gen file you mentioned. E.g.
>> >>>see [0] by Mike McCandless when he talks about LUCENE-5574.
>> >>>
>> >>>is it possible the entire lucene index is replaced by oak?
>> >>>
>> >>>regards
>> >>> marcel
>> >>>
>> >>>[0]
>> >>>
>> http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-
>> >>>af
>> >>>t
>> >>>er.html
>> >>>
>> >>>On 20/10/14 11:59, "Chetan Mehrotra" <ch...@gmail.com> wrote:
>> >>>
>> >>>>While working on copy on read directory support (OAK-1724) and was
>> >>>>checking how Lucene manages the index files. Following observation can
>> >>>>be made with various test runs
>> >>>>
>> >>>>A - Small Index use Compound File format
>> >>>>------------------
>> >>>>
>> >>>>If index contain few entries then it seems it uses the compound file
>> >>>>format as directory listing shows only following files (filename -
>> >>>>size)
>> >>>>
>> >>>>_0.cfs - 621
>> >>>>_0.cfe - 194
>> >>>>segments.gen - 20
>> >>>>segments_1 - 81
>> >>>>_0.si - 266
>> >>>>
>> >>>>If the index gets updates the _0.cfs file size changes and other
>> >>>>remains
>> >>>>same
>> >>>>
>> >>>>B - Large index store index file seprately
>> >>>>--------------------
>> >>>>
>> >>>>For large index (not sure of threshold) Lucene seems to store the
>> >>>>various index file separately and there probably the file do not get
>> >>>>modified and only new file get created
>> >>>>
>> >>>>Question
>> >>>>-------------
>> >>>>1. Is this switch from cfs format to storing in separate files is
>> >>>>automatic and done by Lucene after index reaches certain size. Or this
>> >>>>done something specifically in Oak?
>> >>>>2. Lucene would not modify existing file in a directory unless
>> >>>>  a. In compound storage cfs file would get modified. There also
>> >>>>modification would be append only?
>> >>>>  b. segment.gen - This would get modified everytime
>> >>>>  c. If separate files are used then any file would never be modified
>> >>>>and only new files would be created
>> >>>>
>> >>>>Chetan Mehrotra
>> >>>>PS: Probably the question is more appropriate for Lucene DL but
>> >>>>checking here first to see if something in Oak is different from
>> >>>>default
>> >>>
>> >>
>>
>>

Re: Does Lucene modifies existing file in an index

Posted by Alex Parvulescu <al...@gmail.com>.
> The index indeed gets rebuilt. In IndexUpdate.collectIndexEditors() the
provider does not return any editors and the following code is executed

OAK-2203

On Tue, Oct 21, 2014 at 8:37 AM, Marcel Reutegger <mr...@adobe.com>
wrote:

> Hi,
>
> this is the output when I run it on my machine within IntelliJ:
>
> 17:13:10.035 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
> will be performed for following indexes: [/oak:index/lucene]
> 17:13:10.172 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 1
> nodes, done.
> ================
> _0.cfs - 621
> _0.cfe - 194
> segments.gen - 20
> segments_1 - 81
> _0.si - 252
> 17:13:10.187 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
> will be performed for following indexes: [/oak:index/lucene]
> 17:13:10.200 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 2
> nodes, done.
> ================
> _0.cfs - 789
> _0.cfe - 194
> segments.gen - 20
> segments_1 - 81
> _0.si - 252
> 17:13:10.204 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
> will be performed for following indexes: [/oak:index/lucene]
> 17:13:10.220 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 3
> nodes, done.
> ================
> _0.cfs - 952
> _0.cfe - 194
> segments.gen - 20
> segments_1 - 81
> _0.si - 252
> 17:13:10.223 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
> will be performed for following indexes: [/oak:index/lucene]
> 17:13:10.238 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 2
> nodes, done.
> ================
> _0.cfs - 789
> _0.cfe - 194
> segments.gen - 20
> segments_1 - 81
> _0.si - 252
> 17:13:10.241 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
> will be performed for following indexes: [/oak:index/lucene]
> 17:13:10.256 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 3
> nodes, done.
> ================
> _0.cfs - 955
> _0.cfe - 194
> segments.gen - 20
> segments_1 - 81
> _0.si - 252
>
>
>
>
>
> The index indeed gets rebuilt. In IndexUpdate.collectIndexEditors() the
> provider
> does not return any editors and the following code is executed:
>
> Editor editor = provider.getIndexEditor(type, definition, root,
> updateCallback);
> if (editor == null) {
>     // trigger reindexing when an indexer becomes available
>     definition.setProperty(REINDEX_PROPERTY_NAME, true);
> } else ...
>
>
> We need to detect a re-index and clear the lucene replica on the local
> disk.
> As we can see, lucene will start with generation zero again and increment
> it
> with every modification. This will eventually lead to a collision with the
> replica on the local disk. In this extreme case, it even happens with every
> modification ;)
>
> Regards
>  Marcel
>
> On 20/10/14 14:24, "Chetan Mehrotra" <ch...@gmail.com> wrote:
>
> >Hi Marcel,
> >
> >> in my experience .cfs files are written once
> >and never modified
> >
> >I have checked in a testcase with [1] and if you run that you would
> >see following output which indicate that same file is getting updated.
> >
> >----
> >================
> >_0.cfs - 621
> >_0.cfe - 194
> >segments.gen - 20
> >segments_1 - 81
> >_0.si - 266
> >================
> >_0.cfs - 789
> >_0.cfe - 194
> >segments.gen - 20
> >segments_1 - 81
> >_0.si - 266
> >================
> >_0.cfs - 952
> >_0.cfe - 194
> >segments.gen - 20
> >segments_1 - 81
> >_0.si - 266
> >================
> >_0.cfs - 789
> >_0.cfe - 194
> >segments.gen - 20
> >segments_1 - 81
> >_0.si - 266
> >================
> >_0.cfs - 955
> >_0.cfe - 194
> >segments.gen - 20
> >segments_1 - 81
> >_0.si - 266
> >---------
> >
> >Chetan Mehrotra
> >[1] http://svn.apache.org/r1633123
> >
> >
> >On Mon, Oct 20, 2014 at 5:34 PM, Thomas Mueller <mu...@adobe.com>
> wrote:
> >> Hi,
> >>
> >> This blog post is interesting: they are using a physical switch (similar
> >> to a christmas light timer) to test a Lucene index doesn't get corrupt
> >>on
> >> power failure. It would be nice if we can do something similar with the
> >> Segment storage at some point.
> >>
> >> Regards,
> >> Thomas
> >>
> >>
> >>
> >> On 20/10/14 13:36, "Marcel Reutegger" <mr...@adobe.com> wrote:
> >>
> >>>Hi,
> >>>
> >>>this is very strange. in my experience .cfs files are written once
> >>>and never modified. this write-once pattern is actually used for
> >>>almost all files, except the segments.gen file you mentioned. E.g.
> >>>see [0] by Mike McCandless when he talks about LUCENE-5574.
> >>>
> >>>is it possible the entire lucene index is replaced by oak?
> >>>
> >>>regards
> >>> marcel
> >>>
> >>>[0]
> >>>
> http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-
> >>>af
> >>>t
> >>>er.html
> >>>
> >>>On 20/10/14 11:59, "Chetan Mehrotra" <ch...@gmail.com> wrote:
> >>>
> >>>>While working on copy on read directory support (OAK-1724) and was
> >>>>checking how Lucene manages the index files. Following observation can
> >>>>be made with various test runs
> >>>>
> >>>>A - Small Index use Compound File format
> >>>>------------------
> >>>>
> >>>>If index contain few entries then it seems it uses the compound file
> >>>>format as directory listing shows only following files (filename -
> >>>>size)
> >>>>
> >>>>_0.cfs - 621
> >>>>_0.cfe - 194
> >>>>segments.gen - 20
> >>>>segments_1 - 81
> >>>>_0.si - 266
> >>>>
> >>>>If the index gets updates the _0.cfs file size changes and other
> >>>>remains
> >>>>same
> >>>>
> >>>>B - Large index store index file seprately
> >>>>--------------------
> >>>>
> >>>>For large index (not sure of threshold) Lucene seems to store the
> >>>>various index file separately and there probably the file do not get
> >>>>modified and only new file get created
> >>>>
> >>>>Question
> >>>>-------------
> >>>>1. Is this switch from cfs format to storing in separate files is
> >>>>automatic and done by Lucene after index reaches certain size. Or this
> >>>>done something specifically in Oak?
> >>>>2. Lucene would not modify existing file in a directory unless
> >>>>  a. In compound storage cfs file would get modified. There also
> >>>>modification would be append only?
> >>>>  b. segment.gen - This would get modified everytime
> >>>>  c. If separate files are used then any file would never be modified
> >>>>and only new files would be created
> >>>>
> >>>>Chetan Mehrotra
> >>>>PS: Probably the question is more appropriate for Lucene DL but
> >>>>checking here first to see if something in Oak is different from
> >>>>default
> >>>
> >>
>
>

Re: Does Lucene modifies existing file in an index

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,

this is the output when I run it on my machine within IntelliJ:

17:13:10.035 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
will be performed for following indexes: [/oak:index/lucene]
17:13:10.172 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 1
nodes, done.
================
_0.cfs - 621 
_0.cfe - 194 
segments.gen - 20 
segments_1 - 81 
_0.si - 252 
17:13:10.187 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
will be performed for following indexes: [/oak:index/lucene]
17:13:10.200 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 2
nodes, done.
================
_0.cfs - 789 
_0.cfe - 194 
segments.gen - 20 
segments_1 - 81 
_0.si - 252 
17:13:10.204 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
will be performed for following indexes: [/oak:index/lucene]
17:13:10.220 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 3
nodes, done.
================
_0.cfs - 952 
_0.cfe - 194 
segments.gen - 20 
segments_1 - 81 
_0.si - 252 
17:13:10.223 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
will be performed for following indexes: [/oak:index/lucene]
17:13:10.238 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 2
nodes, done.
================
_0.cfs - 789 
_0.cfe - 194 
segments.gen - 20 
segments_1 - 81 
_0.si - 252 
17:13:10.241 [main] INFO  o.a.j.oak.plugins.index.IndexUpdate - Reindexing
will be performed for following indexes: [/oak:index/lucene]
17:13:10.256 [main] DEBUG o.a.j.o.p.i.lucene.LuceneIndexEditor - Indexed 3
nodes, done.
================
_0.cfs - 955 
_0.cfe - 194 
segments.gen - 20 
segments_1 - 81 
_0.si - 252 





The index indeed gets rebuilt. In IndexUpdate.collectIndexEditors() the
provider
does not return any editors and the following code is executed:

Editor editor = provider.getIndexEditor(type, definition, root,
updateCallback);
if (editor == null) {
    // trigger reindexing when an indexer becomes available
    definition.setProperty(REINDEX_PROPERTY_NAME, true);
} else ...


We need to detect a re-index and clear the lucene replica on the local
disk.
As we can see, lucene will start with generation zero again and increment
it
with every modification. This will eventually lead to a collision with the
replica on the local disk. In this extreme case, it even happens with every
modification ;)

Regards
 Marcel

On 20/10/14 14:24, "Chetan Mehrotra" <ch...@gmail.com> wrote:

>Hi Marcel,
>
>> in my experience .cfs files are written once
>and never modified
>
>I have checked in a testcase with [1] and if you run that you would
>see following output which indicate that same file is getting updated.
>
>----
>================
>_0.cfs - 621
>_0.cfe - 194
>segments.gen - 20
>segments_1 - 81
>_0.si - 266
>================
>_0.cfs - 789
>_0.cfe - 194
>segments.gen - 20
>segments_1 - 81
>_0.si - 266
>================
>_0.cfs - 952
>_0.cfe - 194
>segments.gen - 20
>segments_1 - 81
>_0.si - 266
>================
>_0.cfs - 789
>_0.cfe - 194
>segments.gen - 20
>segments_1 - 81
>_0.si - 266
>================
>_0.cfs - 955
>_0.cfe - 194
>segments.gen - 20
>segments_1 - 81
>_0.si - 266
>---------
>
>Chetan Mehrotra
>[1] http://svn.apache.org/r1633123
>
>
>On Mon, Oct 20, 2014 at 5:34 PM, Thomas Mueller <mu...@adobe.com> wrote:
>> Hi,
>>
>> This blog post is interesting: they are using a physical switch (similar
>> to a christmas light timer) to test a Lucene index doesn't get corrupt
>>on
>> power failure. It would be nice if we can do something similar with the
>> Segment storage at some point.
>>
>> Regards,
>> Thomas
>>
>>
>>
>> On 20/10/14 13:36, "Marcel Reutegger" <mr...@adobe.com> wrote:
>>
>>>Hi,
>>>
>>>this is very strange. in my experience .cfs files are written once
>>>and never modified. this write-once pattern is actually used for
>>>almost all files, except the segments.gen file you mentioned. E.g.
>>>see [0] by Mike McCandless when he talks about LUCENE-5574.
>>>
>>>is it possible the entire lucene index is replaced by oak?
>>>
>>>regards
>>> marcel
>>>
>>>[0]
>>>http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-
>>>af
>>>t
>>>er.html
>>>
>>>On 20/10/14 11:59, "Chetan Mehrotra" <ch...@gmail.com> wrote:
>>>
>>>>While working on copy on read directory support (OAK-1724) and was
>>>>checking how Lucene manages the index files. Following observation can
>>>>be made with various test runs
>>>>
>>>>A - Small Index use Compound File format
>>>>------------------
>>>>
>>>>If index contain few entries then it seems it uses the compound file
>>>>format as directory listing shows only following files (filename -
>>>>size)
>>>>
>>>>_0.cfs - 621
>>>>_0.cfe - 194
>>>>segments.gen - 20
>>>>segments_1 - 81
>>>>_0.si - 266
>>>>
>>>>If the index gets updates the _0.cfs file size changes and other
>>>>remains
>>>>same
>>>>
>>>>B - Large index store index file seprately
>>>>--------------------
>>>>
>>>>For large index (not sure of threshold) Lucene seems to store the
>>>>various index file separately and there probably the file do not get
>>>>modified and only new file get created
>>>>
>>>>Question
>>>>-------------
>>>>1. Is this switch from cfs format to storing in separate files is
>>>>automatic and done by Lucene after index reaches certain size. Or this
>>>>done something specifically in Oak?
>>>>2. Lucene would not modify existing file in a directory unless
>>>>  a. In compound storage cfs file would get modified. There also
>>>>modification would be append only?
>>>>  b. segment.gen - This would get modified everytime
>>>>  c. If separate files are used then any file would never be modified
>>>>and only new files would be created
>>>>
>>>>Chetan Mehrotra
>>>>PS: Probably the question is more appropriate for Lucene DL but
>>>>checking here first to see if something in Oak is different from
>>>>default
>>>
>>


Re: Does Lucene modifies existing file in an index

Posted by Chetan Mehrotra <ch...@gmail.com>.
Hi Marcel,

> in my experience .cfs files are written once
and never modified

I have checked in a testcase with [1] and if you run that you would
see following output which indicate that same file is getting updated.

----
================
_0.cfs - 621
_0.cfe - 194
segments.gen - 20
segments_1 - 81
_0.si - 266
================
_0.cfs - 789
_0.cfe - 194
segments.gen - 20
segments_1 - 81
_0.si - 266
================
_0.cfs - 952
_0.cfe - 194
segments.gen - 20
segments_1 - 81
_0.si - 266
================
_0.cfs - 789
_0.cfe - 194
segments.gen - 20
segments_1 - 81
_0.si - 266
================
_0.cfs - 955
_0.cfe - 194
segments.gen - 20
segments_1 - 81
_0.si - 266
---------

Chetan Mehrotra
[1] http://svn.apache.org/r1633123


On Mon, Oct 20, 2014 at 5:34 PM, Thomas Mueller <mu...@adobe.com> wrote:
> Hi,
>
> This blog post is interesting: they are using a physical switch (similar
> to a christmas light timer) to test a Lucene index doesn't get corrupt on
> power failure. It would be nice if we can do something similar with the
> Segment storage at some point.
>
> Regards,
> Thomas
>
>
>
> On 20/10/14 13:36, "Marcel Reutegger" <mr...@adobe.com> wrote:
>
>>Hi,
>>
>>this is very strange. in my experience .cfs files are written once
>>and never modified. this write-once pattern is actually used for
>>almost all files, except the segments.gen file you mentioned. E.g.
>>see [0] by Mike McCandless when he talks about LUCENE-5574.
>>
>>is it possible the entire lucene index is replaced by oak?
>>
>>regards
>> marcel
>>
>>[0]
>>http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-af
>>t
>>er.html
>>
>>On 20/10/14 11:59, "Chetan Mehrotra" <ch...@gmail.com> wrote:
>>
>>>While working on copy on read directory support (OAK-1724) and was
>>>checking how Lucene manages the index files. Following observation can
>>>be made with various test runs
>>>
>>>A - Small Index use Compound File format
>>>------------------
>>>
>>>If index contain few entries then it seems it uses the compound file
>>>format as directory listing shows only following files (filename -
>>>size)
>>>
>>>_0.cfs - 621
>>>_0.cfe - 194
>>>segments.gen - 20
>>>segments_1 - 81
>>>_0.si - 266
>>>
>>>If the index gets updates the _0.cfs file size changes and other remains
>>>same
>>>
>>>B - Large index store index file seprately
>>>--------------------
>>>
>>>For large index (not sure of threshold) Lucene seems to store the
>>>various index file separately and there probably the file do not get
>>>modified and only new file get created
>>>
>>>Question
>>>-------------
>>>1. Is this switch from cfs format to storing in separate files is
>>>automatic and done by Lucene after index reaches certain size. Or this
>>>done something specifically in Oak?
>>>2. Lucene would not modify existing file in a directory unless
>>>  a. In compound storage cfs file would get modified. There also
>>>modification would be append only?
>>>  b. segment.gen - This would get modified everytime
>>>  c. If separate files are used then any file would never be modified
>>>and only new files would be created
>>>
>>>Chetan Mehrotra
>>>PS: Probably the question is more appropriate for Lucene DL but
>>>checking here first to see if something in Oak is different from
>>>default
>>
>

Re: Does Lucene modifies existing file in an index

Posted by Thomas Mueller <mu...@adobe.com>.
Hi,

This blog post is interesting: they are using a physical switch (similar
to a christmas light timer) to test a Lucene index doesn't get corrupt on
power failure. It would be nice if we can do something similar with the
Segment storage at some point.

Regards,
Thomas



On 20/10/14 13:36, "Marcel Reutegger" <mr...@adobe.com> wrote:

>Hi,
>
>this is very strange. in my experience .cfs files are written once
>and never modified. this write-once pattern is actually used for
>almost all files, except the segments.gen file you mentioned. E.g.
>see [0] by Mike McCandless when he talks about LUCENE-5574.
>
>is it possible the entire lucene index is replaced by oak?
>
>regards
> marcel
>
>[0] 
>http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-af
>t
>er.html
>
>On 20/10/14 11:59, "Chetan Mehrotra" <ch...@gmail.com> wrote:
>
>>While working on copy on read directory support (OAK-1724) and was
>>checking how Lucene manages the index files. Following observation can
>>be made with various test runs
>>
>>A - Small Index use Compound File format
>>------------------
>>
>>If index contain few entries then it seems it uses the compound file
>>format as directory listing shows only following files (filename -
>>size)
>>
>>_0.cfs - 621
>>_0.cfe - 194
>>segments.gen - 20
>>segments_1 - 81
>>_0.si - 266
>>
>>If the index gets updates the _0.cfs file size changes and other remains
>>same
>>
>>B - Large index store index file seprately
>>--------------------
>>
>>For large index (not sure of threshold) Lucene seems to store the
>>various index file separately and there probably the file do not get
>>modified and only new file get created
>>
>>Question
>>-------------
>>1. Is this switch from cfs format to storing in separate files is
>>automatic and done by Lucene after index reaches certain size. Or this
>>done something specifically in Oak?
>>2. Lucene would not modify existing file in a directory unless
>>  a. In compound storage cfs file would get modified. There also
>>modification would be append only?
>>  b. segment.gen - This would get modified everytime
>>  c. If separate files are used then any file would never be modified
>>and only new files would be created
>>
>>Chetan Mehrotra
>>PS: Probably the question is more appropriate for Lucene DL but
>>checking here first to see if something in Oak is different from
>>default
>


Re: Does Lucene modifies existing file in an index

Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,

this is very strange. in my experience .cfs files are written once
and never modified. this write-once pattern is actually used for
almost all files, except the segments.gen file you mentioned. E.g.
see [0] by Mike McCandless when he talks about LUCENE-5574.

is it possible the entire lucene index is replaced by oak?

regards
 marcel

[0] 
http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-aft
er.html

On 20/10/14 11:59, "Chetan Mehrotra" <ch...@gmail.com> wrote:

>While working on copy on read directory support (OAK-1724) and was
>checking how Lucene manages the index files. Following observation can
>be made with various test runs
>
>A - Small Index use Compound File format
>------------------
>
>If index contain few entries then it seems it uses the compound file
>format as directory listing shows only following files (filename -
>size)
>
>_0.cfs - 621
>_0.cfe - 194
>segments.gen - 20
>segments_1 - 81
>_0.si - 266
>
>If the index gets updates the _0.cfs file size changes and other remains
>same
>
>B - Large index store index file seprately
>--------------------
>
>For large index (not sure of threshold) Lucene seems to store the
>various index file separately and there probably the file do not get
>modified and only new file get created
>
>Question
>-------------
>1. Is this switch from cfs format to storing in separate files is
>automatic and done by Lucene after index reaches certain size. Or this
>done something specifically in Oak?
>2. Lucene would not modify existing file in a directory unless
>  a. In compound storage cfs file would get modified. There also
>modification would be append only?
>  b. segment.gen - This would get modified everytime
>  c. If separate files are used then any file would never be modified
>and only new files would be created
>
>Chetan Mehrotra
>PS: Probably the question is more appropriate for Lucene DL but
>checking here first to see if something in Oak is different from
>default