You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Lee <da...@gmail.com> on 2008/08/23 01:35:26 UTC

Clarification about segments

So from what I understand, is it true that if mergeFactor is 10, then when I
index my first 9 documents, I have 9 separate segments, each containing 1
document? And when searching, it will search through every segment?

Thanks!
David

Re: Clarification about segments

Posted by Michael McCandless <lu...@mikemccandless.com>.

Before 2.3, each doc was in fact a separate segment in memory, and  
then these segments were merged together to flush a single segment in  
the Directory.

As of 2.3, IndexWriter now writes directly into RAM the data  
structures that are needed to create the segment, and then flushing  
the segment is a matter of copying these data structures into the  
Directory.  This gave a substantial speedup to indexing throughput,  
much better RAM efficiency (documents per MB that IndexWriter can  
buffer), etc.

In any event, for all versions of Lucene, when flush happens that  
flush adds a single new segment to the index.

Mike

David Lee wrote:

> ok, thanks. I knew that the documents were buffered in memory until  
> they
> were flushed, but I thought that in memory, they were still separate
> documents/segments until they were merged together at the  
> appropriate time
> (dependent on the mergeFactor).
>
> Do you mean that when the IndexWriter flushes the documents in  
> memory to the
> disk, it will merge all the documents in that flush to one segment?
>
> Thanks!
> David
>
> On Sat, Aug 23, 2008 at 2:40 AM, Karsten F.
> <ka...@fiz-technik.de>wrote:
>
>>
>> Hi David,
>>
>> this is not true, please take a look to
>> IndexWriter#setRAMBufferSizeMB
>> and
>> IndexWriter#setMaxBufferedDocs
>>
>> But you can produce 9 segments (each with only one document), if  
>> you call
>> IndexWriter#flush
>> or
>> IndexWriter#commit
>> after each addDocument
>>
>> so from my knowledge about lucene there is no difference between
>> #flush
>> and
>> #optimize(getMergeFactor())
>> (btw #optimize() is equal to optimize(1) ).
>>
>>
>> Best regards
>> Karsten
>>
>> p.s. and yes, searching goes through every segment.
>>
>>
>> David Lee-26 wrote:
>>>
>>> So from what I understand, is it true that if mergeFactor is 10,  
>>> then
>> when
>>> I
>>> index my first 9 documents, I have 9 separate segments, each  
>>> containing 1
>>> document? And when searching, it will search through every segment?
>>>
>>> Thanks!
>>> David
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Clarification-about-segments-tp19117115p19120086.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Clarification about segments

Posted by Hajiz <am...@gmail.com>.

Hi all,

I'm new to Lucene and to this forum too. I'm a little confused about commit
and flush concepts in Lucene. You said that after we call "commit" the
documents/segments in RAM will be wrote in disk(directory). So what is
"minMergeDocs" factor? What does it mean?

In another way, I can ask my question like this: In which conditions will
Lucene write in-memory(cached) documents into directory?

Thanks in advance.



David Lee-26 wrote:
> 
> ok, thanks. I knew that the documents were buffered in memory until they
> were flushed, but I thought that in memory, they were still separate
> documents/segments until they were merged together at the appropriate time
> (dependent on the mergeFactor).
> 
> Do you mean that when the IndexWriter flushes the documents in memory to
> the
> disk, it will merge all the documents in that flush to one segment?
> 
> Thanks!
> David
> 
> On Sat, Aug 23, 2008 at 2:40 AM, Karsten F.
> <ka...@fiz-technik.de>wrote:
> 
>>
>> Hi David,
>>
>> this is not true, please take a look to
>> IndexWriter#setRAMBufferSizeMB
>> and
>> IndexWriter#setMaxBufferedDocs
>>
>> But you can produce 9 segments (each with only one document), if you call
>> IndexWriter#flush
>> or
>> IndexWriter#commit
>> after each addDocument
>>
>> so from my knowledge about lucene there is no difference between
>> #flush
>> and
>> #optimize(getMergeFactor())
>> (btw #optimize() is equal to optimize(1) ).
>>
>>
>> Best regards
>>  Karsten
>>
>> p.s. and yes, searching goes through every segment.
>>
>>
>> David Lee-26 wrote:
>> >
>> > So from what I understand, is it true that if mergeFactor is 10, then
>> when
>> > I
>> > index my first 9 documents, I have 9 separate segments, each containing
>> 1
>> > document? And when searching, it will search through every segment?
>> >
>> > Thanks!
>> > David
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Clarification-about-segments-tp19117115p19120086.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Clarification-about-segments-tp19117115p25434739.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Luke issues "Unknown format version: -6"

Posted by Michael McCandless <lu...@mikemccandless.com>.

I think you need to triple check your CLASSPATH?  It seems like you  
are somehow getting and older version of Luke.

The file format definitely did not change from 2.3.0 --> 2.3.2.

Mike

Jiao, Jason (NSN - CN/Cheng Du) wrote:

> Hi there,
> 	I use luke v0.8.1 which build base on lucene 2.3.0. First, I run
> lucene/demo/IndexFiles to build index successfully. Then I use luke to
> open index, but luke  issues "Unknown format version: -6" . I check  
> the
> documentation of lucene which said "lucene 2.3.2 does not contain any
> new features, API or file format changes, which makes it fully
> compatible to 2.3.0 and 2.3.1".
>
> Any hints?
>
> Thanks in advance.
>
>
> Jason Jiao
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Luke issues "Unknown format version: -6"

Posted by "Jiao, Jason (NSN - CN/Cheng Du)" <ja...@nsn.com>.

Hi there,
	I use luke v0.8.1 which build base on lucene 2.3.0. First, I run
lucene/demo/IndexFiles to build index successfully. Then I use luke to
open index, but luke  issues "Unknown format version: -6" . I check the
documentation of lucene which said "lucene 2.3.2 does not contain any
new features, API or file format changes, which makes it fully
compatible to 2.3.0 and 2.3.1".

Any hints?

Thanks in advance.


Jason Jiao

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Clarification about segments

Posted by David Lee <da...@gmail.com>.

ok, thanks. I knew that the documents were buffered in memory until they
were flushed, but I thought that in memory, they were still separate
documents/segments until they were merged together at the appropriate time
(dependent on the mergeFactor).

Do you mean that when the IndexWriter flushes the documents in memory to the
disk, it will merge all the documents in that flush to one segment?

Thanks!
David

On Sat, Aug 23, 2008 at 2:40 AM, Karsten F.
<ka...@fiz-technik.de>wrote:

>
> Hi David,
>
> this is not true, please take a look to
> IndexWriter#setRAMBufferSizeMB
> and
> IndexWriter#setMaxBufferedDocs
>
> But you can produce 9 segments (each with only one document), if you call
> IndexWriter#flush
> or
> IndexWriter#commit
> after each addDocument
>
> so from my knowledge about lucene there is no difference between
> #flush
> and
> #optimize(getMergeFactor())
> (btw #optimize() is equal to optimize(1) ).
>
>
> Best regards
>  Karsten
>
> p.s. and yes, searching goes through every segment.
>
>
> David Lee-26 wrote:
> >
> > So from what I understand, is it true that if mergeFactor is 10, then
> when
> > I
> > index my first 9 documents, I have 9 separate segments, each containing 1
> > document? And when searching, it will search through every segment?
> >
> > Thanks!
> > David
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Clarification-about-segments-tp19117115p19120086.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Clarification about segments

Posted by "Karsten F." <ka...@fiz-technik.de>.

Hi David,

this is not true, please take a look to
IndexWriter#setRAMBufferSizeMB
and
IndexWriter#setMaxBufferedDocs

But you can produce 9 segments (each with only one document), if you call
IndexWriter#flush
or
IndexWriter#commit
after each addDocument

so from my knowledge about lucene there is no difference between 
#flush 
and
#optimize(getMergeFactor())
(btw #optimize() is equal to optimize(1) ).


Best regards
  Karsten

p.s. and yes, searching goes through every segment.


David Lee-26 wrote:
> 
> So from what I understand, is it true that if mergeFactor is 10, then when
> I
> index my first 9 documents, I have 9 separate segments, each containing 1
> document? And when searching, it will search through every segment?
> 
> Thanks!
> David
> 
> 

-- 
View this message in context: http://www.nabble.com/Clarification-about-segments-tp19117115p19120086.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org