You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2006/07/12 00:02:30 UTC

[jira] Created: (LUCENE-624) Segment size limit for compound files

Segment size limit for compound files
-------------------------------------

         Key: LUCENE-624
         URL: http://issues.apache.org/jira/browse/LUCENE-624
     Project: Lucene - Java
        Type: Improvement

  Components: Index  
    Reporter: Michael Busch
    Priority: Minor


Hello everyone,

I implemented an improvement targeting compound file usage. Compound files are used to decrease the number of index files, because operating systems can't handle too many open file descriptors. On the other hand, a disadvantage of compound file format is the worse performance compared to multi-file indexes:

http://www.gossamer-threads.com/lists/lucene/java-user/8950

In the book "Lucene in Action" it's said that compound file format is about 5-10% slower than multi-file format.


The patch I'm proposing here adds the ability to the IndexWriter to use compound format only for segments, that do not contain more documents than a specific limit "CompoundFileSegmentSizeLimit", which the user can set.

Due to the exponential merges, a lucene index usually contains only a few very big segments, but much more small segments. The best performance is actually just needed for the big segments, whereas a slighly worse performance for small segments shouldn't play a big role in the overall search performance.


Consider the following example:
Index Size:                            1,500,000
Merge factor:                        10
Max buffered docs:             100
Number of indexed fields: 10
Max. OS file descriptors:    1024

in the worst case a not-optimized index could contain the following amount of segments:
1 x 1,000,000
9 x   100,000
9 x    10,000
9 x     1,000
9 x       100

That's 37 segments. A multi-file format index would have:
37 segments * (7 files per segment + 10 files for indexed fields) = 629 files ==> only about 2 open indexes per machine could be handled by the operating system

A compound-file format index would have:
37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be handled by the operating system, but performance would be 5-10% worse.

A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 would have:
36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open indexes could be handled by the OS


The OS can handle now 20 instead of just 2 open indexes, while maintaining the multi-file format performance.

I'm going to create diffs on the current HEAD and will attach the patch files soon. Please let me know what you think about this improvement.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Probably not during indexing, which is what Michael was referring to in his last email, if I understood him correctly.
I suppose indexing with compound format would be a bit slower because individual index files will have to be compounded in a .cfs file, and that'll consume a bit of extra time.

Otis

----- Original Message ----
From: robert engels <re...@ix.netcom.com>
To: java-dev@lucene.apache.org
Sent: Thursday, July 27, 2006 8:48:53 PM
Subject: Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files

In my experience, the more segment files the worse the performance  
(thus the optimize method).

On Jul 27, 2006, at 7:44 PM, Michael Busch wrote:

> robert engels wrote:
>> Why does more segment files improve search performance? I can see  
>> that if you have many smaller files, the merge process for  
>> incremental adds might be faster, but more segments should  
>> actually make searching slower.
> Robert,
>
> I did not run my own performance experiments, but after reading  
> come threads about compound performance again I think you are  
> right. Compound file format does not affect search performance  
> significantly, but it slows down indexing time by 5-10%. So this  
> tiny patch should improve indexing speed while keeping the number  
> of segment files relatively low. If I find some time I will run  
> performance experiments to get some numbers.
>
> Michael
>
>> On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote:
>>
>>>      [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
>>>
>>> Michael Busch updated LUCENE-624:
>>> ---------------------------------
>>>
>>>     Attachment: cfs_seg_size_limit.patch
>>>
>>> I attach the patch file for this improvement.
>>>
>>> This patch adds two new methods to the API of IndexWriter and  
>>> IndexModifier:
>>>   /** Get the current value of the compound file segment size limit.
>>>    *  Note that this just returns the value you set with  
>>> setCompoundFileSegmentSizeLimit(int)
>>>    *  or the default. You cannot use this to query the status of  
>>> an existing index.
>>>    *  @see #setCompoundFileSegmentSizeLimit(int)
>>>    */
>>>   public int getCompoundFileSegmentSizeLimit();
>>>
>>>   /** Sets the limit of documents a segment can have, so that
>>>    *  compound format is being used for that segment. A high
>>>    *  limit will decrease the number of files per index, whereas
>>>    *  a lower limit will improve search performance but
>>>    *  increase the number of files.
>>>    */
>>>   public void setCompoundFileSegmentSizeLimit(int value);
>>>
>>> Furthermore I added a constant to IndexWriter:
>>> public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT  
>>> = Integer.MAX_VALUE;
>>>
>>> Since the default value is set to Integer.MAX_VALUE, the behavior  
>>> of IndexWriter/IndexModifier only changes if the user uses  
>>> setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
>>>
>>>> Segment size limit for compound files
>>>> -------------------------------------
>>>>
>>>>                 Key: LUCENE-624
>>>>                 URL: http://issues.apache.org/jira/browse/ 
>>>> LUCENE-624
>>>>             Project: Lucene - Java
>>>>          Issue Type: Improvement
>>>>          Components: Index
>>>>            Reporter: Michael Busch
>>>>            Priority: Minor
>>>>         Attachments: cfs_seg_size_limit.patch
>>>>
>>>>
>>>> Hello everyone,
>>>> I implemented an improvement targeting compound file usage.  
>>>> Compound files are used to decrease the number of index files,  
>>>> because operating systems can't handle too many open file  
>>>> descriptors. On the other hand, a disadvantage of compound file  
>>>> format is the worse performance compared to multi-file indexes:
>>>> http://www.gossamer-threads.com/lists/lucene/java-user/8950
>>>> In the book "Lucene in Action" it's said that compound file  
>>>> format is about 5-10% slower than multi-file format.
>>>> The patch I'm proposing here adds the ability to the IndexWriter  
>>>> to use compound format only for segments, that do not contain  
>>>> more documents than a specific limit  
>>>> "CompoundFileSegmentSizeLimit", which the user can set.
>>>> Due to the exponential merges, a lucene index usually contains  
>>>> only a few very big segments, but much more small segments. The  
>>>> best performance is actually just needed for the big segments,  
>>>> whereas a slighly worse performance for small segments shouldn't  
>>>> play a big role in the overall search performance.
>>>> Consider the following example:
>>>> Index Size:                            1,500,000
>>>> Merge factor:                        10
>>>> Max buffered docs:             100
>>>> Number of indexed fields: 10
>>>> Max. OS file descriptors:    1024
>>>> in the worst case a not-optimized index could contain the  
>>>> following amount of segments:
>>>> 1 x 1,000,000
>>>> 9 x   100,000
>>>> 9 x    10,000
>>>> 9 x     1,000
>>>> 9 x       100
>>>> That's 37 segments. A multi-file format index would have:
>>>> 37 segments * (7 files per segment + 10 files for indexed  
>>>> fields) = 629 files ==> only about 2 open indexes per machine  
>>>> could be handled by the operating system
>>>> A compound-file format index would have:
>>>> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes  
>>>> could be handled by the operating system, but performance would  
>>>> be 5-10% worse.
>>>> A compound-file format index with CompoundFileSegmentSizeLimit =  
>>>> 1,000,000 would have:
>>>> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==>  
>>>> about 20 open indexes could be handled by the OS
>>>> The OS can handle now 20 instead of just 2 open indexes, while  
>>>> maintaining the multi-file format performance.
>>>> I'm going to create diffs on the current HEAD and will attach  
>>>> the patch files soon. Please let me know what you think about  
>>>> this improvement.
>>>
>>> --This message is automatically generated by JIRA.
>>> -
>>> If you think it was sent incorrectly contact one of the  
>>> administrators: http://issues.apache.org/jira/secure/ 
>>> Administrators.jspa
>>> -
>>> For more information on JIRA, see: http://www.atlassian.com/ 
>>> software/jira
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files

Posted by robert engels <re...@ix.netcom.com>.
In my experience, the more segment files the worse the performance  
(thus the optimize method).

On Jul 27, 2006, at 7:44 PM, Michael Busch wrote:

> robert engels wrote:
>> Why does more segment files improve search performance? I can see  
>> that if you have many smaller files, the merge process for  
>> incremental adds might be faster, but more segments should  
>> actually make searching slower.
> Robert,
>
> I did not run my own performance experiments, but after reading  
> come threads about compound performance again I think you are  
> right. Compound file format does not affect search performance  
> significantly, but it slows down indexing time by 5-10%. So this  
> tiny patch should improve indexing speed while keeping the number  
> of segment files relatively low. If I find some time I will run  
> performance experiments to get some numbers.
>
> Michael
>
>> On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote:
>>
>>>      [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
>>>
>>> Michael Busch updated LUCENE-624:
>>> ---------------------------------
>>>
>>>     Attachment: cfs_seg_size_limit.patch
>>>
>>> I attach the patch file for this improvement.
>>>
>>> This patch adds two new methods to the API of IndexWriter and  
>>> IndexModifier:
>>>   /** Get the current value of the compound file segment size limit.
>>>    *  Note that this just returns the value you set with  
>>> setCompoundFileSegmentSizeLimit(int)
>>>    *  or the default. You cannot use this to query the status of  
>>> an existing index.
>>>    *  @see #setCompoundFileSegmentSizeLimit(int)
>>>    */
>>>   public int getCompoundFileSegmentSizeLimit();
>>>
>>>   /** Sets the limit of documents a segment can have, so that
>>>    *  compound format is being used for that segment. A high
>>>    *  limit will decrease the number of files per index, whereas
>>>    *  a lower limit will improve search performance but
>>>    *  increase the number of files.
>>>    */
>>>   public void setCompoundFileSegmentSizeLimit(int value);
>>>
>>> Furthermore I added a constant to IndexWriter:
>>> public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT  
>>> = Integer.MAX_VALUE;
>>>
>>> Since the default value is set to Integer.MAX_VALUE, the behavior  
>>> of IndexWriter/IndexModifier only changes if the user uses  
>>> setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
>>>
>>>> Segment size limit for compound files
>>>> -------------------------------------
>>>>
>>>>                 Key: LUCENE-624
>>>>                 URL: http://issues.apache.org/jira/browse/ 
>>>> LUCENE-624
>>>>             Project: Lucene - Java
>>>>          Issue Type: Improvement
>>>>          Components: Index
>>>>            Reporter: Michael Busch
>>>>            Priority: Minor
>>>>         Attachments: cfs_seg_size_limit.patch
>>>>
>>>>
>>>> Hello everyone,
>>>> I implemented an improvement targeting compound file usage.  
>>>> Compound files are used to decrease the number of index files,  
>>>> because operating systems can't handle too many open file  
>>>> descriptors. On the other hand, a disadvantage of compound file  
>>>> format is the worse performance compared to multi-file indexes:
>>>> http://www.gossamer-threads.com/lists/lucene/java-user/8950
>>>> In the book "Lucene in Action" it's said that compound file  
>>>> format is about 5-10% slower than multi-file format.
>>>> The patch I'm proposing here adds the ability to the IndexWriter  
>>>> to use compound format only for segments, that do not contain  
>>>> more documents than a specific limit  
>>>> "CompoundFileSegmentSizeLimit", which the user can set.
>>>> Due to the exponential merges, a lucene index usually contains  
>>>> only a few very big segments, but much more small segments. The  
>>>> best performance is actually just needed for the big segments,  
>>>> whereas a slighly worse performance for small segments shouldn't  
>>>> play a big role in the overall search performance.
>>>> Consider the following example:
>>>> Index Size:                            1,500,000
>>>> Merge factor:                        10
>>>> Max buffered docs:             100
>>>> Number of indexed fields: 10
>>>> Max. OS file descriptors:    1024
>>>> in the worst case a not-optimized index could contain the  
>>>> following amount of segments:
>>>> 1 x 1,000,000
>>>> 9 x   100,000
>>>> 9 x    10,000
>>>> 9 x     1,000
>>>> 9 x       100
>>>> That's 37 segments. A multi-file format index would have:
>>>> 37 segments * (7 files per segment + 10 files for indexed  
>>>> fields) = 629 files ==> only about 2 open indexes per machine  
>>>> could be handled by the operating system
>>>> A compound-file format index would have:
>>>> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes  
>>>> could be handled by the operating system, but performance would  
>>>> be 5-10% worse.
>>>> A compound-file format index with CompoundFileSegmentSizeLimit =  
>>>> 1,000,000 would have:
>>>> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==>  
>>>> about 20 open indexes could be handled by the OS
>>>> The OS can handle now 20 instead of just 2 open indexes, while  
>>>> maintaining the multi-file format performance.
>>>> I'm going to create diffs on the current HEAD and will attach  
>>>> the patch files soon. Please let me know what you think about  
>>>> this improvement.
>>>
>>> --This message is automatically generated by JIRA.
>>> -
>>> If you think it was sent incorrectly contact one of the  
>>> administrators: http://issues.apache.org/jira/secure/ 
>>> Administrators.jspa
>>> -
>>> For more information on JIRA, see: http://www.atlassian.com/ 
>>> software/jira
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files

Posted by Michael Busch <bu...@gmail.com>.
robert engels wrote:
> Why does more segment files improve search performance? I can see that 
> if you have many smaller files, the merge process for incremental adds 
> might be faster, but more segments should actually make searching slower.
Robert,

I did not run my own performance experiments, but after reading come 
threads about compound performance again I think you are right. Compound 
file format does not affect search performance significantly, but it 
slows down indexing time by 5-10%. So this tiny patch should improve 
indexing speed while keeping the number of segment files relatively low. 
If I find some time I will run performance experiments to get some numbers.

Michael

> On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote:
>
>>      [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
>>
>> Michael Busch updated LUCENE-624:
>> ---------------------------------
>>
>>     Attachment: cfs_seg_size_limit.patch
>>
>> I attach the patch file for this improvement.
>>
>> This patch adds two new methods to the API of IndexWriter and 
>> IndexModifier:
>>   /** Get the current value of the compound file segment size limit.
>>    *  Note that this just returns the value you set with 
>> setCompoundFileSegmentSizeLimit(int)
>>    *  or the default. You cannot use this to query the status of an 
>> existing index.
>>    *  @see #setCompoundFileSegmentSizeLimit(int)
>>    */
>>   public int getCompoundFileSegmentSizeLimit();
>>
>>   /** Sets the limit of documents a segment can have, so that
>>    *  compound format is being used for that segment. A high
>>    *  limit will decrease the number of files per index, whereas
>>    *  a lower limit will improve search performance but
>>    *  increase the number of files.
>>    */
>>   public void setCompoundFileSegmentSizeLimit(int value);
>>
>> Furthermore I added a constant to IndexWriter:
>> public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT = 
>> Integer.MAX_VALUE;
>>
>> Since the default value is set to Integer.MAX_VALUE, the behavior of 
>> IndexWriter/IndexModifier only changes if the user uses 
>> setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
>>
>>> Segment size limit for compound files
>>> -------------------------------------
>>>
>>>                 Key: LUCENE-624
>>>                 URL: http://issues.apache.org/jira/browse/LUCENE-624
>>>             Project: Lucene - Java
>>>          Issue Type: Improvement
>>>          Components: Index
>>>            Reporter: Michael Busch
>>>            Priority: Minor
>>>         Attachments: cfs_seg_size_limit.patch
>>>
>>>
>>> Hello everyone,
>>> I implemented an improvement targeting compound file usage. Compound 
>>> files are used to decrease the number of index files, because 
>>> operating systems can't handle too many open file descriptors. On 
>>> the other hand, a disadvantage of compound file format is the worse 
>>> performance compared to multi-file indexes:
>>> http://www.gossamer-threads.com/lists/lucene/java-user/8950
>>> In the book "Lucene in Action" it's said that compound file format 
>>> is about 5-10% slower than multi-file format.
>>> The patch I'm proposing here adds the ability to the IndexWriter to 
>>> use compound format only for segments, that do not contain more 
>>> documents than a specific limit "CompoundFileSegmentSizeLimit", 
>>> which the user can set.
>>> Due to the exponential merges, a lucene index usually contains only 
>>> a few very big segments, but much more small segments. The best 
>>> performance is actually just needed for the big segments, whereas a 
>>> slighly worse performance for small segments shouldn't play a big 
>>> role in the overall search performance.
>>> Consider the following example:
>>> Index Size:                            1,500,000
>>> Merge factor:                        10
>>> Max buffered docs:             100
>>> Number of indexed fields: 10
>>> Max. OS file descriptors:    1024
>>> in the worst case a not-optimized index could contain the following 
>>> amount of segments:
>>> 1 x 1,000,000
>>> 9 x   100,000
>>> 9 x    10,000
>>> 9 x     1,000
>>> 9 x       100
>>> That's 37 segments. A multi-file format index would have:
>>> 37 segments * (7 files per segment + 10 files for indexed fields) = 
>>> 629 files ==> only about 2 open indexes per machine could be handled 
>>> by the operating system
>>> A compound-file format index would have:
>>> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could 
>>> be handled by the operating system, but performance would be 5-10% 
>>> worse.
>>> A compound-file format index with CompoundFileSegmentSizeLimit = 
>>> 1,000,000 would have:
>>> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 
>>> 20 open indexes could be handled by the OS
>>> The OS can handle now 20 instead of just 2 open indexes, while 
>>> maintaining the multi-file format performance.
>>> I'm going to create diffs on the current HEAD and will attach the 
>>> patch files soon. Please let me know what you think about this 
>>> improvement.
>>
>> --This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the 
>> administrators: http://issues.apache.org/jira/secure/Administrators.jspa
>> -
>> For more information on JIRA, see: 
>> http://www.atlassian.com/software/jira
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files

Posted by robert engels <re...@ix.netcom.com>.
Why does more segment files improve search performance? I can see  
that if you have many smaller files, the merge process for  
incremental adds might be faster, but more segments should actually  
make searching slower.

On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote:

>      [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
>
> Michael Busch updated LUCENE-624:
> ---------------------------------
>
>     Attachment: cfs_seg_size_limit.patch
>
> I attach the patch file for this improvement.
>
> This patch adds two new methods to the API of IndexWriter and  
> IndexModifier:
>   /** Get the current value of the compound file segment size limit.
>    *  Note that this just returns the value you set with  
> setCompoundFileSegmentSizeLimit(int)
>    *  or the default. You cannot use this to query the status of an  
> existing index.
>    *  @see #setCompoundFileSegmentSizeLimit(int)
>    */
>   public int getCompoundFileSegmentSizeLimit();
>
>   /** Sets the limit of documents a segment can have, so that
>    *  compound format is being used for that segment. A high
>    *  limit will decrease the number of files per index, whereas
>    *  a lower limit will improve search performance but
>    *  increase the number of files.
>    */
>   public void setCompoundFileSegmentSizeLimit(int value);
>
> Furthermore I added a constant to IndexWriter:
> public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT =  
> Integer.MAX_VALUE;
>
> Since the default value is set to Integer.MAX_VALUE, the behavior  
> of IndexWriter/IndexModifier only changes if the user uses  
> setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
>
>> Segment size limit for compound files
>> -------------------------------------
>>
>>                 Key: LUCENE-624
>>                 URL: http://issues.apache.org/jira/browse/LUCENE-624
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Index
>>            Reporter: Michael Busch
>>            Priority: Minor
>>         Attachments: cfs_seg_size_limit.patch
>>
>>
>> Hello everyone,
>> I implemented an improvement targeting compound file usage.  
>> Compound files are used to decrease the number of index files,  
>> because operating systems can't handle too many open file  
>> descriptors. On the other hand, a disadvantage of compound file  
>> format is the worse performance compared to multi-file indexes:
>> http://www.gossamer-threads.com/lists/lucene/java-user/8950
>> In the book "Lucene in Action" it's said that compound file format  
>> is about 5-10% slower than multi-file format.
>> The patch I'm proposing here adds the ability to the IndexWriter  
>> to use compound format only for segments, that do not contain more  
>> documents than a specific limit "CompoundFileSegmentSizeLimit",  
>> which the user can set.
>> Due to the exponential merges, a lucene index usually contains  
>> only a few very big segments, but much more small segments. The  
>> best performance is actually just needed for the big segments,  
>> whereas a slighly worse performance for small segments shouldn't  
>> play a big role in the overall search performance.
>> Consider the following example:
>> Index Size:                            1,500,000
>> Merge factor:                        10
>> Max buffered docs:             100
>> Number of indexed fields: 10
>> Max. OS file descriptors:    1024
>> in the worst case a not-optimized index could contain the  
>> following amount of segments:
>> 1 x 1,000,000
>> 9 x   100,000
>> 9 x    10,000
>> 9 x     1,000
>> 9 x       100
>> That's 37 segments. A multi-file format index would have:
>> 37 segments * (7 files per segment + 10 files for indexed fields)  
>> = 629 files ==> only about 2 open indexes per machine could be  
>> handled by the operating system
>> A compound-file format index would have:
>> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes  
>> could be handled by the operating system, but performance would be  
>> 5-10% worse.
>> A compound-file format index with CompoundFileSegmentSizeLimit =  
>> 1,000,000 would have:
>> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==>  
>> about 20 open indexes could be handled by the OS
>> The OS can handle now 20 instead of just 2 open indexes, while  
>> maintaining the multi-file format performance.
>> I'm going to create diffs on the current HEAD and will attach the  
>> patch files soon. Please let me know what you think about this  
>> improvement.
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: http://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-624) Segment size limit for compound files

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]

Michael Busch updated LUCENE-624:
---------------------------------

    Attachment: cfs_seg_size_limit.patch

I attach the patch file for this improvement.

This patch adds two new methods to the API of IndexWriter and IndexModifier:
  /** Get the current value of the compound file segment size limit.
   *  Note that this just returns the value you set with setCompoundFileSegmentSizeLimit(int)
   *  or the default. You cannot use this to query the status of an existing index.
   *  @see #setCompoundFileSegmentSizeLimit(int)
   */
  public int getCompoundFileSegmentSizeLimit();
    
  /** Sets the limit of documents a segment can have, so that
   *  compound format is being used for that segment. A high
   *  limit will decrease the number of files per index, whereas
   *  a lower limit will improve search performance but 
   *  increase the number of files.
   */
  public void setCompoundFileSegmentSizeLimit(int value);

Furthermore I added a constant to IndexWriter:
public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT = Integer.MAX_VALUE;

Since the default value is set to Integer.MAX_VALUE, the behavior of IndexWriter/IndexModifier only changes if the user uses setCompoundFileSegmentSizeLimit(int) to change the value explicitly. 

> Segment size limit for compound files
> -------------------------------------
>
>                 Key: LUCENE-624
>                 URL: http://issues.apache.org/jira/browse/LUCENE-624
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: cfs_seg_size_limit.patch
>
>
> Hello everyone,
> I implemented an improvement targeting compound file usage. Compound files are used to decrease the number of index files, because operating systems can't handle too many open file descriptors. On the other hand, a disadvantage of compound file format is the worse performance compared to multi-file indexes:
> http://www.gossamer-threads.com/lists/lucene/java-user/8950
> In the book "Lucene in Action" it's said that compound file format is about 5-10% slower than multi-file format.
> The patch I'm proposing here adds the ability to the IndexWriter to use compound format only for segments, that do not contain more documents than a specific limit "CompoundFileSegmentSizeLimit", which the user can set.
> Due to the exponential merges, a lucene index usually contains only a few very big segments, but much more small segments. The best performance is actually just needed for the big segments, whereas a slighly worse performance for small segments shouldn't play a big role in the overall search performance.
> Consider the following example:
> Index Size:                            1,500,000
> Merge factor:                        10
> Max buffered docs:             100
> Number of indexed fields: 10
> Max. OS file descriptors:    1024
> in the worst case a not-optimized index could contain the following amount of segments:
> 1 x 1,000,000
> 9 x   100,000
> 9 x    10,000
> 9 x     1,000
> 9 x       100
> That's 37 segments. A multi-file format index would have:
> 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files ==> only about 2 open indexes per machine could be handled by the operating system
> A compound-file format index would have:
> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be handled by the operating system, but performance would be 5-10% worse.
> A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 would have:
> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open indexes could be handled by the OS
> The OS can handle now 20 instead of just 2 open indexes, while maintaining the multi-file format performance.
> I'm going to create diffs on the current HEAD and will attach the patch files soon. Please let me know what you think about this improvement.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Closed: (LUCENE-624) Segment size limit for compound files

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]

Michael Busch closed LUCENE-624.
--------------------------------

    Resolution: Won't Fix
      Assignee: Michael Busch

I'm closing this issue, because:
- no votes or comments for almost half a year
- only indexing performance benefits slightly from this feature
- another config parameter in IndexWriter will probably confuse users more than help them

> Segment size limit for compound files
> -------------------------------------
>
>                 Key: LUCENE-624
>                 URL: http://issues.apache.org/jira/browse/LUCENE-624
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>            Priority: Minor
>         Attachments: cfs_seg_size_limit.patch
>
>
> Hello everyone,
> I implemented an improvement targeting compound file usage. Compound files are used to decrease the number of index files, because operating systems can't handle too many open file descriptors. On the other hand, a disadvantage of compound file format is the worse performance compared to multi-file indexes:
> http://www.gossamer-threads.com/lists/lucene/java-user/8950
> In the book "Lucene in Action" it's said that compound file format is about 5-10% slower than multi-file format.
> The patch I'm proposing here adds the ability to the IndexWriter to use compound format only for segments, that do not contain more documents than a specific limit "CompoundFileSegmentSizeLimit", which the user can set.
> Due to the exponential merges, a lucene index usually contains only a few very big segments, but much more small segments. The best performance is actually just needed for the big segments, whereas a slighly worse performance for small segments shouldn't play a big role in the overall search performance.
> Consider the following example:
> Index Size:                            1,500,000
> Merge factor:                        10
> Max buffered docs:             100
> Number of indexed fields: 10
> Max. OS file descriptors:    1024
> in the worst case a not-optimized index could contain the following amount of segments:
> 1 x 1,000,000
> 9 x   100,000
> 9 x    10,000
> 9 x     1,000
> 9 x       100
> That's 37 segments. A multi-file format index would have:
> 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files ==> only about 2 open indexes per machine could be handled by the operating system
> A compound-file format index would have:
> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be handled by the operating system, but performance would be 5-10% worse.
> A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 would have:
> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open indexes could be handled by the OS
> The OS can handle now 20 instead of just 2 open indexes, while maintaining the multi-file format performance.
> I'm going to create diffs on the current HEAD and will attach the patch files soon. Please let me know what you think about this improvement.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org