You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2011/04/12 02:54:13 UTC

Re: partial optimize does not reduce the segment number to maxNumSegments

: I have a core with 120+ segment files and I tried partial optimize specify
: maxNumSegments=10, after the optimize the segment files reduced to 64 files;

a) the option you want to specify is "maxSegments" .. not "maxNumSegments"

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22optimize.22

b) i can't reproduce this ... i just created an index with 200 segments 
and when i hit the example url from the wiki...

curl 'http://localhost:8983/solr/update?optimize=true&maxSegments=10&waitFlush=false'

...my index was correctly optimized down to 10 segments.

is it possible that you just didn't wait long enough and you were 
observing the number of segments while the optimize was still taking 
place?


-Hoss

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Renee,

Here's what I'd do:
* Check how many open files your system is set up for (ulimit -n).  You likely 
want to increase that (1024 seems to be a common default under Linux, and in the 
past I've set that to 30k+ without issues)
* Look at your mergeFactor.  If it's high, consider lowering it (will slow down 
indexing a bit)
* Consider using cfs, but if you do the above right, you can avoid using it.
* Consider a better Solr monitoring tool

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Renee Sun <re...@mcafee.com>
> To: solr-user@lucene.apache.org
> Sent: Fri, April 15, 2011 3:41:28 PM
> Subject: Re: partial optimize does not reduce the segment number to 
>maxNumSegments
> 
> sorry I should elaborate that earlier...
> 
> in our production environment,  we have multiple cores and the ingest
> continuously all day long; we only do  optimize periodically, and optimize
> once a day in mid night.
> 
> So  sometimes we could see 'too many open files' error. To prevent it  from
> happening, in production we maintain a script to monitor the segment  files
> total with all cores, and send out warnings if that number exceed  a
> threshold... it is kind of preventive measurement.  Currently we are  using
> the linux command to count the files. We are wondering if we can simply  use
> some formula to figure out this number, it will be better that way. Seems  we
> could use the stat url to get segment number and multiply it by 8 (that  is
> what we have given our schema).
> 
> Any better way to approach this?  thanks a lot!
> Renee
> 
> --
> View this message in context: 
>http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2825736.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
> 

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Renee Sun <re...@mcafee.com>.
sorry I should elaborate that earlier...

in our production environment, we have multiple cores and the ingest
continuously all day long; we only do optimize periodically, and optimize
once a day in mid night.

So sometimes we could see 'too many open files' error. To prevent it from
happening, in production we maintain a script to monitor the segment files
total with all cores, and send out warnings if that number exceed a
threshold... it is kind of preventive measurement.  Currently we are using
the linux command to count the files. We are wondering if we can simply use
some formula to figure out this number, it will be better that way. Seems we
could use the stat url to get segment number and multiply it by 8 (that is
what we have given our schema).

Any better way to approach this? thanks a lot!
Renee

--
View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2825736.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Erick Erickson <er...@gmail.com>.
Why do you care? You haven't outlined why having the precise numbers
here is necessary. Perhaps with a higher-level statement of the problem
you're trying to solve we could make some better suggestions....

Best
Erick

On Wed, Apr 13, 2011 at 5:23 PM, Renee Sun <re...@mcafee.com> wrote:

> yeah, I can figure out the segment number by going to stat page of solr...
> but my question was how to figure out exact total number of files in
> 'index'
> folder for each core.
>
> Like I mentioned in previous message, I currently have 8 files per segment
> (.prx .tii etc), but it seems this might change if I use term vector for
> example.  So I need suggestions on how to accurately figure out the total
> file number.
>
> thanks
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2817912.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Renee Sun <re...@mcafee.com>.
yeah, I can figure out the segment number by going to stat page of solr...
but my question was how to figure out exact total number of files in 'index'
folder for each core.

Like I mentioned in previous message, I currently have 8 files per segment
(.prx .tii etc), but it seems this might change if I use term vector for
example.  So I need suggestions on how to accurately figure out the total
file number.

thanks

--
View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2817912.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Jay Hill <ja...@gmail.com>.
As Hoss mentioned earlier in the thread, you can use the statistics page
from the admin console to view the current number of segments. But if you
want to know by looking at the files, each segment will have a unique
prefix, such as "_u". There will be one unique prefix for every segment in
the index.

-Jay


On Tue, Apr 12, 2011 at 3:16 PM, Renee Sun <re...@mcafee.com> wrote:

> ok I dug more into this and realize the file extensions can vary depending
> on
> schema, right?
> for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)...
> and
> I suspect the file extensions
> may change with future lucene releases?
>
> now it seems we can't just count the file using any formula, we have to
> list
> all files in that directory and count that way... any insight will be
> appreciated.
> thanks
> Renee
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Renee Sun <re...@mcafee.com>.
ok I dug more into this and realize the file extensions can vary depending on
schema, right?
for instance we dont have *.tvx, *.tvd, *.tvf (not using term vector)... and
I suspect the file extensions
may change with future lucene releases?

now it seems we can't just count the file using any formula, we have to list
all files in that directory and count that way... any insight will be
appreciated.
thanks
Renee

--
View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813561.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Renee Sun <re...@mcafee.com>.
thanks! 

It seems the file count in index directory is the segment# * 8 in my dev
environment...

I see there are .fnm .frq .fdt .fdx .nrm .prx .tii .tis (8) file extensions,
and each has as many as segment# files.

Is it always safe to calculate the file counts using segment number multiply
by 8? of course this excludes the segment_N, segment.gen and xxx_del files.

I found most of the cores has the file count that can be calculated just
using above formula, but few cores do not have a match number... 

thanks
Renee

--
View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2813419.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Chris Hostetter <ho...@fucit.org>.
: /tmp # ls /xxx/solr/data/32455077/index | wc   ---------------> this is the start point, 150 seg files
:      150      150     946
: /tmp # time curl


the number of files i nthe index directory is not the "number of 
segments"

the number of segments is an internal lucene concept that impacts the the 
number of files, but it is not an actual file count.  A segment can 
consist of multiple files depending on how your schema.xml is configured 
(and wether you are using the compound file format)

You can see the current number of segments by looking at the stats page...

http://localhost:8983/solr/admin/stats.jsp
SolrIndexReader{this=64a7c45e,r=ReadOnlyDirectoryReader@64a7c45e,refCnt=1,segments=10} 

...that's from the solr example, where the index directory at the 
time of that request actually contained 93 files.


-Hoss

Re: partial optimize does not reduce the segment number to maxNumSegments

Posted by Renee Sun <re...@mcafee.com>.
Hi Hoss,
thanks for your response...

you are right I got a typo in my question, but I did use maxSegments, and
here is the exactly url I used:

 curl
'http://localhost:8080/solr/97/update?optimize=true&maxSegments=10&waitFlush=true'

I used jconsole and du -sk to monitor each partial optimize, and I am sure
the optimize was done and
it always reduce segment files from 130+ to 65+ when I started with
maxSegments=10; when I run
again with maxSegments=9, it reduce to somewhere in 50.

when I use maxSegments=2, it always reduce the segment to 18; and
maxSegments=1 (full optimize)
will always reduce the core to 10 segment files.

this has been repeated for about dozen times.

I think the resulting files number is depending on the size of the core. I
have a core takes 10GB disk
space, and it has 4 million documents.

It perhaps also depends on other sole/lucene configurations? let me know if
I should give you any data
with our solr config.  

Here is the actual data from the test I run lately for your reference, you
can see it definitely finished
each partial optimize and the time spent is also included (please note I am
using a core id there which
is different from yours):

/tmp # ls /xxx/solr/data/32455077/index | wc   ---------------> this is the
start point, 150 seg files
     150      150     946
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=10&waitFlush=true'
real    0m36.050s
user    0m0.002s
sys    0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc    ---------> after first
partial optimize (10), reduce to 82
     82      82     746
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=9&waitFlush=true'
real    1m54.364s
user    0m0.003s
sys    0m0.002s

/tmp # ls /xxx/solr/data/32455077/index | wc
     74      74     674
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=8&waitFlush=true'
real    2m0.443s
user    0m0.002s
sys    0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc
     66      66     602
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=7&waitFlush=true'
<?xml version="1.0" encoding="UTF-8"?>
real    3m22.201s
user    0m0.002s
sys    0m0ls 

/tmp # ls /xxx/solr/data/32455077/index | wc
     58      58     530
/tmp #  time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=6&w 
real    3m29.277s
user    0m0.001s
sys    0m0.004s

/tmp # ls /xxx/solr/data/32455077/index | wc
     50      50     458
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=5&w 
real    3m41.514s
user    0m0.003s
sys    0m0.003s

/tmp # ls /xxx/solr/data/32455077/index | wc
     42      42     386
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=4&w 
real    5m35.697s
user    0m0.003s
sys    0m0.004s

/tmp # ls /xxx/solr/data/32455077/index | wc
     34      34     314
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=3wa 
real    7m8.773s
user    0m0.003s
sys    0m0.002s

/tmp # ls /xxx/solr/data/32455077/index | wc 
     26      26     242
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=2&w 
real    9m18.814s
user    0m0.004s
sys    0m0.001s

/tmp # ls /xxx/solr/data/32455077/index | wc
     18      18     170
/tmp # time curl
'http://localhost:8080/solr/32455077/update?optimize=true&maxSegments=1&w
(full optimize)
real    16m6.599s
user    0m0.003s
sys    0m0.004s

Disk Space Usage:
first 3 runs took about 20% extra 
middle couple runs took about 50% extra 
last full optimize took 100% extra


--
View this message in context: http://lucene.472066.n3.nabble.com/partial-optimize-does-not-reduce-the-segment-number-to-maxNumSegments-tp2682195p2812415.html
Sent from the Solr - User mailing list archive at Nabble.com.