You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2010/01/13 02:42:05 UTC

Supported way to get segment from IndexWriter?

A conversation with someone earlier today got me thinking about cranking 
out a patch for SOLR-1559 (in which the goal is to allow for rules do 
dermine the iput to optimize(maxNumSegments) instead of requiring a fixed 
integer value as input)  when i realized that i wasn't certain what 
"approved" methods there might be for deterrmining hte current number of 
segments from an IndexWriter.

I see IndexWriter.getSegmentCount() but it's package protected (with a 
comment that it exists for tests).  So my best guess using only public 
APIs would be something like...

  int numCurrentSegments = -1;
  IndexReader r = writer.getReader();
  try {
    IndexReader[]tmp = r.getSequentialSubReaders();
    numCurrentSegments = null==tmp ? 1 : tmp.length;
  } finally {
    r.close();
  }

Is there a better way?

(My main concern about this approach being that my intuition (which seems 
supported by the javadocs) is that getReader might be a little 
expensive/excesive just to count the segments)

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Supported way to get segment from IndexWriter?

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Jan 14, 2010 at 7:57 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : Since SegmentInfos is now public, you could use SegmentInfos.read to
> : read the current segments_N file, and then call its .size() method?
> :
> : But, this will only count as of the last commit... which is probably
> : not sufficient for SOLR-1559?
>
> Honestly: i have no idea, I'm a little out of touch with awhat "commit"
> means in Lucene-Java these days.

Commit just means that a new segments_N file is written into the
index, so that an external reader on doing an open/reopen would see
all changes made with the IndexWriter prior to the commit.  (commit
also makes the changes "durable", ie, will survive a crash, power
loss, etc, by syncing the necessary files).

> The goal is to be able to compute a maxNumberOfSegments relative to "the
> current number of segments", some people might percieve that as the
> current number of "committed" segments -- but really it comes down to
> what optimize is going to do with the resulting number.
>
> if someone has the goal of making iterative micro optimizations to their
> index, so they say "optimize to $currentSegmentCount-1 segments" but the
> number of commited segments is 3 and the number of uncommited segments is
> 27 higher (because of active indexing) so the app starts trashing as it
> tries to optimize down from 27 to 3 that doesn't feel like "Do what i
> mean"

Right.  Merging/optimizing currently always run against all (committed
& uncommitted) segments...

> : We could simply make getSegmentCount public / expert / not only for tests?
>
> I was considering doing that in the SolrIndexWriter class that already
> exists as a subclass of IndexWriter, but i didn't want to go that route if
> there was a good reason why IndexWriter.getSegmentCount isn't already
> public (ie: if there was an expecation that the way IndexWRiter manages
> segments was subject to refactoring getSegmentCount out of existence)

I think it's fine to make this public and mark it as expert, as well
as note that one can't rely on "when" IW makes new segments.  Eg,
today when you get a near real-time reader, IW creates a new segment,
but in the future it's possible it will not.  So this expert method
lets you peek into the index structure, but what count it returns
after a series of methods invoked on IW, is subject to change.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Supported way to get segment from IndexWriter?

Posted by Chris Hostetter <ho...@fucit.org>.
: Since SegmentInfos is now public, you could use SegmentInfos.read to
: read the current segments_N file, and then call its .size() method?
: 
: But, this will only count as of the last commit... which is probably
: not sufficient for SOLR-1559?

Honestly: i have no idea, I'm a little out of touch with awhat "commit" 
means in Lucene-Java these days.

The goal is to be able to compute a maxNumberOfSegments relative to "the 
current number of segments", some people might percieve that as the 
current number of "committed" segments -- but really it comes down to 
what optimize is going to do with the resulting number.

if someone has the goal of making iterative micro optimizations to their 
index, so they say "optimize to $currentSegmentCount-1 segments" but the 
number of commited segments is 3 and the number of uncommited segments is 
27 higher (because of active indexing) so the app starts trashing as it 
tries to optimize down from 27 to 3 that doesn't feel like "Do what i 
mean"

: We could simply make getSegmentCount public / expert / not only for tests?

I was considering doing that in the SolrIndexWriter class that already 
exists as a subclass of IndexWriter, but i didn't want to go that route if 
there was a good reason why IndexWriter.getSegmentCount isn't already 
public (ie: if there was an expecation that the way IndexWRiter manages 
segments was subject to refactoring getSegmentCount out of existence)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Supported way to get segment from IndexWriter?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Indeed, getReader is an expensive way to get the segment count (it
flushes the current RAM buffer to disk as a new segment).

Since SegmentInfos is now public, you could use SegmentInfos.read to
read the current segments_N file, and then call its .size() method?

But, this will only count as of the last commit... which is probably
not sufficient for SOLR-1559?

We could simply make getSegmentCount public / expert / not only for tests?

Mike

On Tue, Jan 12, 2010 at 8:42 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> A conversation with someone earlier today got me thinking about cranking out
> a patch for SOLR-1559 (in which the goal is to allow for rules do dermine
> the iput to optimize(maxNumSegments) instead of requiring a fixed integer
> value as input)  when i realized that i wasn't certain what "approved"
> methods there might be for deterrmining hte current number of segments from
> an IndexWriter.
>
> I see IndexWriter.getSegmentCount() but it's package protected (with a
> comment that it exists for tests).  So my best guess using only public APIs
> would be something like...
>
>  int numCurrentSegments = -1;
>  IndexReader r = writer.getReader();
>  try {
>   IndexReader[]tmp = r.getSequentialSubReaders();
>   numCurrentSegments = null==tmp ? 1 : tmp.length;
>  } finally {
>   r.close();
>  }
>
> Is there a better way?
>
> (My main concern about this approach being that my intuition (which seems
> supported by the javadocs) is that getReader might be a little
> expensive/excesive just to count the segments)
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org