You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Rahul Goswami <ra...@gmail.com> on 2023/08/31 20:45:58 UTC

[Solr] Reindexing leaving behind 0 live doc segments

Hello,
I tried floating this question on the Lucene list as well, but thought
the answer could also come from Solr's handling of IndexReader. Hence
posting here.

I am trying to execute a program to read documents segment-by-segment and
reindex to the same index. I am reading using Lucene apis and indexing
using Solr api (in a core that is currently loaded) to preserve the field
analysis and automatically take care of deletions. The segments being
processed *do not* participate in merge.

The reindexing code is executing in the same JVM *inside* of Solr.

What I am observing is that even after a segment has been fully processed
and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
with 0 live docs gets left behind bloating up the index. *Upon Solr
restart, the segment gets cleared successfully.*

I tried to replicate the same thing without the code by indexing 3 docs on
an empty test core, and then reindexing the same docs. The older segment
gets deleted as soon as softCommit interval hits or an explicit commit=true
is called.

Here is the high level code. But this leaves undeleted segments behind
until I restart Solr and load the core again.

try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
                    IndexReader reader = DirectoryReader.open(dir)) {
                for (LeafReaderContext lrc : reader.leaves()) {

                       //read live docs from each leaf , create a
SolrInputDocument out of Document and index using Solr api

                }
}catch(Exception e){}

Would opening an IndexReader this way interfere with how Solr manages
IndexReader and file refCounts, thereby preventing file deletion? What am I
missing?

Happy to provide more details if required, code or otherwise. Help would be
much appreciated!

Thanks,
Rahul

Re: [Solr] Reindexing leaving behind 0 live doc segments

Posted by Alex Deparvu <st...@apache.org>.
Hi.

Just a thought, should you be calling IndexReader#decref in your code once
you are done working with the directory?
I see it happening in Solr on close
https://github.com/apache/solr/blob/ec8f23622b04de80c1dcb85638a73f9d9566d1bf/solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java#L575

best,
alex


On Sun, Sep 10, 2023 at 7:06 PM Rahul Goswami <ra...@gmail.com> wrote:

> Shawn ,
> Thanks for looking further into this. Although many of our Solr instances
> do run on Windows servers, for testing this particular reindexing program,
> I have been running it on Linux to get the OS variable out of the equation
> for now. The behavior I described in my original email occurs on Linux.
> After enough troubleshooting (and code reading), it seems like there is a
> ref count maintained internally at the Lucene level which is not going down
> to 0 thereby making the segments ineligible for deletion.
>
> What is baffling is that even after the reader is closed and I am done
> processing all the required segments, when I issue a commit through the
> code, it still doesn't have any effect.
>
> Only two things help with the cleanup...i) Solr restart ii) Core reload.
> And unfortunately neither of these approaches are practical for my use case
> since I can't wait for the whole processing to finish before reclaiming the
> space, especially when some of the cores are 3-4 TB large.
>
> Thanks,
> Rahul
>
> On Sat, Sep 2, 2023 at 4:45 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 9/1/23 16:30, Rahul Goswami wrote:
> > > Thanks for your response. To your question about locking, I am not
> doing
> > > anything explicitly here. If you are alluding to deleting the
> write.lock
> > > file and opening a new IndexWriter, I am not doing that . Only an
> > > IndexReader.
> > >
> > > Are you suggesting opening an IndexReader from within Solr could
> > interfere
> > > with Solr's working and in turn file deletions? I think an answer to
> this
> > > question would really help me understand what is going wrong.
> >
> > I don't know what exactly the effects are of opening just a reader with
> > Lucene.
> >
> > I had another thought, and then I did a little searching on my list
> > archive to see if I could answer a question:  What OS is this on?
> >
> > Other messages you've written say that you're running on Windows.
> >
> > Windows does something that on the surface sounds like a good thing:  If
> > a file is open in ANY mode, including read-only, Windows will not allow
> > that file to be deleted.
> >
> > So I think the problem here is that you've got a Lucene program keeping
> > those segment files open, so when the Lucene code running in Solr tries
> > to delete them as a normal part of commit operations, it can't.
> >
> > If you were running this on pretty much any other OS, you probably
> > wouldn't be having this problem.  Other operating systems like Linux
> > allow file deletion even if the file is open elsewhere.  The file will
> > continue to exist on the filesystem until the last program that has it
> > open exits or closes the file, at which time the filesystem will finish
> > the deletion.
> >
> > If you have to stick with Windows, then you're going to have to do
> > something after your program closes its reader to trigger Lucene's
> > auto-cleanup of segments.  I believe a Solr index reload would
> > accomplish that.  Another way might be to index a dummy document, delete
> > that document, and issue a commit.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: [Solr] Reindexing leaving behind 0 live doc segments

Posted by Rahul Goswami <ra...@gmail.com>.
Shawn ,
Thanks for looking further into this. Although many of our Solr instances
do run on Windows servers, for testing this particular reindexing program,
I have been running it on Linux to get the OS variable out of the equation
for now. The behavior I described in my original email occurs on Linux.
After enough troubleshooting (and code reading), it seems like there is a
ref count maintained internally at the Lucene level which is not going down
to 0 thereby making the segments ineligible for deletion.

What is baffling is that even after the reader is closed and I am done
processing all the required segments, when I issue a commit through the
code, it still doesn't have any effect.

Only two things help with the cleanup...i) Solr restart ii) Core reload.
And unfortunately neither of these approaches are practical for my use case
since I can't wait for the whole processing to finish before reclaiming the
space, especially when some of the cores are 3-4 TB large.

Thanks,
Rahul

On Sat, Sep 2, 2023 at 4:45 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/1/23 16:30, Rahul Goswami wrote:
> > Thanks for your response. To your question about locking, I am not doing
> > anything explicitly here. If you are alluding to deleting the write.lock
> > file and opening a new IndexWriter, I am not doing that . Only an
> > IndexReader.
> >
> > Are you suggesting opening an IndexReader from within Solr could
> interfere
> > with Solr's working and in turn file deletions? I think an answer to this
> > question would really help me understand what is going wrong.
>
> I don't know what exactly the effects are of opening just a reader with
> Lucene.
>
> I had another thought, and then I did a little searching on my list
> archive to see if I could answer a question:  What OS is this on?
>
> Other messages you've written say that you're running on Windows.
>
> Windows does something that on the surface sounds like a good thing:  If
> a file is open in ANY mode, including read-only, Windows will not allow
> that file to be deleted.
>
> So I think the problem here is that you've got a Lucene program keeping
> those segment files open, so when the Lucene code running in Solr tries
> to delete them as a normal part of commit operations, it can't.
>
> If you were running this on pretty much any other OS, you probably
> wouldn't be having this problem.  Other operating systems like Linux
> allow file deletion even if the file is open elsewhere.  The file will
> continue to exist on the filesystem until the last program that has it
> open exits or closes the file, at which time the filesystem will finish
> the deletion.
>
> If you have to stick with Windows, then you're going to have to do
> something after your program closes its reader to trigger Lucene's
> auto-cleanup of segments.  I believe a Solr index reload would
> accomplish that.  Another way might be to index a dummy document, delete
> that document, and issue a commit.
>
> Thanks,
> Shawn
>
>

Re: [Solr] Reindexing leaving behind 0 live doc segments

Posted by Shawn Heisey <ap...@elyograg.org>.
On 9/1/23 16:30, Rahul Goswami wrote:
> Thanks for your response. To your question about locking, I am not doing
> anything explicitly here. If you are alluding to deleting the write.lock
> file and opening a new IndexWriter, I am not doing that . Only an
> IndexReader.
> 
> Are you suggesting opening an IndexReader from within Solr could interfere
> with Solr's working and in turn file deletions? I think an answer to this
> question would really help me understand what is going wrong.

I don't know what exactly the effects are of opening just a reader with 
Lucene.

I had another thought, and then I did a little searching on my list 
archive to see if I could answer a question:  What OS is this on?

Other messages you've written say that you're running on Windows.

Windows does something that on the surface sounds like a good thing:  If 
a file is open in ANY mode, including read-only, Windows will not allow 
that file to be deleted.

So I think the problem here is that you've got a Lucene program keeping 
those segment files open, so when the Lucene code running in Solr tries 
to delete them as a normal part of commit operations, it can't.

If you were running this on pretty much any other OS, you probably 
wouldn't be having this problem.  Other operating systems like Linux 
allow file deletion even if the file is open elsewhere.  The file will 
continue to exist on the filesystem until the last program that has it 
open exits or closes the file, at which time the filesystem will finish 
the deletion.

If you have to stick with Windows, then you're going to have to do 
something after your program closes its reader to trigger Lucene's 
auto-cleanup of segments.  I believe a Solr index reload would 
accomplish that.  Another way might be to index a dummy document, delete 
that document, and issue a commit.

Thanks,
Shawn


Re: [Solr] Reindexing leaving behind 0 live doc segments

Posted by Rahul Goswami <ra...@gmail.com>.
Shawn,
Thanks for your response. To your question about locking, I am not doing
anything explicitly here. If you are alluding to deleting the write.lock
file and opening a new IndexWriter, I am not doing that . Only an
IndexReader.

Are you suggesting opening an IndexReader from within Solr could interfere
with Solr's working and in turn file deletions? I think an answer to this
question would really help me understand what is going wrong.

I am running Solr 8.11.1 in *standalone mode* and the index is a mix of 7.x
and 8.x segments. I am only reindexing 7.x segments by excluding them from
participating in merges through a custom merge policy. Also please note
that the reindexing process finishes successfully and if I check the
/admin/segments Solr endpoint, all the segments have version 8.11 at this
point. The process finishes fine with all data integrity checks and the
searches working fine, except that these 0 live doc 7.x segments get left
behind and cause index bloat.

Thanks,
Rahul


On Thu, Aug 31, 2023 at 10:22 PM Shawn Heisey <el...@elyograg.org> wrote:

> On 8/31/23 14:45, Rahul Goswami wrote:
> > I am trying to execute a program to read documents segment-by-segment and
> > reindex to the same index. I am reading using Lucene apis and indexing
> > using Solr api (in a core that is currently loaded) to preserve the field
> > analysis and automatically take care of deletions. The segments being
> > processed *do not* participate in merge.
>
> In order to make this even possible, you must have changed the locking
> to 'none'.  Lucene normally prevents opening the same index more than
> once, and it does so for good reason.  Operation is undefined in that
> situation.  Your index could easily become corrupted.
>
> What you should do is make a copy of the index directory and index from
> there to Solr.  If you make one pass with rsync and then repeat the
> rsync, the second pass will complete VERY quickly and should produce a
> good index.  I would use "rsync -avH --delete /path/to/source/
> /path/to/target/" for that.
>
> <snip>
>
> > Would opening an IndexReader this way interfere with how Solr manages
> > IndexReader and file refCounts, thereby preventing file deletion? What
> am I
> > missing?
> As I mentioned above, Lucene can make no guarantees about how things
> work if the index is opened more than once.  The Lucene program would
> likely not interfere with Solr's reference counts, but it is still a
> REALLY bad idea to have both Solr and your program open the index.
>
> You could try reloading the core rather than restarting Solr.  That will
> happen very quickly and might cause Lucene to delete empty segments.  If
> the index is not large, you could ask Solr to optimize the index with
> maxSegments=1.  You would not want to do that on a really large index.
>
> Thanks,
> Shawn
>
>

Re: [Solr] Reindexing leaving behind 0 live doc segments

Posted by Shawn Heisey <el...@elyograg.org>.
On 8/31/23 14:45, Rahul Goswami wrote:
> I am trying to execute a program to read documents segment-by-segment and
> reindex to the same index. I am reading using Lucene apis and indexing
> using Solr api (in a core that is currently loaded) to preserve the field
> analysis and automatically take care of deletions. The segments being
> processed *do not* participate in merge.

In order to make this even possible, you must have changed the locking 
to 'none'.  Lucene normally prevents opening the same index more than 
once, and it does so for good reason.  Operation is undefined in that 
situation.  Your index could easily become corrupted.

What you should do is make a copy of the index directory and index from 
there to Solr.  If you make one pass with rsync and then repeat the 
rsync, the second pass will complete VERY quickly and should produce a 
good index.  I would use "rsync -avH --delete /path/to/source/ 
/path/to/target/" for that.

<snip>

> Would opening an IndexReader this way interfere with how Solr manages
> IndexReader and file refCounts, thereby preventing file deletion? What am I
> missing?
As I mentioned above, Lucene can make no guarantees about how things 
work if the index is opened more than once.  The Lucene program would 
likely not interfere with Solr's reference counts, but it is still a 
REALLY bad idea to have both Solr and your program open the index.

You could try reloading the core rather than restarting Solr.  That will 
happen very quickly and might cause Lucene to delete empty segments.  If 
the index is not large, you could ask Solr to optimize the index with 
maxSegments=1.  You would not want to do that on a really large index.

Thanks,
Shawn