You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Trejkaz <tr...@trypticon.org> on 2011/08/23 06:17:51 UTC

IndexWriter.ramSizeInBytes() no longer returns to 0 after commit()?

Hi all.

We are using IndexWriter with no limits set and managing the commits
ourselves, mainly so that we can ensure they are done at the same time
as other (non-Lucene) commits.

After upgrading from 3.0 ~ 3.3, we are seeing a change in
ramSizeInBytes() behaviour where it is no longer resetting to zero
after a commit().  The end result is that after a while, the code
wants to commit after adding even a single document.

I boiler it down to a test case (though I'm obviously just using JUnit
as a helper here):

    @Test
    public void testIndexWriterByteCount() throws Exception
    {
        Directory directory = new RAMDirectory();
        IndexWriter writer = new IndexWriter(directory, new
WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
        System.out.println("At start: " + writer.ramSizeInBytes());

        for (int j = 0; j < 3; j++)
        {
            for (int i = 0; i < 5; i++)
            {
                Document document = new Document();
                document.add(new Field("text", "a", Field.Store.YES,
Field.Index.ANALYZED));
                writer.addDocument(document);
            }
            System.out.println("After adding some docs: " +
writer.ramSizeInBytes());

            writer.commit();
            System.out.println("After commit: " + writer.ramSizeInBytes());
        }

        writer.close();
        directory.close();
    }

The results on Lucene 3.3.0:

    At start: 0
    After adding some docs: 99400
    After commit: 99344
    After adding some docs: 99400
    After commit: 99344
    After adding some docs: 99400
    After commit: 99344

The results of running more or less the same test on Lucene 3.0.3:

    At start: 0
    After adding some docs: 115712
    After commit: 0
    After adding some docs: 50176
    After commit: 0
    After adding some docs: 50176
    After commit: 0

Questions:

(1) Is Lucene now caching more than it used to be caching, which would
account for the extra space usage, or is this simply a bug where the
count isn't being updated correctly?

(2) Is checking ramSizeInBytes() still the recommended way to
determine whether it's time to commit()?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriter.ramSizeInBytes() no longer returns to 0 after commit()?

Posted by Trejkaz <tr...@trypticon.org>.
On Wed, Aug 24, 2011 at 4:45 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Hmm... this looks like a side-effect of LUCENE-2680, which was merged
> back from trunk to 3.1.
>
> So the problem is, IW recycles the RAM it has allocated, and so this
> method is returning the allocated RAM, even if those buffers are not
> in fact in use right now (ie, filled with postings data).  I think
> it's important that it does this, ie, it should be honest that it is
> in fact tying up RAM.
>
> Maybe we could fix this by adding a new method that tells you how much
> of the buffers are really in-use... but I don't think we directly
> track that now; it'd have to be computed from the free buffers lists
> inside DocumentsWriter.
>
> BTW, why not have IW flush by RAM itself?  This way it will flush (but
> not commit) the postings to disk... commit is rather costly since it
> fsyncs all the newly written files.

I think we're worried about the consequences of leaving around
partially written segments, particularly in the case where the
indexing process crashes multiple times.

Although, even if we turned on flushing, it seems like we still need
to know when to commit(), because we commit Lucene and other things at
the same time.  We were determining an appropriate time based on the
amount of data which wasn't committed yet, but it isn't possible to do
that with the current version as far as I can tell (you can get the
number of documents, but documents in the real world are so differing
in size that the number isn't useful.)

I might have a try at adding a method to DocumentsWriter to compute
the amount of actual used space.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: IndexWriter.ramSizeInBytes() no longer returns to 0 after commit()?

Posted by Michael McCandless <lu...@mikemccandless.com>.
Hmm... this looks like a side-effect of LUCENE-2680, which was merged
back from trunk to 3.1.

So the problem is, IW recycles the RAM it has allocated, and so this
method is returning the allocated RAM, even if those buffers are not
in fact in use right now (ie, filled with postings data).  I think
it's important that it does this, ie, it should be honest that it is
in fact tying up RAM.

Maybe we could fix this by adding a new method that tells you how much
of the buffers are really in-use... but I don't think we directly
track that now; it'd have to be computed from the free buffers lists
inside DocumentsWriter.

BTW, why not have IW flush by RAM itself?  This way it will flush (but
not commit) the postings to disk... commit is rather costly since it
fsyncs all the newly written files.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 23, 2011 at 12:17 AM, Trejkaz <tr...@trypticon.org> wrote:
> Hi all.
>
> We are using IndexWriter with no limits set and managing the commits
> ourselves, mainly so that we can ensure they are done at the same time
> as other (non-Lucene) commits.
>
> After upgrading from 3.0 ~ 3.3, we are seeing a change in
> ramSizeInBytes() behaviour where it is no longer resetting to zero
> after a commit().  The end result is that after a while, the code
> wants to commit after adding even a single document.
>
> I boiler it down to a test case (though I'm obviously just using JUnit
> as a helper here):
>
>    @Test
>    public void testIndexWriterByteCount() throws Exception
>    {
>        Directory directory = new RAMDirectory();
>        IndexWriter writer = new IndexWriter(directory, new
> WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
>        System.out.println("At start: " + writer.ramSizeInBytes());
>
>        for (int j = 0; j < 3; j++)
>        {
>            for (int i = 0; i < 5; i++)
>            {
>                Document document = new Document();
>                document.add(new Field("text", "a", Field.Store.YES,
> Field.Index.ANALYZED));
>                writer.addDocument(document);
>            }
>            System.out.println("After adding some docs: " +
> writer.ramSizeInBytes());
>
>            writer.commit();
>            System.out.println("After commit: " + writer.ramSizeInBytes());
>        }
>
>        writer.close();
>        directory.close();
>    }
>
> The results on Lucene 3.3.0:
>
>    At start: 0
>    After adding some docs: 99400
>    After commit: 99344
>    After adding some docs: 99400
>    After commit: 99344
>    After adding some docs: 99400
>    After commit: 99344
>
> The results of running more or less the same test on Lucene 3.0.3:
>
>    At start: 0
>    After adding some docs: 115712
>    After commit: 0
>    After adding some docs: 50176
>    After commit: 0
>    After adding some docs: 50176
>    After commit: 0
>
> Questions:
>
> (1) Is Lucene now caching more than it used to be caching, which would
> account for the extra space usage, or is this simply a bug where the
> count isn't being updated correctly?
>
> (2) Is checking ramSizeInBytes() still the recommended way to
> determine whether it's time to commit()?
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org