You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mindaugas Žakšauskas <mi...@gmail.com> on 2014/01/16 12:30:31 UTC

Presence of uncommitted changes

Hi,

I was wondering what would be the best approach to deal with the
situation when some documents are deleted and it is unclear on whether
deletions have resulted any pending commits.

In a single thread scenario this seems to be as simple as

1  indexWriter.deleteDocuments(query); // same for terms arg
2  if (indexWriter.hasUncommittedChanges()) {
3     indexWriter.commit();
4  }

However, I am not sure if this works as nicely in multi threaded
environment, where a possible race condition can happen between lines
2 & 3.

How feasible would it be for .deleteDocuments() to return a number
telling how many documents have actually been deleted?

Also, how cheap is .hasUncommittedChanges()? - my quick code scan
shows that there's no I/O is involved, but maybe somebody can confirm
this? And if this is cheap, why this method isn't invoked in the
beginning of indexWriter.commit()?

Thank you.

m.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Presence of uncommitted changes

Posted by Erick Erickson <er...@gmail.com>.
You might want to look at the soft/hard commit options for insuring
data integrity .vs. latency.
Here's a blog on this topic at the Solr level, but all the Solr stuff
is realized at the Lucene level
eventually, so....

http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Although this is written with SolrCloud in mind, I don't _think_
there's any problem with
doing this on a regular Lucene index....

Best,
Erick

On Fri, Jan 17, 2014 at 8:34 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Fri, Jan 17, 2014 at 7:42 AM, Mindaugas Žakšauskas <mi...@gmail.com> wrote:
>> On Fri, Jan 17, 2014 at 12:13 PM, Michael McCandless
>>> Backing up, what is your app doing, that it so strongly relies on
>>> knowing whether commit() would do anything?  Usually, commit is
>>> something you call rarely, for "safety" purposes to ensure if the
>>> world comes crashing down, you'll have a known state in the index on
>>> restart.
>>
>> We use quite conservative commit policy - commit almost every time
>> when a new document is added to the index (or updated/deleted) - hence
>> the need to know if commit() is necessary.
>>
>> This might sound sub-optimal, but I think it is justifiable because in
>> our application the incoming data stream is not really intense: we
>> normally get just a handful of documents added in a minute. The
>> ability to see those newly added (updated, deleted) documents
>> instantly is far more important.
>
> Seeing newly added documents instantly (in search) is what
> near-real-time readers are for.
>
> Opening an NRT reader from a writer is far faster and less costly than
> doing a commit + reopen.
>
>> Committing often also gives extra security: in case if the system
>> crashes, we are pretty sure we haven't lost anything as rebuilding the
>> index can take days. We could, of course, reindex just the missing
>> documents but finding out what exactly is missing is not trivial.
>
> OK.  Committing should only be used for this purpose (ensuring the
> index is in a known state if the world comes crashing down).
>
> Still, committing after every document is rather insane: performance
> will be awful.  But since your app seems to be very low traffic, maybe
> it's OK in your case ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Presence of uncommitted changes

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Fri, Jan 17, 2014 at 7:42 AM, Mindaugas Žakšauskas <mi...@gmail.com> wrote:
> On Fri, Jan 17, 2014 at 12:13 PM, Michael McCandless
>> Backing up, what is your app doing, that it so strongly relies on
>> knowing whether commit() would do anything?  Usually, commit is
>> something you call rarely, for "safety" purposes to ensure if the
>> world comes crashing down, you'll have a known state in the index on
>> restart.
>
> We use quite conservative commit policy - commit almost every time
> when a new document is added to the index (or updated/deleted) - hence
> the need to know if commit() is necessary.
>
> This might sound sub-optimal, but I think it is justifiable because in
> our application the incoming data stream is not really intense: we
> normally get just a handful of documents added in a minute. The
> ability to see those newly added (updated, deleted) documents
> instantly is far more important.

Seeing newly added documents instantly (in search) is what
near-real-time readers are for.

Opening an NRT reader from a writer is far faster and less costly than
doing a commit + reopen.

> Committing often also gives extra security: in case if the system
> crashes, we are pretty sure we haven't lost anything as rebuilding the
> index can take days. We could, of course, reindex just the missing
> documents but finding out what exactly is missing is not trivial.

OK.  Committing should only be used for this purpose (ensuring the
index is in a known state if the world comes crashing down).

Still, committing after every document is rather insane: performance
will be awful.  But since your app seems to be very low traffic, maybe
it's OK in your case ...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Presence of uncommitted changes

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
On Fri, Jan 17, 2014 at 12:13 PM, Michael McCandless
> Backing up, what is your app doing, that it so strongly relies on
> knowing whether commit() would do anything?  Usually, commit is
> something you call rarely, for "safety" purposes to ensure if the
> world comes crashing down, you'll have a known state in the index on
> restart.

We use quite conservative commit policy - commit almost every time
when a new document is added to the index (or updated/deleted) - hence
the need to know if commit() is necessary.

This might sound sub-optimal, but I think it is justifiable because in
our application the incoming data stream is not really intense: we
normally get just a handful of documents added in a minute. The
ability to see those newly added (updated, deleted) documents
instantly is far more important.

Committing often also gives extra security: in case if the system
crashes, we are pretty sure we haven't lost anything as rebuilding the
index can take days. We could, of course, reindex just the missing
documents but finding out what exactly is missing is not trivial.

m.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Presence of uncommitted changes

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Fri, Jan 17, 2014 at 4:59 AM, Mindaugas Žakšauskas <mi...@gmail.com> wrote:
> Hi,
>
>>> 1  indexWriter.deleteDocuments(query); // same for terms arg
>>> 2  if (indexWriter.hasUncommittedChanges()) {
>>> 3     indexWriter.commit();
>>> 4  }
>>
>> hasUncommittedChanges will return true if you deleted (by Term or
>> Query), even if that Term or Query matches no documents.
>
> Mhm, this is surprising (in a negative sense). So is there any way to
> know if these deletions have actually affected any documents?

I suppose you could pull an NRT reader and do a search by that
Term/Query and see if the totalHits is non-zero?

But that is of course costly in general, and this cost is the reason
why IW can't tell you.  It only buffers up a bunch of deletions and
then later applies them in bulk.

> How about indexWorker.hasDeletions()?

hasDeletions will also return true if you had just deleted by Term or
Query that in fact match no documents.

> Contracts of these methods are not
> documented well enough and I don't feel too confident to make
> independent decisions just by looking at the source code :-(

Good point; I'll improve the javadocs for hasUncommittedChanges and
hasDeletions.

Backing up, what is your app doing, that it so strongly relies on
knowing whether commit() would do anything?  Usually, commit is
something you call rarely, for "safety" purposes to ensure if the
world comes crashing down, you'll have a known state in the index on
restart.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Presence of uncommitted changes

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.
Hi,

>> 1  indexWriter.deleteDocuments(query); // same for terms arg
>> 2  if (indexWriter.hasUncommittedChanges()) {
>> 3     indexWriter.commit();
>> 4  }
>
> hasUncommittedChanges will return true if you deleted (by Term or
> Query), even if that Term or Query matches no documents.

Mhm, this is surprising (in a negative sense). So is there any way to
know if these deletions have actually affected any documents? How
about indexWorker.hasDeletions()? Contracts of these methods are not
documented well enough and I don't feel too confident to make
independent decisions just by looking at the source code :-(

m.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Presence of uncommitted changes

Posted by Michael McCandless <lu...@mikemccandless.com>.
On Thu, Jan 16, 2014 at 6:30 AM, Mindaugas Žakšauskas <mi...@gmail.com> wrote:
> Hi,
>
> I was wondering what would be the best approach to deal with the
> situation when some documents are deleted and it is unclear on whether
> deletions have resulted any pending commits.
>
> In a single thread scenario this seems to be as simple as
>
> 1  indexWriter.deleteDocuments(query); // same for terms arg
> 2  if (indexWriter.hasUncommittedChanges()) {
> 3     indexWriter.commit();
> 4  }

hasUncommittedChanges will return true if you deleted (by Term or
Query), even if that Term or Query matches no documents.

> However, I am not sure if this works as nicely in multi threaded
> environment, where a possible race condition can happen between lines
> 2 & 3.

Yeah.

> How feasible would it be for .deleteDocuments() to return a number
> telling how many documents have actually been deleted?

This isn't really possible: IndexWriter just buffers up any pending
deleted Term/Query, because resolving those to the actual docIDs is so
costly (requires opening SegmentReader for each segment, looking up
all deleted Term or running all deleted Query).  It only resolves once
too much RAM is used by the buffered deletes, or once commit is
called, or when a merge needs to kick off.

> Also, how cheap is .hasUncommittedChanges()? - my quick code scan
> shows that there's no I/O is involved, but maybe somebody can confirm
> this?

It should be cheap ... just several volatile reads (e.g. AtomicInt/Long).

> And if this is cheap, why this method isn't invoked in the
> beginning of indexWriter.commit()?

Just history :)  That method was added waay after IW.commit was added.

But then maybe also paranoia/defensive: it would be bad if a bug in
hasUncommittedChanges meant that commit() failed to write stuff to
disk.  Better to isolate the bug in that case.

> Thank you.

You're welcome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org