You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Adam Retter <ad...@googlemail.com> on 2014/12/16 14:54:22 UTC

Transactions multiplexing IndexWriter

Hey devs,

I am looking at making use of the two-phase commit approach available
in IndexWriter but the current architecture there does not quite fit
with what we want to achieve. It seems that I could build this atop
IndexWriter, however I wonder if there is either an existing
alternative that I have not discovered or whether it would be better
to contribute a patch to the Lucene project itself?

In my system I have many concurrent transactions and each of them
needs to make modifications to a single Lucene index in an atomic and
consistent manner.
If I had a single-thread (i.e. one concurrent transaction) for
example, I could access the IndexWriter instance, call addDocument or
whatever as many times as I like, call prepareForCommit and then
commit.

The main issue that I have is that if I let all concurrent
transactions use the same IndexWriter then I loose isolation, as a
commit of one transaction may write the partial pending updates of
another transaction.

Now I can see a naive solution for my application where I could add
all updates that I want to make to the index to a `pending list`, I
could then take an exclusive lock for the index writer, apply my
pending list to the index writer and then commit, finally releasing
the exclusive lock. Whilst I could get this working, the down-side is
that I have to implement and manage this `pending list` and applying
it to the index myself, and it comes at the cost of memory (or even
paging it to disk).

It seems likely to me that others before me have also had such a
requirement, does anything like this already exist in Lucene or would
it be desirable for me to contribute something? At a rough guess I
would imagine separating the IndexWriter from transaction content,
something like:

try(Transaction txn = indexWriter.beginTransaction()) {
   indexWriter.addDocument(txn, doc1);
   indexWriter.addDocument(txn, doc2);
   indexWriter.addDocument(txn, doc3);

  indexWriter.commit(txn);
}

The transaction could be automatically rolled-back in the close()
method called by try-with-resources if it has not been committed,
which would allow any exceptions to be cleanly handled by the caller.

Does that make any sense, or am I way off?

Cheers Adam.

-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Transactions multiplexing IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Thu, Jul 23, 2015 at 8:39 AM, Adam Retter <ad...@googlemail.com> wrote:

> Sorry for the very delayed reply. I had to put this on hold whilst I
> focused on another project. I am now revisiting this.

No problem, the internet never forgets!

> Would I be correct in thinking that IndexWriter.addIndexes only allows
> me to add entries to an existing index, i.e. if I wanted to remove
> entries this would not work for me.

That's correct.

> Also out of interest what happens
> for updates, where there are entries for the same key in the core
> index and in the new index for which I call addIndexes?

Then you would have 2 docs for a given key, i.e. the old one is not replaced.

It's logically as if you called IW.addDocument for all documents in
the index passed to IW.addIndexes.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Transactions multiplexing IndexWriter

Posted by Adam Retter <ad...@googlemail.com>.

Thanks Michael,

Sorry for the very delayed reply. I had to put this on hold whilst I
focused on another project. I am now revisiting this.

Would I be correct in thinking that IndexWriter.addIndexes only allows
me to add entries to an existing index, i.e. if I wanted to remove
entries this would not work for me. Also out of interest what happens
for updates, where there are entries for the same key in the core
index and in the new index for which I call addIndexes?

On 16 December 2014 at 19:06, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Lucene's IndexWriter only allows one transaction at a time.  Fixing
> this would be challenging I think.
>
> One workaround might be to let your separate transactions write into
> private directories, and then when complete, use
> IndexWriter.addIndexes (on the main writer) to fold those changes it.
> That part would still be single-transaction, but the addIndexes call
> should be faster than doing N separate indexing ops.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Dec 16, 2014 at 8:54 AM, Adam Retter <ad...@googlemail.com> wrote:
>> Hey devs,
>>
>> I am looking at making use of the two-phase commit approach available
>> in IndexWriter but the current architecture there does not quite fit
>> with what we want to achieve. It seems that I could build this atop
>> IndexWriter, however I wonder if there is either an existing
>> alternative that I have not discovered or whether it would be better
>> to contribute a patch to the Lucene project itself?
>>
>> In my system I have many concurrent transactions and each of them
>> needs to make modifications to a single Lucene index in an atomic and
>> consistent manner.
>> If I had a single-thread (i.e. one concurrent transaction) for
>> example, I could access the IndexWriter instance, call addDocument or
>> whatever as many times as I like, call prepareForCommit and then
>> commit.
>>
>> The main issue that I have is that if I let all concurrent
>> transactions use the same IndexWriter then I loose isolation, as a
>> commit of one transaction may write the partial pending updates of
>> another transaction.
>>
>> Now I can see a naive solution for my application where I could add
>> all updates that I want to make to the index to a `pending list`, I
>> could then take an exclusive lock for the index writer, apply my
>> pending list to the index writer and then commit, finally releasing
>> the exclusive lock. Whilst I could get this working, the down-side is
>> that I have to implement and manage this `pending list` and applying
>> it to the index myself, and it comes at the cost of memory (or even
>> paging it to disk).
>>
>> It seems likely to me that others before me have also had such a
>> requirement, does anything like this already exist in Lucene or would
>> it be desirable for me to contribute something? At a rough guess I
>> would imagine separating the IndexWriter from transaction content,
>> something like:
>>
>> try(Transaction txn = indexWriter.beginTransaction()) {
>>    indexWriter.addDocument(txn, doc1);
>>    indexWriter.addDocument(txn, doc2);
>>    indexWriter.addDocument(txn, doc3);
>>
>>   indexWriter.commit(txn);
>> }
>>
>> The transaction could be automatically rolled-back in the close()
>> method called by try-with-resources if it has not been committed,
>> which would allow any exceptions to be cleanly handled by the caller.
>>
>> Does that make any sense, or am I way off?
>>
>> Cheers Adam.
>>
>> --
>> Adam Retter
>>
>> skype: adam.retter
>> tweet: adamretter
>> http://www.adamretter.org.uk
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>



-- 
Adam Retter

skype: adam.retter
tweet: adamretter
http://www.adamretter.org.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Transactions multiplexing IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.

Lucene's IndexWriter only allows one transaction at a time.  Fixing
this would be challenging I think.

One workaround might be to let your separate transactions write into
private directories, and then when complete, use
IndexWriter.addIndexes (on the main writer) to fold those changes it.
That part would still be single-transaction, but the addIndexes call
should be faster than doing N separate indexing ops.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Dec 16, 2014 at 8:54 AM, Adam Retter <ad...@googlemail.com> wrote:
> Hey devs,
>
> I am looking at making use of the two-phase commit approach available
> in IndexWriter but the current architecture there does not quite fit
> with what we want to achieve. It seems that I could build this atop
> IndexWriter, however I wonder if there is either an existing
> alternative that I have not discovered or whether it would be better
> to contribute a patch to the Lucene project itself?
>
> In my system I have many concurrent transactions and each of them
> needs to make modifications to a single Lucene index in an atomic and
> consistent manner.
> If I had a single-thread (i.e. one concurrent transaction) for
> example, I could access the IndexWriter instance, call addDocument or
> whatever as many times as I like, call prepareForCommit and then
> commit.
>
> The main issue that I have is that if I let all concurrent
> transactions use the same IndexWriter then I loose isolation, as a
> commit of one transaction may write the partial pending updates of
> another transaction.
>
> Now I can see a naive solution for my application where I could add
> all updates that I want to make to the index to a `pending list`, I
> could then take an exclusive lock for the index writer, apply my
> pending list to the index writer and then commit, finally releasing
> the exclusive lock. Whilst I could get this working, the down-side is
> that I have to implement and manage this `pending list` and applying
> it to the index myself, and it comes at the cost of memory (or even
> paging it to disk).
>
> It seems likely to me that others before me have also had such a
> requirement, does anything like this already exist in Lucene or would
> it be desirable for me to contribute something? At a rough guess I
> would imagine separating the IndexWriter from transaction content,
> something like:
>
> try(Transaction txn = indexWriter.beginTransaction()) {
>    indexWriter.addDocument(txn, doc1);
>    indexWriter.addDocument(txn, doc2);
>    indexWriter.addDocument(txn, doc3);
>
>   indexWriter.commit(txn);
> }
>
> The transaction could be automatically rolled-back in the close()
> method called by try-with-resources if it has not been committed,
> which would allow any exceptions to be cleanly handled by the caller.
>
> Does that make any sense, or am I way off?
>
> Cheers Adam.
>
> --
> Adam Retter
>
> skype: adam.retter
> tweet: adamretter
> http://www.adamretter.org.uk
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org