You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Chris Dollin <ch...@epimorphics.com> on 2016/01/08 16:34:35 UTC

transactions and docProducers

Dear All

(Not sure if this is really an @dev or @users question)

When Fuseki handles a query (or update), is that query
(or update) handled by a single thread or might it
be handled by multiple threads over the lifetime of
the query (or update)?

I ask because

* we have a TextDocProducer implementation called
   TextDocProducerBatch. It (hence) follows the
   DatasetChanges interface, tracking adds and
   removes and updating a Lucene index.

* The "Batch" part is because it accumulates
   quads with the same subject and, when the subject
   changes, makes a single Entity for the subject
   rather than entities for each quad.

* The accumulating quads are held in a data structure

* It's possible that read queries are running in
   parallel with updates. The read queries also
   go through the TextDocProducerBatch. To prevent
   the read query performing operations on the update
   state [1] we're holding the state as a thread-local
   variable.

* This is only sound if all the TextDocProducer(Batch)
   operations for a given query (or update) are handled
   by a single transaction. Which seems plausible but I
   can't point to anything that actually says so.

* So: is it the case?

* An alternative I considered was, given that there can
   be at most on concurrent write transaction, to only
   do perform the batch-and-update-index operations when
   inside a write transaction. However, starting from
   a TextDocProducerBatch, which is initialised with just
   a TextIndex[Lucene] and a DatasetGraph[Transaction],
   there doesn't seem any way to find out what the current
   transaction is; you can find out that you are (or are
   not) *in* a transaction but not whether it's a read
   or write [2].

* Have I missed something?

Chris

[1] An actual problem that happened

[2] Yes, we could have a divergent version of Jena with
     patches to access the transaction, but then we end
     up using SNAPSHOT versions of Jena and gnashing teeth.


Re: transactions and docProducers

Posted by Chris Dollin <ch...@epimorphics.com>.
Thanks to Andy for his reply.

Chris "over-eager on the DELETE button" Dollin

Re: transactions and docProducers

Posted by Andy Seaborne <an...@apache.org>.
On 08/01/16 15:34, Chris Dollin wrote:
> Dear All
>
> (Not sure if this is really an @dev or @users question)
>
> When Fuseki handles a query (or update), is that query
> (or update) handled by a single thread or might it
> be handled by multiple threads over the lifetime of
> the query (or update)?

Single thread per Fuseki request.

What you seem to be replying on is that the update changes are all 
handled by a single thread per transaction, which is true, although for 
any part that will touch the text index, query and update are both 
single-threaded.

 From experience, just remember to remove the thread local (as well as 
nulling it out) each transaction otherwise there is memory growth.  It's 
not bad in Fuseki, threads come from a Jetty-managed pool; but the pool 
does not seem to guarantee to only reuse a fixed number and that it 
isn't deleting and creating new threads esp under load.  That makes the 
number of ThreadLocals grow.

	Andy

[1]
You are using TDB for the triplestore.

> I ask because
>
> * we have a TextDocProducer implementation called
>    TextDocProducerBatch. It (hence) follows the
>    DatasetChanges interface, tracking adds and
>    removes and updating a Lucene index.
>
> * The "Batch" part is because it accumulates
>    quads with the same subject and, when the subject
>    changes, makes a single Entity for the subject
>    rather than entities for each quad.
>
> * The accumulating quads are held in a data structure
>
> * It's possible that read queries are running in
>    parallel with updates. The read queries also
>    go through the TextDocProducerBatch. To prevent
>    the read query performing operations on the update
>    state [1] we're holding the state as a thread-local
>    variable.
>
> * This is only sound if all the TextDocProducer(Batch)
>    operations for a given query (or update) are handled
>    by a single transaction. Which seems plausible but I
>    can't point to anything that actually says so.
>
> * So: is it the case?
>
> * An alternative I considered was, given that there can
>    be at most on concurrent write transaction, to only
>    do perform the batch-and-update-index operations when
>    inside a write transaction. However, starting from
>    a TextDocProducerBatch, which is initialised with just
>    a TextIndex[Lucene] and a DatasetGraph[Transaction],
>    there doesn't seem any way to find out what the current
>    transaction is; you can find out that you are (or are
>    not) *in* a transaction but not whether it's a read
>    or write [2].
>
> * Have I missed something?
>
> Chris
>
> [1] An actual problem that happened
>
> [2] Yes, we could have a divergent version of Jena with
>      patches to access the transaction, but then we end
>      up using SNAPSHOT versions of Jena and gnashing teeth.
>