You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Zak Mc Kracken <za...@yahoo.it.INVALID> on 2021/04/08 11:02:42 UTC
Jena 3.17, TDB performance and behaviour with read-only transaction
Hi all,
I've upgraded the Jena version that I use with this tool:
https://github.com/Rothamsted/rdf2pg
Now I'm seeing performance problems with the TDB used in read-only
transactions, as explained by the documentation:
https://github.com/Rothamsted/rdf2pg/blob/44f2bd16b27a6f13f447d1070f6abcea45f3d492/rdf2pg-core/src/main/java/uk/ac/rothamsted/kg/rdf2pg/pgmaker/support/rdf/RdfDataManager.java#L153
As you can see, the approach is: begin RO transaction, query, end
transaction, all done in parallel threads (8 to 32, depending on the
underlining system).
Using VisualVM, I see the threads running the code above often go in the
"monitor" state, ie, they wait for a Java synchronized object to be
freed up, most of the time they wait 1-3 seconds for that. While it's
hard to know where exactly this happens, I commented all actions around
and I left the above TDB reading only, and then they block each-other
more often.
Moreover, VisualVM also allows me to see that the threads spend a lot of
time with *org.apache.jena.dboe.transaction.txn.Transaction.end ()*,
drilling down the later, I can see that
*org.apache.jena.dboe.transaction.txn.journal.Journal.sync ()* is the
method consuming most of the time.
I don't understand this: all the operations are read-only, why are they
run into synchronized sections? Why does Jena spend so much time
synchronising the journal? Would abort() make any difference?
My intuition is that even RO operations must ensure that no writing
transaction has changed the TDB, but if that's the case, isn't there a
way to tell that I never write anything anywhere, and hence it shouldn't
waste time with the journal or anything that check for changes?
My rdf2pg tool writes into TDB only during a possible initial stage,
when RDF data are loaded from files, then the TDB is re-opened and a
long conversion stage is run that is entirely read-only. I guess this is
a pretty common behaviour and maybe I'm doing something wrong.
Furthermore, It used to be much faster with past Jena versions (with the
same code): https://github.com/Rothamsted/graphdb-benchmarks#test-results
Thanks in advance for any help,
Marco.
Re: Jena 3.17, TDB performance and behaviour with read-only
transaction
Posted by Andy Seaborne <an...@apache.org>.
Marco,
I've reproduced it on my system.
Recorded as:
https://issues.apache.org/jira/browse/JENA-2086
Not sure what is going on but it does not look right (it has been awhile
since I looked at that code).
Thanks
Andy
On 08/04/2021 23:44, Zak Mc Kracken wrote:
> Hi Andy,
>
> thank you for your quick reply.
>
> On 08/04/2021 18:59, Andy Seaborne wrote:
>>
>>>
>>> Now I'm seeing performance problems with the TDB used in read-only
>>> transactions, as explained by the documentation:
>>>
>>> https://github.com/Rothamsted/rdf2pg/blob/44f2bd16b27a6f13f447d1070f6abcea45f3d492/rdf2pg-core/src/main/java/uk/ac/rothamsted/kg/rdf2pg/pgmaker/support/rdf/RdfDataManager.java#L153
>>
>>
>>
>> Is there an outer transaction?
>
> Not in the case at issue. I wrote it that way cause it's a generic
> utility, which I happened to use in nested transactions.
>
>>
>> (BTW there is "Txn.executeRead" to do this "if in transaction" pattern.)
>>
> Thanks, I've rewritten everything with this approach:
>
> https://github.com/Rothamsted/rdf2pg/blob/d240d22a4f237297ae931aaccf8a4db10e3d19c3/rdf2pg-core/src/main/java/uk/ac/rothamsted/kg/rdf2pg/pgmaker/support/rdf/RdfDataManager.java#L155
>
>
> Now the code is more readable, but it didn't become much faster. And
> Journal.sync() is still taking a lot of time. Inside this, the method
> that consumes the consumes most of time is
> sun.nio.ch.FileChannelImpl.force ().
>
>>> As you can see, the approach is: begin RO transaction, query, end
>>> transaction, all done in parallel threads (8 to 32, depending on the
>>> underlining system).
>>>
>>> Using VisualVM, I see the threads running the code above often go in
>>> the "monitor" state, ie, they wait for a Java synchronized object to be
>>
>> Did you happen to notice on which object they are "synchronized" on?
>>
> I can't find a profile that is able to show me that, however, by
> exclusion, I see synchronized sections are met by the Journal only,
> namely, by FileChannelImpl.
>
>>> freed up, most of the time they wait 1-3 seconds for that. While it's
>>> hard to know where exactly this happens, I commented all actions
>>> around and I left the above TDB reading only, and then they block
>>> each-other more often.
>>>
>>> Moreover, VisualVM also allows me to see that the threads spend a lot
>>> of time with
>>>> *org.apache.jena.dboe.transaction.txn.Transaction.end ()*,
>>> drilling down the later, I can see that
>>> *org.apache.jena.dboe.transaction.txn.journal.Journal.sync ()* is the
>>> method consuming most of the time.
>>
>> That does not sound right.
>>
>> Do you have a call trace for this?
>
> Please, see the attachment, hope it helps.
>
>> In TDB2, a RO operation shoudl eb purely read-only. The writer pays
>> for everything (unlike TDB1).
>
> I can't get why the sync() above takes so much time. I'm using TDB2
>
>>
>>>
>>> Furthermore, It used to be much faster with past Jena versions (with
>>> the same code):
>>> https://github.com/Rothamsted/graphdb-benchmarks#test-results
>>
>> The version information isn't jumping out of that page.
>>
>> Which version of Jena?
>> (and was it TDB1 or TDB2 at the time?)
>>
> Sorry. Currently I'm using TDB2 with Jena 3.17. Recently, I've upgraded
> from 3.14 and before this, I remember it was working with good performance.
> That benchmark I linked was done in 2018 with Jena 3.9.0 (I've
> reconstructed that from the git commits, not 100% sure, but, let's say
> 95%).
>
> Thanks again,
> Marco.
>
>
Re: Jena 3.17, TDB performance and behaviour with read-only
transaction
Posted by Zak Mc Kracken <za...@yahoo.it.INVALID>.
Hi Andy,
thank you for your quick reply.
On 08/04/2021 18:59, Andy Seaborne wrote:
>
>>
>> Now I'm seeing performance problems with the TDB used in read-only
>> transactions, as explained by the documentation:
>>
>> https://github.com/Rothamsted/rdf2pg/blob/44f2bd16b27a6f13f447d1070f6abcea45f3d492/rdf2pg-core/src/main/java/uk/ac/rothamsted/kg/rdf2pg/pgmaker/support/rdf/RdfDataManager.java#L153
>
>
> Is there an outer transaction?
Not in the case at issue. I wrote it that way cause it's a generic
utility, which I happened to use in nested transactions.
>
> (BTW there is "Txn.executeRead" to do this "if in transaction" pattern.)
>
Thanks, I've rewritten everything with this approach:
https://github.com/Rothamsted/rdf2pg/blob/d240d22a4f237297ae931aaccf8a4db10e3d19c3/rdf2pg-core/src/main/java/uk/ac/rothamsted/kg/rdf2pg/pgmaker/support/rdf/RdfDataManager.java#L155
Now the code is more readable, but it didn't become much faster. And
Journal.sync() is still taking a lot of time. Inside this, the method
that consumes the consumes most of time is
sun.nio.ch.FileChannelImpl.force ().
>> As you can see, the approach is: begin RO transaction, query, end
>> transaction, all done in parallel threads (8 to 32, depending on the
>> underlining system).
>>
>> Using VisualVM, I see the threads running the code above often go in
>> the "monitor" state, ie, they wait for a Java synchronized object to be
>
> Did you happen to notice on which object they are "synchronized" on?
>
I can't find a profile that is able to show me that, however, by
exclusion, I see synchronized sections are met by the Journal only,
namely, by FileChannelImpl.
>> freed up, most of the time they wait 1-3 seconds for that. While it's
>> hard to know where exactly this happens, I commented all actions
>> around and I left the above TDB reading only, and then they block
>> each-other more often.
>>
>> Moreover, VisualVM also allows me to see that the threads spend a lot
>> of time with
>>> *org.apache.jena.dboe.transaction.txn.Transaction.end ()*,
>> drilling down the later, I can see that
>> *org.apache.jena.dboe.transaction.txn.journal.Journal.sync ()* is the
>> method consuming most of the time.
>
> That does not sound right.
>
> Do you have a call trace for this?
Please, see the attachment, hope it helps.
> In TDB2, a RO operation shoudl eb purely read-only. The writer pays
> for everything (unlike TDB1).
I can't get why the sync() above takes so much time. I'm using TDB2
>
>>
>> Furthermore, It used to be much faster with past Jena versions (with
>> the same code):
>> https://github.com/Rothamsted/graphdb-benchmarks#test-results
>
> The version information isn't jumping out of that page.
>
> Which version of Jena?
> (and was it TDB1 or TDB2 at the time?)
>
Sorry. Currently I'm using TDB2 with Jena 3.17. Recently, I've upgraded
from 3.14 and before this, I remember it was working with good performance.
That benchmark I linked was done in 2018 with Jena 3.9.0 (I've
reconstructed that from the git commits, not 100% sure, but, let's say 95%).
Thanks again,
Marco.
Re: Jena 3.17, TDB performance and behaviour with read-only
transaction
Posted by Andy Seaborne <an...@apache.org>.
Hi Marco,
Thanks for the report.
Before I dive into the code - a couple of questions:
On 08/04/2021 12:02, Zak Mc Kracken wrote:
> Hi all,
>
> I've upgraded the Jena version that I use with this tool:
>
> https://github.com/Rothamsted/rdf2pg
>
> Now I'm seeing performance problems with the TDB used in read-only
> transactions, as explained by the documentation:
>
> https://github.com/Rothamsted/rdf2pg/blob/44f2bd16b27a6f13f447d1070f6abcea45f3d492/rdf2pg-core/src/main/java/uk/ac/rothamsted/kg/rdf2pg/pgmaker/support/rdf/RdfDataManager.java#L153
Is there an outer transaction?
(BTW there is "Txn.executeRead" to do this "if in transaction" pattern.)
> As you can see, the approach is: begin RO transaction, query, end
> transaction, all done in parallel threads (8 to 32, depending on the
> underlining system).
>
> Using VisualVM, I see the threads running the code above often go in the
> "monitor" state, ie, they wait for a Java synchronized object to be
Did you happen to notice on which object they are "synchronized" on?
> freed up, most of the time they wait 1-3 seconds for that. While it's
> hard to know where exactly this happens, I commented all actions around
> and I left the above TDB reading only, and then they block each-other
> more often.
>
> Moreover, VisualVM also allows me to see that the threads spend a lot of
> time with
>> *org.apache.jena.dboe.transaction.txn.Transaction.end ()*,
> drilling down the later, I can see that
> *org.apache.jena.dboe.transaction.txn.journal.Journal.sync ()*
> is the method consuming most of the time.
That does not sound right.
Do you have a call trace for this?
>
> I don't understand this: all the operations are read-only, why are they
> run into synchronized sections? Why does Jena spend so much time
> synchronising the journal? Would abort() make any difference?
>
> My intuition is that even RO operations must ensure that no writing
> transaction has changed the TDB,
In TDB2, a RO operation shoudl eb purely read-only. The writer pays for
everything (unlike TDB1).
> but if that's the case, isn't there a
> way to tell that I never write anything anywhere, and hence it shouldn't
> waste time with the journal or anything that check for changes?
>
> My rdf2pg tool writes into TDB only during a possible initial stage,
> when RDF data are loaded from files, then the TDB is re-opened and a
> long conversion stage is run that is entirely read-only. I guess this is
> a pretty common behaviour and maybe I'm doing something wrong.
>
> Furthermore, It used to be much faster with past Jena versions (with the
> same code): https://github.com/Rothamsted/graphdb-benchmarks#test-results
The version information isn't jumping out of that page.
Which version of Jena?
(and was it TDB1 or TDB2 at the time?)
Andy
>
> Thanks in advance for any help,
> Marco.
>
>
>
>