You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2012/01/08 20:27:05 UTC

TDB: release process

The release of core/ARQ etc. hasn't lead to any immediate disasters (but 
there is still time!) so we can move on to TDB.

As far as I'm concerned, the code in the current snapshot and in SVN is 
release candidate code (JENA-102 is fixed) and if people don't test it 
(I've pinged jena-users@), then they risk it taking longer to get a 
released version with fixes.

I need to write the transaction API documentation and there is something 
odd in the prefix handling but as  far as I can see, it's been odd for 
some time, maybe all time; it needs reworking, not fixing so shouldn't 
block a release.

     Andy

PS Fuseki snapshot is using TDB transactions now.

Re: TDB: release process

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Andy

Andy Seaborne wrote:
> On 02/02/12 08:34, Paolo Castagna wrote:
>> Hi Andy
>>
>> Andy Seaborne wrote:
>>>> Do you have a plan for LARQ?
>>>
>>> No plan whatsoever.  I am at the limit of the number of things I can
>>> manage.  I was hoping you would deal with LARQ.
>>
>> Ack. I know you are busy.
> 
> It is not about me being busy.  I should not be the only do doing releases.
> 
> Getting the first release of Fuseki out, Apache licensed, is important
> as it establishes the codebase as clean.
> 
> Referring to snapshots and the version confusion has not been helping on
> jena-users@
> 
>>>> In relation to Fuseki, JENA-63 is still open/pending:
>>>> https://issues.apache.org/jira/browse/JENA-63
>>>> But, if Fuseki is released first... LARQ cannot be included in it
>>>> and JENA-63 can only be closed with the next Fuseki release.
>>>
>>> Do you want the release of Fuseki held up?
> 
> I have staged TDB and pulled the Fuseki release but that's because there
> is an issue with the zip build of Fuseki.

Ack.

> Talis want TDB released surely?

I let Talis answer that. ;-)

As individual, I often use TDB/Fuseki as well as LARQ and I want to help
with the release process.

> 
>> Well, my colleagues who have used Fuseki to manage/explore their
>> data on their machines, found free-text "searching" very useful
>> (and I needed to tell them to always patch Fuseki to add LARQ to it).
>>
>> Patch is currently tiny and just a dependency on the Fuseki's pom.xml:
> 
> Arguments based on "colleagues" don't work for me.
> 
> If Talis/Kasabi is managing to usefully use it for their business then
> great but it's all a black box to me, except for support time when you
> direct people to jena-users@.
> 
> It is a few lines of maven for LARQ to depend on Fuseki and build it's
> own extended jar.
> 
> This unlinks the release dependency.
> 
> It allows LARQ to release to a different cycle to Fuseki. We can't
> release Fuseki just for bug and enhancements in LARQ.
> 
> I hope that updateable indexes happen - but making Fuseki have to
> release for such a new feature seems a bad way to do it.

So, one way forward is:

 - we close JENA-63 as "Won't Fix" or "Not a Problem"
 - we park JENA-164 until there is clear demand for it
 - people who want to have Fuseki with LARQ included can package it themselves

No dependency links between Fuseki and LARQ from a release point of view.

Shall I do that?

I still want to have LARQ released, do it myself and/or help doing that.

> See also JENA-190.
> 
> .. patch to Fuseki POM only ..
> 
>> We also had people asking for LARQ (and Fuseki) on the mailing list
>> since we moved to Apache.
> 
> We have more people asking about protocol and Fuseki.  Getting a Apache
> Fuseki out is important to me.

Agree.

>>
>> The best scenario for me would be:
>>
>>   1. TDB is released first
>>   2. LARQ and Fuseki are released soon after (with JENA-63 fixed/closed)
>>
>>> LARQ does not work SPARQL Update or with SPARQL Graph Store protocol.
>>
>> Yes, we discussed about this already.
>>
>> This is a known (and important) limitation. There is an open issue on
>> this: https://issues.apache.org/jira/browse/JENA-164
>> It isn't a blocker to me my mind.
> 
> JENA-164 blocks JENA-63 (your entry in JIRA on 11/Nov/11).
> 
>> I'll document the known limitation.
>>
>> Re: JENA-164, you know the "update" route into Fuseki much better than
>> I do, if you could just add a small comment (i.e. one or two paragraphs)
> 
> The JIRA points to an email thread from Oct 2011 that deals with the
> bulkloader problem.
> 
> And I've suggested LARQ could create a DatasetGraph and catch every
> add(quad)/delete(quad).
> 
> A LARQ assember could simply name the dataset description it wraps.
> Assemble the LARQ assember assembles the inner dataset.  Fuseki service
> points to LARQ.
> 
> Seems quite practical to try to me.

I'll try that.

> But I'm not going to do it.

Sure.

>> on how this could be done... without big changes on the
>> event/notification
>> system, I'll do it. I have time tomorrow and probably next week.
>> The problem I have with JENA-164 is that I do not see how I can
>> "intercept"
>> changes without changing Fuseki code.
>>
>>> We see on jena-users@ that people are using Fuseki via the update
>>> protocols.
>>
>> Yep.
>>
>> All the people I saw using Fuseki (and LARQ) at work they were loading
>> stuff with tdbloader(2) and then "exploring" the data (i.e. a read-only
>> scenario).
>>
>> RDF publishing systems are often used in a mostly read
>> scenarios, with little/few updates. In this case, rebuilding the Lucene
>> index nightly would be a reasonable workaround, until JENA-164 is fixed.
>>
>>> Could LARQ be released separately as a bolt-on to Fuseki, with
>>> instructions on how to build and maintain the index?  I presume you want
>>> to say its for read-only publishing at the moment.
>>
>> Yeah.
>>
>> I am not sure what you exactly mean "as a bolt-on to Fuseki".
>>
>> My colleagues love the fact that Fuseki is just a single jar file (with
>> all the dependencies). LARQ is an extension which can simply added to
>> the classpath (together with Lucene) (i.e. two jars).
>>
>> People wanting to use LARQ with Fuseki will need to repackage Fuseki
>> if they want the single jar file with LARQ in it.
>>
>> A similar scenario will arise for GeoSPARQL (i.e. another cool SPARQL
>> extension I/we would love to see/have and use in Fuseki).
>> I can see how this can become a problem.
> 
> Well, there is no activity on GeoSPARQL so the whole issue there is moot
> for this release cycle.

Sure.

Paolo

> 
>     Andy
> 
>> On the other hand, Fuseki is so easy to checkout/build/package that
>> even if LARQ isn't included in it... people can package it themselves
>> or third parties could distribute a pre-packaged version with all the
>> cool extensions in it (not my preferred options for various reasons).
>>
>>> I'll hold things up for a day while we discuss this.
>>
>> Thanks.
>>
>> Paolo
>>
>>>
>>>      Andy
>>>
>>>>
>>>> Paolo
>>>>
>>>>>
>>>>>       Andy
>>>>
>>>
>

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 02/02/12 08:34, Paolo Castagna wrote:
> Hi Andy
>
> Andy Seaborne wrote:
>>> Do you have a plan for LARQ?
>>
>> No plan whatsoever.  I am at the limit of the number of things I can
>> manage.  I was hoping you would deal with LARQ.
>
> Ack. I know you are busy.

It is not about me being busy.  I should not be the only do doing 
releases.

Getting the first release of Fuseki out, Apache licensed, is important 
as it establishes the codebase as clean.

Referring to snapshots and the version confusion has not been helping on 
jena-users@

>>> In relation to Fuseki, JENA-63 is still open/pending:
>>> https://issues.apache.org/jira/browse/JENA-63
>>> But, if Fuseki is released first... LARQ cannot be included in it
>>> and JENA-63 can only be closed with the next Fuseki release.
>>
>> Do you want the release of Fuseki held up?

I have staged TDB and pulled the Fuseki release but that's because there 
is an issue with the zip build of Fuseki.

Talis want TDB released surely?

> Well, my colleagues who have used Fuseki to manage/explore their
> data on their machines, found free-text "searching" very useful
> (and I needed to tell them to always patch Fuseki to add LARQ to it).
>
> Patch is currently tiny and just a dependency on the Fuseki's pom.xml:

Arguments based on "colleagues" don't work for me.

If Talis/Kasabi is managing to usefully use it for their business then 
great but it's all a black box to me, except for support time when you 
direct people to jena-users@.

It is a few lines of maven for LARQ to depend on Fuseki and build it's 
own extended jar.

This unlinks the release dependency.

It allows LARQ to release to a different cycle to Fuseki. We can't 
release Fuseki just for bug and enhancements in LARQ.

I hope that updateable indexes happen - but making Fuseki have to 
release for such a new feature seems a bad way to do it.

See also JENA-190.

.. patch to Fuseki POM only ..

> We also had people asking for LARQ (and Fuseki) on the mailing list
> since we moved to Apache.

We have more people asking about protocol and Fuseki.  Getting a Apache 
Fuseki out is important to me.

 >
> The best scenario for me would be:
>
>   1. TDB is released first
>   2. LARQ and Fuseki are released soon after (with JENA-63 fixed/closed)
>
>> LARQ does not work SPARQL Update or with SPARQL Graph Store protocol.
>
> Yes, we discussed about this already.
 >
> This is a known (and important) limitation. There is an open issue on
> this: https://issues.apache.org/jira/browse/JENA-164
> It isn't a blocker to me my mind.

JENA-164 blocks JENA-63 (your entry in JIRA on 11/Nov/11).

> I'll document the known limitation.
>
> Re: JENA-164, you know the "update" route into Fuseki much better than
> I do, if you could just add a small comment (i.e. one or two paragraphs)

The JIRA points to an email thread from Oct 2011 that deals with the 
bulkloader problem.

And I've suggested LARQ could create a DatasetGraph and catch every 
add(quad)/delete(quad).

A LARQ assember could simply name the dataset description it wraps. 
Assemble the LARQ assember assembles the inner dataset.  Fuseki service 
points to LARQ.

Seems quite practical to try to me.

But I'm not going to do it.

> on how this could be done... without big changes on the event/notification
> system, I'll do it. I have time tomorrow and probably next week.
> The problem I have with JENA-164 is that I do not see how I can "intercept"
> changes without changing Fuseki code.
>
>> We see on jena-users@ that people are using Fuseki via the update
>> protocols.
>
> Yep.
>
> All the people I saw using Fuseki (and LARQ) at work they were loading
> stuff with tdbloader(2) and then "exploring" the data (i.e. a read-only
> scenario).
>
> RDF publishing systems are often used in a mostly read
> scenarios, with little/few updates. In this case, rebuilding the Lucene
> index nightly would be a reasonable workaround, until JENA-164 is fixed.
>
>> Could LARQ be released separately as a bolt-on to Fuseki, with
>> instructions on how to build and maintain the index?  I presume you want
>> to say its for read-only publishing at the moment.
>
> Yeah.
>
> I am not sure what you exactly mean "as a bolt-on to Fuseki".
>
> My colleagues love the fact that Fuseki is just a single jar file (with
> all the dependencies). LARQ is an extension which can simply added to
> the classpath (together with Lucene) (i.e. two jars).
>
> People wanting to use LARQ with Fuseki will need to repackage Fuseki
> if they want the single jar file with LARQ in it.
 >
> A similar scenario will arise for GeoSPARQL (i.e. another cool SPARQL
> extension I/we would love to see/have and use in Fuseki).
> I can see how this can become a problem.

Well, there is no activity on GeoSPARQL so the whole issue there is moot 
for this release cycle.

	Andy

> On the other hand, Fuseki is so easy to checkout/build/package that
> even if LARQ isn't included in it... people can package it themselves
> or third parties could distribute a pre-packaged version with all the
> cool extensions in it (not my preferred options for various reasons).
>
>> I'll hold things up for a day while we discuss this.
>
> Thanks.
>
> Paolo
>
>>
>>      Andy
>>
>>>
>>> Paolo
>>>
>>>>
>>>>       Andy
>>>
>>

Re: TDB: release process

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Andy

Andy Seaborne wrote:
>> Do you have a plan for LARQ?
> 
> No plan whatsoever.  I am at the limit of the number of things I can
> manage.  I was hoping you would deal with LARQ.

Ack. I know you are busy.

I am going to try publishing LARQ artifacts on the Apache staging
repository, so I can check everything is fine.

Since LARQ, at the moment, depends on TDB (for testing) LARQ release
must follow TDB release. (I am planning to remove that dependency,
since it adds little value and it creates this sort of troubles).
I'll try this tomorrow.

>> It's just a small extension for ARQ, therefore it does not need a full
>> .zip distribution. Or, that is necessary anyway?
> 
> ARQ does not have a distribution zip.

Ack. Same will be for LARQ.

>> Probably, just the
>> -source-release.zip is necessary as per jena-iri, for example.
>> Do you agree?
> 
> The source-release is absolutely required.  It *is* the release.
> Everything else is additional in Apache process.

Ack.

>> In relation to Fuseki, JENA-63 is still open/pending:
>> https://issues.apache.org/jira/browse/JENA-63
>> But, if Fuseki is released first... LARQ cannot be included in it
>> and JENA-63 can only be closed with the next Fuseki release.
> 
> Do you want the release of Fuseki held up?

Well, my colleagues who have used Fuseki to manage/explore their
data on their machines, found free-text "searching" very useful
(and I needed to tell them to always patch Fuseki to add LARQ to it).

Patch is currently tiny and just a dependency on the Fuseki's pom.xml:

Index: pom.xml
===================================================================
--- pom.xml	(revision 1203107)
+++ pom.xml	(working copy)
@@ -53,6 +53,7 @@
     <ver.jena>2.6.5-incubating-SNAPSHOT</ver.jena>
     <ver.arq>2.8.9-incubating-SNAPSHOT</ver.arq>
     <ver.tdb>0.9.0-incubating-SNAPSHOT</ver.tdb>
+    <ver.larq>1.0.0-incubating-SNAPSHOT</ver.larq>

     <!-- These two go together -->
     <ver.jetty>7.2.1.v20101111</ver.jetty>
@@ -75,6 +76,12 @@

     <dependency>
       <groupId>org.apache.jena</groupId>
+      <artifactId>jena-larq</artifactId>
+      <version>${ver.larq}</version>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.jena</groupId>
       <artifactId>jena-arq</artifactId>
       <version>${ver.arq}</version>
       <classifier>tests</classifier>

We also had people asking for LARQ (and Fuseki) on the mailing list
since we moved to Apache.

The best scenario for me would be:

 1. TDB is released first
 2. LARQ and Fuseki are released soon after (with JENA-63 fixed/closed)

> LARQ does not work SPARQL Update or with SPARQL Graph Store protocol.

Yes, we discussed about this already.

This is a known (and important) limitation. There is an open issue on
this: https://issues.apache.org/jira/browse/JENA-164
It isn't a blocker to me my mind.

I'll document the known limitation.

Re: JENA-164, you know the "update" route into Fuseki much better than
I do, if you could just add a small comment (i.e. one or two paragraphs)
on how this could be done... without big changes on the event/notification
system, I'll do it. I have time tomorrow and probably next week.
The problem I have with JENA-164 is that I do not see how I can "intercept"
changes without changing Fuseki code.

> We see on jena-users@ that people are using Fuseki via the update
> protocols.

Yep.

All the people I saw using Fuseki (and LARQ) at work they were loading
stuff with tdbloader(2) and then "exploring" the data (i.e. a read-only
scenario). RDF publishing systems are often used in a mostly read
scenarios, with little/few updates. In this case, rebuilding the Lucene
index nightly would be a reasonable workaround, until JENA-164 is fixed.

> Could LARQ be released separately as a bolt-on to Fuseki, with
> instructions on how to build and maintain the index?  I presume you want
> to say its for read-only publishing at the moment.

Yeah.

I am not sure what you exactly mean "as a bolt-on to Fuseki".

My colleagues love the fact that Fuseki is just a single jar file (with
all the dependencies). LARQ is an extension which can simply added to
the classpath (together with Lucene) (i.e. two jars).
People wanting to use LARQ with Fuseki will need to repackage Fuseki
if they want the single jar file with LARQ in it.

A similar scenario will arise for GeoSPARQL (i.e. another cool SPARQL
extension I/we would love to see/have and use in Fuseki).
I can see how this can become a problem.

On the other hand, Fuseki is so easy to checkout/build/package that
even if LARQ isn't included in it... people can package it themselves
or third parties could distribute a pre-packaged version with all the
cool extensions in it (not my preferred options for various reasons).

> I'll hold things up for a day while we discuss this.

Thanks.

Paolo

> 
>     Andy
> 
>>
>> Paolo
>>
>>>
>>>      Andy
>>
>

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 01/02/12 18:39, Paolo Castagna wrote:
> Andy Seaborne wrote:
>> On 08/01/12 19:27, Andy Seaborne wrote:
>>> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
>>> there is still time!) so we can move on to TDB.
>>>
>>> As far as I'm concerned, the code in the current snapshot and in SVN is
>>> release candidate code (JENA-102 is fixed) and if people don't test it
>>> (I've pinged jena-users@), then they risk it taking longer to get a
>>> released version with fixes.
>>>
>>> I need to write the transaction API documentation and there is something
>>> odd in the prefix handling but as far as I can see, it's been odd for
>>> some time, maybe all time; it needs reworking, not fixing so shouldn't
>>> block a release.
>>>
>>> Andy
>>>
>>> PS Fuseki snapshot is using TDB transactions now.
>>
>> I'm going to start the release process for TDB and Fuseki.  I'll call
>> the vote as soon as possible.
>
> Thank you Andy.
>
> Do you have a plan for LARQ?

No plan whatsoever.  I am at the limit of the number of things I can 
manage.  I was hoping you would deal with LARQ.

> It's just a small extension for ARQ, therefore it does not need a full
> .zip distribution. Or, that is necessary anyway?

ARQ does not have a distribution zip.

The only zip currently is apache-jena.  Fuseki adds one (except it's 
broken and needs fixing :-()

See
http://www.apache.org/dist/incubator/jena/
for what we have released as well as maven release repository.

> Probably, just the
> -source-release.zip is necessary as per jena-iri, for example.
> Do you agree?

The source-release is absolutely required.  It *is* the release. 
Everything else is additional in Apache process.

> In relation to Fuseki, JENA-63 is still open/pending:
> https://issues.apache.org/jira/browse/JENA-63
> But, if Fuseki is released first... LARQ cannot be included in it
> and JENA-63 can only be closed with the next Fuseki release.

Do you want the release of Fuseki held up?

LARQ does not work SPARQL Update or with SPARQL Graph Store protocol.

We see on jena-users@ that people are using Fuseki via the update protocols.

Could LARQ be released separately as a bolt-on to Fuseki, with 
instructions on how to build and maintain the index?  I presume you want 
to say its for read-only publishing at the moment.

I'll hold things up for a day while we discuss this.

	Andy

>
> Paolo
>
>>
>>      Andy
>

Re: TDB: release process

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> On 08/01/12 19:27, Andy Seaborne wrote:
>> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
>> there is still time!) so we can move on to TDB.
>>
>> As far as I'm concerned, the code in the current snapshot and in SVN is
>> release candidate code (JENA-102 is fixed) and if people don't test it
>> (I've pinged jena-users@), then they risk it taking longer to get a
>> released version with fixes.
>>
>> I need to write the transaction API documentation and there is something
>> odd in the prefix handling but as far as I can see, it's been odd for
>> some time, maybe all time; it needs reworking, not fixing so shouldn't
>> block a release.
>>
>> Andy
>>
>> PS Fuseki snapshot is using TDB transactions now.
> 
> I'm going to start the release process for TDB and Fuseki.  I'll call
> the vote as soon as possible.

Thank you Andy.

Do you have a plan for LARQ?

It's just a small extension for ARQ, therefore it does not need a full
.zip distribution. Or, that is necessary anyway? Probably, just the
-source-release.zip is necessary as per jena-iri, for example.
Do you agree?

In relation to Fuseki, JENA-63 is still open/pending:
https://issues.apache.org/jira/browse/JENA-63
But, if Fuseki is released first... LARQ cannot be included in it
and JENA-63 can only be closed with the next Fuseki release.

Paolo

> 
>     Andy

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 08/01/12 19:27, Andy Seaborne wrote:
> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
> there is still time!) so we can move on to TDB.
>
> As far as I'm concerned, the code in the current snapshot and in SVN is
> release candidate code (JENA-102 is fixed) and if people don't test it
> (I've pinged jena-users@), then they risk it taking longer to get a
> released version with fixes.
>
> I need to write the transaction API documentation and there is something
> odd in the prefix handling but as far as I can see, it's been odd for
> some time, maybe all time; it needs reworking, not fixing so shouldn't
> block a release.
>
> Andy
>
> PS Fuseki snapshot is using TDB transactions now.

I'm going to start the release process for TDB and Fuseki.  I'll call 
the vote as soon as possible.

	Andy

Re: TDB: release process

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> On 09/01/12 07:59, Paolo Castagna wrote:
>> Hi Andy,
>> do you have a date in mind for the release?
> 
> As soon as time permits.  I don't know of anything other than
> documentation that needs doing because I did clearing up while the main
> release was being done.
> 
>>
>> I'd like to release LARQ as well, if possible (at the same time).
> 
> Why does LARQ depend on TDB?

Maybe that was a mistake, but it's because of the assembly dataset stuff
and related tests.

>> There is an open bug/new feature for LARQ:
>>
>>   - JENA-164: LARQ needs to update the Lucene index when a
>>     SPARQL Update request is received
>>
>> If you agree, I'd like to release LARQ anyway (since JENA-164 isn't
>> a trivial fix/task and it's a new feature, not a bug).
> 
> Your decision.

Ok, I'll go for a release without that feature.

People used LARQ already in the bast without any update capability.

If users want to update their dataset via SPARQL Update will need to
be informed that this is a known limitation (i.e. I'll add a comment
on the website/documentation) and they will need to re-index as they
see fit.

Many of datasets/use cases are mostly read-only, therefore this will
not have an impact there (and they can benefit from having LARQ if
they want free-text searches).

>> I also need to spend a couple of hours to double check the NOTICE.txt
>> file and make sure it is correct and following criteria used in the
>> other modules.
> 
> The README has "openjena" and "sourceforge" in it.

Thanks. I'll fix all these.

Paolo

>>
>> I'll have a double check on the pom.xml and see if it doing something
>> different from the other modoules with the aim at reducing diversity
>> between modules (=>  lower cost for the release manager).
>>
>> Other than this, I do not see other tasks pending for LARQ.
>>
>> Paolo
>>
>> Andy Seaborne wrote:
>>> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
>>> there is still time!) so we can move on to TDB.
>>>
>>> As far as I'm concerned, the code in the current snapshot and in SVN is
>>> release candidate code (JENA-102 is fixed) and if people don't test it
>>> (I've pinged jena-users@), then they risk it taking longer to get a
>>> released version with fixes.
>>>
>>> I need to write the transaction API documentation and there is something
>>> odd in the prefix handling but as  far as I can see, it's been odd for
>>> some time, maybe all time; it needs reworking, not fixing so shouldn't
>>> block a release.
>>>
>>>      Andy
>>>
>>> PS Fuseki snapshot is using TDB transactions now.
>>
>

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 09/01/12 07:59, Paolo Castagna wrote:
> Hi Andy,
> do you have a date in mind for the release?

As soon as time permits.  I don't know of anything other than 
documentation that needs doing because I did clearing up while the main 
release was being done.

>
> I'd like to release LARQ as well, if possible (at the same time).

Why does LARQ depend on TDB?

> There is an open bug/new feature for LARQ:
>
>   - JENA-164: LARQ needs to update the Lucene index when a
>     SPARQL Update request is received
>
> If you agree, I'd like to release LARQ anyway (since JENA-164 isn't
> a trivial fix/task and it's a new feature, not a bug).

Your decision.

> I also need to spend a couple of hours to double check the NOTICE.txt
> file and make sure it is correct and following criteria used in the
> other modules.

The README has "openjena" and "sourceforge" in it.

>
> I'll have a double check on the pom.xml and see if it doing something
> different from the other modoules with the aim at reducing diversity
> between modules (=>  lower cost for the release manager).
>
> Other than this, I do not see other tasks pending for LARQ.
>
> Paolo
>
> Andy Seaborne wrote:
>> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
>> there is still time!) so we can move on to TDB.
>>
>> As far as I'm concerned, the code in the current snapshot and in SVN is
>> release candidate code (JENA-102 is fixed) and if people don't test it
>> (I've pinged jena-users@), then they risk it taking longer to get a
>> released version with fixes.
>>
>> I need to write the transaction API documentation and there is something
>> odd in the prefix handling but as  far as I can see, it's been odd for
>> some time, maybe all time; it needs reworking, not fixing so shouldn't
>> block a release.
>>
>>      Andy
>>
>> PS Fuseki snapshot is using TDB transactions now.
>

Re: TDB: release process

Posted by Paolo Castagna <ca...@googlemail.com>.

Paolo Castagna wrote:
> I'll have a double check on the pom.xml and see if it doing something
> different from the other modoules with the aim at reducing diversity
> between modules (=> lower cost for the release manager).

Done.

I don't see things which might get in the way in doing a release.
However, I did not test the "release" (even if only to the staging
repo) process itself.

Hopefully, when the times come there will be no problems.

Paolo

Re: TDB: release process

Posted by Paolo Castagna <ca...@googlemail.com>.

Paolo Castagna wrote:
> I also need to spend a couple of hours to double check the NOTICE.txt
> file and make sure it is correct and following criteria used in the
> other modules.

FYI:

(a couple of hours was a wrong estimate!)

svn co https://svn.apache.org/repos/asf/incubator/jena/Import/Jena-SVN/LARQ/trunk/ larq
cd larq
grep -r --include="*.java" Copyright * | cut --delimiter=* -f 2 | sort -u

The result (although not perfect) is:

 (c) Copyright 2006, 2007, 2008, 2009 Hewlett-Packard Development Company, LP
 (c) Copyright 2006, 2007, 2008, 2009 Hewlett-Packard Development Company, LP
 (c) Copyright 2007, 2008, 2009 Hewlett-Packard Development Company, LP
 (c) Copyright 2008, 2009 Hewlett-Packard Development Company, LP
 (c) Copyright 2009 Hewlett-Packard Development Company, LP
 (c) Copyright 2010 Epimorphics Ltd.
 (c) Copyright 2010 Epimorphics Ltd.
 (c) Copyright 2010 Talis Information Ltd.
 (c) Copyright 2010 Talis Information Ltd.
 (c) Copyright 2010 Talis Information Ltd
 (c) Copyright 2010 Talis Systems Ltd.
 (c) Copyright 2010 Talis Systems Ltd.
 (c) Copyright 2011 Talis Systems Ltd.

In particular:

src/main/java/org/apache/jena/larq/IndexLARQ.java: * (c) Copyright 2010 Epimorphics Ltd.
src/main/java/org/apache/jena/larq/IndexLARQ.java: * (c) Copyright 2010 Epimorphics Ltd.
src/test/java/dev/Report_LARQ_Concurrent.java: * (c) Copyright 2010 Epimorphics Ltd.
src/test/java/dev/Report_LARQ_Concurrent.java: * (c) Copyright 2010 Epimorphics Ltd.

So, I changed the LARQ's NOTICE.txt file to include: Copyright 2010 Epimorphics Ltd.

Andy, are you ok with that? (given [1], I should have double checked at the time).

It should be ok now,
Paolo

 [1] http://markmail.org/message/6suh3xyfytdqqh2i

Re: TDB: release process

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Andy,
do you have a date in mind for the release?

I'd like to release LARQ as well, if possible (at the same time).

There is an open bug/new feature for LARQ:

 - JENA-164: LARQ needs to update the Lucene index when a
   SPARQL Update request is received

If you agree, I'd like to release LARQ anyway (since JENA-164 isn't
a trivial fix/task and it's a new feature, not a bug).

I also need to spend a couple of hours to double check the NOTICE.txt
file and make sure it is correct and following criteria used in the
other modules.

I'll have a double check on the pom.xml and see if it doing something
different from the other modoules with the aim at reducing diversity
between modules (=> lower cost for the release manager).

Other than this, I do not see other tasks pending for LARQ.

Paolo

Andy Seaborne wrote:
> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
> there is still time!) so we can move on to TDB.
> 
> As far as I'm concerned, the code in the current snapshot and in SVN is
> release candidate code (JENA-102 is fixed) and if people don't test it
> (I've pinged jena-users@), then they risk it taking longer to get a
> released version with fixes.
> 
> I need to write the transaction API documentation and there is something
> odd in the prefix handling but as  far as I can see, it's been odd for
> some time, maybe all time; it needs reworking, not fixing so shouldn't
> block a release.
> 
>     Andy
> 
> PS Fuseki snapshot is using TDB transactions now.

Re: TDB: release process

Posted by Simon Helsen <sh...@ca.ibm.com>.

Andy,

it is tricky for me to provide the suite because it is embedded in a 
larger framework. Yet, the numbers are clean IMO because the times I 
provided are taken around the calls into Jena. Moreover, the absolute 
numbers don't matter very much. Some of the queries are somewhat contrived 
in their complexity and the suite was designed to be very configurable 
(making it harder to determine what the expected results have to look 
like). The difference between the 2 tests is the usage of Jena and the 
bound TDB, so whatever difference in times I see is mostly attributable to 
this, not the framework.

For us, the key is the relative numbers with vanilla TDB. (up to 0.8.x). 
It is surprising that if reads are not blocked by writers that the read 
requests take as long as they do. In the vanilla numbers, we keep track 
how much time we "wait" and it is quite significant. I reran the tests 
where I reduced the length to get more stable numbers. Instead of 
copy-pasting the numbers, I am attaching them as images this time. 

There are 4 files, 2 for tdb and 2 for tx tdb and for each one there is 
pair of tables for indexing operations (using standard jena APIs, not 
sparql Insert) and a pair of tables for query operations

I circled the relevant rows (you can ignore the 3 other rows in each 
table). I put a red box around a few relevant numbers such as total query 
time, which is important because in TDB (where we use our own locks) we 
measure wait times versus actual execution times. In TxTDB, we cannot do 
this, so the total time is more or less the total time for each type of 
query (in this case DESCRIBE and SELECT). In TDB, you can see the time 
both write and read operations have to wait for each other. In TxTDB, 
there is no such thing. For the write operations, we distinguish between 
bulk and non-bulk. During the actual scalability test, not much is bulked, 
so in practice you can add these numbers. The reason is that we start the 
test by writing out about 2000 resources (named graphs), so you'll have 
some bulking there. In TxTDB, bulking means just that we combine write 
operations in one transaction. Operations are combined in a transaction 
when they happen quickly after each other (knowing that it delays the 
visibility of data). But again, this rarely happens during the scalability 
test itself. Finally, you'll notice a slight difference in the number of 
queries executed in TDB and some additional columns (such as overtaking 
and abort and reset). This has to do with a slight variations of the 
standard exclusive write algorithm we employed. It improves the average 
query time a bit in multi-threaded tests compared to the naive exclusive 
write locking mechanism. But it should not be able to beat what you're 
trying to achieve with TxTDB.

I think one can say with these numbers that on average TxTDB needs about 
2,5 times longer to finish a query transaction in the given test scenario 
(50 clients, 2s wait time between operations and a ration of 7/1 
read/write). There could be a few reasons, but since transactions are more 
opaque than when using vanilla TDB, it is hard for me to tell where the 
time goes, i.e. whether there are locks inside the TxTDB code or whether 
it is genuine CPU overhead. Note that in the numbers I provided that in 
TxTDB, factual parallelism is higher, so, perhaps this slows down the 
bottom line?

To answers your original question: it would be great if you have a Jena 
test framework, but I am currently not in a position to contribute 
extensively. I think the moment we actually adopt TxTDB, this may change. 
However, I can offer to run these scalability runs locally whenever you 
have improvements or algorithmic changes and then return the results

Simon

From:
Andy Seaborne <an...@apache.org>
To:
Simon Helsen/Toronto/IBM@IBMCA
Cc:
jena-dev@incubator.apache.org
Date:
01/18/2012 03:22 AM
Subject:
Re: TDB: release process

On 17/01/12 22:14, Simon Helsen wrote:
> 4) I understand what you're new strategy is, but could this not lead to
> starvation of read transactions?

No - a reader can't be overtaken by a later writer so no starvation.  A 
reader sees the state of the database as at the last committed write 
transaction, and does not see any changes from any later writers (the 
isolation level is "serialized" even through there is concurrency). 
Readers are not blocked by writers, unlike TDB upto 0.8.x.

Can you provide us with something to run the tests ourselves?  You have 
said in the past its part of a larger test framework but if it can't be 
separated out doesn't that indicate the rest of the test framework is 
caught up in the numbers?

Maybe an addition to the emerging performance framework JenaPerf [1]? 
(It's in scala but, writing "better Java" is a good way to start scala)

https://svn.apache.org/repos/asf/incubator/jena/Experimental/JenaPerf/trunk/ 

                 Andy

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 17/01/12 22:14, Simon Helsen wrote:
> 4) I understand what you're new strategy is, but could this not lead to
> starvation of read transactions?

No - a reader can't be overtaken by a later writer so no starvation.  A 
reader sees the state of the database as at the last committed write 
transaction, and does not see any changes from any later writers (the 
isolation level is "serialized" even through there is concurrency). 
Readers are not blocked by writers, unlike TDB upto 0.8.x.

Can you provide us with something to run the tests ourselves?  You have 
said in the past its part of a larger test framework but if it can't be 
separated out doesn't that indicate the rest of the test framework is 
caught up in the numbers?

Maybe an addition to the emerging performance framework JenaPerf [1]? 
(It's in scala but, writing "better Java" is a good way to start scala)

https://svn.apache.org/repos/asf/incubator/jena/Experimental/JenaPerf/trunk/ 

	Andy

Re: TDB: release process

Posted by Simon Helsen <sh...@ca.ibm.com>.

Thanks for the elaborate response Andy (and sorry for the late reaction).

1) this was on windows in direct mode.
2) I understand the motivation since we are suffering from this in vanilla 
TDB, i.e. one write blocks reads, even before the write can start because 
the moment you "request" the write lock, all new reads are scheduled to 
run *after* the write, even if the write is still waiting for previous 
reads to finish. 
3) Our test environment is also read-heavy, which is why I put a ratio of 
7 to 1. In practice, the ratio may even be higher but have no reliable 
numbers from customer scenarios at this point
4) I understand what you're new strategy is, but could this not lead to 
starvation of read transactions?
5) With regard to the improved strategy, yes, I agree, but I think the 
half-way proposal is more stable, i.e. you have a write thread which is 
"triggered" by a commit of the write transaction. That is predictable and 
will improve performance without hitting the readers
6) The current implementation is too weak on the readers (compared to 
vanilla TDB) for us, so we'll probably hold off with true adoption until 
the improved strategy is in place. As always, I am happy to test this 
stuff and provide numbers

thanks

Simon

PS: I understand the transactional model has a separate benefit as you 
point out: index consistency and stability in the case of an outage. That 
is quite a benefit in itself, so if (for 50+ threads) performance was even 
only comparable to TDB (not worse, but not necessarily better), it would 
be enough incentive to adopt. Of course, we hope that with 50+ threads, 
performance would improve compared to TDB

From:
Andy Seaborne <an...@apache.org>
To:
jena-dev@incubator.apache.org
Cc:
Simon Helsen/Toronto/IBM@IBMCA
Date:
01/12/2012 10:25 AM
Subject:
Re: TDB: release process

Simon,

Thank you for the figures.  Very useful.

Is this on Linux or Windows?

One of the design points of transactions is when to write-back the 
changes to the main database.  At the moment, write-back is done by any 
transaction (read included :-) that finds the database quiescent when 
the transaction clears up.  A commit in a writer writes to the log, not 
the main DB.  The journal is written back later.  This will hit some 
readers.  This is also why your write - times are better.

The advantage of the writeback policy currently is that it is 
predictable and easy to calculate when it will happen.  What I hope to 
do when the system is proven reliable is to have a write-back thread, 
taking all database writes off the transaction end code path. SQLlite 
and similar so this - but you need to be very careful that write-back 
does occur and the log doesn't just continue to grow.

An obvious half-way design is to fork the write-back at the end of the 
transaction that decides it can do the final changes but immediately 
return from the transaction.

Also, with the change in locking strategy, different ordering maybe 
happening.  In testing, I did see, when the system was totally loaded, 
some read transactions seemingly scheduled aside when there were writers 
(Linux) even accounting for the time when changes were written to the 
main database.  However, that was in a system that was tuned to be 
max-ed out and transactions that did zero, or near zero work.

During my testing, which is for a different workload, query has seemed 
the same as 0.8.10 in normal use. The workload is read-dominated. 
Writes are infrequent.  We're continuing to test.

One of the drivers for the transaction was to reduce the stutter effect 
of global locking for a writer to run - using MRSW locking, the writer 
means the readers are locked out, resulting in high latency for some 
reads.  The write-back needs further tuning for that although it is 
already better because (1) it waits until the DB is quiet, (2) it 
amalgamates writes and especially sync() calls.  sync() is the expesive 
point.

Another driver for the transactions has been to make the data more 
durable.

                 Andy

PS Early work-in-progress

https://svn.apache.org/repos/asf/incubator/jena/Experimental/JenaPerf/trunk/

A performance framework for running query (and later update) mixes and 
reporting.  Unfinished.

On 10/01/12 23:24, Simon Helsen wrote:
> Andy,
>
> yes, I'll look into it as soon as I have cycles again. And no, I have
> not yet tried with non-transactional API in 2.7.0. I actually want to do
> that at some point to have a cleaner baseline.
>
> In the mean time, here is a summary of the results I found:
>
> 1) when I run with 1 client, query and store execution is comparable to
> each other. I have detailed numbers, but they help much
> 2) things become interesting when I start scaling up the number of
> clients (one of the principal motivations to move to TDB Tx). The data
> below is for the following scenario:
>
> * 50 clients
> * the operations of each client is a mixture of queries and write
> operations, where I execute a write operation for every 7th query
> * the queries are deterministically taken from a pool of about 35
> queries with varying complexity. When run in 1 client, they take
> anywhere from a few ms to almost 2 seconds for most intense query
> * between each operation, I wait 2s
> * there is plenty of memory/heap available. I use a 64 bit machine with
> 8Gb of memory where 4 is used for the java heap.
>
> Note that in TDB we use an exclusive write lock for write operations and
> shared read locks for read operations. In TDBTx, I just use transactions
> (i.e. we don't lock ourselves):
>
> A) Here are the numbers for TDB (0.8.7 etc):
>
> - total write time = 1345594ms, so about 1346s
>
> cnt | avg | max | min | dev | tot
> 
======================================================================================================================
>
> DESCRIBE (ms) 402 | 466 | 4,859 | 0 | 609 | 187,609
> SELECT (ms) 4,618 | 4,809 | 93,453 | 0 | 9,621 | 22,211,907
> 
----------------------------------------------------------------------------------------------------------------------
>
> PARALLELISM 5,020 | 14 | 41 | 0 | 8 | 79,066
>
> quite note about parallelism: this indicates effectively how much
> parallel activity was going on. For instance, on average, there were 14
> queries running at the same time, but maximum 41. The total indicates
> how heavily query activity was running in parallel.
>
> B) Here are the numbers of TDBTx:
>
> - total write time = 166047ms, so about 166s
>
> cnt | avg | max | min | dev | tot
> 
==================================================================================================================
>
> DESCRIBE (ms) 168 | 2,557 | 9,219 | 31 | 1,769 | 429,645
> SELECT (ms) 1,853 | 38,866 | 392,282 | 0 | 74,008 | 72,020,224
> 
-------------------------------------------------------------------------------------------------------------------
>
> PARALLELISM 2,021 | 35 | 49 | 0 | 10 | 71,791
>
>
> note that although the test suite are running in the same way, The long
> query times in TDBTx caused several timeouts, which indicates the
> substantially smaller amount of completed queries. Even so, the total
> query time was still almost 4 times higher
>
> So, it seems that in this multi-client scenario, TDBTx is way better in
> avoiding lock contention around write operations, but, it is behaving
> significantly weaker for queries. One thing that is interesting is TDBTx
> has a higher number
> of average parallel running queries and a higher max. So, perhaps this
> is an important cause in the slowdown.
>
> Hopefully these are useful. Does any of you have done any performance
> measurements with transactional TDB?
>
> Simon
>
>
> From:                  Andy Seaborne <an...@apache.org>
> To:            jena-dev@incubator.apache.org
> Date:                  01/10/2012 02:04 PM
> Subject:               Re: TDB: release process
>
>
> ------------------------------------------------------------------------
>
>
>
> On 10/01/12 13:45, Andy Seaborne wrote:
>  > On 09/01/12 15:07, Simon Helsen wrote:
>  >> Andy, others,
>  >>
>  >> I have been testing TxTDB on my end and functionally, things are 
looking
>  >> good. I am not able to see any immediate problems anymore. Of 
course,
>  >> there may still be more exotic things left, but those can probably
>  >> managed
>  >> in am minor release. However, now that it is getting good on the
>  >> functional end, I am starting to check the non-functional
>  >> characteristics,
>  >> especially speed and scalability (in terms of multiple clients). For
> this
>  >> I use a test suite with about 35 different queries and I compare the
>  >> performance against Jena 2.6.3/ARQ 2.8.5 and TDB 0.8.7 because that 
is
>  >> the
>  >> version we currently use in the release of our product.. I am 
comparing
>  >> these numbers then with Jena/ARQ 2.7.0 and TDB 0.9.0 (20111229) and 
the
>  >> transaction API. I realize this partially comparing apples to pears 
but
>  >> from our perspective, we need to see how the bottomline changes in 
terms
>  >> of query speed when we increase the number of concurrent clients.
>  >>
>  >> I have detailed numbers, but before I start sharing these, I want to
> know
>  >> if there is anything I could/should do to tune ARQ/TxTDB in terms of
>  >> performance. For instance, I wonder if there are still a whole range 
of
>  >> checks active which I can/should turn off now that we are 
functionally
>  >> more sound. For completeness, I should add that we don't use any
>  >> optimization (i.e. we run with none.opt )
>  >>
>  >> thanks
>  >>
>  >> Simon
>  >
>  > Simon,
>  >
>  > Figure would be good. If you use TDB without touching the transaction
>  > system then it should be the same as before (with the obvious chances 
of
>  > unintended changes). Have you run this way?
>  >
>  > Just creating a transaction, especially one that allows write is a 
cost
>  > and if the granularity is small then it's going to make a big
>  > difference. (This is one reason there isn't an "autocommit" mode - it
>  > only seems to end in trouble one way or another). Read transactions 
are
>  > cheaper but not free.
>  >
>  > In terms of tuning, TDB 0.9 needs more heap as the transaction
>  > intermediate state is in-RAM , with no proper spill-to-disk yet.
>  >
>  > There shouldn't be the internal consistency checking enabled. Hmm -
>  > better check yet again!
>  >
>  > Andy
>  >
>
> Simon,
>
> Could you profile the tests and pass on the results? Any testing code
> left should show as hotspots.
>
> Andy
>
>
>

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

Simon,

Thank you for the figures.  Very useful.

Is this on Linux or Windows?

One of the design points of transactions is when to write-back the 
changes to the main database.  At the moment, write-back is done by any 
transaction (read included :-) that finds the database quiescent when 
the transaction clears up.  A commit in a writer writes to the log, not 
the main DB.  The journal is written back later.  This will hit some 
readers.  This is also why your write - times are better.

The advantage of the writeback policy currently is that it is 
predictable and easy to calculate when it will happen.  What I hope to 
do when the system is proven reliable is to have a write-back thread, 
taking all database writes off the transaction end code path. SQLlite 
and similar so this - but you need to be very careful that write-back 
does occur and the log doesn't just continue to grow.

An obvious half-way design is to fork the write-back at the end of the 
transaction that decides it can do the final changes but immediately 
return from the transaction.

Also, with the change in locking strategy, different ordering maybe 
happening.  In testing, I did see, when the system was totally loaded, 
some read transactions seemingly scheduled aside when there were writers 
(Linux) even accounting for the time when changes were written to the 
main database.  However, that was in a system that was tuned to be 
max-ed out and transactions that did zero, or near zero work.

During my testing, which is for a different workload, query has seemed 
the same as 0.8.10 in normal use. The workload is read-dominated. 
Writes are infrequent.  We're continuing to test.

One of the drivers for the transaction was to reduce the stutter effect 
of global locking for a writer to run - using MRSW locking, the writer 
means the readers are locked out, resulting in high latency for some 
reads.  The write-back needs further tuning for that although it is 
already better because (1) it waits until the DB is quiet, (2) it 
amalgamates writes and especially sync() calls.  sync() is the expesive 
point.

Another driver for the transactions has been to make the data more durable.

	Andy

PS Early work-in-progress

https://svn.apache.org/repos/asf/incubator/jena/Experimental/JenaPerf/trunk/

A performance framework for running query (and later update) mixes and 
reporting.  Unfinished.



On 10/01/12 23:24, Simon Helsen wrote:
> Andy,
>
> yes, I'll look into it as soon as I have cycles again. And no, I have
> not yet tried with non-transactional API in 2.7.0. I actually want to do
> that at some point to have a cleaner baseline.
>
> In the mean time, here is a summary of the results I found:
>
> 1) when I run with 1 client, query and store execution is comparable to
> each other. I have detailed numbers, but they help much
> 2) things become interesting when I start scaling up the number of
> clients (one of the principal motivations to move to TDB Tx). The data
> below is for the following scenario:
>
> * 50 clients
> * the operations of each client is a mixture of queries and write
> operations, where I execute a write operation for every 7th query
> * the queries are deterministically taken from a pool of about 35
> queries with varying complexity. When run in 1 client, they take
> anywhere from a few ms to almost 2 seconds for most intense query
> * between each operation, I wait 2s
> * there is plenty of memory/heap available. I use a 64 bit machine with
> 8Gb of memory where 4 is used for the java heap.
>
> Note that in TDB we use an exclusive write lock for write operations and
> shared read locks for read operations. In TDBTx, I just use transactions
> (i.e. we don't lock ourselves):
>
> A) Here are the numbers for TDB (0.8.7 etc):
>
> - total write time = 1345594ms, so about 1346s
>
> cnt | avg | max | min | dev | tot
> ======================================================================================================================
>
> DESCRIBE (ms) 402 | 466 | 4,859 | 0 | 609 | 187,609
> SELECT (ms) 4,618 | 4,809 | 93,453 | 0 | 9,621 | 22,211,907
> ----------------------------------------------------------------------------------------------------------------------
>
> PARALLELISM 5,020 | 14 | 41 | 0 | 8 | 79,066
>
> quite note about parallelism: this indicates effectively how much
> parallel activity was going on. For instance, on average, there were 14
> queries running at the same time, but maximum 41. The total indicates
> how heavily query activity was running in parallel.
>
> B) Here are the numbers of TDBTx:
>
> - total write time = 166047ms, so about 166s
>
> cnt | avg | max | min | dev | tot
> ==================================================================================================================
>
> DESCRIBE (ms) 168 | 2,557 | 9,219 | 31 | 1,769 | 429,645
> SELECT (ms) 1,853 | 38,866 | 392,282 | 0 | 74,008 | 72,020,224
> -------------------------------------------------------------------------------------------------------------------
>
> PARALLELISM 2,021 | 35 | 49 | 0 | 10 | 71,791
>
>
> note that although the test suite are running in the same way, The long
> query times in TDBTx caused several timeouts, which indicates the
> substantially smaller amount of completed queries. Even so, the total
> query time was still almost 4 times higher
>
> So, it seems that in this multi-client scenario, TDBTx is way better in
> avoiding lock contention around write operations, but, it is behaving
> significantly weaker for queries. One thing that is interesting is TDBTx
> has a higher number
> of average parallel running queries and a higher max. So, perhaps this
> is an important cause in the slowdown.
>
> Hopefully these are useful. Does any of you have done any performance
> measurements with transactional TDB?
>
> Simon
>
>
> From: 	Andy Seaborne <an...@apache.org>
> To: 	jena-dev@incubator.apache.org
> Date: 	01/10/2012 02:04 PM
> Subject: 	Re: TDB: release process
>
>
> ------------------------------------------------------------------------
>
>
>
> On 10/01/12 13:45, Andy Seaborne wrote:
>  > On 09/01/12 15:07, Simon Helsen wrote:
>  >> Andy, others,
>  >>
>  >> I have been testing TxTDB on my end and functionally, things are looking
>  >> good. I am not able to see any immediate problems anymore. Of course,
>  >> there may still be more exotic things left, but those can probably
>  >> managed
>  >> in am minor release. However, now that it is getting good on the
>  >> functional end, I am starting to check the non-functional
>  >> characteristics,
>  >> especially speed and scalability (in terms of multiple clients). For
> this
>  >> I use a test suite with about 35 different queries and I compare the
>  >> performance against Jena 2.6.3/ARQ 2.8.5 and TDB 0.8.7 because that is
>  >> the
>  >> version we currently use in the release of our product.. I am comparing
>  >> these numbers then with Jena/ARQ 2.7.0 and TDB 0.9.0 (20111229) and the
>  >> transaction API. I realize this partially comparing apples to pears but
>  >> from our perspective, we need to see how the bottomline changes in terms
>  >> of query speed when we increase the number of concurrent clients.
>  >>
>  >> I have detailed numbers, but before I start sharing these, I want to
> know
>  >> if there is anything I could/should do to tune ARQ/TxTDB in terms of
>  >> performance. For instance, I wonder if there are still a whole range of
>  >> checks active which I can/should turn off now that we are functionally
>  >> more sound. For completeness, I should add that we don't use any
>  >> optimization (i.e. we run with none.opt )
>  >>
>  >> thanks
>  >>
>  >> Simon
>  >
>  > Simon,
>  >
>  > Figure would be good. If you use TDB without touching the transaction
>  > system then it should be the same as before (with the obvious chances of
>  > unintended changes). Have you run this way?
>  >
>  > Just creating a transaction, especially one that allows write is a cost
>  > and if the granularity is small then it's going to make a big
>  > difference. (This is one reason there isn't an "autocommit" mode - it
>  > only seems to end in trouble one way or another). Read transactions are
>  > cheaper but not free.
>  >
>  > In terms of tuning, TDB 0.9 needs more heap as the transaction
>  > intermediate state is in-RAM , with no proper spill-to-disk yet.
>  >
>  > There shouldn't be the internal consistency checking enabled. Hmm -
>  > better check yet again!
>  >
>  > Andy
>  >
>
> Simon,
>
> Could you profile the tests and pass on the results? Any testing code
> left should show as hotspots.
>
> Andy
>
>
>

Re: TDB: release process

Posted by Simon Helsen <sh...@ca.ibm.com>.

Andy,

yes, I'll look into it as soon as I have cycles again. And no, I have not 
yet tried with non-transactional API in 2.7.0. I actually want to do that 
at some point to have a cleaner baseline. 

In the mean time, here is a summary of the results I found:

1) when I run with 1 client, query and store execution is comparable to 
each other. I have detailed numbers, but they help much
2) things become interesting when I start scaling up the number of clients 
(one of the principal motivations to move to TDB Tx). The data below is 
for the following scenario:

* 50 clients
* the operations of each client is a mixture of queries and write 
operations, where I execute a write operation for every 7th query
* the queries are deterministically taken from a pool of about 35 queries 
with varying complexity. When run in 1 client, they take anywhere from a 
few ms to almost 2 seconds for most intense query
* between each operation, I wait 2s
* there is plenty of memory/heap available. I use a 64 bit machine with 
8Gb of memory where 4 is used for the java heap.

Note that in TDB we use an exclusive write lock for write operations and 
shared read locks for read operations. In TDBTx, I just use transactions 
(i.e. we don't lock ourselves):

A) Here are the numbers for TDB (0.8.7 etc):

- total write time = 1345594ms, so about 1346s

                        cnt             |       avg             | max | 
min     |       dev             |       tot
======================================================================================================================
DESCRIBE (ms)   402             |       466             |       4,859   | 
0       |       609             |       187,609 
SELECT   (ms)   4,618   |       4,809   |       93,453  |       0       | 
9,621   |       22,211,907 
----------------------------------------------------------------------------------------------------------------------
PARALLELISM             5,020           |       14              |       41 
                |       0       |       8               |       79,066 

quite note about parallelism: this indicates effectively how much parallel 
activity was going on. For instance, on average, there were 14 queries 
running at the same time, but maximum 41. The total indicates how heavily 
query activity was running in parallel. 

B) Here are the numbers of TDBTx: 

- total write time = 166047ms, so about 166s

                        cnt             |       avg             | max | 
min     |       dev             |       tot
==================================================================================================================
DESCRIBE (ms)   168             |       2,557   |       9,219   |       31 
        |       1,769   |       429,645 
SELECT   (ms)   1,853   |       38,866  |       392,282         |       0 
|       74,008  |       72,020,224 
-------------------------------------------------------------------------------------------------------------------
PARALLELISM             2,021           |       35              |       49 
                |       0       |       10              |       71,791 

note that although the test suite are running in the same way, The long 
query times in TDBTx caused several timeouts, which indicates the 
substantially smaller amount of completed queries. Even so, the total 
query time was still almost 4 times higher

So, it seems that in this multi-client scenario, TDBTx is way better in 
avoiding lock contention around write operations, but, it is behaving 
significantly weaker for queries. One thing that is interesting is TDBTx 
has a higher number
of average parallel running queries and a higher max. So, perhaps this is 
an important cause in the slowdown. 

Hopefully these are useful. Does any of you have done any performance 
measurements with transactional TDB?

Simon

From:
Andy Seaborne <an...@apache.org>
To:
jena-dev@incubator.apache.org
Date:
01/10/2012 02:04 PM
Subject:
Re: TDB: release process

On 10/01/12 13:45, Andy Seaborne wrote:
> On 09/01/12 15:07, Simon Helsen wrote:
>> Andy, others,
>>
>> I have been testing TxTDB on my end and functionally, things are 
looking
>> good. I am not able to see any immediate problems anymore. Of course,
>> there may still be more exotic things left, but those can probably
>> managed
>> in am minor release. However, now that it is getting good on the
>> functional end, I am starting to check the non-functional
>> characteristics,
>> especially speed and scalability (in terms of multiple clients). For 
this
>> I use a test suite with about 35 different queries and I compare the
>> performance against Jena 2.6.3/ARQ 2.8.5 and TDB 0.8.7 because that is
>> the
>> version we currently use in the release of our product.. I am comparing
>> these numbers then with Jena/ARQ 2.7.0 and TDB 0.9.0 (20111229) and the
>> transaction API. I realize this partially comparing apples to pears but
>> from our perspective, we need to see how the bottomline changes in 
terms
>> of query speed when we increase the number of concurrent clients.
>>
>> I have detailed numbers, but before I start sharing these, I want to 
know
>> if there is anything I could/should do to tune ARQ/TxTDB in terms of
>> performance. For instance, I wonder if there are still a whole range of
>> checks active which I can/should turn off now that we are functionally
>> more sound. For completeness, I should add that we don't use any
>> optimization (i.e. we run with none.opt )
>>
>> thanks
>>
>> Simon
>
> Simon,
>
> Figure would be good. If you use TDB without touching the transaction
> system then it should be the same as before (with the obvious chances of
> unintended changes). Have you run this way?
>
> Just creating a transaction, especially one that allows write is a cost
> and if the granularity is small then it's going to make a big
> difference. (This is one reason there isn't an "autocommit" mode - it
> only seems to end in trouble one way or another). Read transactions are
> cheaper but not free.
>
> In terms of tuning, TDB 0.9 needs more heap as the transaction
> intermediate state is in-RAM , with no proper spill-to-disk yet.
>
> There shouldn't be the internal consistency checking enabled. Hmm -
> better check yet again!
>
> Andy
>

Simon,

Could you profile the tests and pass on the results?  Any testing code 
left should show as hotspots.

                 Andy

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 10/01/12 13:45, Andy Seaborne wrote:
> On 09/01/12 15:07, Simon Helsen wrote:
>> Andy, others,
>>
>> I have been testing TxTDB on my end and functionally, things are looking
>> good. I am not able to see any immediate problems anymore. Of course,
>> there may still be more exotic things left, but those can probably
>> managed
>> in am minor release. However, now that it is getting good on the
>> functional end, I am starting to check the non-functional
>> characteristics,
>> especially speed and scalability (in terms of multiple clients). For this
>> I use a test suite with about 35 different queries and I compare the
>> performance against Jena 2.6.3/ARQ 2.8.5 and TDB 0.8.7 because that is
>> the
>> version we currently use in the release of our product.. I am comparing
>> these numbers then with Jena/ARQ 2.7.0 and TDB 0.9.0 (20111229) and the
>> transaction API. I realize this partially comparing apples to pears but
>> from our perspective, we need to see how the bottomline changes in terms
>> of query speed when we increase the number of concurrent clients.
>>
>> I have detailed numbers, but before I start sharing these, I want to know
>> if there is anything I could/should do to tune ARQ/TxTDB in terms of
>> performance. For instance, I wonder if there are still a whole range of
>> checks active which I can/should turn off now that we are functionally
>> more sound. For completeness, I should add that we don't use any
>> optimization (i.e. we run with none.opt )
>>
>> thanks
>>
>> Simon
>
> Simon,
>
> Figure would be good. If you use TDB without touching the transaction
> system then it should be the same as before (with the obvious chances of
> unintended changes). Have you run this way?
>
> Just creating a transaction, especially one that allows write is a cost
> and if the granularity is small then it's going to make a big
> difference. (This is one reason there isn't an "autocommit" mode - it
> only seems to end in trouble one way or another). Read transactions are
> cheaper but not free.
>
> In terms of tuning, TDB 0.9 needs more heap as the transaction
> intermediate state is in-RAM , with no proper spill-to-disk yet.
>
> There shouldn't be the internal consistency checking enabled. Hmm -
> better check yet again!
>
> Andy
>

Simon,

Could you profile the tests and pass on the results?  Any testing code 
left should show as hotspots.

	Andy

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 09/01/12 15:07, Simon Helsen wrote:
> Andy, others,
>
> I have been testing TxTDB on my end and functionally, things are looking
> good. I am not able to see any immediate problems anymore. Of course,
> there may still be more exotic things left, but those can probably managed
> in am minor release. However, now that it is getting good on the
> functional end, I am starting to check the non-functional characteristics,
> especially speed and scalability (in terms of multiple clients). For this
> I use a test suite with about 35 different queries and I compare the
> performance against Jena 2.6.3/ARQ 2.8.5 and TDB 0.8.7 because that is the
> version we currently use in the release of our product.. I am comparing
> these numbers then with Jena/ARQ 2.7.0 and TDB 0.9.0 (20111229) and the
> transaction API. I realize this partially comparing apples to pears but
> from our perspective, we need to see how the bottomline changes in terms
> of query speed when we increase the number of concurrent clients.
>
> I have detailed numbers, but before I start sharing these, I want to know
> if there is anything I could/should do to tune ARQ/TxTDB in terms of
> performance. For instance, I wonder if there are still a whole range of
> checks active which I can/should turn off now that we are functionally
> more sound. For completeness, I should add that we don't use any
> optimization (i.e. we run with none.opt )
>
> thanks
>
> Simon

Simon,

Figure would be good.  If you use TDB without touching the transaction 
system then it should be the same as before (with the obvious chances of 
unintended changes).  Have you run this way?

Just creating a transaction, especially one that allows write is a cost 
and if the granularity is small then it's going to make a big 
difference.  (This is one reason there isn't an "autocommit" mode - it 
only seems to end in trouble one way or another).  Read transactions are 
cheaper but not free.

In terms of tuning, TDB 0.9 needs more heap as the transaction 
intermediate state is in-RAM , with no proper spill-to-disk yet.

There shouldn't be the internal consistency checking enabled.  Hmm - 
better check yet again!

	Andy

Re: TDB: release process

Posted by Simon Helsen <sh...@ca.ibm.com>.

Andy, others,

I have been testing TxTDB on my end and functionally, things are looking 
good. I am not able to see any immediate problems anymore. Of course, 
there may still be more exotic things left, but those can probably managed 
in am minor release. However, now that it is getting good on the 
functional end, I am starting to check the non-functional characteristics, 
especially speed and scalability (in terms of multiple clients). For this 
I use a test suite with about 35 different queries and I compare the 
performance against Jena 2.6.3/ARQ 2.8.5 and TDB 0.8.7 because that is the 
version we currently use in the release of our product.. I am comparing 
these numbers then with Jena/ARQ 2.7.0 and TDB 0.9.0 (20111229) and the 
transaction API. I realize this partially comparing apples to pears but 
from our perspective, we need to see how the bottomline changes in terms 
of query speed when we increase the number of concurrent clients.

I have detailed numbers, but before I start sharing these, I want to know 
if there is anything I could/should do to tune ARQ/TxTDB in terms of 
performance. For instance, I wonder if there are still a whole range of 
checks active which I can/should turn off now that we are functionally 
more sound. For completeness, I should add that we don't use any 
optimization (i.e. we run with none.opt )

thanks

Simon




From:
Andy Seaborne <an...@apache.org>
To:
jena-dev@incubator.apache.org
Date:
01/08/2012 02:28 PM
Subject:
TDB: release process



The release of core/ARQ etc. hasn't lead to any immediate disasters (but 
there is still time!) so we can move on to TDB.

As far as I'm concerned, the code in the current snapshot and in SVN is 
release candidate code (JENA-102 is fixed) and if people don't test it 
(I've pinged jena-users@), then they risk it taking longer to get a 
released version with fixes.

I need to write the transaction API documentation and there is something 
odd in the prefix handling but as  far as I can see, it's been odd for 
some time, maybe all time; it needs reworking, not fixing so shouldn't 
block a release.

     Andy

PS Fuseki snapshot is using TDB transactions now.

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 16/01/12 09:19, Andy Seaborne wrote:
> On 08/01/12 19:27, Andy Seaborne wrote:
>> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
>> there is still time!) so we can move on to TDB.
>>
>> As far as I'm concerned, the code in the current snapshot and in SVN is
>> release candidate code (JENA-102 is fixed) and if people don't test it
>> (I've pinged jena-users@), then they risk it taking longer to get a
>> released version with fixes.
>>
>> I need to write the transaction API documentation and there is something
>> odd in the prefix handling but as far as I can see, it's been odd for
>> some time, maybe all time; it needs reworking, not fixing so shouldn't
>> block a release.
>>
>> Andy
>>
>> PS Fuseki snapshot is using TDB transactions now.
>
> Slight delay - the reports on jena-users about Fuseki might be
> TDB-related or might be Fuseki-related. I want to investigate them first.
>
> Other than that, I just about the stage the release ... :-(
>
> Andy

OK - back on track, hopelfully.  It turns out that not all the 
transaction code was properly wired in in TDB.  It was not a Fuseki problem.

As pert of much clearup, the original, rather detailed and manual set 
for transactions has been shuffled out of the main package and a 
DatasetGraph equivalent of the Dataset API is presented:

http://incubator.apache.org/jena/documentation/tdb/tdb_transactions.html#api_for_transactions

Incidentally - Simon - it is possible that issues around query 
performance are improved.  It was using the general query engine. 
Whether this caught your performance tests or not depends on how you 
were creating transaction code.

	Andy

Re: TDB: release process

Posted by Andy Seaborne <an...@apache.org>.

On 08/01/12 19:27, Andy Seaborne wrote:
> The release of core/ARQ etc. hasn't lead to any immediate disasters (but
> there is still time!) so we can move on to TDB.
>
> As far as I'm concerned, the code in the current snapshot and in SVN is
> release candidate code (JENA-102 is fixed) and if people don't test it
> (I've pinged jena-users@), then they risk it taking longer to get a
> released version with fixes.
>
> I need to write the transaction API documentation and there is something
> odd in the prefix handling but as far as I can see, it's been odd for
> some time, maybe all time; it needs reworking, not fixing so shouldn't
> block a release.
>
> Andy
>
> PS Fuseki snapshot is using TDB transactions now.

Slight delay - the reports on jena-users about Fuseki might be 
TDB-related or might be Fuseki-related.  I want to investigate them first.

Other than that, I just about the stage the release ... :-(

	Andy