You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@clerezza.apache.org by Olivier Grisel <og...@nuxeo.com> on 2011/03/17 11:50:20 UTC

A question about Jena Sparql Engine and the TDB Store

Hi all,

I am reading the source code of the Jena SPARQL / commons /  storage /
tdb packages of clerezza and it seems that the Jena sparql engine is
passed a generic Dataset implementation that wraps arbitrary clerezza
triple collection, even though the TC implementation is based on a TDB
implementation.

I am afraid that this double wrapping / unwrapping mechanism will
prevent ARQ to work efficiently at scale on large TDB triple stores.

Is my reading correct? Would it be possible to refactor the API of
either the JenaSparqlEngine to pass the native TDB dataset
implementation to the sparql engine whenever it's based on Jena
instead of always passing the generic Clerezza wrapper?

Cheers,

-- 
Olivier

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Reto Bachmann-Gmuer <re...@trialox.org>.
I was convinced that sparql-fastlane wasa laready an issue, but it doesn't
seem to.

Thanks for adding the issues.

Reto

On Fri, Mar 18, 2011 at 12:08 PM, Olivier Grisel <og...@nuxeo.com> wrote:

> Hi Reto,
>
> Thanks for your reply,
>
> >> Ok thanks for your reply. Security and locking checks apart, making
> >> the jena sparql engine able to fetch a TdbDataset instead of the
> >> generic TcDataset wrapper would require to introduce a new dependency
> >> between rdf.jena.tdb.storage and rdf.jena.sparql one way or another,
> >> maybe moving the TcDataset class as an interface in the
> >> rdf.jena.commons instead to avoid cyclic or unwanted dependencies.
> >
> > No there must not be such a dependency. The solution I think should
> rougly
> > look like this: TcProvider can Implement an additional Subinterface (say
> > SparqlTcProvider) allowing to pass Sparql-queries to them. If a sparql
> query
> > affects only graphs provided by the same SparqTcProvider the the query is
> > forwarded to that provider, otherwise the current process aplies.
>
> Understood, I opened this:
>
>  https://issues.apache.org/jira/browse/CLEREZZA-466
>
> >> Also it would be great to have support for fined grained support for
> >> default and named graph in SPARQL queries with named graph mapped to
> >> the Clerezza graph ids used by the TcProvider. In jena this seem to be
> >> provided by:
> >>
> >>  http://www.openjena.org/wiki/TDB/DynamicDatasets
> >>
> >> I wonder if this compatible with the current directory structure
> >> implied by the TdbTcProvider#getMGraph implementation.
> >
> > I agree. Possibly changes in tdb.storage are needed.
>
> Ok this improvement is tracked separately here:
>
>  https://issues.apache.org/jira/browse/CLEREZZA-467
>
> I won't probably have time to work on this in the short term but other
> Stanbol developers might need earlier than myself. I will tell them to
> submit patches to those two issues if needed.
>
> --
> Olivier
>

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Andy Seaborne <an...@epimorphics.com>.
>> I might be wrong but I thinkg the Jena TDB store will be able to
>> perform multiple named graphs queries much more efficiently if all the
>> graphs involved by the query belong to the same TDB store rather that
>> going through a generic indirection involving a generic wrapper. Would
>> be worth checking that claim by reading the jena source code though.

True.

> My assumption is, that a query cannot be forwarded to TDB if the
> graphs do not all belong to the same TDB store, so that 466 cannot be
> resolved without the changes described in 467. We will see...

ARQ also can query across a dataset made up of graphs backed by 
different storage subsystems.  It's not as efficient as all in one TDB 
store though.

	Andy

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Reto Bachmann-Gmuer <re...@trialox.org>.
On Fri, Mar 18, 2011 at 2:39 PM, Olivier Grisel <og...@nuxeo.com> wrote:
> On 18 March 2011 14:19, Reto Bachmann-Gmuer <re...@trialox.org> wrote:
>> On Fri, Mar 18, 2011 at 12:31 PM, Olivier Grisel <og...@nuxeo.com> wrote:
>>>
>>> On 18 March 2011 12:22, Reto Bachmann-Gmuer <re...@trialox.org> wrote:
>>> > I was convinced that sparql-fastlane wasa laready an issue, but it doesn't
>>> > seem to.
>>> >
>>> > Thanks for adding the issues.
>>>
>>> Actually I found one:
>>>
>>>  https://issues.apache.org/jira/browse/CLEREZZA-194
>>>
>>> But for some reason it was closed as "won't fix". I linked it to
>>> CLEREZZA-466 for reference.
>>
>> I undesrstood the issue CLEREZZA-194 as a suggestion to replace the
>> arq based implementation with another one, so basically as providing
>> another implementation of
>> org.apache.clerezza.rdf.core.sparql.QueryEngine. The fastlane approach
>> (mentioned in the comments on to issue 194) is different as allows
>> support for endpoint that are tied to the storage.
>
> Ok.
>
>> I've added CLEREZZA-468 which comprises the needed changes in SCB core
>> and marked 466 as depending on it. I'm not sure about CLEREZZA-467, it
>> is currently possible (afaik) to run a query against multiple graphs
>> using FROM and FROM NAMED clauses, with a resolution of CLEREZZA-468
>> and 466 such a query would be forwarded to the tdb.storage provider
>> and has to be handled - not sure how the state between the resolution
>> of 466 and 467 is supposed to look like.
>
> I might be wrong but I thinkg the Jena TDB store will be able to
> perform multiple named graphs queries much more efficiently if all the
> graphs involved by the query belong to the same TDB store rather that
> going through a generic indirection involving a generic wrapper. Would
> be worth checking that claim by reading the jena source code though.
My assumption is, that a query cannot be forwarded to TDB if the
graphs do not all belong to the same TDB store, so that 466 cannot be
resolved without the changes described in 467. We will see...

Cheers,
Reto

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Olivier Grisel <og...@nuxeo.com>.
On 18 March 2011 14:19, Reto Bachmann-Gmuer <re...@trialox.org> wrote:
> On Fri, Mar 18, 2011 at 12:31 PM, Olivier Grisel <og...@nuxeo.com> wrote:
>>
>> On 18 March 2011 12:22, Reto Bachmann-Gmuer <re...@trialox.org> wrote:
>> > I was convinced that sparql-fastlane wasa laready an issue, but it doesn't
>> > seem to.
>> >
>> > Thanks for adding the issues.
>>
>> Actually I found one:
>>
>>  https://issues.apache.org/jira/browse/CLEREZZA-194
>>
>> But for some reason it was closed as "won't fix". I linked it to
>> CLEREZZA-466 for reference.
>
> I undesrstood the issue CLEREZZA-194 as a suggestion to replace the
> arq based implementation with another one, so basically as providing
> another implementation of
> org.apache.clerezza.rdf.core.sparql.QueryEngine. The fastlane approach
> (mentioned in the comments on to issue 194) is different as allows
> support for endpoint that are tied to the storage.

Ok.

> I've added CLEREZZA-468 which comprises the needed changes in SCB core
> and marked 466 as depending on it. I'm not sure about CLEREZZA-467, it
> is currently possible (afaik) to run a query against multiple graphs
> using FROM and FROM NAMED clauses, with a resolution of CLEREZZA-468
> and 466 such a query would be forwarded to the tdb.storage provider
> and has to be handled - not sure how the state between the resolution
> of 466 and 467 is supposed to look like.

I might be wrong but I thinkg the Jena TDB store will be able to
perform multiple named graphs queries much more efficiently if all the
graphs involved by the query belong to the same TDB store rather that
going through a generic indirection involving a generic wrapper. Would
be worth checking that claim by reading the jena source code though.

-- 
Olivier

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Reto Bachmann-Gmuer <re...@trialox.org>.
On Fri, Mar 18, 2011 at 12:31 PM, Olivier Grisel <og...@nuxeo.com> wrote:
>
> On 18 March 2011 12:22, Reto Bachmann-Gmuer <re...@trialox.org> wrote:
> > I was convinced that sparql-fastlane wasa laready an issue, but it doesn't
> > seem to.
> >
> > Thanks for adding the issues.
>
> Actually I found one:
>
>  https://issues.apache.org/jira/browse/CLEREZZA-194
>
> But for some reason it was closed as "won't fix". I linked it to
> CLEREZZA-466 for reference.

I undesrstood the issue CLEREZZA-194 as a suggestion to replace the
arq based implementation with another one, so basically as providing
another implementation of
org.apache.clerezza.rdf.core.sparql.QueryEngine. The fastlane approach
(mentioned in the comments on to issue 194) is different as allows
support for endpoint that are tied to the storage.

I've added CLEREZZA-468 which comprises the needed changes in SCB core
and marked 466 as depending on it. I'm not sure about CLEREZZA-467, it
is currently possible (afaik) to run a query against multiple graphs
using FROM and FROM NAMED clauses, with a resolution of CLEREZZA-468
and 466 such a query would be forwarded to the tdb.storage provider
and has to be handled - not sure how the state between the resolution
of 466 and 467 is supposed to look like.

Reto

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Olivier Grisel <og...@nuxeo.com>.
On 18 March 2011 12:22, Reto Bachmann-Gmuer <re...@trialox.org> wrote:
> I was convinced that sparql-fastlane wasa laready an issue, but it doesn't
> seem to.
>
> Thanks for adding the issues.

Actually I found one:

  https://issues.apache.org/jira/browse/CLEREZZA-194

But for some reason it was closed as "won't fix". I linked it to
CLEREZZA-466 for reference.

-- 
Olivier

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Olivier Grisel <og...@nuxeo.com>.
Hi Reto,

Thanks for your reply,

>> Ok thanks for your reply. Security and locking checks apart, making
>> the jena sparql engine able to fetch a TdbDataset instead of the
>> generic TcDataset wrapper would require to introduce a new dependency
>> between rdf.jena.tdb.storage and rdf.jena.sparql one way or another,
>> maybe moving the TcDataset class as an interface in the
>> rdf.jena.commons instead to avoid cyclic or unwanted dependencies.
>
> No there must not be such a dependency. The solution I think should rougly
> look like this: TcProvider can Implement an additional Subinterface (say
> SparqlTcProvider) allowing to pass Sparql-queries to them. If a sparql query
> affects only graphs provided by the same SparqTcProvider the the query is
> forwarded to that provider, otherwise the current process aplies.

Understood, I opened this:

 https://issues.apache.org/jira/browse/CLEREZZA-466

>> Also it would be great to have support for fined grained support for
>> default and named graph in SPARQL queries with named graph mapped to
>> the Clerezza graph ids used by the TcProvider. In jena this seem to be
>> provided by:
>>
>>  http://www.openjena.org/wiki/TDB/DynamicDatasets
>>
>> I wonder if this compatible with the current directory structure
>> implied by the TdbTcProvider#getMGraph implementation.
>
> I agree. Possibly changes in tdb.storage are needed.

Ok this improvement is tracked separately here:

 https://issues.apache.org/jira/browse/CLEREZZA-467

I won't probably have time to work on this in the short term but other
Stanbol developers might need earlier than myself. I will tell them to
submit patches to those two issues if needed.

-- 
Olivier

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Reto Bachmann-Gmuer <re...@trialox.org>.
On Thu, Mar 17, 2011 at 5:18 PM, Olivier Grisel <og...@nuxeo.com> wrote:

> On 17 March 2011 16:37, Hasan Hasan <ha...@trialox.org> wrote:
> > Hi Olivier
> >
> > Indeed, this "fastlane" support in executing sparql queries, where the
> > underlying triple store is jena, is something that still on our to do
> list.
> > However, if you would like to contribute and provide a patch, that would
> be
> > really appreciated.
> > Not sure about the effort needed to get the native TDB Dataset
> encapsulated
> > in the TripleCollection though. There may be some other wrappers
> in-between
> > which are inserted to provide security and locking mechanisms
>
> Ok thanks for your reply. Security and locking checks apart, making
> the jena sparql engine able to fetch a TdbDataset instead of the
> generic TcDataset wrapper would require to introduce a new dependency
> between rdf.jena.tdb.storage and rdf.jena.sparql one way or another,
> maybe moving the TcDataset class as an interface in the
> rdf.jena.commons instead to avoid cyclic or unwanted dependencies.
>
No there must not be such a dependency. The solution I think should rougly
look like this: TcProvider can Implement an additional Subinterface (say
SparqlTcProvider) allowing to pass Sparql-queries to them. If a sparql query
affects only graphs provided by the same SparqTcProvider the the query is
forwarded to that provider, otherwise the current process aplies.


>
> Also it would be great to have support for fined grained support for
> default and named graph in SPARQL queries with named graph mapped to
> the Clerezza graph ids used by the TcProvider. In jena this seem to be
> provided by:
>
>  http://www.openjena.org/wiki/TDB/DynamicDatasets
>
> I wonder if this compatible with the current directory structure
> implied by the TdbTcProvider#getMGraph implementation.
>
I agree. Possibly changes in tdb.storage are needed.

Cheers,
Reto

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Olivier Grisel <og...@nuxeo.com>.
On 17 March 2011 16:37, Hasan Hasan <ha...@trialox.org> wrote:
> Hi Olivier
>
> Indeed, this "fastlane" support in executing sparql queries, where the
> underlying triple store is jena, is something that still on our to do list.
> However, if you would like to contribute and provide a patch, that would be
> really appreciated.
> Not sure about the effort needed to get the native TDB Dataset encapsulated
> in the TripleCollection though. There may be some other wrappers in-between
> which are inserted to provide security and locking mechanisms

Ok thanks for your reply. Security and locking checks apart, making
the jena sparql engine able to fetch a TdbDataset instead of the
generic TcDataset wrapper would require to introduce a new dependency
between rdf.jena.tdb.storage and rdf.jena.sparql one way or another,
maybe moving the TcDataset class as an interface in the
rdf.jena.commons instead to avoid cyclic or unwanted dependencies.

Also it would be great to have support for fined grained support for
default and named graph in SPARQL queries with named graph mapped to
the Clerezza graph ids used by the TcProvider. In jena this seem to be
provided by:

  http://www.openjena.org/wiki/TDB/DynamicDatasets

I wonder if this compatible with the current directory structure
implied by the TdbTcProvider#getMGraph implementation.

-- 
Olivier

Re: A question about Jena Sparql Engine and the TDB Store

Posted by Hasan Hasan <ha...@trialox.org>.
Hi Olivier

Indeed, this "fastlane" support in executing sparql queries, where the
underlying triple store is jena, is something that still on our to do list.
However, if you would like to contribute and provide a patch, that would be
really appreciated.
Not sure about the effort needed to get the native TDB Dataset encapsulated
in the TripleCollection though. There may be some other wrappers in-between
which are inserted to provide security and locking mechanisms

cheers
hasan

On Thu, Mar 17, 2011 at 11:50 AM, Olivier Grisel <og...@nuxeo.com> wrote:

> Hi all,
>
> I am reading the source code of the Jena SPARQL / commons /  storage /
> tdb packages of clerezza and it seems that the Jena sparql engine is
> passed a generic Dataset implementation that wraps arbitrary clerezza
> triple collection, even though the TC implementation is based on a TDB
> implementation.
>
> I am afraid that this double wrapping / unwrapping mechanism will
> prevent ARQ to work efficiently at scale on large TDB triple stores.
>
> Is my reading correct? Would it be possible to refactor the API of
> either the JenaSparqlEngine to pass the native TDB dataset
> implementation to the sparql engine whenever it's based on Jena
> instead of always passing the generic Clerezza wrapper?
>
> Cheers,
>
> --
> Olivier
>