You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Niels Andersen <ni...@thinkiq.com> on 2016/11/13 06:59:37 UTC

How do I do a join between multiple model.listStatments calls?

Dear user community,

Our current approach to joining multiple model.listStatements (with SimpleSelector) calls is to take the content of the iterators returned and add them to separate HashSets and then use functions such as retainAll to find the intersection between the two sets.

This works relative well when model.listStatements return a small to medium number of statements.

My problem is that this seems to be a very inefficient way of joining to sets of data that are already ordered in TDB. I assume that there must be a better way to do this. I have searched the web, but all uses of listStatements are very simple.

I have also not found an effective way to do filtering (for instance literal less than 5) without comparing every statement that listStatements returns

My questions are:

*         What is the recommended way to do a join between two lists of statements?

*         What is the recommended way to implement filtering?

*         Is there anything else than SimpleSelector? Are there any Advanced selectors?

Thanks in advance,
Niels





Re: How do I do a join between multiple model.listStatments calls?

Posted by Martynas Jusevičius <ma...@graphity.org>.
Why not SPARQL FILTER?
https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#expressions

On Sun, 13 Nov 2016 at 08.59, Niels Andersen <ni...@thinkiq.com> wrote:

> Dear user community,
>
> Our current approach to joining multiple model.listStatements (with
> SimpleSelector) calls is to take the content of the iterators returned and
> add them to separate HashSets and then use functions such as retainAll to
> find the intersection between the two sets.
>
> This works relative well when model.listStatements return a small to
> medium number of statements.
>
> My problem is that this seems to be a very inefficient way of joining to
> sets of data that are already ordered in TDB. I assume that there must be a
> better way to do this. I have searched the web, but all uses of
> listStatements are very simple.
>
> I have also not found an effective way to do filtering (for instance
> literal less than 5) without comparing every statement that listStatements
> returns
>
> My questions are:
>
> *         What is the recommended way to do a join between two lists of
> statements?
>
> *         What is the recommended way to implement filtering?
>
> *         Is there anything else than SimpleSelector? Are there any
> Advanced selectors?
>
> Thanks in advance,
> Niels
>
>
>
>
>

Re: How do I do a join between multiple model.listStatments calls?

Posted by Andy Seaborne <an...@apache.org>.
ARQ is either as fast at joins as listStatements (because it is using 
the underlying Graph.find that backs listStatement) or is faster because 
it avoids churning lot of unnecessary bytes.

As many NoSQL application have discovered, reinventing joins client 
side, results in a lot of data transfer from data storage to client.

	Andy

Re: How do I do a join between multiple model.listStatments calls?

Posted by Claude Warren <cl...@xenei.com>.
In response to #5: Are there examples of large scale solutions built on
Jena ARQ/SPARQL without the use of the Jena API? Can we see their reference
architectures?

I'm not sure this qualified but the Granatum project used SPARQL to query
the data.  It also used the Jena API in several other places, most notably
to track which endpoints were up/down and their response time.  This is a
tacit acknowlagement of your point " Public SPARQL end-points are
notoriously bad."   The project implemented a preprocessor to the Jena
Query Engine that distributed the queries across multiple endpoints to pick
up the necessary data to answer the query.  The entire user/researcher
front end was SPARQL.

https://aran.library.nuigalway.ie/xmlui/bitstream/handle/10379/4845/Linked_
Biomedical_Dataspace_-_Lessons_Learned_integrating_Data_for_Drug_Discovery_%
28Final%29.pdf?sequence=1

Claude
-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

RE: How do I do a join between multiple model.listStatments calls?

Posted by Niels Andersen <ni...@thinkiq.com>.
Claude and Martynas,



Thank you for your quick response.



We are aware that the SPARQL language is providing join and filtering capabilities, it is however important to be reminded that it exists and not get stuck in a single implementation track. Thanks for reminding us.



I apologize that this email is longer than I intended and may seem like a rant against SPARQL. I don’t want to anger anyone with this email, this is a summary of what I have observed and believe to be facts. I want to make it very clear; “Nothing would make me happier than to be proven wrong, I would love to see SPARQL work in large web scale applications”. To people who choose to reply to this email; keep in mind that I only care about scientific and provable facts, I do not care about opinions.



We started with SPARQL as our core language. In the beginning it looked very promising. Due to the following, we are now looking at SPARQL as an add on capability, not the core capability:

1.       The performance was poor for advanced queries.

a.       It made us question if Jena SPARQL is only viable for simple queries.

b.       In particular queries that returned a large dataset in JSON would take a long time to return data.

2.       SPARQL does not appear to be a mature language:

a.       Compared to SQL: There is no concept of views, functions or procedures. This is particularly a problem as triple stores have weak schema capabilities and the schema must be enforced in the application that interacts with the data.

b.       Poor subquery capabilities and performance. No procedural multi statement capabilities. For instance, it is not possible to do the equivalent of SQL selecting into a temporary table in one statement and use this temporary table in a subsequent query.

c.       How do I take the result set of one query and pass it to the next query? Do I have to use CONSTRUCT to insert this relationship into the model and re-use it?

3.       All SPARQL examples used in documentation are very simple.

a.       Again, made us question if SPARQL was fit for more advanced queries.

4.       Jena ARQ is limited by the capabilities of the underlying technology.

a.       If the underlying technology does is incapable of doing an effective join, then a system put on top of it will be equally ineffective of doing the same. The fact that the SPARQL language provides join capabilities does not mean that Jena provides an effective implementation of this language.

5.       All large scale Jena implementations seems to use the Jena API instead of the Jena ARQ

a.       Again, making us question the capabilities and maturity of both SPARQL and Jena ARQ

b.       SPARQL seemed to be a dead end, only suitable for small solutions and demonstrations.

c.       In particular Jena Fuseki is referred to as only fit for smaller solutions.

6.       There seemed to be a lack of good query optimizers

a.       Even simple things such as changing the order of triples in the WHERE clause would lead to significant different performance.

7.       Public SPARQL end-points are notoriously bad.

a.       They are constantly down.

b.       Queries are slow

c.       Queries are often limited to simple triple sets.

d.       Some queries would not return and even crash or overload the server.

8.       Poor SPARQL documentation:

a.       The W3C documentation is hard to read and hard to understand. Combine this with the W3C RDF, OWL and OWL2 documentation and you will see a real issue.

b.       The more accessible documentation is shallow and incomplete. Only simple SPARQL queries are shown.

c.       There are no really good sources of best practice and application examples. Some of them are even contradicting each other.

d.       It seems like there are a lot of good intentions when people start using SPARQL, but they all end up being dead ends.

e.       A lot of the documentation seems to be “old”, written in 2008/2009 and not updated since.

f.        The biggest red flag is the number of broken links to SPARQL, RDF and OWL documentation on the web.

9.       SPARQL can only return rectangular data:

a.       This is the same limitation as SQL, but in SQL I can create a procedure that will return multiple datasets with common keys.

b.       Rectangular datasets causes duplicate data and loss of structure.

10.   Building SPARQL strings to send to the server is not an effective way to deal with queries

a.       This is probably more of an opinion than a fact. Excuse me for putting it in the list.

11.   The lack of adoption of RDF stores compared to other data stores:

a.       Not sure how scientific this chart is, but even if it off by a factor of 10 it shows a big difference: http://db-engines.com/en/ranking_trend/system/Jena%3BMicrosoft+SQL+Server%3BMongoDB%3BMySQL%3BNeo4j



We did not give up, and dug into the problems to find solutions. We observed that some of the query complexity could be simplified by using SPARQL CONSTRUCT statements or Jena inference rules to pre-create relationships that users might want to query on. This provided much faster queries, but made the underlying model more murky with significant duplication of data. The proliferation of the vocabulary (predicates/properties) became a concern. Having to use CONSTRUCTS and rules to “pre-answer” complex questions also contradicts the primary reason to use a triple store in the first place; “we wanted a data store that could answer the questions that no one had thought about”.



While SPARQL seems to promise to do what we want, the reality is that we have been unable to apply it in a way that delivers what we want. I am aware that this might be a failure of understanding how to use SPARQL.



So, please help us understand the following:

1.       Are our observations correct? Please prove/disprove each point, it would make me happy to see that I am wrong.

2.       Are these issues resolved in the latest Jena and Jena Fuseki implementations? I see that there are comments about faster SPARQL queries in the latest release. Is there any documentation showing what was done to improve it?

3.       Are we using SPARQL incorrectly? How should we use it?

4.       Is there documentation available that we do not know about? Please point us to the really good documentation. (We have read the positively rated books on the subjects as well as every website that refers to Jena, SPARQL, RDF, OWL, Semantic web within the first page of Google search).

5.       Are there examples of large scale solutions built on Jena ARQ/SPARQL without the use of the Jena API? Can we see their reference architectures?

6.       How can it be that the Jena API cannot do an effective join? Is SPARQL based on this API? Is there another API available to effectively get to the data?

7.       How is ARQ implemented? Does it use the indexed data in Jena TDB? How does it handled indexes in subqueries?

Looking forward to hearing from you again.



Best regards,

Niels









-----Original Message-----
From: Claude Warren [mailto:claude@xenei.com]
Sent: Sunday, November 13, 2016 01:04
To: users@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?



Niels,



SPARQL (https://www.w3.org/TR/rdf-sparql-query/) provides a simple way to join the triples of different statements and can be called from within your java code (http://jena.apache.org/documentation/query/index.html).



As noted previously using a filter should do the trick.  There is documentation for how to write your own filter if you need to but you may find that your filter requirements are already met by existing filters.



Claude



On Sun, Nov 13, 2016 at 6:59 AM, Niels Andersen <ni...@thinkiq.com>> wrote:



> Dear user community,

>

> Our current approach to joining multiple model.listStatements (with

> SimpleSelector) calls is to take the content of the iterators returned

> and add them to separate HashSets and then use functions such as

> retainAll to find the intersection between the two sets.

>

> This works relative well when model.listStatements return a small to

> medium number of statements.

>

> My problem is that this seems to be a very inefficient way of joining

> to sets of data that are already ordered in TDB. I assume that there

> must be a better way to do this. I have searched the web, but all uses

> of listStatements are very simple.

>

> I have also not found an effective way to do filtering (for instance

> literal less than 5) without comparing every statement that

> listStatements returns

>

> My questions are:

>

> *         What is the recommended way to do a join between two lists of

> statements?

>

> *         What is the recommended way to implement filtering?

>

> *         Is there anything else than SimpleSelector? Are there any

> Advanced selectors?

>

> Thanks in advance,

> Niels

>

>

>

>

>





--

I like: Like Like - The likeliest place on the web <http://like-like.xenei.com>

LinkedIn: http://www.linkedin.com/in/claudewarren

Re: How do I do a join between multiple model.listStatments calls?

Posted by Claude Warren <cl...@xenei.com>.
Niels,

SPARQL (https://www.w3.org/TR/rdf-sparql-query/) provides a simple way to
join the triples of different statements and can be called from within your
java code (http://jena.apache.org/documentation/query/index.html).

As noted previously using a filter should do the trick.  There is
documentation for how to write your own filter if you need to but you may
find that your filter requirements are already met by existing filters.

Claude

On Sun, Nov 13, 2016 at 6:59 AM, Niels Andersen <ni...@thinkiq.com> wrote:

> Dear user community,
>
> Our current approach to joining multiple model.listStatements (with
> SimpleSelector) calls is to take the content of the iterators returned and
> add them to separate HashSets and then use functions such as retainAll to
> find the intersection between the two sets.
>
> This works relative well when model.listStatements return a small to
> medium number of statements.
>
> My problem is that this seems to be a very inefficient way of joining to
> sets of data that are already ordered in TDB. I assume that there must be a
> better way to do this. I have searched the web, but all uses of
> listStatements are very simple.
>
> I have also not found an effective way to do filtering (for instance
> literal less than 5) without comparing every statement that listStatements
> returns
>
> My questions are:
>
> *         What is the recommended way to do a join between two lists of
> statements?
>
> *         What is the recommended way to implement filtering?
>
> *         Is there anything else than SimpleSelector? Are there any
> Advanced selectors?
>
> Thanks in advance,
> Niels
>
>
>
>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren