You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@marmotta.apache.org by Nikolaus Rumm <ni...@gmail.com> on 2013/04/10 15:36:38 UTC

Several questions on Marmotta

Hi,

we're currently looking for a scalable cloud-based solution to store and retrieve large quantities of RDF triples. Having met Mr. Günter today he introduced me to Marmotta and I found it quite promising.
Our project (www.fupol.eu) runs under the FP7 umbrella and we have to store social media content using several ontologies (foaf, sioc, dc, ...) in order to analyze and visualize it later.

After reading through your docs I still have some fundamental questions and I'd appreciate any answers to them:
- a single instance solution won't work for us for reasons of data protection and scalability (mainly: scalability in numbers of triples). In order to scale Marmotta: which solutions are recommended ? Do you just scale the database behind or the whole system ? Has anyone ever used Marmotta in the cloud ?
- Are there any benchmarks available that we could use to compare Marmotta's triple store performance against other solutions (i.e. Virtuoso)
- Are there any reports available on the system's performance under heavy load ? Is there a known maximum feasible number of triples (i.e. if the system doesn't scale well the response times might increase exponentially) ?

Many questions though...

Kind regards
Nikolaus

Von meinem iPad gesendet

Re: Several questions on Marmotta

Posted by Sebastian Schaffert <se...@gmail.com>.

Dear Nikolaus,

sorry for the late reply. I am trying to answer your questions below.


2013/4/10 Nikolaus Rumm <ni...@gmail.com>

> Hi,
>
> we're currently looking for a scalable cloud-based solution to store and
> retrieve large quantities of RDF triples. Having met Mr. Günter today he
> introduced me to Marmotta and I found it quite promising.
> Our project (www.fupol.eu) runs under the FP7 umbrella and we have to
> store social media content using several ontologies (foaf, sioc, dc, ...)
> in order to analyze and visualize it later.
>
> After reading through your docs I still have some fundamental questions
> and I'd appreciate any answers to them:
> - a single instance solution won't work for us for reasons of data
> protection and scalability (mainly: scalability in numbers of triples). In
> order to scale Marmotta: which solutions are recommended ? Do you just
> scale the database behind or the whole system ? Has anyone ever used
> Marmotta in the cloud ?
>

Marmotta is a Linked Data server, and thus by its very definition "in the
cloud". If you scale the Linked Data way, you can simply store different
resources on different servers. Through Marmotta's transparent Linked Data
caching, remote resources will be available for querying either via the
Java API or via the LDPath query language. Unfortunately not SPARQL,
because SPARQL is not really a Linked Data query language (it requires full
knowledge about all resources for some operations).

We have scaled single instances of Marmotta in our tests to up to 150
million triples without problems. More should be possible as well. Note
that generally, since the current backend is based on a relational
database, the import performance is not very high (for the 150 million
triples we take around 2:30 hours). The reason for this is mainly that the
relational database needs to ensure transaction isolation, consistent data,
and parallel access. We are working on a fast import option for the next
version. Querying performance should be acceptable, though.

Another option (for the next version) will be to configure Marmotta to use
a different storage backend (e.g. BigData or Jena TDB) . This might mean
higher performance, but restricted features (i.e. no versioning and
no/different reasoning).



> - Are there any benchmarks available that we could use to compare
> Marmotta's triple store performance against other solutions (i.e. Virtuoso)
>

No, not yet. What would you be interested in specifically?


> - Are there any reports available on the system's performance under heavy
> load ? Is there a known maximum feasible number of triples (i.e. if the
> system doesn't scale well the response times might increase exponentially) ?
>


Response times stay more-or-less constant even under heavy load. I have
done some tests on a workstation (8 cores, 24GB RAM, SSD). On a dataset
with 150 million triples and 15 million resources (GeoNames), this server
manages about 460 Linked Data requests per second with 40 parallel random
requests. This test was running in a loop for around 1 hour with constant
load and performance. All 8 cores were at around 80% load in this test.

I did not do any SPARQL tests yet. The SPARQL performance depends heavily
on the kinds of query you are executing. Since we are currently using the
API-based SPARQL implementation of Sesame, this might mean that certain
queries are very slow (because they need to iterate over all triples in
memory, instead of running them on the database). On the other hand,
queries with restrictive triple patterns (e.g. fixed subject or object)
will run very fast. We have a task open to optimise certain common SPARQL
queries, though.

Greetings,

Sebastian