You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Alexandra Kokkinaki <al...@gmail.com> on 2016/04/12 15:39:47 UTC

Re: Fuseki server: many data services or many fuseki installations?

Hi Andy, thanks for your answers. So would it be feasible to add/delete
triples in an existing database?

Thanks,

Alexandra

On Tue, Mar 29, 2016 at 9:58 AM, Andy Seaborne <an...@apache.org> wrote:

> On 21/03/16 13:35, Alexandra Kokkinaki wrote:
>
>> Hi Andy, thanks for your answers.
>>
>>
>> On Fri, Mar 18, 2016 at 11:43 AM, Andy Seaborne <an...@apache.org> wrote:
>>
>> Hi,
>>>
>>> it will depend on usage patterns. 2* 500 million isn't unreasonable but
>>> validating with your expected usage is essential.
>>> The critical factors are the usage patterns and the hardware available.
>>> Number of queries, query complexity, number of updates, all matter. RAM
>>> is
>>> good (which is true for any database) as are SSDs if you do lots of
>>> update
>>> or need fast startup from cold.
>>>
>>> What kind of usage patterns are considered not valid for big triple
>> stores.
>> We are planning to use our fuseki server to allow, machine to machine
>> communication and also allow independent users to  express mostly spatial
>> queries We plan to do indexing and have a query time out too. Is that
>> enough to address performance issues?
>>
>
> They are a good idea.  It will protect the server.
>
> It is possible to write SPARQL queries which are fundamentally expensive.
>
> The TDB will need to get updated daily, using jena API, since I suppose
>> deleting and inserting everything back would take a long time. I read in (
>>
>> https://lists.w3.org/Archives/Public/public-sparql-dev/2008JulSep/0029.html
>> ) that it takes 5370secs for 100M triples  to be loaded in TDB, which is
>> good.
>> But here <https://www.w3.org/wiki/LargeTripleStores> it is said that it
>> took 36 hours to load 1.7B triples in TDB
>>
>
> ... in 2008 ... with a spinning disk.
>
> 12k triples/s would be a bit slow nowadays.
>
> At large scale tdbloader2 can be faster that tdbloader. You have to try
> with your data on your hardware - it isn't a simple yes/no question
> unfortunately.
>
> tdbloader2 only loads from empty.
>
> tdbloader does not do anything special when loading a partial database.
>
> , which drives me towards the
>> daily updates rather than daily delete and insert.
>> How long would a 500 triple DB take to be loaded in an empty database?
>>
>
> 500M?
>
> Just run
>
> tdbloader --loc DB <data> and see what rate you get - I'd be interested in
> seeing the log.  Every data set, every hardware set can be different.
> That's why it is hard to make any accurate predications - just try it.
>
> tdbloader --loc=DB <the_data>
>
> The pattern of the data makes a difference - LUBM loads very fast as it
> has a high triples to nodes ratio so less bytes are being loaded.  All
> triple stores report better figures on that data - a factor of x2 faster is
> common - but it's not typical data.
>
>         Andy
>
>
> Multiple requests, whether same service or different service, are
>>> competing for the same machine resources.  Fuseki runs requests
>>> independently and in parallel.  There are per-database transactions
>>> supporting multiple, truly parallel readers.
>>>
>>>
>>      Andy
>>>
>>
>>
>> Many thanks,
>>
>> Alexandra
>>
>>
>>>
>>> On 18/03/16 09:35, Alexandra Kokkinaki wrote:
>>>
>>> Hi,
>>>>
>>>> after researching on TDB performance with Big Data, I would still like
>>>> to
>>>> know:
>>>> We have one fuseki server exposing 2 sparql endpoints (2million triples
>>>> each) as data services. We are planning to add one more, but with Big
>>>> data
>>>>
>>>> 500Million triples
>>>>>
>>>>>
>>>>      - For big data is it better to use many installations of fuseki
>>>> server
>>>>      or
>>>>      - many data services under the same Fuseki server?
>>>>
>>>>
>>>> Could fuseki cope with two or more services with more than  500 Million
>>>> triples each?
>>>>
>>>>
>>>
>>>
>>> How does Fuseki cope when it has to serve concurrent queries to the
>>>> different data services?
>>>>
>>>>
>>>
>>>
>>> Many thanks,
>>>>
>>>> Alexandra
>>>>
>>>>
>>>>
>>>
>>
>

Re: Chained Properties

Posted by Dave Reynolds <da...@gmail.com>.

On 15/04/16 12:12, Abduladem Eljamel wrote:
> HiDoes the reasoning engine in Jena supports the chained propertiesrules of OWL2?ThanksAbdul

No, there's no support for OWL2 builtin.

Dave

Chained Properties

Posted by Abduladem Eljamel <a_...@yahoo.co.uk.INVALID>.

HiDoes the reasoning engine in Jena supports the chained propertiesrules of OWL2?ThanksAbdul

Re: Reasoning On Graphs

Posted by Dave Reynolds <da...@gmail.com>.

On 14/04/16 15:17, Abduladem Eljamel wrote:
> Hi,,
> I have a DBT dataset contains several named graphs and ontology file. I would like to apply the same reasoning on all graphs in the dataset.

The Jena reasoner is not dataset-aware, it only works over models.

The reasoner also only works in memory, you can run it over a TDB-backed 
model but the results will be stored in memory and will not  persist 
unless you explicitly write them out to a persisted model.

> I tried this code to apply reasoning on one named graph.
>
> OntModel ontmodel = ModelFactory.createOntologyModel(OntModelSpec.OWL_DL_MEM, null);
> File filePath = new File(OntologyPath);
> URL filePathURL = filePath.toURI().toURL();
> ontmodel.read(filePathURL.toString());
>
> Dataset dataset = TDBFactory.createDataset(datasetLocation);
> dataset.begin(ReadWrite.WRITE) ;

No need for this, you are not writing to the dataset here.

> Model model = dataset.getNamedModel(graphName);
>
> Reasoner reasoner = ReasonerRegistry.getOWLMicroReasoner();
> reasoner= reasoner.bindSchema(ontmodel);
> InfModel infmodel = ModelFactory.createInfModel(reasoner, model);

This will work but, as noted above, the infmodel is in memory and it has 
to do quite a bit of work to load the data from TDB. It can be more 
efficient to just load the TDB model into an in memory model first and 
create the infmodel over that.

> Should I apply the reasoning on the graphs one by one? Is there any method to apply the reasoning on all graphs and they stay separate as they are? Is there any method which can be used as for the default graphs below?
>
> Model model = dataset.getDefaultModel();

Not sure I follow the question.

If you want to be able to access each of the inference closures of each 
of the source graphs separately then you will need an InfModel for each.

If you want to treat the set of graphs as one big graph, and create a 
single inference closure over the union model then do that. You can 
either configure TDB to have the defaultGraph be the union of all the 
named graphs or you can use the TDB-specific name for the union 
urn:x-arq:DefaultGraph.

Dave

Reasoning On Graphs

Posted by Abduladem Eljamel <a_...@yahoo.co.uk.INVALID>.

Hi,,
I have a DBT dataset contains several named graphs and ontology file. I would like to apply the same reasoning on all graphs in the dataset. 
I tried this code to apply reasoning on one named graph.   

OntModel ontmodel = ModelFactory.createOntologyModel(OntModelSpec.OWL_DL_MEM, null);
File filePath = new File(OntologyPath);
URL filePathURL = filePath.toURI().toURL();
ontmodel.read(filePathURL.toString());

Dataset dataset = TDBFactory.createDataset(datasetLocation);
dataset.begin(ReadWrite.WRITE) ;
Model model = dataset.getNamedModel(graphName);

Reasoner reasoner = ReasonerRegistry.getOWLMicroReasoner();
reasoner= reasoner.bindSchema(ontmodel);
InfModel infmodel = ModelFactory.createInfModel(reasoner, model);

Should I apply the reasoning on the graphs one by one? Is there any method to apply the reasoning on all graphs and they stay separate as they are? Is there any method which can be used as for the default graphs below?

Model model = dataset.getDefaultModel();

Thanks
Abdul

Re: Fuseki server: many data services or many fuseki installations?

Posted by Andy Seaborne <an...@apache.org>.

On 14/04/16 13:43, Alexandra Kokkinaki wrote:
> Thanks Andy, so you are suggesting to break the updates (deletions and
> insertions) in smaller requests to avoid any memory issues. I suppose that
> we will make daily updates so that our triple store is up to date, which
> will probably result some hundreds of triples.

Hundreds - no worries.

Just keep the batch size of a single request below a few million (and 
tweak the heap up a bit if you are doing million+ updates at one time).

Do not increase the heap regardless TBD uses non-heap space so more heap 
is slower.

> I have seen the commit command but not the rollback. Is there any safety
> net if something goes wrong during the update procedure to rollback?
> Are there any nice examples about "TDB transactions" that I could start
> looking at?

In Fuseki, one HTTP request = one transaction.  It aborts if the update 
fails (e.g. bad request, syntax error).

What there isn't yet is begin-request-request-request-commit.

If you are dealing with hundred of triples, the way to approach is to 
build up a set of all the changes you want, build a SPARQL update and do 
it all at once.  It will all happen or none happen.

You can not update a database using the TDB tools when Fuseki is using it.

	Andy

Re: Fuseki server: many data services or many fuseki installations?

Posted by Alexandra Kokkinaki <al...@gmail.com>.

Thanks Andy, so you are suggesting to break the updates (deletions and
insertions) in smaller requests to avoid any memory issues. I suppose that
we will make daily updates so that our triple store is up to date, which
will probably result some hundreds of triples.
I have seen the commit command but not the rollback. Is there any safety
net if something goes wrong during the update procedure to rollback?
Are there any nice examples about "TDB transactions" that I could start
looking at?

Many thanks

Alexandra

On Thu, Apr 14, 2016 at 12:32 PM, Andy Seaborne <an...@apache.org> wrote:

> On 12/04/16 14:39, Alexandra Kokkinaki wrote:
>
>> Hi Andy, thanks for your answers. So would it be feasible to add/delete
>> triples in an existing database?
>>
>
> Updates are supported.
>
> However, changing large amounts (deleting or adding or a mix) - 10's of
> millions of triples - in a single transaction (single HTTP request) will
> consume too much memory.  Such a large change would need to be broken up
> into multiple requests.
>
>         Andy
>
>
>
>> Thanks,
>>
>> Alexandra
>>
>> On Tue, Mar 29, 2016 at 9:58 AM, Andy Seaborne <an...@apache.org> wrote:
>>
>> On 21/03/16 13:35, Alexandra Kokkinaki wrote:
>>>
>>> Hi Andy, thanks for your answers.
>>>>
>>>>
>>>> On Fri, Mar 18, 2016 at 11:43 AM, Andy Seaborne <an...@apache.org>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>>
>>>>> it will depend on usage patterns. 2* 500 million isn't unreasonable but
>>>>> validating with your expected usage is essential.
>>>>> The critical factors are the usage patterns and the hardware available.
>>>>> Number of queries, query complexity, number of updates, all matter. RAM
>>>>> is
>>>>> good (which is true for any database) as are SSDs if you do lots of
>>>>> update
>>>>> or need fast startup from cold.
>>>>>
>>>>> What kind of usage patterns are considered not valid for big triple
>>>>>
>>>> stores.
>>>> We are planning to use our fuseki server to allow, machine to machine
>>>> communication and also allow independent users to  express mostly
>>>> spatial
>>>> queries We plan to do indexing and have a query time out too. Is that
>>>> enough to address performance issues?
>>>>
>>>>
>>> They are a good idea.  It will protect the server.
>>>
>>> It is possible to write SPARQL queries which are fundamentally expensive.
>>>
>>> The TDB will need to get updated daily, using jena API, since I suppose
>>>
>>>> deleting and inserting everything back would take a long time. I read
>>>> in (
>>>>
>>>>
>>>> https://lists.w3.org/Archives/Public/public-sparql-dev/2008JulSep/0029.html
>>>> ) that it takes 5370secs for 100M triples  to be loaded in TDB, which is
>>>> good.
>>>> But here <https://www.w3.org/wiki/LargeTripleStores> it is said that it
>>>> took 36 hours to load 1.7B triples in TDB
>>>>
>>>>
>>> ... in 2008 ... with a spinning disk.
>>>
>>> 12k triples/s would be a bit slow nowadays.
>>>
>>> At large scale tdbloader2 can be faster that tdbloader. You have to try
>>> with your data on your hardware - it isn't a simple yes/no question
>>> unfortunately.
>>>
>>> tdbloader2 only loads from empty.
>>>
>>> tdbloader does not do anything special when loading a partial database.
>>>
>>> , which drives me towards the
>>>
>>>> daily updates rather than daily delete and insert.
>>>> How long would a 500 triple DB take to be loaded in an empty database?
>>>>
>>>>
>>> 500M?
>>>
>>> Just run
>>>
>>> tdbloader --loc DB <data> and see what rate you get - I'd be interested
>>> in
>>> seeing the log.  Every data set, every hardware set can be different.
>>> That's why it is hard to make any accurate predications - just try it.
>>>
>>> tdbloader --loc=DB <the_data>
>>>
>>> The pattern of the data makes a difference - LUBM loads very fast as it
>>> has a high triples to nodes ratio so less bytes are being loaded.  All
>>> triple stores report better figures on that data - a factor of x2 faster
>>> is
>>> common - but it's not typical data.
>>>
>>>          Andy
>>>
>>>
>>> Multiple requests, whether same service or different service, are
>>>
>>>> competing for the same machine resources.  Fuseki runs requests
>>>>> independently and in parallel.  There are per-database transactions
>>>>> supporting multiple, truly parallel readers.
>>>>>
>>>>>
>>>>>       Andy
>>>>
>>>>>
>>>>>
>>>>
>>>> Many thanks,
>>>>
>>>> Alexandra
>>>>
>>>>
>>>>
>>>>> On 18/03/16 09:35, Alexandra Kokkinaki wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>>
>>>>>> after researching on TDB performance with Big Data, I would still like
>>>>>> to
>>>>>> know:
>>>>>> We have one fuseki server exposing 2 sparql endpoints (2million
>>>>>> triples
>>>>>> each) as data services. We are planning to add one more, but with Big
>>>>>> data
>>>>>>
>>>>>> 500Million triples
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       - For big data is it better to use many installations of fuseki
>>>>>> server
>>>>>>       or
>>>>>>       - many data services under the same Fuseki server?
>>>>>>
>>>>>>
>>>>>> Could fuseki cope with two or more services with more than  500
>>>>>> Million
>>>>>> triples each?
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> How does Fuseki cope when it has to serve concurrent queries to the
>>>>>
>>>>>> different data services?
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Many thanks,
>>>>>
>>>>>>
>>>>>> Alexandra
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Fuseki server: many data services or many fuseki installations?

Posted by Andy Seaborne <an...@apache.org>.

On 12/04/16 14:39, Alexandra Kokkinaki wrote:
> Hi Andy, thanks for your answers. So would it be feasible to add/delete
> triples in an existing database?

Updates are supported.

However, changing large amounts (deleting or adding or a mix) - 10's of 
millions of triples - in a single transaction (single HTTP request) will 
consume too much memory.  Such a large change would need to be broken up 
into multiple requests.

	Andy

>
> Thanks,
>
> Alexandra
>
> On Tue, Mar 29, 2016 at 9:58 AM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 21/03/16 13:35, Alexandra Kokkinaki wrote:
>>
>>> Hi Andy, thanks for your answers.
>>>
>>>
>>> On Fri, Mar 18, 2016 at 11:43 AM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>> Hi,
>>>>
>>>> it will depend on usage patterns. 2* 500 million isn't unreasonable but
>>>> validating with your expected usage is essential.
>>>> The critical factors are the usage patterns and the hardware available.
>>>> Number of queries, query complexity, number of updates, all matter. RAM
>>>> is
>>>> good (which is true for any database) as are SSDs if you do lots of
>>>> update
>>>> or need fast startup from cold.
>>>>
>>>> What kind of usage patterns are considered not valid for big triple
>>> stores.
>>> We are planning to use our fuseki server to allow, machine to machine
>>> communication and also allow independent users to  express mostly spatial
>>> queries We plan to do indexing and have a query time out too. Is that
>>> enough to address performance issues?
>>>
>>
>> They are a good idea.  It will protect the server.
>>
>> It is possible to write SPARQL queries which are fundamentally expensive.
>>
>> The TDB will need to get updated daily, using jena API, since I suppose
>>> deleting and inserting everything back would take a long time. I read in (
>>>
>>> https://lists.w3.org/Archives/Public/public-sparql-dev/2008JulSep/0029.html
>>> ) that it takes 5370secs for 100M triples  to be loaded in TDB, which is
>>> good.
>>> But here <https://www.w3.org/wiki/LargeTripleStores> it is said that it
>>> took 36 hours to load 1.7B triples in TDB
>>>
>>
>> ... in 2008 ... with a spinning disk.
>>
>> 12k triples/s would be a bit slow nowadays.
>>
>> At large scale tdbloader2 can be faster that tdbloader. You have to try
>> with your data on your hardware - it isn't a simple yes/no question
>> unfortunately.
>>
>> tdbloader2 only loads from empty.
>>
>> tdbloader does not do anything special when loading a partial database.
>>
>> , which drives me towards the
>>> daily updates rather than daily delete and insert.
>>> How long would a 500 triple DB take to be loaded in an empty database?
>>>
>>
>> 500M?
>>
>> Just run
>>
>> tdbloader --loc DB <data> and see what rate you get - I'd be interested in
>> seeing the log.  Every data set, every hardware set can be different.
>> That's why it is hard to make any accurate predications - just try it.
>>
>> tdbloader --loc=DB <the_data>
>>
>> The pattern of the data makes a difference - LUBM loads very fast as it
>> has a high triples to nodes ratio so less bytes are being loaded.  All
>> triple stores report better figures on that data - a factor of x2 faster is
>> common - but it's not typical data.
>>
>>          Andy
>>
>>
>> Multiple requests, whether same service or different service, are
>>>> competing for the same machine resources.  Fuseki runs requests
>>>> independently and in parallel.  There are per-database transactions
>>>> supporting multiple, truly parallel readers.
>>>>
>>>>
>>>       Andy
>>>>
>>>
>>>
>>> Many thanks,
>>>
>>> Alexandra
>>>
>>>
>>>>
>>>> On 18/03/16 09:35, Alexandra Kokkinaki wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> after researching on TDB performance with Big Data, I would still like
>>>>> to
>>>>> know:
>>>>> We have one fuseki server exposing 2 sparql endpoints (2million triples
>>>>> each) as data services. We are planning to add one more, but with Big
>>>>> data
>>>>>
>>>>> 500Million triples
>>>>>>
>>>>>>
>>>>>       - For big data is it better to use many installations of fuseki
>>>>> server
>>>>>       or
>>>>>       - many data services under the same Fuseki server?
>>>>>
>>>>>
>>>>> Could fuseki cope with two or more services with more than  500 Million
>>>>> triples each?
>>>>>
>>>>>
>>>>
>>>>
>>>> How does Fuseki cope when it has to serve concurrent queries to the
>>>>> different data services?
>>>>>
>>>>>
>>>>
>>>>
>>>> Many thanks,
>>>>>
>>>>> Alexandra
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>