You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "Dimov, Stefan" <st...@sap.com> on 2017/12/22 20:37:14 UTC

Operational issues with TDB

Hi all,

We have a project, which we’re trying to productize and we’re facing certain operational issues with big size files. Especially with copying and maintaining them on the productive cloud hardware (application nodes).

Did anybody have similar issues? How did you resolve them?

I will appreciate if someone shares their experience/problems/solutions.

Regards,
Stefan

Re: performance measures

Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.

Thank you for the good advice!

The argument is to show that triple store are fast enough for linguist 
application. five years ago a comparison was published, where a 
proprietary data structure excelled. i would like to show that 
triple-stores are fast enough today. I can perhaps get the same dataset 
and the same queries (at the application level), but i have no idea how 
cache data was accounted; it seems that results differed between runs.

i guess I could use some warmup queries, sort of similar to the 
application queries for the test and then run the test queries and 
compare with the previously produced response times. If the response 
time is of the same order of magnitude than before, it would be shown 
that triple-store is fast enough.

Does this sound "good enough"?


On 12/24/2017 01:24 PM, ajs6f wrote:
>> Any measurements would be unreliable at best and probably worthless.
>> 1/ Different data gives different answers to queries.
>> 2/ Caching matters a lot for databases and a different setup will cache differently.
> This is so true, and it's not even a complete list. It might be better to approach the problem from the application layer. Are you able to put together a good suite of test data, queries, and updates, accompanied by a good understanding of the kinds of load the triplestore will experience in production?
>
> Adam Soroka
>
>> On Dec 24, 2017, at 1:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>> On 24/12/17 14:11, Andrew U. Frank wrote:
>>> thank you for the information; i take that using teh indexes  a one-variable query would be (close to) linear in the amount of triples found. i saw that TBD does build indexes and assumed they use hashes.
>>> i have still the following questions:
>>> 1. is performance different for a named or the default graph?
>> Query performance is approximately the same for GRAPH.
>> Update is slower.
>>
>>> 2. can i simplify measurements with putting pieces of the dataset in different graphs and then add more or less of these graphs to take a measure? say i have 5 named graphs, each with 10 million triples, do queries over 2, 3, 4 and 5 graphs give the same (or very similar) results than when i would load 20, 30, 40 and 50 million triples in a single named graph?
>> Any measurements would be unreliable at best and probably worthless.
>>
>> 1/ Different data gives different answers to queries.
>>
>> 2/ Caching matters a lot for databases and a different setup will cache differently.
>>
>>     Andy
>>
>>> thank you for help!
>>> andrew
>>> On 12/23/2017 06:20 AM, ajs6f wrote:
>>>> For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 on quads to ensure that all one-variable queries (i.e. for triples ?s <p> <o>, <s> ?p <o>, <s> <p> ?o) will be as direct as possible. The indexes are hashmaps (e.g. Map<Node, Map<Node, Set<Node>>>) and don't use the kind of node directory that TDB does.
>>>>
>>>> There are lots of other ways to play that out, according to the balance of times costs and storage costs desired and the expected types of queries.
>>>>
>>>> Adam
>>>>
>>>>> On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>>>
>>>>>
>>>>> On 23.12.2017 00:47, Andrew U. Frank wrote:
>>>>>> are there some rules which queries are linear in the amount of data in
>>>>>> the graph? is it correct to assume that searching for a triples based
>>>>>> on a single condition (?p a X) is logarithmic in the size of the data
>>>>>> collection?
>>>>> Why should it be logarithmic? The complexity of matching a single BGP
>>>>> depends on the implementation. I could search for matches by doing a
>>>>> scan on the whole dataset - that would for sure be not logarithmic but
>>>>> linear. Usually, if exists, a triple store would use the POS index in
>>>>> order to find bindings for variable ?p.
>>>>>
>>>>> Cheers,
>>>>> Lorenz

-- 
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                  +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil

Re: performance measures

Posted by ajs6f <aj...@apache.org>.

> Any measurements would be unreliable at best and probably worthless.
> 1/ Different data gives different answers to queries.
> 2/ Caching matters a lot for databases and a different setup will cache differently.

This is so true, and it's not even a complete list. It might be better to approach the problem from the application layer. Are you able to put together a good suite of test data, queries, and updates, accompanied by a good understanding of the kinds of load the triplestore will experience in production?

Adam Soroka

> On Dec 24, 2017, at 1:21 PM, Andy Seaborne <an...@apache.org> wrote:
> 
> On 24/12/17 14:11, Andrew U. Frank wrote:
>> thank you for the information; i take that using teh indexes  a one-variable query would be (close to) linear in the amount of triples found. i saw that TBD does build indexes and assumed they use hashes.
>> i have still the following questions:
>> 1. is performance different for a named or the default graph?
> 
> Query performance is approximately the same for GRAPH.
> Update is slower.
> 
>> 2. can i simplify measurements with putting pieces of the dataset in different graphs and then add more or less of these graphs to take a measure? say i have 5 named graphs, each with 10 million triples, do queries over 2, 3, 4 and 5 graphs give the same (or very similar) results than when i would load 20, 30, 40 and 50 million triples in a single named graph?
> 
> Any measurements would be unreliable at best and probably worthless.
> 
> 1/ Different data gives different answers to queries.
> 
> 2/ Caching matters a lot for databases and a different setup will cache differently.
> 
>    Andy
> 
>> thank you for help!
>> andrew
>> On 12/23/2017 06:20 AM, ajs6f wrote:
>>> For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 on quads to ensure that all one-variable queries (i.e. for triples ?s <p> <o>, <s> ?p <o>, <s> <p> ?o) will be as direct as possible. The indexes are hashmaps (e.g. Map<Node, Map<Node, Set<Node>>>) and don't use the kind of node directory that TDB does.
>>> 
>>> There are lots of other ways to play that out, according to the balance of times costs and storage costs desired and the expected types of queries.
>>> 
>>> Adam
>>> 
>>>> On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>>> 
>>>> 
>>>> On 23.12.2017 00:47, Andrew U. Frank wrote:
>>>>> are there some rules which queries are linear in the amount of data in
>>>>> the graph? is it correct to assume that searching for a triples based
>>>>> on a single condition (?p a X) is logarithmic in the size of the data
>>>>> collection?
>>>> Why should it be logarithmic? The complexity of matching a single BGP
>>>> depends on the implementation. I could search for matches by doing a
>>>> scan on the whole dataset - that would for sure be not logarithmic but
>>>> linear. Usually, if exists, a triple store would use the POS index in
>>>> order to find bindings for variable ?p.
>>>> 
>>>> Cheers,
>>>> Lorenz

Re: performance measures

Posted by Andy Seaborne <an...@apache.org>.

On 24/12/17 14:11, Andrew U. Frank wrote:
> thank you for the information; i take that using teh indexes  a 
> one-variable query would be (close to) linear in the amount of triples 
> found. i saw that TBD does build indexes and assumed they use hashes.
> 
> i have still the following questions:
> 
> 1. is performance different for a named or the default graph?

Query performance is approximately the same for GRAPH.
Update is slower.

> 
> 2. can i simplify measurements with putting pieces of the dataset in 
> different graphs and then add more or less of these graphs to take a 
> measure? say i have 5 named graphs, each with 10 million triples, do 
> queries over 2, 3, 4 and 5 graphs give the same (or very similar) 
> results than when i would load 20, 30, 40 and 50 million triples in a 
> single named graph?

Any measurements would be unreliable at best and probably worthless.

1/ Different data gives different answers to queries.

2/ Caching matters a lot for databases and a different setup will cache 
differently.

     Andy

> 
> thank you for help!
> 
> andrew
> 
> 
> On 12/23/2017 06:20 AM, ajs6f wrote:
>> For example, the TIM in-memory dataset impl uses 3 indexes on triples 
>> and 6 on quads to ensure that all one-variable queries (i.e. for 
>> triples ?s <p> <o>, <s> ?p <o>, <s> <p> ?o) will be as direct as 
>> possible. The indexes are hashmaps (e.g. Map<Node, Map<Node, 
>> Set<Node>>>) and don't use the kind of node directory that TDB does.
>>
>> There are lots of other ways to play that out, according to the 
>> balance of times costs and storage costs desired and the expected 
>> types of queries.
>>
>> Adam
>>
>>> On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann 
>>> <bu...@informatik.uni-leipzig.de> wrote:
>>>
>>>
>>> On 23.12.2017 00:47, Andrew U. Frank wrote:
>>>> are there some rules which queries are linear in the amount of data in
>>>> the graph? is it correct to assume that searching for a triples based
>>>> on a single condition (?p a X) is logarithmic in the size of the data
>>>> collection?
>>> Why should it be logarithmic? The complexity of matching a single BGP
>>> depends on the implementation. I could search for matches by doing a
>>> scan on the whole dataset - that would for sure be not logarithmic but
>>> linear. Usually, if exists, a triple store would use the POS index in
>>> order to find bindings for variable ?p.
>>>
>>> Cheers,
>>> Lorenz
>

Re: performance measures

Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.

thank you for the information; i take that using teh indexes  a 
one-variable query would be (close to) linear in the amount of triples 
found. i saw that TBD does build indexes and assumed they use hashes.

i have still the following questions:

1. is performance different for a named or the default graph?

2. can i simplify measurements with putting pieces of the dataset in 
different graphs and then add more or less of these graphs to take a 
measure? say i have 5 named graphs, each with 10 million triples, do 
queries over 2, 3, 4 and 5 graphs give the same (or very similar) 
results than when i would load 20, 30, 40 and 50 million triples in a 
single named graph?

thank you for help!

andrew


On 12/23/2017 06:20 AM, ajs6f wrote:
> For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 on quads to ensure that all one-variable queries (i.e. for triples ?s <p> <o>, <s> ?p <o>, <s> <p> ?o) will be as direct as possible. The indexes are hashmaps (e.g. Map<Node, Map<Node, Set<Node>>>) and don't use the kind of node directory that TDB does.
>
> There are lots of other ways to play that out, according to the balance of times costs and storage costs desired and the expected types of queries.
>
> Adam
>
>> On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
>>
>>
>> On 23.12.2017 00:47, Andrew U. Frank wrote:
>>> are there some rules which queries are linear in the amount of data in
>>> the graph? is it correct to assume that searching for a triples based
>>> on a single condition (?p a X) is logarithmic in the size of the data
>>> collection?
>> Why should it be logarithmic? The complexity of matching a single BGP
>> depends on the implementation. I could search for matches by doing a
>> scan on the whole dataset - that would for sure be not logarithmic but
>> linear. Usually, if exists, a triple store would use the POS index in
>> order to find bindings for variable ?p.
>>
>> Cheers,
>> Lorenz

-- 
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                  +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil

Re: performance measures

Posted by ajs6f <aj...@apache.org>.

For example, the TIM in-memory dataset impl uses 3 indexes on triples and 6 on quads to ensure that all one-variable queries (i.e. for triples ?s <p> <o>, <s> ?p <o>, <s> <p> ?o) will be as direct as possible. The indexes are hashmaps (e.g. Map<Node, Map<Node, Set<Node>>>) and don't use the kind of node directory that TDB does. 

There are lots of other ways to play that out, according to the balance of times costs and storage costs desired and the expected types of queries.

Adam

> On Dec 23, 2017, at 2:56 AM, Lorenz Buehmann <bu...@informatik.uni-leipzig.de> wrote:
> 
> 
> On 23.12.2017 00:47, Andrew U. Frank wrote:
>> are there some rules which queries are linear in the amount of data in
>> the graph? is it correct to assume that searching for a triples based
>> on a single condition (?p a X) is logarithmic in the size of the data
>> collection? 
> Why should it be logarithmic? The complexity of matching a single BGP
> depends on the implementation. I could search for matches by doing a
> scan on the whole dataset - that would for sure be not logarithmic but
> linear. Usually, if exists, a triple store would use the POS index in
> order to find bindings for variable ?p.
> 
> Cheers,
> Lorenz

Re: performance measures

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

On 23.12.2017 00:47, Andrew U. Frank wrote:
> are there some rules which queries are linear in the amount of data in
> the graph? is it correct to assume that searching for a triples based
> on a single condition (?p a X) is logarithmic in the size of the data
> collection? 
Why should it be logarithmic? The complexity of matching a single BGP
depends on the implementation. I could search for matches by doing a
scan on the whole dataset - that would for sure be not logarithmic but
linear. Usually, if exists, a triple store would use the POS index in
order to find bindings for variable ?p.

Cheers,
Lorenz

performance measures

Posted by "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>.

i should do some comparison of a fuseki store based application with 
others with rel db or proprietary dbs. i use fuseki and tdb stored on an 
ssd or a hard disk.

can i simplify measurements with putting pieces of the dataset in 
different graphs and then add more or less of these graphs to take a 
measure? say i have 5 named graphs, each with 10 million triples, do 
queries over 2, 3, 4 and 5 graphs give the same (or very similar) 
results than when i would load 20, 30, 40 and 50 million triples in a 
single named graph?

is performance different for a named or the default graph?

are there some rules which queries are linear in the amount of data in 
the graph? is it correct to assume that searching for a triples based on 
a single condition (?p a X) is logarithmic in the size of the data 
collection?

is there a document which gives some insight into the expected 
performance of queries?

thank you for any information!

andrew

On 12/22/2017 05:16 PM, Dick Murray wrote:
> How big? How many?
>
> On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
>
>> Hi all,
>>
>> We have a project, which we’re trying to productize and we’re facing
>> certain operational issues with big size files. Especially with copying and
>> maintaining them on the productive cloud hardware (application nodes).
>>
>> Did anybody have similar issues? How did you resolve them?
>>
>> I will appreciate if someone shares their experience/problems/solutions.
>>
>> Regards,
>> Stefan
>>

-- 
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                  +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil

Re: Operational issues with TDB

Posted by Andy Seaborne <an...@apache.org>.

TDB backup is safer way to back it up.  It is not safe to copy the files 
of a live database because you may copy different files at different 
points in time.

You can build from a backup, and copy files, then start TDB.

Or even capture changes; build from a backup, and replay patches. That
is what rdf-delta (not part of Jena) does.

TDB2 also offers a possibility that might work for you.

It has a compaction step that produces a new, smaller database (TDB2 
databases grow faster then TDB1).  Like TDB1, you can't safely copy a 
live database around. But the old database after compaction is not live. 
You can sync twice to produce a small copy.

     Andy

On 05/01/18 06:35, Dimov, Stefan wrote:
> Thanks Lorenz,
> 
> That could work if the DB is already deployed.
> 
> How about initial deployment on multiple nodes? Or vice versa – copying it up from live nodes to some backup/archive storage?
> 
> S.
> 
> On 12/22/17, 11:46 PM, "Lorenz Buehmann" <bu...@informatik.uni-leipzig.de> wrote:
> 
>      Probably I misunderstand the workflow, but isn't it more efficient to
>      push the changeset to all nodes instead of copying the whole TDB? I.e.
>      synchronizing the TDB instances on the triple level instead of files.
>      
>      Cheers,
>      
>      Lorenz
>      
>      
>      On 23.12.2017 01:55, Dimov, Stefan wrote:
>      > Yes, this is high av., deployment and backup.
>      >
>      > We have cloud system with multiple nodes and TDB needs to be copied to/from these nodes. Copied from – when we need to back it up and copied to – when we deploy it. So, the problems come when we copy from/to the nodes – it takes time, operations sometimes fail, downtime - the nodes don’t work while we copy the TDB (it’s a limitation that comes from Jena/TDB) and also sometimes the hardware OS is throwing errors just because there are big files on it (without even copying them).
>      >
>      > Before you say “you need a better hardware”, I agree, but that’s what we have right now, so I was just wondering if somebody had similar problems and how they resolved it (apart from replacing the hardware with faster, bigger and high av. one ( )
>      >
>      > Regards,
>      > Stefan
>      >
>      >
>      > On 12/22/17, 4:24 PM, "ajs6f" <aj...@apache.org> wrote:
>      >
>      >     Can you tell us about the workflow that is managing them? You mention copying and maintaining TDB files-- is that for high availability? Are you writing updates to a master and replicating the resulting store, or some other scenario?
>      >
>      >     ajs6f
>      >
>      >     > On Dec 22, 2017, at 7:19 PM, Dimov, Stefan <st...@sap.com> wrote:
>      >     >
>      >     > Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …
>      >     >
>      >     > On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:
>      >     >
>      >     >    How big? How many?
>      >     >
>      >     >    On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
>      >     >
>      >     >> Hi all,
>      >     >>
>      >     >> We have a project, which we’re trying to productize and we’re facing
>      >     >> certain operational issues with big size files. Especially with copying and
>      >     >> maintaining them on the productive cloud hardware (application nodes).
>      >     >>
>      >     >> Did anybody have similar issues? How did you resolve them?
>      >     >>
>      >     >> I will appreciate if someone shares their experience/problems/solutions.
>      >     >>
>      >     >> Regards,
>      >     >> Stefan
>      >     >>
>      >     >
>      >     >
>      >
>      >
>      >
>      >
>      
>      
>

Re: Operational issues with TDB

Posted by "Dimov, Stefan" <st...@sap.com>.

Thanks Lorenz,

That could work if the DB is already deployed.

How about initial deployment on multiple nodes? Or vice versa – copying it up from live nodes to some backup/archive storage?

S.

On 12/22/17, 11:46 PM, "Lorenz Buehmann" <bu...@informatik.uni-leipzig.de> wrote:

    Probably I misunderstand the workflow, but isn't it more efficient to
    push the changeset to all nodes instead of copying the whole TDB? I.e.
    synchronizing the TDB instances on the triple level instead of files.
    
    Cheers,
    
    Lorenz
    
    
    On 23.12.2017 01:55, Dimov, Stefan wrote:
    > Yes, this is high av., deployment and backup. 
    >
    > We have cloud system with multiple nodes and TDB needs to be copied to/from these nodes. Copied from – when we need to back it up and copied to – when we deploy it. So, the problems come when we copy from/to the nodes – it takes time, operations sometimes fail, downtime - the nodes don’t work while we copy the TDB (it’s a limitation that comes from Jena/TDB) and also sometimes the hardware OS is throwing errors just because there are big files on it (without even copying them).
    >
    > Before you say “you need a better hardware”, I agree, but that’s what we have right now, so I was just wondering if somebody had similar problems and how they resolved it (apart from replacing the hardware with faster, bigger and high av. one ( )
    >
    > Regards,
    > Stefan
    >
    >
    > On 12/22/17, 4:24 PM, "ajs6f" <aj...@apache.org> wrote:
    >
    >     Can you tell us about the workflow that is managing them? You mention copying and maintaining TDB files-- is that for high availability? Are you writing updates to a master and replicating the resulting store, or some other scenario?
    >     
    >     ajs6f
    >     
    >     > On Dec 22, 2017, at 7:19 PM, Dimov, Stefan <st...@sap.com> wrote:
    >     > 
    >     > Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …
    >     > 
    >     > On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:
    >     > 
    >     >    How big? How many?
    >     > 
    >     >    On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
    >     > 
    >     >> Hi all,
    >     >> 
    >     >> We have a project, which we’re trying to productize and we’re facing
    >     >> certain operational issues with big size files. Especially with copying and
    >     >> maintaining them on the productive cloud hardware (application nodes).
    >     >> 
    >     >> Did anybody have similar issues? How did you resolve them?
    >     >> 
    >     >> I will appreciate if someone shares their experience/problems/solutions.
    >     >> 
    >     >> Regards,
    >     >> Stefan
    >     >> 
    >     > 
    >     > 
    >     
    >     
    >
    >

Re: Operational issues with TDB

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Probably I misunderstand the workflow, but isn't it more efficient to
push the changeset to all nodes instead of copying the whole TDB? I.e.
synchronizing the TDB instances on the triple level instead of files.

Cheers,

Lorenz


On 23.12.2017 01:55, Dimov, Stefan wrote:
> Yes, this is high av., deployment and backup. 
>
> We have cloud system with multiple nodes and TDB needs to be copied to/from these nodes. Copied from – when we need to back it up and copied to – when we deploy it. So, the problems come when we copy from/to the nodes – it takes time, operations sometimes fail, downtime - the nodes don’t work while we copy the TDB (it’s a limitation that comes from Jena/TDB) and also sometimes the hardware OS is throwing errors just because there are big files on it (without even copying them).
>
> Before you say “you need a better hardware”, I agree, but that’s what we have right now, so I was just wondering if somebody had similar problems and how they resolved it (apart from replacing the hardware with faster, bigger and high av. one ( )
>
> Regards,
> Stefan
>
>
> On 12/22/17, 4:24 PM, "ajs6f" <aj...@apache.org> wrote:
>
>     Can you tell us about the workflow that is managing them? You mention copying and maintaining TDB files-- is that for high availability? Are you writing updates to a master and replicating the resulting store, or some other scenario?
>     
>     ajs6f
>     
>     > On Dec 22, 2017, at 7:19 PM, Dimov, Stefan <st...@sap.com> wrote:
>     > 
>     > Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …
>     > 
>     > On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:
>     > 
>     >    How big? How many?
>     > 
>     >    On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
>     > 
>     >> Hi all,
>     >> 
>     >> We have a project, which we’re trying to productize and we’re facing
>     >> certain operational issues with big size files. Especially with copying and
>     >> maintaining them on the productive cloud hardware (application nodes).
>     >> 
>     >> Did anybody have similar issues? How did you resolve them?
>     >> 
>     >> I will appreciate if someone shares their experience/problems/solutions.
>     >> 
>     >> Regards,
>     >> Stefan
>     >> 
>     > 
>     > 
>     
>     
>
>

Re: Operational issues with TDB

Posted by James Anderson <an...@gmail.com>.

good morning;

> On 2017-12-23, at 01:55, Dimov, Stefan <st...@sap.com> wrote:
> 
> Yes, this is high av., deployment and backup. 
> 
> We have cloud system with multiple nodes and TDB needs to be copied to/from these nodes. Copied from – when we need to back it up and copied to – when we deploy it. So, the problems come when we copy from/to the nodes – it takes time, operations sometimes fail, downtime - the nodes don’t work while we copy the TDB (it’s a limitation that comes from Jena/TDB) and also sometimes the hardware OS is throwing errors just because there are big files on it (without even copying them).
> 
> Before you say “you need a better hardware”, I agree, but that’s what we have right now, so I was just wondering if somebody had similar problems and how they resolved it (apart from replacing the hardware with faster, bigger and high av. one ( )

if you are at liberty to be specific, although this is not strictly a “jena” issue, yours would be an interesting account.
that is, how do you do this now?
- which hardware with which os(s) with which file system(s) and what network connectivity?
- which database(s) with which proxies with which applications on which of the hosts?
- which datasets stored in which form(s)?
- which application access patterns?
- which form(s) of backup and/or replication with which temporal pattern(s)?
- which administration processes to manage the above?
what fails and how?

best regards, from berlin,

Re: Operational issues with TDB

Posted by ajs6f <aj...@apache.org>.

> the nodes don’t work while we copy the TDB (it’s a limitation that comes from Jena/TDB) 

Most people doing this would copy the new db to a separate location on the remote machine (while Fuseki continues to run), then quickly take Fuseki down, swap the dbs (possible using a symlink) and then bring Fuseki back up-- it should take a few seconds at most. If your node storage cannot store more than one copy on a node, that is not a limitation of Jena.

Otherwise, as Lorenz pointed out, wholesale copying the db is not the most network-efficient way to do this. Have you considered some delta-distribution system like RDF Delta?

https://afs.github.io/rdf-delta/

Adam

> On Dec 22, 2017, at 7:55 PM, Dimov, Stefan <st...@sap.com> wrote:
> 
> Yes, this is high av., deployment and backup. 
> 
> We have cloud system with multiple nodes and TDB needs to be copied to/from these nodes. Copied from – when we need to back it up and copied to – when we deploy it. So, the problems come when we copy from/to the nodes – it takes time, operations sometimes fail, downtime - the nodes don’t work while we copy the TDB (it’s a limitation that comes from Jena/TDB) and also sometimes the hardware OS is throwing errors just because there are big files on it (without even copying them).
> 
> Before you say “you need a better hardware”, I agree, but that’s what we have right now, so I was just wondering if somebody had similar problems and how they resolved it (apart from replacing the hardware with faster, bigger and high av. one ( )
> 
> Regards,
> Stefan
> 
> 
> On 12/22/17, 4:24 PM, "ajs6f" <aj...@apache.org> wrote:
> 
>    Can you tell us about the workflow that is managing them? You mention copying and maintaining TDB files-- is that for high availability? Are you writing updates to a master and replicating the resulting store, or some other scenario?
> 
>    ajs6f
> 
>> On Dec 22, 2017, at 7:19 PM, Dimov, Stefan <st...@sap.com> wrote:
>> 
>> Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …
>> 
>> On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:
>> 
>>   How big? How many?
>> 
>>   On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
>> 
>>> Hi all,
>>> 
>>> We have a project, which we’re trying to productize and we’re facing
>>> certain operational issues with big size files. Especially with copying and
>>> maintaining them on the productive cloud hardware (application nodes).
>>> 
>>> Did anybody have similar issues? How did you resolve them?
>>> 
>>> I will appreciate if someone shares their experience/problems/solutions.
>>> 
>>> Regards,
>>> Stefan
>>> 
>> 
>> 
> 
> 
>

Re: Operational issues with TDB

Posted by "Dimov, Stefan" <st...@sap.com>.

Yes, this is high av., deployment and backup. 

We have cloud system with multiple nodes and TDB needs to be copied to/from these nodes. Copied from – when we need to back it up and copied to – when we deploy it. So, the problems come when we copy from/to the nodes – it takes time, operations sometimes fail, downtime - the nodes don’t work while we copy the TDB (it’s a limitation that comes from Jena/TDB) and also sometimes the hardware OS is throwing errors just because there are big files on it (without even copying them).

Before you say “you need a better hardware”, I agree, but that’s what we have right now, so I was just wondering if somebody had similar problems and how they resolved it (apart from replacing the hardware with faster, bigger and high av. one ( )

Regards,
Stefan

On 12/22/17, 4:24 PM, "ajs6f" <aj...@apache.org> wrote:

    Can you tell us about the workflow that is managing them? You mention copying and maintaining TDB files-- is that for high availability? Are you writing updates to a master and replicating the resulting store, or some other scenario?

    ajs6f

    > On Dec 22, 2017, at 7:19 PM, Dimov, Stefan <st...@sap.com> wrote:
    > 
    > Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …
    > 
    > On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:
    > 
    >    How big? How many?
    > 
    >    On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
    > 
    >> Hi all,
    >> 
    >> We have a project, which we’re trying to productize and we’re facing
    >> certain operational issues with big size files. Especially with copying and
    >> maintaining them on the productive cloud hardware (application nodes).
    >> 
    >> Did anybody have similar issues? How did you resolve them?
    >> 
    >> I will appreciate if someone shares their experience/problems/solutions.
    >> 
    >> Regards,
    >> Stefan
    >> 
    > 
    >

Re: Operational issues with TDB

Posted by ajs6f <aj...@apache.org>.

Can you tell us about the workflow that is managing them? You mention copying and maintaining TDB files-- is that for high availability? Are you writing updates to a master and replicating the resulting store, or some other scenario?

ajs6f

> On Dec 22, 2017, at 7:19 PM, Dimov, Stefan <st...@sap.com> wrote:
> 
> Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …
> 
> On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:
> 
>    How big? How many?
> 
>    On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
> 
>> Hi all,
>> 
>> We have a project, which we’re trying to productize and we’re facing
>> certain operational issues with big size files. Especially with copying and
>> maintaining them on the productive cloud hardware (application nodes).
>> 
>> Did anybody have similar issues? How did you resolve them?
>> 
>> I will appreciate if someone shares their experience/problems/solutions.
>> 
>> Regards,
>> Stefan
>> 
> 
>

Re: Operational issues with TDB

Posted by dandh988 <da...@gmail.com>.

That's not huge but I understand from experience that it takes time to copy.
We employ various methods to maintain our farm of TDB stores.
Apply the update to more than one TDB at a time, think loose DTC.
Periodically export the TDB to streaming flat file and import. The file can be compressed which is CPU but saves on the network. If you name your graphs to be a period of time eg the month you can just export the latest month.
Our preferred way for HA is to restrict additions to streamed files and timestamp the file which acts like a journal so the entire TDB can be rebuilt just by importing the files.

Dick
-------- Original message --------From: "Dimov, Stefan" <st...@sap.com> Date: 23/12/2017  00:19  (GMT+00:00) To: users@jena.apache.org Subject: Re: Operational issues with TDB 
Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …

On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:

    How big? How many?

    On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:

    > Hi all,
    >
    > We have a project, which we’re trying to productize and we’re facing
    > certain operational issues with big size files. Especially with copying and
    > maintaining them on the productive cloud hardware (application nodes).
    >
    > Did anybody have similar issues? How did you resolve them?
    >
    > I will appreciate if someone shares their experience/problems/solutions.
    >
    > Regards,
    > Stefan
    >

Re: Operational issues with TDB

Posted by "Dimov, Stefan" <st...@sap.com>.

Our TDB now is about 32G and I see that some of its files are almost 5G (single file size), but let’s assume it can/will grow 3, 4, 5 times …

On 12/22/17, 2:16 PM, "Dick Murray" <da...@gmail.com> wrote:

    How big? How many?
    
    On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:
    
    > Hi all,
    >
    > We have a project, which we’re trying to productize and we’re facing
    > certain operational issues with big size files. Especially with copying and
    > maintaining them on the productive cloud hardware (application nodes).
    >
    > Did anybody have similar issues? How did you resolve them?
    >
    > I will appreciate if someone shares their experience/problems/solutions.
    >
    > Regards,
    > Stefan
    >

Re: Operational issues with TDB

Posted by Dick Murray <da...@gmail.com>.

How big? How many?

On 22 Dec 2017 8:37 pm, "Dimov, Stefan" <st...@sap.com> wrote:

> Hi all,
>
> We have a project, which we’re trying to productize and we’re facing
> certain operational issues with big size files. Especially with copying and
> maintaining them on the productive cloud hardware (application nodes).
>
> Did anybody have similar issues? How did you resolve them?
>
> I will appreciate if someone shares their experience/problems/solutions.
>
> Regards,
> Stefan
>