You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Zhiyun Qian <zh...@umich.edu> on 2013/10/03 16:01:25 UTC

jena TDB scalability

Hi there,

I'm looking for some clues on the scalability of jena TDB. It looks like
our requirement would be at least 1B - 10B triples. From what I can find
online (which seems to be dated back in 2008), the max number ever put into
TDB is 1.7B [1]. I wonder if there's any more recent number on this.

I'm also curious about whether the scalability is primarily measured on the
union of all the graphs or individual graphs. In other words, whether a
"Dataset" (regardless of how many graphs/models in it) can only scale up to
a given number (let's say 1.7B) or an individual graph/model can scale to a
given number. Since our data naturally can be divided into different graphs
(with limited relationship across graphs), most queries can be performed on
a single graph at a time (we need some hacks to query the relationship
across graphs but I assume it is possible).

My understanding is that if we simply query one graph out of the many in a
dataset, it does not matter much how many triples there are in other
graphs. Is this correct?

[1]. http://www.w3.org/wiki/LargeTripleStores

Best,
-Zhiyun

Re: jena TDB scalability

Posted by Andy Seaborne <an...@gmail.com>.
The issue is caching.  Whether it's one DB of 10B or 10 DBs or 1B, there 
is only X amount of RAM in the machine.  TDB uses memory pmapped files 
so they is a machine-wide cache managed by the OS.

If all the queries go to the same graph for some period of time, the 
caches will be helping that graph and it's much like 1 DB of 1B.

But if query 1 goes to graph 1, query 2 goes to a different graph 2, 
then there is no cache positive effect.

	Andy

On 04/10/13 15:54, David Jordan wrote:
> A related question to ask is whether the data can be split into separate TDB datasets (different folders, different files) yet still be able to combine the data in a reasonably efficient manner.
>
>
> -----Original Message-----
> From: qzy888@gmail.com [mailto:qzy888@gmail.com] On Behalf Of Zhiyun Qian
> Sent: Friday, October 04, 2013 10:42 AM
> To: users@jena.apache.org
> Subject: Re: jena TDB scalability
>
> Thanks very much for the explanation, Andy.
>
> I am curious about the case where I divide my data into separate graphs/models. Let's say I have 10B triples into 10 graphs, each has 1B triples. If most of my query can be specified for each graph, is it practically the same (in terms of scalability and performance) between organizing the 10B in a single DB and separate DBs (where each DB has 1B)?
> The reason I may still need to have them in one DB is because I have some (small number of) queries that may need to go over the boundaries of graphs.
>
> Best,
> -Zhiyun
>
>
> On Fri, Oct 4, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 03/10/13 15:01, Zhiyun Qian wrote:
>>
>>> Hi there,
>>>
>>> I'm looking for some clues on the scalability of jena TDB. It looks
>>> like our requirement would be at least 1B - 10B triples. From what I
>>> can find online (which seems to be dated back in 2008), the max
>>> number ever put into TDB is 1.7B [1]. I wonder if there's any more
>>> recent number on this.
>>>
>>> I'm also curious about whether the scalability is primarily measured
>>> on the union of all the graphs or individual graphs. In other words,
>>> whether a "Dataset" (regardless of how many graphs/models in it) can
>>> only scale up to a given number (let's say 1.7B) or an individual
>>> graph/model can scale to a given number. Since our data naturally can
>>> be divided into different graphs (with limited relationship across
>>> graphs), most queries can be performed on a single graph at a time
>>> (we need some hacks to query the relationship across graphs but I
>>> assume it is possible).
>>>
>>> My understanding is that if we simply query one graph out of the many
>>> in a dataset, it does not matter much how many triples there are in
>>> other graphs. Is this correct?
>>>
>>> [1].
>>> http://www.w3.org/wiki/**LargeTripleStores<http://www.w3.org/wiki/Lar
>>> geTripleStores>
>>>
>>> Best,
>>> -Zhiyun
>>>
>>>
>> Theer isn't a hard cutoff point whereby it works at X but not at X+1.
>> There are no particular built-in assumptions like that (the nearest is
>> that nodes have unique hashes - but the node hash is 128 bits so you
>> can do some maths about that; things like undetected memory corruption are more likely).
>>
>> 10B triples is beyond the practical limits.  1B will need a big
>> machine and not too complicated queries.
>>
>> As the database gets larger, the practical queries that can be
>> executed become more limited.  Loading also becomes an issue.
>>
>> If you are just doing URI->some properties and a bit of filtering on
>> the retrieved values, then huge databases are possible.
>>
>> But as soon as general patterns, or group-aggregates or complicated
>> combinations of patterns, OPTIONALs and UNIONS and NOT EXISTS then it
>> will be impractically slow.  ARQ/TDB uses an evaluation strategy [*]
>> that uses temporary RAM only at a few points, so it does not run out of memory easily.
>>
>> Loading takes a long time - more hardware, specifically, more RAM,
>> makes a big difference.
>>
>>          Andy
>>
>> [*] currently, in the released code.
>>
>


RE: jena TDB scalability

Posted by David Jordan <Da...@sas.com>.
A related question to ask is whether the data can be split into separate TDB datasets (different folders, different files) yet still be able to combine the data in a reasonably efficient manner.


-----Original Message-----
From: qzy888@gmail.com [mailto:qzy888@gmail.com] On Behalf Of Zhiyun Qian
Sent: Friday, October 04, 2013 10:42 AM
To: users@jena.apache.org
Subject: Re: jena TDB scalability

Thanks very much for the explanation, Andy.

I am curious about the case where I divide my data into separate graphs/models. Let's say I have 10B triples into 10 graphs, each has 1B triples. If most of my query can be specified for each graph, is it practically the same (in terms of scalability and performance) between organizing the 10B in a single DB and separate DBs (where each DB has 1B)?
The reason I may still need to have them in one DB is because I have some (small number of) queries that may need to go over the boundaries of graphs.

Best,
-Zhiyun


On Fri, Oct 4, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:

> On 03/10/13 15:01, Zhiyun Qian wrote:
>
>> Hi there,
>>
>> I'm looking for some clues on the scalability of jena TDB. It looks 
>> like our requirement would be at least 1B - 10B triples. From what I 
>> can find online (which seems to be dated back in 2008), the max 
>> number ever put into TDB is 1.7B [1]. I wonder if there's any more 
>> recent number on this.
>>
>> I'm also curious about whether the scalability is primarily measured 
>> on the union of all the graphs or individual graphs. In other words, 
>> whether a "Dataset" (regardless of how many graphs/models in it) can 
>> only scale up to a given number (let's say 1.7B) or an individual 
>> graph/model can scale to a given number. Since our data naturally can 
>> be divided into different graphs (with limited relationship across 
>> graphs), most queries can be performed on a single graph at a time 
>> (we need some hacks to query the relationship across graphs but I 
>> assume it is possible).
>>
>> My understanding is that if we simply query one graph out of the many 
>> in a dataset, it does not matter much how many triples there are in 
>> other graphs. Is this correct?
>>
>> [1]. 
>> http://www.w3.org/wiki/**LargeTripleStores<http://www.w3.org/wiki/Lar
>> geTripleStores>
>>
>> Best,
>> -Zhiyun
>>
>>
> Theer isn't a hard cutoff point whereby it works at X but not at X+1.
> There are no particular built-in assumptions like that (the nearest is 
> that nodes have unique hashes - but the node hash is 128 bits so you 
> can do some maths about that; things like undetected memory corruption are more likely).
>
> 10B triples is beyond the practical limits.  1B will need a big 
> machine and not too complicated queries.
>
> As the database gets larger, the practical queries that can be 
> executed become more limited.  Loading also becomes an issue.
>
> If you are just doing URI->some properties and a bit of filtering on 
> the retrieved values, then huge databases are possible.
>
> But as soon as general patterns, or group-aggregates or complicated 
> combinations of patterns, OPTIONALs and UNIONS and NOT EXISTS then it 
> will be impractically slow.  ARQ/TDB uses an evaluation strategy [*] 
> that uses temporary RAM only at a few points, so it does not run out of memory easily.
>
> Loading takes a long time - more hardware, specifically, more RAM, 
> makes a big difference.
>
>         Andy
>
> [*] currently, in the released code.
>


Re: jena TDB scalability

Posted by Zhiyun Qian <zh...@umich.edu>.
Thanks very much for the explanation, Andy.

I am curious about the case where I divide my data into separate
graphs/models. Let's say I have 10B triples into 10 graphs, each has 1B
triples. If most of my query can be specified for each graph, is it
practically the same (in terms of scalability and performance) between
organizing the 10B in a single DB and separate DBs (where each DB has 1B)?
The reason I may still need to have them in one DB is because I have some
(small number of) queries that may need to go over the boundaries of graphs.

Best,
-Zhiyun


On Fri, Oct 4, 2013 at 5:26 AM, Andy Seaborne <an...@apache.org> wrote:

> On 03/10/13 15:01, Zhiyun Qian wrote:
>
>> Hi there,
>>
>> I'm looking for some clues on the scalability of jena TDB. It looks like
>> our requirement would be at least 1B - 10B triples. From what I can find
>> online (which seems to be dated back in 2008), the max number ever put
>> into
>> TDB is 1.7B [1]. I wonder if there's any more recent number on this.
>>
>> I'm also curious about whether the scalability is primarily measured on
>> the
>> union of all the graphs or individual graphs. In other words, whether a
>> "Dataset" (regardless of how many graphs/models in it) can only scale up
>> to
>> a given number (let's say 1.7B) or an individual graph/model can scale to
>> a
>> given number. Since our data naturally can be divided into different
>> graphs
>> (with limited relationship across graphs), most queries can be performed
>> on
>> a single graph at a time (we need some hacks to query the relationship
>> across graphs but I assume it is possible).
>>
>> My understanding is that if we simply query one graph out of the many in a
>> dataset, it does not matter much how many triples there are in other
>> graphs. Is this correct?
>>
>> [1]. http://www.w3.org/wiki/**LargeTripleStores<http://www.w3.org/wiki/LargeTripleStores>
>>
>> Best,
>> -Zhiyun
>>
>>
> Theer isn't a hard cutoff point whereby it works at X but not at X+1.
> There are no particular built-in assumptions like that (the nearest is that
> nodes have unique hashes - but the node hash is 128 bits so you can do some
> maths about that; things like undetected memory corruption are more likely).
>
> 10B triples is beyond the practical limits.  1B will need a big machine
> and not too complicated queries.
>
> As the database gets larger, the practical queries that can be executed
> become more limited.  Loading also becomes an issue.
>
> If you are just doing URI->some properties and a bit of filtering on the
> retrieved values, then huge databases are possible.
>
> But as soon as general patterns, or group-aggregates or complicated
> combinations of patterns, OPTIONALs and UNIONS and NOT EXISTS then it will
> be impractically slow.  ARQ/TDB uses an evaluation strategy [*] that uses
> temporary RAM only at a few points, so it does not run out of memory easily.
>
> Loading takes a long time - more hardware, specifically, more RAM, makes a
> big difference.
>
>         Andy
>
> [*] currently, in the released code.
>

Re: jena TDB scalability

Posted by Andy Seaborne <an...@apache.org>.
On 03/10/13 15:01, Zhiyun Qian wrote:
> Hi there,
>
> I'm looking for some clues on the scalability of jena TDB. It looks like
> our requirement would be at least 1B - 10B triples. From what I can find
> online (which seems to be dated back in 2008), the max number ever put into
> TDB is 1.7B [1]. I wonder if there's any more recent number on this.
>
> I'm also curious about whether the scalability is primarily measured on the
> union of all the graphs or individual graphs. In other words, whether a
> "Dataset" (regardless of how many graphs/models in it) can only scale up to
> a given number (let's say 1.7B) or an individual graph/model can scale to a
> given number. Since our data naturally can be divided into different graphs
> (with limited relationship across graphs), most queries can be performed on
> a single graph at a time (we need some hacks to query the relationship
> across graphs but I assume it is possible).
>
> My understanding is that if we simply query one graph out of the many in a
> dataset, it does not matter much how many triples there are in other
> graphs. Is this correct?
>
> [1]. http://www.w3.org/wiki/LargeTripleStores
>
> Best,
> -Zhiyun
>

Theer isn't a hard cutoff point whereby it works at X but not at X+1. 
There are no particular built-in assumptions like that (the nearest is 
that nodes have unique hashes - but the node hash is 128 bits so you can 
do some maths about that; things like undetected memory corruption are 
more likely).

10B triples is beyond the practical limits.  1B will need a big machine 
and not too complicated queries.

As the database gets larger, the practical queries that can be executed 
become more limited.  Loading also becomes an issue.

If you are just doing URI->some properties and a bit of filtering on the 
retrieved values, then huge databases are possible.

But as soon as general patterns, or group-aggregates or complicated 
combinations of patterns, OPTIONALs and UNIONS and NOT EXISTS then it 
will be impractically slow.  ARQ/TDB uses an evaluation strategy [*] 
that uses temporary RAM only at a few points, so it does not run out of 
memory easily.

Loading takes a long time - more hardware, specifically, more RAM, makes 
a big difference.

	Andy

[*] currently, in the released code.