You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Ekaterina Danilova <ka...@gmail.com> on 2019/02/15 12:02:29 UTC

Storing a lot of strings in TDB store

Hello
i would like to ask how TDB2 and Fuseki manages big amounts of string data
(especially repeating data) and what it the best practices. Does it
optimize it somehow? Or is it on us to do some improvements.

For example, we have a TDB2 storage which we access via Fuseki and example
named graph like this:
[http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Region, "New
York"]
[http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Other, "long
long string"]
[http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#NAME, "JOHN
SMITH"]

So, we have JohnSmith person with 2 properties - "Region" and "Other". One
of them is short string of New York, other is long string.
Assume we have 100 000 more people and many of them have same "Region" and
"other" properties. So, what would be the best approach to storing such
data?

I created 10 000 more named graphs of people with different names but same
other properties and tested the performance.
First I checked 10 000 cases of reading the graphs like this and the
average time was around 4.4 ms (no matter how long are the strings).

Other option I considered is making "New York" a resource and storing it in
"cities" named graph and doing the same thing with "long long string". So,
the idea is to store the actual string only once.I tested reading the
graphs again on 10 000 cases and didn't notice any change in performance.
The average load time was still 4.4 ms when instead of "New York" and "Long
long String" we had resources URIs.
However, to get the full data, we need to add the actual resources to our
original JohnSmith graph, which adds overhead since we have to get 2 more
named graphs. So, it causes quite expectable drop of performance.

So, according to my tests the first case (the one described in the graph
example) performed the best, but it feels like we are storing too much
extra information. So, I still wanted to ask on your opinions to such
approach and learn if TDB store makes some inner optimization to the data.

Re: Storing a lot of strings in TDB store

Posted by Andy Seaborne <an...@apache.org>.

At the level of that description, they are much the same.

TDB2 differs in actual inline encoding of literals (it keeps the datatype).

TDB2 B+Trees are "copy on-write" (MVCC) and TDB2 has a different 
transaction mechanism resulting in arbitrary large transaction changes 
being supported.

TDB2 bulkloader is much faster (although it could be backported to TDB1; 
it is not fundamental to the TDB2 disk layout).

     Andy

On 06/03/2019 12:38, Siddhesh Rane wrote:
> It's for TDB 1 right? Is there a document for TDB 2? I couldn't find one
> 
> Regards
> Siddhesh
> 
> 
> On Fri, 22 Feb 2019, 8:48 pm Rob Vesse, <rv...@dotnetrdf.org> wrote:
> 
>> It's here - http://jena.apache.org/documentation/tdb/architecture.html
>>
>> Rob
>>
>> On 22/02/2019, 04:03, "Ekaterina Danilova" <ka...@gmail.com>
>> wrote:
>>
>>      Thank you, it was exactly what I needed. It is still nice to hear what
>>      others think about my idea of data storage as resources and I think I
>> will
>>      stick to that option, but TDB storage logic was quite unclear to me.
>> Would
>>      be great if it was mentioned in official documentation since I couldn't
>>      find it.
>>      Thanks again for your help
>>
>>      On Tue, 19 Feb 2019 at 20:40, Rob Vesse <rv...@dotnetrdf.org> wrote:
>>
>>      > Since I don't think anyone answered your specific original question
>>      >
>>      > TDB and TDB2 both use dictionary encoding (and in fact most RDF
>> stores use
>>      > some variation on this).  Basically they map each unique RDF term
>> (whether
>>      > URI, string, blank node etc) to a consistent internal identifier and
>> use
>>      > this to refer to the term.  Therefore most data structures
>> internally are
>>      > implemented in terms of these internal identifiers (which are
>> typically
>>      > very compact, TDB/TDB2 use 64 bit identifiers) and the system only
>>      > translates between the internal identifier and the full RDF term when
>>      > explicitly needed e.g. when presenting results
>>      >
>>      > Rob
>>      >
>>      > On 15/02/2019, 06:03, "Ekaterina Danilova" <
>> katja.danilova94@gmail.com>
>>      > wrote:
>>      >
>>      >     i would like to ask how TDB2 and Fuseki manages big amounts of
>> string
>>      > data
>>      >     (especially repeating data) and what it the best practices. Does
>> it
>>      >     optimize it somehow?
>>      >
>>      >
>>      >
>>      >
>>      >
>>
>>
>>
>>
>>
>>
>

Re: Storing a lot of strings in TDB store

Posted by Siddhesh Rane <ki...@gmail.com>.

It's for TDB 1 right? Is there a document for TDB 2? I couldn't find one

Regards
Siddhesh


On Fri, 22 Feb 2019, 8:48 pm Rob Vesse, <rv...@dotnetrdf.org> wrote:

> It's here - http://jena.apache.org/documentation/tdb/architecture.html
>
> Rob
>
> On 22/02/2019, 04:03, "Ekaterina Danilova" <ka...@gmail.com>
> wrote:
>
>     Thank you, it was exactly what I needed. It is still nice to hear what
>     others think about my idea of data storage as resources and I think I
> will
>     stick to that option, but TDB storage logic was quite unclear to me.
> Would
>     be great if it was mentioned in official documentation since I couldn't
>     find it.
>     Thanks again for your help
>
>     On Tue, 19 Feb 2019 at 20:40, Rob Vesse <rv...@dotnetrdf.org> wrote:
>
>     > Since I don't think anyone answered your specific original question
>     >
>     > TDB and TDB2 both use dictionary encoding (and in fact most RDF
> stores use
>     > some variation on this).  Basically they map each unique RDF term
> (whether
>     > URI, string, blank node etc) to a consistent internal identifier and
> use
>     > this to refer to the term.  Therefore most data structures
> internally are
>     > implemented in terms of these internal identifiers (which are
> typically
>     > very compact, TDB/TDB2 use 64 bit identifiers) and the system only
>     > translates between the internal identifier and the full RDF term when
>     > explicitly needed e.g. when presenting results
>     >
>     > Rob
>     >
>     > On 15/02/2019, 06:03, "Ekaterina Danilova" <
> katja.danilova94@gmail.com>
>     > wrote:
>     >
>     >     i would like to ask how TDB2 and Fuseki manages big amounts of
> string
>     > data
>     >     (especially repeating data) and what it the best practices. Does
> it
>     >     optimize it somehow?
>     >
>     >
>     >
>     >
>     >
>
>
>
>
>
>

Re: Storing a lot of strings in TDB store

Posted by Rob Vesse <rv...@dotnetrdf.org>.

It's here - http://jena.apache.org/documentation/tdb/architecture.html

Rob

On 22/02/2019, 04:03, "Ekaterina Danilova" <ka...@gmail.com> wrote:

    Thank you, it was exactly what I needed. It is still nice to hear what
    others think about my idea of data storage as resources and I think I will
    stick to that option, but TDB storage logic was quite unclear to me. Would
    be great if it was mentioned in official documentation since I couldn't
    find it.
    Thanks again for your help
    
    On Tue, 19 Feb 2019 at 20:40, Rob Vesse <rv...@dotnetrdf.org> wrote:
    
    > Since I don't think anyone answered your specific original question
    >
    > TDB and TDB2 both use dictionary encoding (and in fact most RDF stores use
    > some variation on this).  Basically they map each unique RDF term (whether
    > URI, string, blank node etc) to a consistent internal identifier and use
    > this to refer to the term.  Therefore most data structures internally are
    > implemented in terms of these internal identifiers (which are typically
    > very compact, TDB/TDB2 use 64 bit identifiers) and the system only
    > translates between the internal identifier and the full RDF term when
    > explicitly needed e.g. when presenting results
    >
    > Rob
    >
    > On 15/02/2019, 06:03, "Ekaterina Danilova" <ka...@gmail.com>
    > wrote:
    >
    >     i would like to ask how TDB2 and Fuseki manages big amounts of string
    > data
    >     (especially repeating data) and what it the best practices. Does it
    >     optimize it somehow?
    >
    >
    >
    >
    >

Re: Storing a lot of strings in TDB store

Posted by ajs6f <aj...@apache.org>.

TDB's design is given in official documentation here:

https://jena.apache.org/documentation/tdb/architecture.html

ajs6f

> On Feb 22, 2019, at 5:02 AM, Ekaterina Danilova <ka...@gmail.com> wrote:
> 
> Thank you, it was exactly what I needed. It is still nice to hear what
> others think about my idea of data storage as resources and I think I will
> stick to that option, but TDB storage logic was quite unclear to me. Would
> be great if it was mentioned in official documentation since I couldn't
> find it.
> Thanks again for your help
> 
> On Tue, 19 Feb 2019 at 20:40, Rob Vesse <rv...@dotnetrdf.org> wrote:
> 
>> Since I don't think anyone answered your specific original question
>> 
>> TDB and TDB2 both use dictionary encoding (and in fact most RDF stores use
>> some variation on this).  Basically they map each unique RDF term (whether
>> URI, string, blank node etc) to a consistent internal identifier and use
>> this to refer to the term.  Therefore most data structures internally are
>> implemented in terms of these internal identifiers (which are typically
>> very compact, TDB/TDB2 use 64 bit identifiers) and the system only
>> translates between the internal identifier and the full RDF term when
>> explicitly needed e.g. when presenting results
>> 
>> Rob
>> 
>> On 15/02/2019, 06:03, "Ekaterina Danilova" <ka...@gmail.com>
>> wrote:
>> 
>>    i would like to ask how TDB2 and Fuseki manages big amounts of string
>> data
>>    (especially repeating data) and what it the best practices. Does it
>>    optimize it somehow?
>> 
>> 
>> 
>> 
>>

Re: Storing a lot of strings in TDB store

Posted by Ekaterina Danilova <ka...@gmail.com>.

Thank you, it was exactly what I needed. It is still nice to hear what
others think about my idea of data storage as resources and I think I will
stick to that option, but TDB storage logic was quite unclear to me. Would
be great if it was mentioned in official documentation since I couldn't
find it.
Thanks again for your help

On Tue, 19 Feb 2019 at 20:40, Rob Vesse <rv...@dotnetrdf.org> wrote:

> Since I don't think anyone answered your specific original question
>
> TDB and TDB2 both use dictionary encoding (and in fact most RDF stores use
> some variation on this).  Basically they map each unique RDF term (whether
> URI, string, blank node etc) to a consistent internal identifier and use
> this to refer to the term.  Therefore most data structures internally are
> implemented in terms of these internal identifiers (which are typically
> very compact, TDB/TDB2 use 64 bit identifiers) and the system only
> translates between the internal identifier and the full RDF term when
> explicitly needed e.g. when presenting results
>
> Rob
>
> On 15/02/2019, 06:03, "Ekaterina Danilova" <ka...@gmail.com>
> wrote:
>
>     i would like to ask how TDB2 and Fuseki manages big amounts of string
> data
>     (especially repeating data) and what it the best practices. Does it
>     optimize it somehow?
>
>
>
>
>

Re: Storing a lot of strings in TDB store

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Since I don't think anyone answered your specific original question

TDB and TDB2 both use dictionary encoding (and in fact most RDF stores use some variation on this).  Basically they map each unique RDF term (whether URI, string, blank node etc) to a consistent internal identifier and use this to refer to the term.  Therefore most data structures internally are implemented in terms of these internal identifiers (which are typically very compact, TDB/TDB2 use 64 bit identifiers) and the system only translates between the internal identifier and the full RDF term when explicitly needed e.g. when presenting results

Rob

On 15/02/2019, 06:03, "Ekaterina Danilova" <ka...@gmail.com> wrote:

    i would like to ask how TDB2 and Fuseki manages big amounts of string data
    (especially repeating data) and what it the best practices. Does it
    optimize it somehow?

Re: Storing a lot of strings in TDB store

Posted by Andy Seaborne <an...@apache.org>.


On 15/02/2019 13:56, Ekaterina Danilova wrote:
> I have a dataset describing IT infrastructure. It consists of many
> lightweight named graphs (about 15 statements each) describing different
> components.
> I understand that there is little sense in using RDF if store is used
> simply as key-value database, but I have 2 reasons for RDF :
> 1) It is nice and easy to visualize and see the connections (for this part
> URI approach is definitely correct)
> 2) I am interested in inference and use GenericRuleReasoner with rules to
> make different conclusions based on data
> So, 2 mostly used parts are Graph Store protocol to access the named graphs
> and reasoner to reason over data. The graphs are not supposed to be read
> very often, so some loss of performance is acceptable.
> I hope this added some clarity.
> Right now I am actually using the URI approach but I wanted to find out if
> it is the right way. Looks like it is.

The only thing I have to add is that adding triples one by one, or a few 
at a time, over HTTP to Fuseki and TDB2 is going to incur a lot of 
overheads.  Doing all the additions in one single operation is going to 
be significantly faster per triple.

     Andy

> 
> 
> On Fri, 15 Feb 2019 at 15:29, ajs6f <aj...@apache.org> wrote:
> 
>> You are conflating several things here. Jean-Marc is quite right to advise
>> you to use identifiers and not labels for the entities in your data, up to
>> some limit that will depend on your resourcing and purposes. If you don't
>> do that, there is no purpose to using Jena (or RDF at all), because in that
>> case, you are using it as a kind of very low-performance key-value store.
>> On the other hand, if you have specific questions about performance, it
>> would be wise to tell us a great deal more about what you are doing and how
>> you are doing it.
>>
>> What is your data like? What pieces of Jena are you using and how? What
>> queries are you running and how? There are lots of opportunities for
>> optimization when using a complex framework like Jena.
>>
>> ajs6f
>>
>>> On Feb 15, 2019, at 8:06 AM, Ekaterina Danilova <
>> katja.danilova94@gmail.com> wrote:
>>>
>>>>
>>>> No , both better in performance, and in the spirit of Sem Web
>>>
>>> Hm, the performance when using value as string or URI to resource was
>> quite
>>> same. On 10 000 examples it was 4.46ms vs 4.44ms. I didn't notice any
>>> difference even when I tested string of 1000 characters length.
>>>
>>> But I understood your idea, my issue with performance is just caused by
>>> retrieving more named graphs than one and reasoning over it in order to
>> get
>>> the actual string value.
>>> So, in the end it is following the Semantic web logic but the extra
>> actions
>>> cost almost double drop in speed unless I come up with some better idea
>> of
>>> organizing the dataset.
>>>
>>> So, to make it clear - the preferred way is replacing the repeatable
>> value
>>> with resource URIs and avoiding the strings?
>>>
>>> Thanks for the advice
>>>
>>> On Fri, 15 Feb 2019 at 14:52, Jean-Marc Vanel <je...@gmail.com>
>>> wrote:
>>>
>>>> Le ven. 15 févr. 2019 à 13:45, Ekaterina Danilova <
>>>> katja.danilova94@gmail.com> a écrit :
>>>>
>>>>> ....
>>>>>
>>>>
>>>>
>>>>> I understand that you have a database of Vcard stuff, but one must keep
>>>> in
>>>>>> mind that Semantic Web is all about creating links, filling strings is
>>>>>> secondary.
>>>>>>
>>>>> So, does it mean that creating resource is the better attitude in the
>>>> sense
>>>>> of Semantic web but worse in the sense of performance?
>>>>>
>>>>> No , both better in performance, and in the spirit of Sem Web .
>>>>
>>
>>
>

Re: Storing a lot of strings in TDB store

Posted by Ekaterina Danilova <ka...@gmail.com>.

I have a dataset describing IT infrastructure. It consists of many
lightweight named graphs (about 15 statements each) describing different
components.
I understand that there is little sense in using RDF if store is used
simply as key-value database, but I have 2 reasons for RDF :
1) It is nice and easy to visualize and see the connections (for this part
URI approach is definitely correct)
2) I am interested in inference and use GenericRuleReasoner with rules to
make different conclusions based on data
So, 2 mostly used parts are Graph Store protocol to access the named graphs
and reasoner to reason over data. The graphs are not supposed to be read
very often, so some loss of performance is acceptable.
I hope this added some clarity.
Right now I am actually using the URI approach but I wanted to find out if
it is the right way. Looks like it is.


On Fri, 15 Feb 2019 at 15:29, ajs6f <aj...@apache.org> wrote:

> You are conflating several things here. Jean-Marc is quite right to advise
> you to use identifiers and not labels for the entities in your data, up to
> some limit that will depend on your resourcing and purposes. If you don't
> do that, there is no purpose to using Jena (or RDF at all), because in that
> case, you are using it as a kind of very low-performance key-value store.
> On the other hand, if you have specific questions about performance, it
> would be wise to tell us a great deal more about what you are doing and how
> you are doing it.
>
> What is your data like? What pieces of Jena are you using and how? What
> queries are you running and how? There are lots of opportunities for
> optimization when using a complex framework like Jena.
>
> ajs6f
>
> > On Feb 15, 2019, at 8:06 AM, Ekaterina Danilova <
> katja.danilova94@gmail.com> wrote:
> >
> >>
> >> No , both better in performance, and in the spirit of Sem Web
> >
> > Hm, the performance when using value as string or URI to resource was
> quite
> > same. On 10 000 examples it was 4.46ms vs 4.44ms. I didn't notice any
> > difference even when I tested string of 1000 characters length.
> >
> > But I understood your idea, my issue with performance is just caused by
> > retrieving more named graphs than one and reasoning over it in order to
> get
> > the actual string value.
> > So, in the end it is following the Semantic web logic but the extra
> actions
> > cost almost double drop in speed unless I come up with some better idea
> of
> > organizing the dataset.
> >
> > So, to make it clear - the preferred way is replacing the repeatable
> value
> > with resource URIs and avoiding the strings?
> >
> > Thanks for the advice
> >
> > On Fri, 15 Feb 2019 at 14:52, Jean-Marc Vanel <je...@gmail.com>
> > wrote:
> >
> >> Le ven. 15 févr. 2019 à 13:45, Ekaterina Danilova <
> >> katja.danilova94@gmail.com> a écrit :
> >>
> >>> ....
> >>>
> >>
> >>
> >>> I understand that you have a database of Vcard stuff, but one must keep
> >> in
> >>>> mind that Semantic Web is all about creating links, filling strings is
> >>>> secondary.
> >>>>
> >>> So, does it mean that creating resource is the better attitude in the
> >> sense
> >>> of Semantic web but worse in the sense of performance?
> >>>
> >>> No , both better in performance, and in the spirit of Sem Web .
> >>
>
>

Re: Storing a lot of strings in TDB store

Posted by ajs6f <aj...@apache.org>.

You are conflating several things here. Jean-Marc is quite right to advise you to use identifiers and not labels for the entities in your data, up to some limit that will depend on your resourcing and purposes. If you don't do that, there is no purpose to using Jena (or RDF at all), because in that case, you are using it as a kind of very low-performance key-value store. On the other hand, if you have specific questions about performance, it would be wise to tell us a great deal more about what you are doing and how you are doing it.

What is your data like? What pieces of Jena are you using and how? What queries are you running and how? There are lots of opportunities for optimization when using a complex framework like Jena. 

ajs6f

> On Feb 15, 2019, at 8:06 AM, Ekaterina Danilova <ka...@gmail.com> wrote:
> 
>> 
>> No , both better in performance, and in the spirit of Sem Web
> 
> Hm, the performance when using value as string or URI to resource was quite
> same. On 10 000 examples it was 4.46ms vs 4.44ms. I didn't notice any
> difference even when I tested string of 1000 characters length.
> 
> But I understood your idea, my issue with performance is just caused by
> retrieving more named graphs than one and reasoning over it in order to get
> the actual string value.
> So, in the end it is following the Semantic web logic but the extra actions
> cost almost double drop in speed unless I come up with some better idea of
> organizing the dataset.
> 
> So, to make it clear - the preferred way is replacing the repeatable value
> with resource URIs and avoiding the strings?
> 
> Thanks for the advice
> 
> On Fri, 15 Feb 2019 at 14:52, Jean-Marc Vanel <je...@gmail.com>
> wrote:
> 
>> Le ven. 15 févr. 2019 à 13:45, Ekaterina Danilova <
>> katja.danilova94@gmail.com> a écrit :
>> 
>>> ....
>>> 
>> 
>> 
>>> I understand that you have a database of Vcard stuff, but one must keep
>> in
>>>> mind that Semantic Web is all about creating links, filling strings is
>>>> secondary.
>>>> 
>>> So, does it mean that creating resource is the better attitude in the
>> sense
>>> of Semantic web but worse in the sense of performance?
>>> 
>>> No , both better in performance, and in the spirit of Sem Web .
>>

Re: Storing a lot of strings in TDB store

Posted by Ekaterina Danilova <ka...@gmail.com>.

>
> No , both better in performance, and in the spirit of Sem Web

Hm, the performance when using value as string or URI to resource was quite
same. On 10 000 examples it was 4.46ms vs 4.44ms. I didn't notice any
difference even when I tested string of 1000 characters length.

But I understood your idea, my issue with performance is just caused by
retrieving more named graphs than one and reasoning over it in order to get
the actual string value.
So, in the end it is following the Semantic web logic but the extra actions
cost almost double drop in speed unless I come up with some better idea of
organizing the dataset.

So, to make it clear - the preferred way is replacing the repeatable value
with resource URIs and avoiding the strings?

Thanks for the advice

On Fri, 15 Feb 2019 at 14:52, Jean-Marc Vanel <je...@gmail.com>
wrote:

> Le ven. 15 févr. 2019 à 13:45, Ekaterina Danilova <
> katja.danilova94@gmail.com> a écrit :
>
> > ....
> >
>
>
> > I understand that you have a database of Vcard stuff, but one must keep
> in
> > > mind that Semantic Web is all about creating links, filling strings is
> > > secondary.
> > >
> > So, does it mean that creating resource is the better attitude in the
> sense
> > of Semantic web but worse in the sense of performance?
> >
> > No , both better in performance, and in the spirit of Sem Web .
>

Re: Storing a lot of strings in TDB store

Posted by Jean-Marc Vanel <je...@gmail.com>.

Le ven. 15 févr. 2019 à 13:45, Ekaterina Danilova <
katja.danilova94@gmail.com> a écrit :

> ....
>


> I understand that you have a database of Vcard stuff, but one must keep in
> > mind that Semantic Web is all about creating links, filling strings is
> > secondary.
> >
> So, does it mean that creating resource is the better attitude in the sense
> of Semantic web but worse in the sense of performance?
>
> No , both better in performance, and in the spirit of Sem Web .

Re: Storing a lot of strings in TDB store

Posted by Ekaterina Danilova <ka...@gmail.com>.

Thanks for pointing out the issue with New York. However, this is just the
test data which I made for an example, Vcard was just easy choice. My
actual database is not about Vcard and consists of self-made properties
created with smth like this:
public PropertyImpl( String uri )

The idea of my application is storing data and using the reasoning features
over it. It will have a lot of smaller graphs which might have quite a lot
of repeating String data so I am really interested what might be good for
the performance.

I understand that you have a database of Vcard stuff, but one must keep in
> mind that Semantic Web is all about creating links, filling strings is
> secondary.
>
So, does it mean that creating resource is the better attitude in the sense
of Semantic web but worse in the sense of performance?

And is there any information on how TDB2 actually keeps such string data?
Might it be that it actually saves it only once?


On Fri, 15 Feb 2019 at 14:24, Jean-Marc Vanel <je...@gmail.com>
wrote:

> First this a bad practice:
>
> http://people/JohnSmith http://www.w3.org/2001/vcard-rdf/3.0#Region "New
> York" .
>
> You should do
> http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Region
> dbpedia:NewYork .
>
> that is ,
> http://dbpedia.org/resource/New_York
>
> possibly with another object property like
> http://xmlns.com/foaf/0.1/based_near
>
> I understand that you have a database of Vcard stuff, but one must keep in
> mind that Semantic Web is all about creating links, filling strings is
> secondary.
>
>
>
> And then there is no trouble with string at all :) .
>
> Jean-Marc Vanel
> <
> http://163.172.179.125:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
> >
> +33 (0)6 89 16 29 52
> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>  Chroniques jardin
> <
> http://semantic-forms.cc:1952/backlinks?q=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
> >
>
>
> Le ven. 15 févr. 2019 à 13:02, Ekaterina Danilova <
> katja.danilova94@gmail.com> a écrit :
>
> > Hello
> > i would like to ask how TDB2 and Fuseki manages big amounts of string
> data
> > (especially repeating data) and what it the best practices. Does it
> > optimize it somehow? Or is it on us to do some improvements.
> >
> > For example, we have a TDB2 storage which we access via Fuseki and
> example
> > named graph like this:
> > [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Region,
> > "New
> > York"]
> > [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Other,
> > "long
> > long string"]
> > [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#NAME,
> "JOHN
> > SMITH"]
> >
> > So, we have JohnSmith person with 2 properties - "Region" and "Other".
> One
> > of them is short string of New York, other is long string.
> > Assume we have 100 000 more people and many of them have same "Region"
> and
> > "other" properties. So, what would be the best approach to storing such
> > data?
> >
> > I created 10 000 more named graphs of people with different names but
> same
> > other properties and tested the performance.
> > First I checked 10 000 cases of reading the graphs like this and the
> > average time was around 4.4 ms (no matter how long are the strings).
> >
> > Other option I considered is making "New York" a resource and storing it
> in
> > "cities" named graph and doing the same thing with "long long string".
> So,
> > the idea is to store the actual string only once.I tested reading the
> > graphs again on 10 000 cases and didn't notice any change in performance.
> > The average load time was still 4.4 ms when instead of "New York" and
> "Long
> > long String" we had resources URIs.
> > However, to get the full data, we need to add the actual resources to our
> > original JohnSmith graph, which adds overhead since we have to get 2 more
> > named graphs. So, it causes quite expectable drop of performance.
> >
> > So, according to my tests the first case (the one described in the graph
> > example) performed the best, but it feels like we are storing too much
> > extra information. So, I still wanted to ask on your opinions to such
> > approach and learn if TDB store makes some inner optimization to the
> data.
> >
>

Re: Storing a lot of strings in TDB store

Posted by Jean-Marc Vanel <je...@gmail.com>.

First this a bad practice:

http://people/JohnSmith http://www.w3.org/2001/vcard-rdf/3.0#Region "New
York" .

You should do
http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Region
dbpedia:NewYork .

that is ,
http://dbpedia.org/resource/New_York

possibly with another object property like
http://xmlns.com/foaf/0.1/based_near

I understand that you have a database of Vcard stuff, but one must keep in
mind that Semantic Web is all about creating links, filling strings is
secondary.



And then there is no trouble with string at all :) .

Jean-Marc Vanel
<http://163.172.179.125:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
 Chroniques jardin
<http://semantic-forms.cc:1952/backlinks?q=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle>


Le ven. 15 févr. 2019 à 13:02, Ekaterina Danilova <
katja.danilova94@gmail.com> a écrit :

> Hello
> i would like to ask how TDB2 and Fuseki manages big amounts of string data
> (especially repeating data) and what it the best practices. Does it
> optimize it somehow? Or is it on us to do some improvements.
>
> For example, we have a TDB2 storage which we access via Fuseki and example
> named graph like this:
> [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Region,
> "New
> York"]
> [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Other,
> "long
> long string"]
> [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#NAME, "JOHN
> SMITH"]
>
> So, we have JohnSmith person with 2 properties - "Region" and "Other". One
> of them is short string of New York, other is long string.
> Assume we have 100 000 more people and many of them have same "Region" and
> "other" properties. So, what would be the best approach to storing such
> data?
>
> I created 10 000 more named graphs of people with different names but same
> other properties and tested the performance.
> First I checked 10 000 cases of reading the graphs like this and the
> average time was around 4.4 ms (no matter how long are the strings).
>
> Other option I considered is making "New York" a resource and storing it in
> "cities" named graph and doing the same thing with "long long string". So,
> the idea is to store the actual string only once.I tested reading the
> graphs again on 10 000 cases and didn't notice any change in performance.
> The average load time was still 4.4 ms when instead of "New York" and "Long
> long String" we had resources URIs.
> However, to get the full data, we need to add the actual resources to our
> original JohnSmith graph, which adds overhead since we have to get 2 more
> named graphs. So, it causes quite expectable drop of performance.
>
> So, according to my tests the first case (the one described in the graph
> example) performed the best, but it feels like we are storing too much
> extra information. So, I still wanted to ask on your opinions to such
> approach and learn if TDB store makes some inner optimization to the data.
>