You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@marmotta.apache.org by Fabian Cretton <Fa...@hevs.ch> on 2014/09/19 14:26:54 UTC
Data import from URL - LD cache operating behind the scene ?
Hi,
I did 'import' data from a file's URL, and it seems that LD cache does try to update this file regularly, is that right ? if yes, is it something described somewhere as it is, to me, different from what I read about LD cache.
For instance, I did import:
http://sws.geonames.org/2658434/about.rdf
I did import the file in its own context, which I also called "http://sws.geonames.org/2658434/about.rdf"
And the next days, I find in the logs:
09:09:27.587 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieving resource data for http://sws.geonames.org/2658434/about.rdf from 'Linked Data' endpoint, request URI is <http://sws.geonames.org/2658434/about.rdf>
09:09:27.939 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieved 7 triples for resource http://sws.geonames.org/2658434/about.rdf; expiry date: Fri Sep 19 09:09:27 CEST 2014
(Did I do something else than just an 'import', that did trigger that functionality ? I don't think so, but I might have)
I am very surprised of this functionality (and pleased :-)). But I am not sure it works correctly, and here are a few questions:
- When importing from a file into a context, do Marmotta automatically keep information about that import and its provenance (url), and then regularly try to update the content even if not asked to do that ?
- If so, is it a standard functionality for the 'import'-'URL' functionality ? or even for the 'import'-'file' (i.e. when a local file is updated on disk, it is uploaded in the store' ?
- If the file is not loaded in it own context, but mixed with other triples in an existing context, then how can Marmotta handle this update ? (knowing which existing triples to remove if they were removed from the source, etc.)
- is it possible to activate/deactivate this functionality
- why 'retrieved 7 triples'...whereas the context that contains that file does have 141 triples ? is this a bug ? or does the algorithm try to retrieve only 'modified' triples with the file ?
- is it already implemented to achieve the same functionality with RDFa (for instance, or any data that can be retrieved by a LDClient) ? Pointing to a web page that contains RDFa, retrieving its RDF content, and update it on a regular basis when the original page's content changes ?
Thank you
Fabian
Rép. : Re: Rép. : Re: Data import from URL - LD cache operating behind the scene ?
Posted by Fabian Cretton <Fa...@hevs.ch>.
Thank you Sergio,
>> Another critical aspect about LDCache I'd like (as others I guess) to
>> understand:
>> How does LDCache handle the 'update' of those information ? How does it
>> know that a triple did came from a web resource (and should be updated),
>> and how does it know that a triple was added afterwards (and should be
>> kept for instance) ?
>> Let say I have a bunch of triples with
>> <http://www.websemantique.ch/people/fabiancretton> as a subject. Those
>> triples are loaded from the web. LDCache will do some updates, thus
>> deleting all triples where
>> <http://www.websemantique.ch/people/fabiancretton> is subject, and the
>> reimporting all triples where
>> <http://www.websemantique.ch/people/fabiancretton> is subject, is that
>> correct ?
>> in the mean while, I did add a new triple with
>> <http://www.websemantique.ch/people/fabiancretton> as a subject -> will
>> it be deleted by LD Cache ?
> Not sure. That's an inherited problem of the RDF data model and the Open
> World assumption. But actually that's a good point; I'll also discuss it
> with Jakob on Monday.
About this last point, I think I can guess the answer:
- LDCache only deals with triples in the 'cache' context (and don't touch the other ones)
- in the 'cache' context, all triples with a same subject do come from a same source
and thus, LDCache can handle updates easily
-> and this is one point very different from our OverLOD use case
thanks again
Fabian
Re: Rép. : Re: Data import from URL - LD cache operating behind the scene ?
Posted by Sergio Fernández <se...@salzburgresearch.at>.
Hi Fabian,
On 19/09/14 16:12, Fabian Cretton wrote:
> Thank you Sergio, thinks become clearer, I start understanding why you
> said that our overLOD functionality really looks like LDCache.
That's good, things start to become clearer.
> But there is still a behavior I don't understand.
>
> I did load "http://sws.geonames.org/2658434/about.rdf" not being
> interested in the file's URI
> <http://sws.geonames.org/2658434/about.rdf>
> and its description's triples, but being obviously interested in
> <http://sws.geonames.org/2658434/>and ( http://sws.geonames.org/2658434/
> ) its triples.
>
> As this .rdf file contains the two resources, I guess LDCache does try
> to update both of them, and:
> - for <http://sws.geonames.org/2658434/about.rdf> it is able to find
> the file, download it, and extract only the part where
> <http://sws.geonames.org/2658434/about.rdf> is a subject
> - for <http://sws.geonames.org/2658434/>, LDCache, without further
> configuration, is not able to handle the content negociation that
> happens when opening this URL, and so there is no mention about any
> update of <http://sws.geonames.org/2658434/> in the log files.
>
> Is that correct ?
Not really. Geonames supports Linked Data:
$ curl -I http://sws.geonames.org/2658434/ -H "Accept: application/rdf+xml"
HTTP/1.1 303 See Other
Date: Fri, 19 Sep 2014 14:26:25 GMT
Server: Apache/2.2.17 (Linux/SUSE)
Location: http://sws.geonames.org/2658434/about.rdf
Content-Type: text/html; charset=iso-8859-1
So LDCache should update it too. We'd need to debug what's happening
there. Could you please verify there is no traces of such update in the
logs?
> But then, there is something still strange to me:
> I did also upload my personal foaf file:
> http://www.websemantique.ch/people/rdf/fabiancretton.rdf
>
> This file don't contain any information about the resource
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf (
> http://www.websemantique.ch/people/rdf/fabcretton.rdf )>, but only about
> myself: <http://www.websemantique.ch/people/fabiancretton>
>
> Nevertheless, LDCache does have the same behavior as for the geonames
> resource:
> 09:09:27.023 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieving
> resource data for
> http://www.websemantique.ch/people/rdf/fabiancretton.rdf from 'Linked
> Data' endpoint, request URI is
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf>
> 09:09:27.352 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieved 0
> triples for resource
> http://www.websemantique.ch/people/rdf/fabiancretton.rdf; expiry date:
> Fri Sep 19 09:09:27 CEST 2014
> So where did LDCache find
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf> ?
I might be wrong, and the imported URI could be registered somewhere
else for later updates. Probably Jakob would know it.
BTW, your personal URI is not retrievable according the Linked
Principles and the httpRange-14 resolution, because it returns 300
instead of the normative 303.
> I did name the context
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf> -> does
> LDCache also take into account the contexts names ?
No, LDCache is not aware of contexts (named graphs).
> Another critical aspect about LDCache I'd like (as others I guess) to
> understand:
> How does LDCache handle the 'update' of those information ? How does it
> know that a triple did came from a web resource (and should be updated),
> and how does it know that a triple was added afterwards (and should be
> kept for instance) ?
> Let say I have a bunch of triples with
> <http://www.websemantique.ch/people/fabiancretton> as a subject. Those
> triples are loaded from the web. LDCache will do some updates, thus
> deleting all triples where
> <http://www.websemantique.ch/people/fabiancretton> is subject, and the
> reimporting all triples where
> <http://www.websemantique.ch/people/fabiancretton> is subject, is that
> correct ?
> in the mean while, I did add a new triple with
> <http://www.websemantique.ch/people/fabiancretton> as a subject -> will
> it be deleted by LD Cache ?
Not sure. That's an inherited problem of the RDF data model and the Open
World assumption. But actually that's a good point; I'll also discuss it
with Jakob on Monday.
Thanks for your feedback.
Cheers,
--
Sergio Fernández
Senior Researcher
Knowledge and Media Technologies
Salzburg Research Forschungsgesellschaft mbH
Jakob-Haringer-Straße 5/3 | 5020 Salzburg, Austria
T: +43 662 2288 318 | M: +43 660 2747 925
sergio.fernandez@salzburgresearch.at
http://www.salzburgresearch.at
Rép. : Re: Data import from URL - LD cache operating behind the scene ?
Posted by Fabian Cretton <Fa...@hevs.ch>.
Thank you Sergio, thinks become clearer, I start understanding why you
said that our overLOD functionality really looks like LDCache.
But there is still a behavior I don't understand.
I did load "http://sws.geonames.org/2658434/about.rdf" not being
interested in the file's URI
<http://sws.geonames.org/2658434/about.rdf>
and its description's triples, but being obviously interested in
<http://sws.geonames.org/2658434/>and ( http://sws.geonames.org/2658434/
) its triples.
As this .rdf file contains the two resources, I guess LDCache does try
to update both of them, and:
- for <http://sws.geonames.org/2658434/about.rdf> it is able to find
the file, download it, and extract only the part where
<http://sws.geonames.org/2658434/about.rdf> is a subject
- for <http://sws.geonames.org/2658434/>, LDCache, without further
configuration, is not able to handle the content negociation that
happens when opening this URL, and so there is no mention about any
update of <http://sws.geonames.org/2658434/> in the log files.
Is that correct ?
But then, there is something still strange to me:
I did also upload my personal foaf file:
http://www.websemantique.ch/people/rdf/fabiancretton.rdf
This file don't contain any information about the resource
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf (
http://www.websemantique.ch/people/rdf/fabcretton.rdf )>, but only about
myself: <http://www.websemantique.ch/people/fabiancretton>
Nevertheless, LDCache does have the same behavior as for the geonames
resource:
09:09:27.023 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieving
resource data for
http://www.websemantique.ch/people/rdf/fabiancretton.rdf from 'Linked
Data' endpoint, request URI is
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf>
09:09:27.352 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieved 0
triples for resource
http://www.websemantique.ch/people/rdf/fabiancretton.rdf; expiry date:
Fri Sep 19 09:09:27 CEST 2014
So where did LDCache find
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf> ?
I did name the context
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf> -> does
LDCache also take into account the contexts names ?
Another critical aspect about LDCache I'd like (as others I guess) to
understand:
How does LDCache handle the 'update' of those information ? How does it
know that a triple did came from a web resource (and should be updated),
and how does it know that a triple was added afterwards (and should be
kept for instance) ?
Let say I have a bunch of triples with
<http://www.websemantique.ch/people/fabiancretton> as a subject. Those
triples are loaded from the web. LDCache will do some updates, thus
deleting all triples where
<http://www.websemantique.ch/people/fabiancretton> is subject, and the
reimporting all triples where
<http://www.websemantique.ch/people/fabiancretton> is subject, is that
correct ?
in the mean while, I did add a new triple with
<http://www.websemantique.ch/people/fabiancretton> as a subject -> will
it be deleted by LD Cache ?
Thank you
Fabian
>>> Sergio Fernández<wi...@apache.org> 19.09.2014 15:21 >>>
On 19/09/14 15:01, Sergio Fernández wrote:
>> - why 'retrieved 7 triples'...whereas the context that contains
that
>> file does have 141 triples ? is this a bug ? or does the algorithm
try
>> to retrieve only 'modified' triples with the file ?
>
> That's strange, yes. Internally LDCache would be using something
like:
> https://gist.github.com/wikier/728e234bb998158bf9ec
>
> I've just included as a test:
>
https://github.com/apache/marmotta/blob/b24553cdc877e5f39361c4dd7f0994b46b3ad707/libraries/ldclient/ldclient-provider-rdf/src/test/java/org/apache/marmotta/ldclient/test/rdf/TestLinkedDataProvider.java#L72
>
>
> And it actually retrieves 7 triples. I'd need to debug why.
Well, because at that document there are only 7 triples actually
talking
about the resource <http://sws.geonames.org/2658434/about.rdf>:
http://sws.geonames.org/2658434/about.rdf a foaf:Document ;
foaf:primaryTopic <http://sws.geonames.org/2658434/> ;
cc:license <http://creativecommons.org/licenses/by/3.0/> ;
cc:attributionURL <http://sws.geonames.org/2658434/> ;
cc:attributionName "GeoNames"^^xsd:string ;
dcterms:created "2006-01-15"^^xsd:date ;
dcterms:modified "2012-02-24"^^xsd:date .
As I said, LDCache and LDClient work at a resource level, not at the
document one.
Hope that clarifies the issue.
Cheers,
--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernandez@redlink.co
w: http://redlink.co
Re: Data import from URL - LD cache operating behind the scene ?
Posted by Sergio Fernández <wi...@apache.org>.
On 19/09/14 15:01, Sergio Fernández wrote:
>> - why 'retrieved 7 triples'...whereas the context that contains that
>> file does have 141 triples ? is this a bug ? or does the algorithm try
>> to retrieve only 'modified' triples with the file ?
>
> That's strange, yes. Internally LDCache would be using something like:
> https://gist.github.com/wikier/728e234bb998158bf9ec
>
> I've just included as a test:
> https://github.com/apache/marmotta/blob/b24553cdc877e5f39361c4dd7f0994b46b3ad707/libraries/ldclient/ldclient-provider-rdf/src/test/java/org/apache/marmotta/ldclient/test/rdf/TestLinkedDataProvider.java#L72
>
>
> And it actually retrieves 7 triples. I'd need to debug why.
Well, because at that document there are only 7 triples actually talking
about the resource <http://sws.geonames.org/2658434/about.rdf>:
http://sws.geonames.org/2658434/about.rdf a foaf:Document ;
foaf:primaryTopic <http://sws.geonames.org/2658434/> ;
cc:license <http://creativecommons.org/licenses/by/3.0/> ;
cc:attributionURL <http://sws.geonames.org/2658434/> ;
cc:attributionName "GeoNames"^^xsd:string ;
dcterms:created "2006-01-15"^^xsd:date ;
dcterms:modified "2012-02-24"^^xsd:date .
As I said, LDCache and LDClient work at a resource level, not at the
document one.
Hope that clarifies the issue.
Cheers,
--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernandez@redlink.co
w: http://redlink.co
Re: Data import from URL - LD cache operating behind the scene ?
Posted by Sergio Fernández <wi...@apache.org>.
Hi Fabian,
On 19/09/14 14:26, Fabian Cretton wrote:
> I did 'import' data from a file's URL, and it seems that LD cache does try to update this file regularly, is that right ? if yes, is it something described somewhere as it is, to me, different from what I read about LD cache.
That's the expected behavior of LDCache, right. The cache transparently
updates the resources according the different endpoint configures (the
raw Linked Data endpoint is usually the fallback always).
> For instance, I did import:
>
> http://sws.geonames.org/2658434/about.rdf
>
> I did import the file in its own context, which I also called "http://sws.geonames.org/2658434/about.rdf"
>
> And the next days, I find in the logs:
> 09:09:27.587 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieving resource data for http://sws.geonames.org/2658434/about.rdf from 'Linked Data' endpoint, request URI is <http://sws.geonames.org/2658434/about.rdf>
> 09:09:27.939 INFO o.a.m.l.s.p.AbstractHttpProvider - retrieved 7 triples for resource http://sws.geonames.org/2658434/about.rdf; expiry date: Fri Sep 19 09:09:27 CEST 2014
>
> (Did I do something else than just an 'import', that did trigger that functionality ? I don't think so, but I might have)
>
> I am very surprised of this functionality (and pleased :-)).But I am not sure it works correctly, and here are a few questions:
>
> - When importing from a file into a context, do Marmotta automatically keep information about that import and its provenance (url), and then regularly try to update the content even if not asked to do that ?
No yes, but planned: https://issues.apache.org/jira/browse/MARMOTTA-146
> - If so, is it a standard functionality for the 'import'-'URL' functionality ? or even for the 'import'-'file' (i.e. when a local file is updated on disk, it is uploaded in the store' ?
The cache just work as resource level, does matter how the data
initially came in. So it wouldn't try to get again the file itself,but
the resources described by the file, does not matter if the file came by
URL or locally uploaded.
> - If the file is not loaded in it own context, but mixed with other triples in an existing context, then how can Marmotta handle this update ? (knowing which existing triples to remove if they were removed from the source, etc.)
> - is it possible to activate/deactivate this functionality
Yes, by defining a backlist endpoint for LDCache. Further details aT
http://marmotta.apache.org/platform/ldcache-module.html
> - why 'retrieved 7 triples'...whereas the context that contains that file does have 141 triples ? is this a bug ? or does the algorithm try to retrieve only 'modified' triples with the file ?
That's strange, yes. Internally LDCache would be using something like:
https://gist.github.com/wikier/728e234bb998158bf9ec
I've just included as a test:
https://github.com/apache/marmotta/blob/b24553cdc877e5f39361c4dd7f0994b46b3ad707/libraries/ldclient/ldclient-provider-rdf/src/test/java/org/apache/marmotta/ldclient/test/rdf/TestLinkedDataProvider.java#L72
And it actually retrieves 7 triples. I'd need to debug why.
> - is it already implemented to achieve the same functionality with RDFa (for instance, or any data that can be retrieved by a LDClient) ? Pointing to a web page that contains RDFa, retrieving its RDF content, and update it on a regular basis when the original page's content changes ?
Yes, you just need to register a LDCache endpoint using the RDFa data
provider.
Hope that helps.
Cheers,
--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernandez@redlink.co
w: http://redlink.co