You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@marmotta.apache.org by Fabian Cretton <Fa...@hevs.ch> on 2014/09/19 14:26:54 UTC

Data import from URL - LD cache operating behind the scene ?

Hi,
 
I did 'import' data from a file's URL, and it seems that LD cache does try to update this file regularly, is that right ? if yes, is it something described somewhere as it is, to me, different from what I read about LD cache.
 
For instance, I did import:

http://sws.geonames.org/2658434/about.rdf
 
I did import the file in its own context, which I also called "http://sws.geonames.org/2658434/about.rdf"
 
And the next days, I find in the logs:
09:09:27.587 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieving resource data for http://sws.geonames.org/2658434/about.rdf from 'Linked Data' endpoint, request URI is <http://sws.geonames.org/2658434/about.rdf>
09:09:27.939 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieved 7 triples for resource http://sws.geonames.org/2658434/about.rdf; expiry date: Fri Sep 19 09:09:27 CEST 2014
(Did I do something else than just an 'import', that did trigger that functionality ? I don't think so, but I might have)
 
I am very surprised of this functionality (and pleased :-)). But I am not sure it works correctly, and here are a few questions:
- When importing from a file into a context, do Marmotta automatically keep information about that import and its provenance (url), and then regularly try to update the content even if not asked to do that ? 
- If so, is it a standard functionality for the 'import'-'URL' functionality ? or even for the 'import'-'file' (i.e. when a local file is updated on disk, it is uploaded in the store' ?
- If the file is not loaded in it own context, but mixed with other triples in an existing context, then how can Marmotta handle this update ? (knowing which existing triples to remove if they were removed from the source, etc.)
- is it possible to activate/deactivate this functionality
- why 'retrieved 7 triples'...whereas the context that contains that file does have 141 triples ? is this a bug ? or does the algorithm try to retrieve only 'modified' triples with the file ?
- is it already implemented to achieve the same functionality with RDFa (for instance, or any data that can be retrieved by a LDClient) ? Pointing to a web page that contains RDFa, retrieving its RDF content, and update it on a regular basis when the original page's content changes ?
 
Thank you
Fabian

Rép. : Re: Rép. : Re: Data import from URL - LD cache operating behind the scene ?

Posted by Fabian Cretton <Fa...@hevs.ch>.

Thank you Sergio,

>> Another critical aspect about LDCache I'd like (as others I guess) to
>> understand:
>> How does LDCache handle the 'update' of those information ? How does it
>> know that a triple did came from a web resource (and should be updated),
>> and how does it know that a triple was added afterwards (and should be
>> kept for instance) ?
>> Let say I have a bunch of triples with
>> <http://www.websemantique.ch/people/fabiancretton> as a subject. Those
>> triples are loaded from the web. LDCache will do some updates, thus
>> deleting all triples where
>> <http://www.websemantique.ch/people/fabiancretton> is subject, and the
>> reimporting all triples where
>> <http://www.websemantique.ch/people/fabiancretton> is subject, is that
>> correct ?
>> in the mean while, I did add a new triple with
>> <http://www.websemantique.ch/people/fabiancretton> as a subject -> will
>> it be deleted by LD Cache ?

> Not sure. That's an inherited problem of the RDF data model and the Open 
> World assumption. But actually that's a good point; I'll also discuss it 
> with Jakob on Monday.

About this last point, I think I can guess the answer:
- LDCache only deals with triples in the 'cache' context (and don't touch the other ones)
- in the 'cache' context, all triples with a same subject do come from a same source
and thus, LDCache can handle updates easily

-> and this is one point very different from our OverLOD use case

thanks again
Fabian

Re: Rép. : Re: Data import from URL - LD cache operating behind the scene ?

Posted by Sergio Fernández <se...@salzburgresearch.at>.

Hi Fabian,

On 19/09/14 16:12, Fabian Cretton wrote:
> Thank you Sergio, thinks become clearer, I start understanding why you
> said that our overLOD functionality really looks like LDCache.

That's good, things start to become clearer.

> But there is still a behavior I don't understand.
>
> I did load "http://sws.geonames.org/2658434/about.rdf" not being
> interested in the file's URI
> <http://sws.geonames.org/2658434/about.rdf>
> and its description's triples, but being obviously interested in
> <http://sws.geonames.org/2658434/>and ( http://sws.geonames.org/2658434/
> ) its triples.
>
> As this .rdf file contains the two resources, I guess LDCache does try
> to update both of them, and:
> - for <http://sws.geonames.org/2658434/about.rdf> it is able to find
> the file, download it, and extract only the part where
> <http://sws.geonames.org/2658434/about.rdf> is a subject
> - for <http://sws.geonames.org/2658434/>, LDCache, without further
> configuration, is not able to handle the content negociation that
> happens when opening this URL, and so there is no mention about any
> update of <http://sws.geonames.org/2658434/> in the log files.
>
> Is that correct ?

Not really. Geonames supports Linked Data:

$ curl -I http://sws.geonames.org/2658434/ -H "Accept: application/rdf+xml"
HTTP/1.1 303 See Other
Date: Fri, 19 Sep 2014 14:26:25 GMT
Server: Apache/2.2.17 (Linux/SUSE)
Location: http://sws.geonames.org/2658434/about.rdf
Content-Type: text/html; charset=iso-8859-1

So LDCache should update it too. We'd need to debug what's happening 
there. Could you please verify there is no traces of such update in the 
logs?

> But then, there is something still strange to me:
> I did also upload my personal foaf file:
> http://www.websemantique.ch/people/rdf/fabiancretton.rdf
>
> This file don't contain any information about the resource
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf (
> http://www.websemantique.ch/people/rdf/fabcretton.rdf )>, but only about
> myself: <http://www.websemantique.ch/people/fabiancretton>
>
> Nevertheless, LDCache does have the same behavior as for the geonames
> resource:
> 09:09:27.023 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieving
> resource data for
> http://www.websemantique.ch/people/rdf/fabiancretton.rdf from 'Linked
> Data' endpoint, request URI is
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf>
> 09:09:27.352 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieved 0
> triples for resource
> http://www.websemantique.ch/people/rdf/fabiancretton.rdf; expiry date:
> Fri Sep 19 09:09:27 CEST 2014
> So where did LDCache find
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf> ?

I might be wrong, and the imported URI could be registered somewhere 
else for later updates. Probably Jakob would know it.

BTW, your personal URI is not retrievable according the Linked 
Principles and the httpRange-14 resolution, because it returns 300 
instead of the normative 303.

> I did name the context
> <http://www.websemantique.ch/people/rdf/fabiancretton.rdf> -> does
> LDCache also take into account the contexts names ?

No, LDCache is not aware of contexts (named graphs).

> Another critical aspect about LDCache I'd like (as others I guess) to
> understand:
> How does LDCache handle the 'update' of those information ? How does it
> know that a triple did came from a web resource (and should be updated),
> and how does it know that a triple was added afterwards (and should be
> kept for instance) ?
> Let say I have a bunch of triples with
> <http://www.websemantique.ch/people/fabiancretton> as a subject. Those
> triples are loaded from the web. LDCache will do some updates, thus
> deleting all triples where
> <http://www.websemantique.ch/people/fabiancretton> is subject, and the
> reimporting all triples where
> <http://www.websemantique.ch/people/fabiancretton> is subject, is that
> correct ?
> in the mean while, I did add a new triple with
> <http://www.websemantique.ch/people/fabiancretton> as a subject -> will
> it be deleted by LD Cache ?

Not sure. That's an inherited problem of the RDF data model and the Open 
World assumption. But actually that's a good point; I'll also discuss it 
with Jakob on Monday.

Thanks for your feedback.

Cheers,

-- 
Sergio Fernández
Senior Researcher
Knowledge and Media Technologies
Salzburg Research Forschungsgesellschaft mbH
Jakob-Haringer-Straße 5/3 | 5020 Salzburg, Austria
T: +43 662 2288 318 | M: +43 660 2747 925
sergio.fernandez@salzburgresearch.at
http://www.salzburgresearch.at

Rép. : Re: Data import from URL - LD cache operating behind the scene ?

Posted by Fabian Cretton <Fa...@hevs.ch>.

Thank you Sergio, thinks become clearer, I start understanding why you
said that our overLOD functionality really looks like LDCache.

But there is still a behavior I don't understand.

I did load "http://sws.geonames.org/2658434/about.rdf" not being
interested in the file's URI
<http://sws.geonames.org/2658434/about.rdf>
and its description's triples, but being obviously interested in
<http://sws.geonames.org/2658434/>and ( http://sws.geonames.org/2658434/
) its triples.

As this .rdf file contains the two resources, I guess LDCache does try
to update both of them, and:
- for <http://sws.geonames.org/2658434/about.rdf> it is able to find
the file, download it, and extract only the part where
<http://sws.geonames.org/2658434/about.rdf> is a subject
- for <http://sws.geonames.org/2658434/>, LDCache, without further
configuration, is not able to handle the content negociation that
happens when opening this URL, and so there is no mention about any
update of <http://sws.geonames.org/2658434/> in the log files.

Is that correct ?

But then, there is something still strange to me:
I did also upload my personal foaf file:
http://www.websemantique.ch/people/rdf/fabiancretton.rdf

This file don't contain any information about the resource
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf (
http://www.websemantique.ch/people/rdf/fabcretton.rdf )>, but only about
myself: <http://www.websemantique.ch/people/fabiancretton>

Nevertheless, LDCache does have the same behavior as for the geonames
resource:
09:09:27.023 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieving
resource data for
http://www.websemantique.ch/people/rdf/fabiancretton.rdf from 'Linked
Data' endpoint, request URI is
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf>
09:09:27.352 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieved 0
triples for resource
http://www.websemantique.ch/people/rdf/fabiancretton.rdf; expiry date:
Fri Sep 19 09:09:27 CEST 2014
So where did LDCache find
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf> ?
I did name the context
<http://www.websemantique.ch/people/rdf/fabiancretton.rdf> -> does
LDCache also take into account the contexts names ?

Another critical aspect about LDCache I'd like (as others I guess) to
understand:
How does LDCache handle the 'update' of those information ? How does it
know that a triple did came from a web resource (and should be updated),
and how does it know that a triple was added afterwards (and should be
kept for instance) ?
Let say I have a bunch of triples with
<http://www.websemantique.ch/people/fabiancretton> as a subject. Those
triples are loaded from the web. LDCache will do some updates, thus
deleting all triples where
<http://www.websemantique.ch/people/fabiancretton> is subject, and the
reimporting all triples where
<http://www.websemantique.ch/people/fabiancretton> is subject, is that
correct ?
in the mean while, I did add a new triple with
<http://www.websemantique.ch/people/fabiancretton> as a subject -> will
it be deleted by LD Cache ?

Thank you
Fabian

>>> Sergio Fernández<wi...@apache.org> 19.09.2014 15:21 >>>
On 19/09/14 15:01, Sergio Fernández wrote:
>> - why 'retrieved 7 triples'...whereas the context that contains
that
>> file does have 141 triples ? is this a bug ? or does the algorithm
try
>> to retrieve only 'modified' triples with the file ?
>
> That's strange, yes. Internally LDCache would be using something
like:
> https://gist.github.com/wikier/728e234bb998158bf9ec
>
> I've just included as a test:
>
https://github.com/apache/marmotta/blob/b24553cdc877e5f39361c4dd7f0994b46b3ad707/libraries/ldclient/ldclient-provider-rdf/src/test/java/org/apache/marmotta/ldclient/test/rdf/TestLinkedDataProvider.java#L72
>
>
> And it actually retrieves 7 triples. I'd need to debug why.

Well, because at that document there are only 7 triples actually
talking 
about the resource <http://sws.geonames.org/2658434/about.rdf>:

http://sws.geonames.org/2658434/about.rdf a foaf:Document ;
   foaf:primaryTopic <http://sws.geonames.org/2658434/> ;
   cc:license <http://creativecommons.org/licenses/by/3.0/> ;
   cc:attributionURL <http://sws.geonames.org/2658434/> ;
   cc:attributionName "GeoNames"^^xsd:string ;
   dcterms:created "2006-01-15"^^xsd:date ;
   dcterms:modified "2012-02-24"^^xsd:date .

As I said, LDCache and LDClient work at a resource level, not at the 
document one.

Hope that clarifies the issue.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: Data import from URL - LD cache operating behind the scene ?

Posted by Sergio Fernández <wi...@apache.org>.

On 19/09/14 15:01, Sergio Fernández wrote:
>> - why 'retrieved 7 triples'...whereas the context that contains that
>> file does have 141 triples ? is this a bug ? or does the algorithm try
>> to retrieve only 'modified' triples with the file ?
>
> That's strange, yes. Internally LDCache would be using something like:
> https://gist.github.com/wikier/728e234bb998158bf9ec
>
> I've just included as a test:
> https://github.com/apache/marmotta/blob/b24553cdc877e5f39361c4dd7f0994b46b3ad707/libraries/ldclient/ldclient-provider-rdf/src/test/java/org/apache/marmotta/ldclient/test/rdf/TestLinkedDataProvider.java#L72
>
>
> And it actually retrieves 7 triples. I'd need to debug why.

Well, because at that document there are only 7 triples actually talking 
about the resource <http://sws.geonames.org/2658434/about.rdf>:

http://sws.geonames.org/2658434/about.rdf a foaf:Document ;
   foaf:primaryTopic <http://sws.geonames.org/2658434/> ;
   cc:license <http://creativecommons.org/licenses/by/3.0/> ;
   cc:attributionURL <http://sws.geonames.org/2658434/> ;
   cc:attributionName "GeoNames"^^xsd:string ;
   dcterms:created "2006-01-15"^^xsd:date ;
   dcterms:modified "2012-02-24"^^xsd:date .

As I said, LDCache and LDClient work at a resource level, not at the 
document one.

Hope that clarifies the issue.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernandez@redlink.co
w: http://redlink.co

Re: Data import from URL - LD cache operating behind the scene ?

Posted by Sergio Fernández <wi...@apache.org>.

Hi Fabian,

On 19/09/14 14:26, Fabian Cretton wrote:
> I did 'import' data from a file's URL, and it seems that LD cache does try to update this file regularly, is that right ? if yes, is it something described somewhere as it is, to me, different from what I read about LD cache.

That's the expected behavior of LDCache, right. The cache transparently 
updates the resources according the different endpoint configures (the 
raw Linked Data endpoint is usually the fallback always).

> For instance, I did import:
>
> http://sws.geonames.org/2658434/about.rdf
>
> I did import the file in its own context, which I also called "http://sws.geonames.org/2658434/about.rdf"
>
> And the next days, I find in the logs:
> 09:09:27.587 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieving resource data for http://sws.geonames.org/2658434/about.rdf from 'Linked Data' endpoint, request URI is <http://sws.geonames.org/2658434/about.rdf>
> 09:09:27.939 INFO  o.a.m.l.s.p.AbstractHttpProvider - retrieved 7 triples for resource http://sws.geonames.org/2658434/about.rdf; expiry date: Fri Sep 19 09:09:27 CEST 2014
 >
> (Did I do something else than just an 'import', that did trigger that functionality ? I don't think so, but I might have)
>
> I am very surprised of this functionality (and pleased :-)).But I am not sure it works correctly, and here are a few questions:
 >
> - When importing from a file into a context, do Marmotta automatically keep information about that import and its provenance (url), and then regularly try to update the content even if not asked to do that ?

No yes, but planned: https://issues.apache.org/jira/browse/MARMOTTA-146

> - If so, is it a standard functionality for the 'import'-'URL' functionality ? or even for the 'import'-'file' (i.e. when a local file is updated on disk, it is uploaded in the store' ?

The cache just work as resource level, does matter how the data 
initially came in. So it wouldn't try to get again the file itself,but 
the resources described by the file, does not matter if the file came by 
URL or locally uploaded.

> - If the file is not loaded in it own context, but mixed with other triples in an existing context, then how can Marmotta handle this update ? (knowing which existing triples to remove if they were removed from the source, etc.)
> - is it possible to activate/deactivate this functionality

Yes, by defining a backlist endpoint for LDCache. Further details aT 
http://marmotta.apache.org/platform/ldcache-module.html

> - why 'retrieved 7 triples'...whereas the context that contains that file does have 141 triples ? is this a bug ? or does the algorithm try to retrieve only 'modified' triples with the file ?

That's strange, yes. Internally LDCache would be using something like: 
https://gist.github.com/wikier/728e234bb998158bf9ec

I've just included as a test: 
https://github.com/apache/marmotta/blob/b24553cdc877e5f39361c4dd7f0994b46b3ad707/libraries/ldclient/ldclient-provider-rdf/src/test/java/org/apache/marmotta/ldclient/test/rdf/TestLinkedDataProvider.java#L72

And it actually retrieves 7 triples. I'd need to debug why.

> - is it already implemented to achieve the same functionality with RDFa (for instance, or any data that can be retrieved by a LDClient) ? Pointing to a web page that contains RDFa, retrieving its RDF content, and update it on a regular basis when the original page's content changes ?

Yes, you just need to register a LDCache endpoint using the RDFa data 
provider.

Hope that helps.

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 660 2747 925
e: sergio.fernandez@redlink.co
w: http://redlink.co