You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Jeremy Debattista <de...@iai.uni-bonn.de> on 2015/04/09 13:48:20 UTC

Parsing snapshots of Semantic resources

Hi All,

I am trying to figure out how to check if a given resource has some semantic content or not. To give you more context on this issue, imagine I have the resource: http://imf.270a.info/data/imf.observations.ttl, a method should return true, without having to parse the whole file/resource (because files such as this one could be very large > 400MB). On the other hand the same method returns false if the passed resource is http://www.google.com .

My initial idea (code available in [1]) was that I parse the resources using the RDFDataMgr into a piped rdf stream, then waiting on an iterator.hasNext(). If the iterator gives some result, then close the rdf stream and return true. The problem was that when a "non-semantic” resource is passed, the execution would loop indefinitely on the iterator.hasNext(). For this, i was deliberately trying to invoke exceptions like closing the rdf stream. This hack seems to work well, but it does not seem right to me. My feeling is that RDFDataMgr.parse(…) should close the stream and iterator when throwing a RIOTException that a resource could not be parsed, but I might be wrong here.

Would you have any other idea how this “snapshot” checking/parsing can be done?

Best Regards,
Jeremy

[1] https://raw.githubusercontent.com/diachron/quality/b45832d3111f28e7cc78799f4a074c6c88a6a51a/lod-qualitymetrics/lod-qualitymetrics-accessibility/src/main/java/eu/diachron/qualitymetrics/accessibility/availability/helper/ModelParser.java

Re: Parsing snapshots of Semantic resources

Posted by Jeremy Debattista <de...@iai.uni-bonn.de>.
> How about looking at the HTTP response headers?  They tell you what the content type is (well, sort of - not perfectly reliable).
> 
> http://www.google.com is HTML.
> 

Response headers are not always the best solution. I am also checking if resources actually report if the reported Content-Type is the same as the resource file available.

> If you want semantci extraction, have you considered using Apache Any23?

I will check that out. Thanks!

Jeremy

Re: Parsing snapshots of Semantic resources

Posted by Andy Seaborne <an...@apache.org>.
On 09/04/15 12:48, Jeremy Debattista wrote:
> Hi All,
>
> I am trying to figure out how to check if a given resource has some
> semantic content or not. To give you more context on this issue,
> imagine I have the resource:
> http://imf.270a.info/data/imf.observations.ttl, a method should
> return true, without having to parse the whole file/resource (because
> files such as this one could be very large > 400MB). On the other
> hand the same method returns false if the passed resource is
> http://www.google.com .
>
> My initial idea (code available in [1]) was that I parse the
> resources using the RDFDataMgr into a piped rdf stream, then waiting
> on an iterator.hasNext(). If the iterator gives some result, then
> close the rdf stream and return true. The problem was that when a
> "non-semantic” resource is passed, the execution would loop
> indefinitely on the iterator.hasNext(). For this, i was deliberately
> trying to invoke exceptions like closing the rdf stream. This hack
> seems to work well, but it does not seem right to me. My feeling is
> that RDFDataMgr.parse(…) should close the stream and iterator when
> throwing a RIOTException that a resource could not be parsed, but I
> might be wrong here.
>
> Would you have any other idea how this “snapshot” checking/parsing
> can be done?

How about looking at the HTTP response headers?  They tell you what the 
content type is (well, sort of - not perfectly reliable).

http://www.google.com is HTML.

If you want semantci extraction, have you considered using Apache Any23?

	Andy

>
> Best Regards, Jeremy
>
> [1]
> https://raw.githubusercontent.com/diachron/quality/b45832d3111f28e7cc78799f4a074c6c88a6a51a/lod-qualitymetrics/lod-qualitymetrics-accessibility/src/main/java/eu/diachron/qualitymetrics/accessibility/availability/helper/ModelParser.java
>