You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@clerezza.apache.org by Alessandro Adamou <ad...@cs.unibo.it> on 2012/08/14 13:53:42 UTC

Setting a read limit when parsing a Graph

Hi,

I need to write a function that performs lookahead of the OWL ontology 
ID for a Graph, therefore it has to scan the content up to a certain 
point to see if it has found an ontology IRI / version IRI pair.

I thought that setting mark() on a BufferedInputStream did the trick, 
something like:

MGraph graph = new SimpleMGraph();
BufferedInputStream bIn = new BufferedInputStream(content);
bIn.mark(1240); // Read up to 1k
parser.parse(graph, bIn, SupportedFormat.RDF_XML);

(parser has a Jena parser provider registered)

But apparently this is not working. Even for streams much longer than 1 
kiB, with the interesting triples right at the very end, these triples 
are always found.

Do the Clerezza parser override the marks on a buffered stream, or maybe 
Jena is doing so? Or even better, am I doing this wrong?

Best,
-- Alessandro

-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, just don't demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice

Re: Setting a read limit when parsing a Graph

Posted by Andy Seaborne <an...@apache.org>.

On 14/08/12 13:09, Reto Bachmann-Gmür wrote:
> Hi Alessandro,
>
> Two things:
>
> - the mark method doesn't truncate the stream after the indicated number of
> bytes, but makes sure that within the indicated number of bytes one can
> reset the stream back to that position. If one reads more than the
> indicated number of bytes the mark becomes invalid (i.e. reset won't work)
> but otherwise the stream behaves as normal.
>
> - I'mm not sure how the jena parser works and if you get the triples read
> so far if your rdf/xml is truncated. You might want to truncate n-triples
> after a dot.

The parser streams and emits triples as they are completed.  All the 
parsers do.

Which Jena parser framework does Clerezza use?  The old one or the new 
one?  The new one, RIOT, has an interface for receiving a stream of 
triples.  (You can do this in the old one but you need to write a small 
graph implementation.)

Reto - we're moving towards switching from the old jena-core readers to 
a new reader framework which does better content negotiation, uses the 
(much) faster parsers from RIOT and is properly extensible.  Igf you 
have any input, then mailto:dev@jena.a.o

Alessandro - if you can open the file twice, something like this might work:

read in 1240 bytes and then parse just that.  Then close-reopen the file 
to read the whole thing.

	Andy

>
> Cheers,
> Reto
>
>
> On Tue, Aug 14, 2012 at 1:53 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:
>
>> Hi,
>>
>> I need to write a function that performs lookahead of the OWL ontology ID
>> for a Graph, therefore it has to scan the content up to a certain point to
>> see if it has found an ontology IRI / version IRI pair.
>>
>> I thought that setting mark() on a BufferedInputStream did the trick,
>> something like:
>>
>> MGraph graph = new SimpleMGraph();
>> BufferedInputStream bIn = new BufferedInputStream(content);
>> bIn.mark(1240); // Read up to 1k
>> parser.parse(graph, bIn, SupportedFormat.RDF_XML);
>>
>> (parser has a Jena parser provider registered)
>>
>> But apparently this is not working. Even for streams much longer than 1
>> kiB, with the interesting triples right at the very end, these triples are
>> always found.
>>
>> Do the Clerezza parser override the marks on a buffered stream, or maybe
>> Jena is doing so? Or even better, am I doing this wrong?
>>
>> Best,
>> -- Alessandro
>>
>> --
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, just don't demand anything."
>> (Ettore Petrolini, 1917)
>>
>> Not sent from my iSnobTechDevice
>>
>>
>

Re: Setting a read limit when parsing a Graph

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Thanks Reto,

based on what you said I decided to do an implementation of the 
lookahead method with a limit set on triples instead of bytes. It should 
still have a pretty decent memory footprint and takes a reasonable time. 
It is now a Stanbol utility of commons.owl

Alessandro


On 8/14/12 2:11 PM, Reto Bachmann-Gmür wrote:
> Hi Alessandro,
>
> Two things:
>
> - the mark method doesn't truncate the stream after the indicated number of
> bytes, but makes sure that within the indicated number of bytes one can
> reset the stream back to that position. If one reads more than the
> indicated number of bytes the mark becomes invalid (i.e. reset won't work)
> but otherwise the stream behaves as normal.
> - I'mm not sure how the jena parser works and if you get the triples read
> so far if your rdf/xml is truncated. You might want to truncate n-triples
> after a dot.
>
> Cheers,
> Reto
>
>
>
> On Tue, Aug 14, 2012 at 1:53 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:
>
>> Hi,
>>
>> I need to write a function that performs lookahead of the OWL ontology ID
>> for a Graph, therefore it has to scan the content up to a certain point to
>> see if it has found an ontology IRI / version IRI pair.
>>
>> I thought that setting mark() on a BufferedInputStream did the trick,
>> something like:
>>
>> MGraph graph = new SimpleMGraph();
>> BufferedInputStream bIn = new BufferedInputStream(content);
>> bIn.mark(1240); // Read up to 1k
>> parser.parse(graph, bIn, SupportedFormat.RDF_XML);
>>
>> (parser has a Jena parser provider registered)
>>
>> But apparently this is not working. Even for streams much longer than 1
>> kiB, with the interesting triples right at the very end, these triples are
>> always found.
>>
>> Do the Clerezza parser override the marks on a buffered stream, or maybe
>> Jena is doing so? Or even better, am I doing this wrong?
>>
>> Best,
>> -- Alessandro
>>
>> --
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, just don't demand anything."
>> (Ettore Petrolini, 1917)
>>
>> Not sent from my iSnobTechDevice
>>
>>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, just don't demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice

Re: Setting a read limit when parsing a Graph

Posted by Reto Bachmann-Gmür <re...@apache.org>.

Hi Alessandro,

Two things:

- the mark method doesn't truncate the stream after the indicated number of
bytes, but makes sure that within the indicated number of bytes one can
reset the stream back to that position. If one reads more than the
indicated number of bytes the mark becomes invalid (i.e. reset won't work)
but otherwise the stream behaves as normal.
- I'mm not sure how the jena parser works and if you get the triples read
so far if your rdf/xml is truncated. You might want to truncate n-triples
after a dot.

Cheers,
Reto



On Tue, Aug 14, 2012 at 1:53 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:

> Hi,
>
> I need to write a function that performs lookahead of the OWL ontology ID
> for a Graph, therefore it has to scan the content up to a certain point to
> see if it has found an ontology IRI / version IRI pair.
>
> I thought that setting mark() on a BufferedInputStream did the trick,
> something like:
>
> MGraph graph = new SimpleMGraph();
> BufferedInputStream bIn = new BufferedInputStream(content);
> bIn.mark(1240); // Read up to 1k
> parser.parse(graph, bIn, SupportedFormat.RDF_XML);
>
> (parser has a Jena parser provider registered)
>
> But apparently this is not working. Even for streams much longer than 1
> kiB, with the interesting triples right at the very end, these triples are
> always found.
>
> Do the Clerezza parser override the marks on a buffered stream, or maybe
> Jena is doing so? Or even better, am I doing this wrong?
>
> Best,
> -- Alessandro
>
> --
> M.Sc. Alessandro Adamou
>
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
>
>
> "I will give you everything, just don't demand anything."
> (Ettore Petrolini, 1917)
>
> Not sent from my iSnobTechDevice
>
>

Re: Setting a read limit when parsing a Graph

Posted by Reto Bachmann-Gmür <re...@wymiwyg.com>.

Hi Alessandro,

Two things:

- the mark method doesn't truncate the stream after the indicated number of
bytes, but makes sure that within the indicated number of bytes one can
reset the stream back to that position. If one reads more than the
indicated number of bytes the mark becomes invalid (i.e. reset won't work)
but otherwise the stream behaves as normal.
- I'mm not sure how the jena parser works and if you get the triples read
so far if your rdf/xml is truncated. You might want to truncate n-triples
after a dot.

Cheers,
Reto


On Tue, Aug 14, 2012 at 1:53 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:

> Hi,
>
> I need to write a function that performs lookahead of the OWL ontology ID
> for a Graph, therefore it has to scan the content up to a certain point to
> see if it has found an ontology IRI / version IRI pair.
>
> I thought that setting mark() on a BufferedInputStream did the trick,
> something like:
>
> MGraph graph = new SimpleMGraph();
> BufferedInputStream bIn = new BufferedInputStream(content);
> bIn.mark(1240); // Read up to 1k
> parser.parse(graph, bIn, SupportedFormat.RDF_XML);
>
> (parser has a Jena parser provider registered)
>
> But apparently this is not working. Even for streams much longer than 1
> kiB, with the interesting triples right at the very end, these triples are
> always found.
>
> Do the Clerezza parser override the marks on a buffered stream, or maybe
> Jena is doing so? Or even better, am I doing this wrong?
>
> Best,
> -- Alessandro
>
> --
> M.Sc. Alessandro Adamou
>
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
>
>
> "I will give you everything, just don't demand anything."
> (Ettore Petrolini, 1917)
>
> Not sent from my iSnobTechDevice
>
>

Re: Setting a read limit when parsing a Graph

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Clearly I meant 1024 for 1 kiB (but anyway, once this works, I would 
like to set it to 100k)

I was thinking that perhaps marks are being overridden, otherwise 
XML-based parsers would fail as they could not encounter the closing tag 
(e.g. </rdf:RDF>).

So I was thinking, perhaps I should override the SimpleMGraph and set a 
limit to the triples instead?

Thank You

Alessandro


On 8/14/12 1:53 PM, Alessandro Adamou wrote:
> Hi,
>
> I need to write a function that performs lookahead of the OWL ontology 
> ID for a Graph, therefore it has to scan the content up to a certain 
> point to see if it has found an ontology IRI / version IRI pair.
>
> I thought that setting mark() on a BufferedInputStream did the trick, 
> something like:
>
> MGraph graph = new SimpleMGraph();
> BufferedInputStream bIn = new BufferedInputStream(content);
> bIn.mark(1240); // Read up to 1k
> parser.parse(graph, bIn, SupportedFormat.RDF_XML);
>
> (parser has a Jena parser provider registered)
>
> But apparently this is not working. Even for streams much longer than 
> 1 kiB, with the interesting triples right at the very end, these 
> triples are always found.
>
> Do the Clerezza parser override the marks on a buffered stream, or 
> maybe Jena is doing so? Or even better, am I doing this wrong?
>
> Best,
> -- Alessandro
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, just don't demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice