You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Henry Story <he...@bblfish.net> on 2012/01/29 22:40:28 UTC

Support for Non Blocking Parsers

Hi, 

   [ I just opened a bug report for this, but it was suggested that a wider 
discussion on how to do it would be useful on this list. ]

  In a Linked Data environment servers have to fetch data off the web. The 
speed at which such data  is served can be very slow. So one wants to avoid 
using up one thread for each connections (1 thread = 0.5 to 1MB approximately). 
This is why Java NIO was developed and why servers such as Netty are so 
popular, why http client libraries such as 
https://github.com/sonatype/async-http-client are more  and more numerous, 
and why actor frameworks such as http://akka.io/ which support relatively 
lightweight actors (500 bytes per actor) are growing more visible. 

Unless I am mistaken the only way to parse some content is using methods that use an 
InputStream such as this: 

    val m = ModelFactory.createDefaultModel() 
     m.getReader(lang.jenaLang).read(m, in, base.toString) 

That read call *blocks*: i.e. the thread that calls that will then
spend all its time on the reading in the information, HOWEVER SLOWLY
it is sent. Would it be possible to have an API which allows 
one to parse a document in chunks as they arrive from the input? 

Without that each request for a remote resource ties up a minimum of 0.5-1 MB,
plus the swapping costs of threads (which is known to be very high). So if you
fetch 500 remote resources before you even get started and you use up 500MB 
whilst you slow down your machine dramatically due to swapping. Instead with
akka actors you would use 500bytes*500 = 250000bytes = 250kbytes = 1/4 MB 
plus perhaps a few threads. With simple NIO you have the same or even less.
1 NIO thread can read as much input as it can handle. And you probably just need
a few worker threads if the parsing is more work that reading. So just like that
we can save a lot of memory.

   HAVING Said that.

   
   What is the best way to do this?

   An (ugly?) solution that would work is just to have a method
    
    reader.write(byteArray)

   So instead of having the thread doing the reading, this makes it possible
for the IO layer to pass blocks of characters straight to the model as those
blocks of characters come along.

   It would be better of course if the structure passed could be one that was not
changeable, even better, if it could use NIO bytes buffers as that reduces 
the need even to copy data, but I guess that the Jena parsers were not written 
with that in mind.

   I did open the issue-203 so that when we agree on a solution we could send in
some patches.

 https://issues.apache.org/jira/browse/JENA-203

	Henry


Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Henry Story <he...@bblfish.net>.
On 29 Jan 2012, at 23:25, Andy Seaborne wrote:

> On 29/01/12 21:40, Henry Story wrote:
>>    It would be better of course if the structure passed could be one that was not
>> changeable, even better, if it could use NIO bytes buffers as that reduces
>> the need even to copy data, but I guess that the Jena parsers were not written
>> with that in mind.
> 
> This bit, I didn't follow.

I just discovered this, which you should find very interesting
   http://akka.io/docs/akka/2.0-M3/scala/io.html


> 
> Parsing, in general, needs a char stream and, for Turtle one-char look ahead.
> 
> The parsers work from InputStreams.  The RIOT parsers work from Tokenizers, which normally work from InputStreams but it's chnagable as its Jena code.
> 
> An InputStream is just an interface and a bit of machinary (AKA a trait) - it can be implemented to implement over NIO buffers so a zero-copy design is quite possible.
> 
> RIOT has PeekInputStream which could be adapted to get bytes from an NIO buffer.
> 
> My experience is that accessing an NIO buffer byte-by-byte needs a little care - it may not be very cheap as several checks are always done and, while the JIT is good, the per-byte cost that can be significant. It might be better to read out chunks (RIOT's InputStreamBuffered).  It would still be zero-copy overall - no complete copy of the source taken.
> 
> Copying is not always bad - I have tried to do faster-than-std-java conversion of UTF-8 bytes to chars in pure code, no copy, but the built-in decoder (which is probably native code) is still a few-% better despite the fact it introduces a copy.  CharsetDecoders work on ByteBuffers.  I don't think its possible in java to avoid a copy at the point of bytes->chars.
> 
> 	Andy

Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Andy Seaborne <an...@apache.org>.
On 29/01/12 21:40, Henry Story wrote:
>     It would be better of course if the structure passed could be one that was not
> changeable, even better, if it could use NIO bytes buffers as that reduces
> the need even to copy data, but I guess that the Jena parsers were not written
> with that in mind.

This bit, I didn't follow.

Parsing, in general, needs a char stream and, for Turtle one-char look 
ahead.

The parsers work from InputStreams.  The RIOT parsers work from 
Tokenizers, which normally work from InputStreams but it's chnagable as 
its Jena code.

An InputStream is just an interface and a bit of machinary (AKA a trait) 
- it can be implemented to implement over NIO buffers so a zero-copy 
design is quite possible.

RIOT has PeekInputStream which could be adapted to get bytes from an NIO 
buffer.

My experience is that accessing an NIO buffer byte-by-byte needs a 
little care - it may not be very cheap as several checks are always done 
and, while the JIT is good, the per-byte cost that can be significant. 
It might be better to read out chunks (RIOT's InputStreamBuffered).  It 
would still be zero-copy overall - no complete copy of the source taken.

Copying is not always bad - I have tried to do faster-than-std-java 
conversion of UTF-8 bytes to chars in pure code, no copy, but the 
built-in decoder (which is probably native code) is still a few-% better 
despite the fact it introduces a copy.  CharsetDecoders work on 
ByteBuffers.  I don't think its possible in java to avoid a copy at the 
point of bytes->chars.

	Andy

Re: Support for Non Blocking Parsers

Posted by Chris Dollin <ch...@epimorphics.com>.
Andy said:

> The rest: Turtle parsers are quite easy to write.  In fact, the actually 
> parser isn't really the bulk of the work.

I'm part-way (in my Copious Free Time) through writing a Turtle parser
in Go. I agree with Andy: the parser isn't the tricky bit. (The lexer is
what I found tripped me up several times ...)

Chris
 
-- 
"The wizard seemed quite willing when I talked to him."  /Howl's Moving Castle/

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)

Re: Support for Non Blocking Parsers

Posted by Henry Story <he...@bblfish.net>.
On 1 Feb 2012, at 17:53, Andy Seaborne wrote:

>> 
>> PS. I wonder why I can't find this thread in the archive
>>   http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/browser
> 
> http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/%3C54563B60-702E-4748-B19E-9C3A0EDFBB1D%40bblfish.net%3E

Ah sorry, I forgot to scroll to the next page. :-/ (Probably because I had not eaten yet)

Thanks.

Henry

Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Andy Seaborne <an...@apache.org>.
On 01/02/12 16:44, Henry Story wrote:
> Ok, I got the asynchronous parser to work for rdf/xml . Details on the bug report
>
>    https://issues.apache.org/jira/browse/JENA-203
>
> Henry
>
> PS. I wonder why I can't find this thread in the archive
>    http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/browser

http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/%3C54563B60-702E-4748-B19E-9C3A0EDFBB1D%40bblfish.net%3E



Re: Support for Non Blocking Parsers

Posted by Henry Story <he...@bblfish.net>.
Ok, I got the asynchronous parser to work for rdf/xml . Details on the bug report

  https://issues.apache.org/jira/browse/JENA-203

Henry

PS. I wonder why I can't find this thread in the archive
  http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201201.mbox/browser


On 30 Jan 2012, at 22:36, Henry Story wrote:

> 
> On 30 Jan 2012, at 14:23, Andy Seaborne wrote:
> 
>> Yes, but :-) that's without writing any kind of adaptor code.
>> 
>> I was looking for a way to reuse the existing parser code.  If you want to start from scratch then it's a different ball game.
>> 
>> There are two cases:
>> 
>> RDF/XML: (Yuk) Jena uses an XML parser - the first point is finding a suitable XML parser - the SAX interface means it might be possible to adapt to being a pipeline based process.
> 
> Well yes, the good thing about RDF/XML is that I think now nobody cares anymore. :-)
> I am told this Apache Licenced parser is very good.
> 
>  https://github.com/FasterXML/aalto-xml
> 
> How difficult would it be to use that?
> 
>> 
>> The rest: Turtle parsers are quite easy to write.  In fact, the actually parser isn't really the bulk of the work.
>> 
>> The purest actor-style implementation needs to spit out the parsing phases: bytes to chars, chars to tokens, tokens to triples.  Each of those steps is a small state machine but it looks a whole lot easier to write as separate FSMs.  Even UTF-8 chars can be split across byte buffer boundaries.
> 
> I doing some research there.
> 
>> 
>> Practical points:
>> 
>> 1/ for all the small documents (say, less than 50K) it might be simpler to gather the bytes together and parse whole documents.  Then devote a thread to large documents - assumes you get Content-Length.  This isn't as ideal as a compete rewrite but it's less work.  Isn't thread stack size is key determinant of space used?
> 
> yes, that's an ugly band aid, but I'll use that in the mean time, as I would like to get more familiar with actor based programming.
> 
>> 2/ Have X threads (where X ~ # cores), use a executor pool requests together.  The far wil start seding and it will be buffered in the low levels.  There aren't any extra CPU cycles to go round so while it's batch-y it isn't going to go fast with more active parsers.
> 
> I think without getting the parsers to be non blocking, everything else is just going to be ugly and inefficient. Getting the parsers to be non blocking will make everything else just clean and seamless. 
> 
> For example one could easily create a proxy that could proxy 1 GB of RDF files and only use up a few kbytes of memory, by simply reading in triples and spitting them out in another format on the other end, before even the first document had finished parsing.
> 
> 
> 
>> 
>> I am interested in the question on rendezvous still by the way - how does the app want to be notified parsing has finished and does it not touch the model during this time?
>> 
>> 	Andy
>> 
>> On 30/01/12 12:57, Henry Story wrote:
>>> So I wrote out a gist that shows how one should be able to use Jena Parsers
>>> It is here:
>>> 
>>>   https://gist.github.com/1704255
>>> 
>>> But I get the exception
>>> 
>>> ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
>>> com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
>>> 	at com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60)
>>> 	at com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51)
>>> 	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211)
>>> 	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241)
>>> 	at o
>>> 
>>> As expected, because there one cannot pass partial documents to the reader.
>>> 
>>> Henry
>>> 
>>> 
>>> On 29 Jan 2012, at 23:52, Henry Story wrote:
>>> 
>>>> 
>>>> On 29 Jan 2012, at 23:28, Henry Story wrote:
>>>> 
>>>>> 
>>>>> On 29 Jan 2012, at 23:04, Andy Seaborne wrote:
>>>>> 
>>>>>> Hi Henry,
>>>>>> 
>>>>>> On 29/01/12 21:40, Henry Story wrote:
>>>>>>> [ I just opened a bug report for this, but it was suggested that a wider
>>>>>>> discussion on how to do it would be useful on this list. ]
>>>>>> 
>>>>>> The thread of interest is:
>>>>>> 
>>>>>> http://www.mail-archive.com/jena-users@incubator.apache.org/msg02451.html
>>>>>> 
>>>>>>> Unless I am mistaken the only way to parse some content is using methods that use an
>>>>>>> InputStream such as this:
>>>>>>> 
>>>>>>>  val m = ModelFactory.createDefaultModel()
>>>>>>>   m.getReader(lang.jenaLang).read(m, in, base.toString)
>>>>>> 
>>>>>> As already commented on the thread, passing the reader to an actor allows async reading.  Readers are configurable - you can have anything you like.  No reason why the RDFReader can't be using async NIO.
>>>>> 
>>>>> Mhh, can I call at time t1
>>>>> 
>>>>> reader.read( model, inputStream, base);
>>>>> 
>>>>> with an inputStream that only contains a chunk of the data? And then call it again with
>>>>> another chunk of the data later with a newly filled input stream that contains the next segment
>>>>> of the data?
>>>>> 
>>>>> reader.read( model, inputStream2, base);
>>>>> 
>>>>> It says nothing about that in the documentation, so I just assumed it does not work...
>>>> 
>>>> Well I did look at the code (but perhaps not deeply enough, and only the released
>>>> version of Jena). From that I got the feeling that one has to send one whole RDF
>>>> document down an input stream at a time.
>>>> 
>>>> If one cannot send chunks to the reader then essentially the thread that calls the
>>>> read(...) method above will block until the whole document is read in. Even if an
>>>> actor calls that method, the actor will then block the thread that it is executing
>>>> in until it  is finished. So actors don't help (unless there is some magic I don't
>>>> know about). Now if the server serving the document is serving it at 56 bauds, really
>>>> slowly, then one thread could be used up even though it is producing very very
>>>> little work.
>>>> 
>>>> If on the other hand I could send partial pieces of XML documents down different
>>>> input streams and different times, then the NIO thread could call the reader
>>>> every time it received some data. For example in the code I was writing here using the
>>>> http-async-client https://gist.github.com/1701141
>>>> 
>>>> The method I have now on line 39-42
>>>> 
>>>> def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>>   bodyPart.writeTo(out)
>>>>   STATE.CONTINUE
>>>> }
>>>> 
>>>> 
>>>> could be changed to
>>>> 
>>>> def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>>   reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), base)
>>>>   STATE.CONTINUE
>>>> }
>>>> 
>>>> and so the body part would be consumed by the read in chunks.
>>>> 
>>>>> 
>>>>>> 
>>>>>> There is also RIOT - have you looked parsing the read request to a parser in an actor, the catching the Sink<Triple>  interface for the return -- that wokrs in an actor style.
>>>>>> 
>>>>>> The key question is what Jena can enable,  this so that possibilities can be built on top.  I don't think jena is a good level to pick one approach over another as it is in danger of clashing with other choice in the application.  Your akka is a good example of one possible choice.
>>>>>> 
>>>>>>> I did open the issue-203 so that when we agree on a solution we could send in
>>>>>>> some patches.
>>>>>> 
>>>>>> Look forward to seeing this,
>>>>>> 
>>>>>> 	Andy
>>>>> 
>>>>> Social Web Architect
>>>>> http://bblfish.net/
>>>>> 
>>>> 
>>>> Social Web Architect
>>>> http://bblfish.net/
>>>> 
>>> 
>>> Social Web Architect
>>> http://bblfish.net/
>>> 
>> 
> 
> Social Web Architect
> http://bblfish.net/
> 

Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Henry Story <he...@bblfish.net>.
On 30 Jan 2012, at 14:23, Andy Seaborne wrote:

> Yes, but :-) that's without writing any kind of adaptor code.
> 
> I was looking for a way to reuse the existing parser code.  If you want to start from scratch then it's a different ball game.
> 
> There are two cases:
> 
> RDF/XML: (Yuk) Jena uses an XML parser - the first point is finding a suitable XML parser - the SAX interface means it might be possible to adapt to being a pipeline based process.

Well yes, the good thing about RDF/XML is that I think now nobody cares anymore. :-)
I am told this Apache Licenced parser is very good.

  https://github.com/FasterXML/aalto-xml

How difficult would it be to use that?

> 
> The rest: Turtle parsers are quite easy to write.  In fact, the actually parser isn't really the bulk of the work.
> 
> The purest actor-style implementation needs to spit out the parsing phases: bytes to chars, chars to tokens, tokens to triples.  Each of those steps is a small state machine but it looks a whole lot easier to write as separate FSMs.  Even UTF-8 chars can be split across byte buffer boundaries.

I doing some research there.

> 
> Practical points:
> 
> 1/ for all the small documents (say, less than 50K) it might be simpler to gather the bytes together and parse whole documents.  Then devote a thread to large documents - assumes you get Content-Length.  This isn't as ideal as a compete rewrite but it's less work.  Isn't thread stack size is key determinant of space used?

yes, that's an ugly band aid, but I'll use that in the mean time, as I would like to get more familiar with actor based programming.

> 2/ Have X threads (where X ~ # cores), use a executor pool requests together.  The far wil start seding and it will be buffered in the low levels.  There aren't any extra CPU cycles to go round so while it's batch-y it isn't going to go fast with more active parsers.

I think without getting the parsers to be non blocking, everything else is just going to be ugly and inefficient. Getting the parsers to be non blocking will make everything else just clean and seamless. 

For example one could easily create a proxy that could proxy 1 GB of RDF files and only use up a few kbytes of memory, by simply reading in triples and spitting them out in another format on the other end, before even the first document had finished parsing.



> 
> I am interested in the question on rendezvous still by the way - how does the app want to be notified parsing has finished and does it not touch the model during this time?
> 
> 	Andy
> 
> On 30/01/12 12:57, Henry Story wrote:
>> So I wrote out a gist that shows how one should be able to use Jena Parsers
>> It is here:
>> 
>>    https://gist.github.com/1704255
>> 
>> But I get the exception
>> 
>> ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
>> com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
>> 	at com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60)
>> 	at com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51)
>> 	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211)
>> 	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241)
>> 	at o
>> 
>> As expected, because there one cannot pass partial documents to the reader.
>> 
>> Henry
>> 
>> 
>> On 29 Jan 2012, at 23:52, Henry Story wrote:
>> 
>>> 
>>> On 29 Jan 2012, at 23:28, Henry Story wrote:
>>> 
>>>> 
>>>> On 29 Jan 2012, at 23:04, Andy Seaborne wrote:
>>>> 
>>>>> Hi Henry,
>>>>> 
>>>>> On 29/01/12 21:40, Henry Story wrote:
>>>>>> [ I just opened a bug report for this, but it was suggested that a wider
>>>>>> discussion on how to do it would be useful on this list. ]
>>>>> 
>>>>> The thread of interest is:
>>>>> 
>>>>> http://www.mail-archive.com/jena-users@incubator.apache.org/msg02451.html
>>>>> 
>>>>>> Unless I am mistaken the only way to parse some content is using methods that use an
>>>>>> InputStream such as this:
>>>>>> 
>>>>>>   val m = ModelFactory.createDefaultModel()
>>>>>>    m.getReader(lang.jenaLang).read(m, in, base.toString)
>>>>> 
>>>>> As already commented on the thread, passing the reader to an actor allows async reading.  Readers are configurable - you can have anything you like.  No reason why the RDFReader can't be using async NIO.
>>>> 
>>>> Mhh, can I call at time t1
>>>> 
>>>>  reader.read( model, inputStream, base);
>>>> 
>>>> with an inputStream that only contains a chunk of the data? And then call it again with
>>>> another chunk of the data later with a newly filled input stream that contains the next segment
>>>> of the data?
>>>> 
>>>>  reader.read( model, inputStream2, base);
>>>> 
>>>> It says nothing about that in the documentation, so I just assumed it does not work...
>>> 
>>> Well I did look at the code (but perhaps not deeply enough, and only the released
>>> version of Jena). From that I got the feeling that one has to send one whole RDF
>>> document down an input stream at a time.
>>> 
>>> If one cannot send chunks to the reader then essentially the thread that calls the
>>> read(...) method above will block until the whole document is read in. Even if an
>>> actor calls that method, the actor will then block the thread that it is executing
>>> in until it  is finished. So actors don't help (unless there is some magic I don't
>>> know about). Now if the server serving the document is serving it at 56 bauds, really
>>> slowly, then one thread could be used up even though it is producing very very
>>> little work.
>>> 
>>> If on the other hand I could send partial pieces of XML documents down different
>>> input streams and different times, then the NIO thread could call the reader
>>> every time it received some data. For example in the code I was writing here using the
>>> http-async-client https://gist.github.com/1701141
>>> 
>>> The method I have now on line 39-42
>>> 
>>>  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>    bodyPart.writeTo(out)
>>>    STATE.CONTINUE
>>>  }
>>> 
>>> 
>>>  could be changed to
>>> 
>>>  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>>    reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), base)
>>>    STATE.CONTINUE
>>>  }
>>> 
>>>  and so the body part would be consumed by the read in chunks.
>>> 
>>>> 
>>>>> 
>>>>> There is also RIOT - have you looked parsing the read request to a parser in an actor, the catching the Sink<Triple>  interface for the return -- that wokrs in an actor style.
>>>>> 
>>>>> The key question is what Jena can enable,  this so that possibilities can be built on top.  I don't think jena is a good level to pick one approach over another as it is in danger of clashing with other choice in the application.  Your akka is a good example of one possible choice.
>>>>> 
>>>>>> I did open the issue-203 so that when we agree on a solution we could send in
>>>>>> some patches.
>>>>> 
>>>>> Look forward to seeing this,
>>>>> 
>>>>> 	Andy
>>>> 
>>>> Social Web Architect
>>>> http://bblfish.net/
>>>> 
>>> 
>>> Social Web Architect
>>> http://bblfish.net/
>>> 
>> 
>> Social Web Architect
>> http://bblfish.net/
>> 
> 

Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Andy Seaborne <an...@apache.org>.
Yes, but :-) that's without writing any kind of adaptor code.

I was looking for a way to reuse the existing parser code.  If you want 
to start from scratch then it's a different ball game.

There are two cases:

RDF/XML: (Yuk) Jena uses an XML parser - the first point is finding a 
suitable XML parser - the SAX interface means it might be possible to 
adapt to being a pipeline based process.

The rest: Turtle parsers are quite easy to write.  In fact, the actually 
parser isn't really the bulk of the work.

The purest actor-style implementation needs to spit out the parsing 
phases: bytes to chars, chars to tokens, tokens to triples.  Each of 
those steps is a small state machine but it looks a whole lot easier to 
write as separate FSMs.  Even UTF-8 chars can be split across byte 
buffer boundaries.

Practical points:

1/ for all the small documents (say, less than 50K) it might be simpler 
to gather the bytes together and parse whole documents.  Then devote a 
thread to large documents - assumes you get Content-Length.  This isn't 
as ideal as a compete rewrite but it's less work.  Isn't thread stack 
size is key determinant of space used?

2/ Have X threads (where X ~ # cores), use a executor pool requests 
together.  The far wil start seding and it will be buffered in the low 
levels.  There aren't any extra CPU cycles to go round so while it's 
batch-y it isn't going to go fast with more active parsers.

I am interested in the question on rendezvous still by the way - how 
does the app want to be notified parsing has finished and does it not 
touch the model during this time?

	Andy

On 30/01/12 12:57, Henry Story wrote:
> So I wrote out a gist that shows how one should be able to use Jena Parsers
> It is here:
>
>     https://gist.github.com/1704255
>
> But I get the exception
>
> ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
> com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
> 	at com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60)
> 	at com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51)
> 	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211)
> 	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241)
> 	at o
>
> As expected, because there one cannot pass partial documents to the reader.
>
> Henry
>
>
> On 29 Jan 2012, at 23:52, Henry Story wrote:
>
>>
>> On 29 Jan 2012, at 23:28, Henry Story wrote:
>>
>>>
>>> On 29 Jan 2012, at 23:04, Andy Seaborne wrote:
>>>
>>>> Hi Henry,
>>>>
>>>> On 29/01/12 21:40, Henry Story wrote:
>>>>> [ I just opened a bug report for this, but it was suggested that a wider
>>>>> discussion on how to do it would be useful on this list. ]
>>>>
>>>> The thread of interest is:
>>>>
>>>> http://www.mail-archive.com/jena-users@incubator.apache.org/msg02451.html
>>>>
>>>>> Unless I am mistaken the only way to parse some content is using methods that use an
>>>>> InputStream such as this:
>>>>>
>>>>>    val m = ModelFactory.createDefaultModel()
>>>>>     m.getReader(lang.jenaLang).read(m, in, base.toString)
>>>>
>>>> As already commented on the thread, passing the reader to an actor allows async reading.  Readers are configurable - you can have anything you like.  No reason why the RDFReader can't be using async NIO.
>>>
>>> Mhh, can I call at time t1
>>>
>>>   reader.read( model, inputStream, base);
>>>
>>> with an inputStream that only contains a chunk of the data? And then call it again with
>>> another chunk of the data later with a newly filled input stream that contains the next segment
>>> of the data?
>>>
>>>   reader.read( model, inputStream2, base);
>>>
>>> It says nothing about that in the documentation, so I just assumed it does not work...
>>
>> Well I did look at the code (but perhaps not deeply enough, and only the released
>> version of Jena). From that I got the feeling that one has to send one whole RDF
>> document down an input stream at a time.
>>
>> If one cannot send chunks to the reader then essentially the thread that calls the
>> read(...) method above will block until the whole document is read in. Even if an
>> actor calls that method, the actor will then block the thread that it is executing
>> in until it  is finished. So actors don't help (unless there is some magic I don't
>> know about). Now if the server serving the document is serving it at 56 bauds, really
>> slowly, then one thread could be used up even though it is producing very very
>> little work.
>>
>> If on the other hand I could send partial pieces of XML documents down different
>> input streams and different times, then the NIO thread could call the reader
>> every time it received some data. For example in the code I was writing here using the
>> http-async-client https://gist.github.com/1701141
>>
>> The method I have now on line 39-42
>>
>>   def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>     bodyPart.writeTo(out)
>>     STATE.CONTINUE
>>   }
>>
>>
>>   could be changed to
>>
>>   def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>>     reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), base)
>>     STATE.CONTINUE
>>   }
>>
>>   and so the body part would be consumed by the read in chunks.
>>
>>>
>>>>
>>>> There is also RIOT - have you looked parsing the read request to a parser in an actor, the catching the Sink<Triple>  interface for the return -- that wokrs in an actor style.
>>>>
>>>> The key question is what Jena can enable,  this so that possibilities can be built on top.  I don't think jena is a good level to pick one approach over another as it is in danger of clashing with other choice in the application.  Your akka is a good example of one possible choice.
>>>>
>>>>> I did open the issue-203 so that when we agree on a solution we could send in
>>>>> some patches.
>>>>
>>>> Look forward to seeing this,
>>>>
>>>> 	Andy
>>>
>>> Social Web Architect
>>> http://bblfish.net/
>>>
>>
>> Social Web Architect
>> http://bblfish.net/
>>
>
> Social Web Architect
> http://bblfish.net/
>


Re: Support for Non Blocking Parsers

Posted by Henry Story <he...@bblfish.net>.
So I wrote out a gist that shows how one should be able to use Jena Parsers
It is here:

   https://gist.github.com/1704255

But I get the exception 

ERROR (WebFetcher.scala:59) : org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
com.hp.hpl.jena.shared.JenaException: org.xml.sax.SAXParseException; systemId: http://bblfish.net/people/henry/card.rdf; lineNumber: 134; columnNumber: 44; XML document structures must start and end within the same entity.
	at com.hp.hpl.jena.rdf.model.impl.RDFDefaultErrorHandler.fatalError(RDFDefaultErrorHandler.java:60)
	at com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:51)
	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:211)
	at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:241)
	at o

As expected, because there one cannot pass partial documents to the reader.

Henry


On 29 Jan 2012, at 23:52, Henry Story wrote:

> 
> On 29 Jan 2012, at 23:28, Henry Story wrote:
> 
>> 
>> On 29 Jan 2012, at 23:04, Andy Seaborne wrote:
>> 
>>> Hi Henry,
>>> 
>>> On 29/01/12 21:40, Henry Story wrote:
>>>> [ I just opened a bug report for this, but it was suggested that a wider
>>>> discussion on how to do it would be useful on this list. ]
>>> 
>>> The thread of interest is:
>>> 
>>> http://www.mail-archive.com/jena-users@incubator.apache.org/msg02451.html
>>> 
>>>> Unless I am mistaken the only way to parse some content is using methods that use an
>>>> InputStream such as this:
>>>> 
>>>>   val m = ModelFactory.createDefaultModel()
>>>>    m.getReader(lang.jenaLang).read(m, in, base.toString)
>>> 
>>> As already commented on the thread, passing the reader to an actor allows async reading.  Readers are configurable - you can have anything you like.  No reason why the RDFReader can't be using async NIO.
>> 
>> Mhh, can I call at time t1
>> 
>>  reader.read( model, inputStream, base);
>> 
>> with an inputStream that only contains a chunk of the data? And then call it again with
>> another chunk of the data later with a newly filled input stream that contains the next segment
>> of the data?
>> 
>>  reader.read( model, inputStream2, base);
>> 
>> It says nothing about that in the documentation, so I just assumed it does not work...
> 
> Well I did look at the code (but perhaps not deeply enough, and only the released 
> version of Jena). From that I got the feeling that one has to send one whole RDF 
> document down an input stream at a time.
> 
> If one cannot send chunks to the reader then essentially the thread that calls the
> read(...) method above will block until the whole document is read in. Even if an 
> actor calls that method, the actor will then block the thread that it is executing
> in until it  is finished. So actors don't help (unless there is some magic I don't
> know about). Now if the server serving the document is serving it at 56 bauds, really
> slowly, then one thread could be used up even though it is producing very very
> little work.
> 
> If on the other hand I could send partial pieces of XML documents down different 
> input streams and different times, then the NIO thread could call the reader 
> every time it received some data. For example in the code I was writing here using the
> http-async-client https://gist.github.com/1701141
> 
> The method I have now on line 39-42
> 
>  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>    bodyPart.writeTo(out)
>    STATE.CONTINUE
>  }
> 
> 
>  could be changed to 
> 
>  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
>    reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), base)
>    STATE.CONTINUE
>  }
> 
>  and so the body part would be consumed by the read in chunks.
> 
>> 
>>> 
>>> There is also RIOT - have you looked parsing the read request to a parser in an actor, the catching the Sink<Triple> interface for the return -- that wokrs in an actor style.
>>> 
>>> The key question is what Jena can enable,  this so that possibilities can be built on top.  I don't think jena is a good level to pick one approach over another as it is in danger of clashing with other choice in the application.  Your akka is a good example of one possible choice.
>>> 
>>>> I did open the issue-203 so that when we agree on a solution we could send in
>>>> some patches.
>>> 
>>> Look forward to seeing this,
>>> 
>>> 	Andy
>> 
>> Social Web Architect
>> http://bblfish.net/
>> 
> 
> Social Web Architect
> http://bblfish.net/
> 

Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Henry Story <he...@bblfish.net>.
On 29 Jan 2012, at 23:28, Henry Story wrote:

> 
> On 29 Jan 2012, at 23:04, Andy Seaborne wrote:
> 
>> Hi Henry,
>> 
>> On 29/01/12 21:40, Henry Story wrote:
>>>  [ I just opened a bug report for this, but it was suggested that a wider
>>> discussion on how to do it would be useful on this list. ]
>> 
>> The thread of interest is:
>> 
>> http://www.mail-archive.com/jena-users@incubator.apache.org/msg02451.html
>> 
>>> Unless I am mistaken the only way to parse some content is using methods that use an
>>> InputStream such as this:
>>> 
>>>    val m = ModelFactory.createDefaultModel()
>>>     m.getReader(lang.jenaLang).read(m, in, base.toString)
>> 
>> As already commented on the thread, passing the reader to an actor allows async reading.  Readers are configurable - you can have anything you like.  No reason why the RDFReader can't be using async NIO.
> 
> Mhh, can I call at time t1
> 
>   reader.read( model, inputStream, base);
> 
> with an inputStream that only contains a chunk of the data? And then call it again with
> another chunk of the data later with a newly filled input stream that contains the next segment
> of the data?
> 
>   reader.read( model, inputStream2, base);
> 
> It says nothing about that in the documentation, so I just assumed it does not work...

Well I did look at the code (but perhaps not deeply enough, and only the released 
version of Jena). From that I got the feeling that one has to send one whole RDF 
document down an input stream at a time.

If one cannot send chunks to the reader then essentially the thread that calls the
read(...) method above will block until the whole document is read in. Even if an 
actor calls that method, the actor will then block the thread that it is executing
in until it  is finished. So actors don't help (unless there is some magic I don't
know about). Now if the server serving the document is serving it at 56 bauds, really
slowly, then one thread could be used up even though it is producing very very
little work.

If on the other hand I could send partial pieces of XML documents down different 
input streams and different times, then the NIO thread could call the reader 
every time it received some data. For example in the code I was writing here using the
http-async-client https://gist.github.com/1701141

The method I have now on line 39-42

  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
    bodyPart.writeTo(out)
    STATE.CONTINUE
  }


  could be changed to 

  def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
    reader.read(model, new ByteArrayInputStream(bodyPart.getBodyPartBytes(), base)
    STATE.CONTINUE
  }
   
  and so the body part would be consumed by the read in chunks.

> 
>> 
>> There is also RIOT - have you looked parsing the read request to a parser in an actor, the catching the Sink<Triple> interface for the return -- that wokrs in an actor style.
>> 
>> The key question is what Jena can enable,  this so that possibilities can be built on top.  I don't think jena is a good level to pick one approach over another as it is in danger of clashing with other choice in the application.  Your akka is a good example of one possible choice.
>> 
>>>  I did open the issue-203 so that when we agree on a solution we could send in
>>> some patches.
>> 
>> Look forward to seeing this,
>> 
>> 	Andy
> 
> Social Web Architect
> http://bblfish.net/
> 

Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Henry Story <he...@bblfish.net>.
On 29 Jan 2012, at 23:04, Andy Seaborne wrote:

> Hi Henry,
> 
> On 29/01/12 21:40, Henry Story wrote:
>>   [ I just opened a bug report for this, but it was suggested that a wider
>> discussion on how to do it would be useful on this list. ]
> 
> The thread of interest is:
> 
> http://www.mail-archive.com/jena-users@incubator.apache.org/msg02451.html
> 
>> Unless I am mistaken the only way to parse some content is using methods that use an
>> InputStream such as this:
>> 
>>     val m = ModelFactory.createDefaultModel()
>>      m.getReader(lang.jenaLang).read(m, in, base.toString)
> 
> As already commented on the thread, passing the reader to an actor allows async reading.  Readers are configurable - you can have anything you like.  No reason why the RDFReader can't be using async NIO.

Mhh, can I call at time t1

   reader.read( model, inputStream, base);

with an inputStream that only contains a chunk of the data? And then call it again with
another chunk of the data later with a newly filled input stream that contains the next segment
of the data?

   reader.read( model, inputStream2, base);

It says nothing about that in the documentation, so I just assumed it does not work...

> 
> There is also RIOT - have you looked parsing the read request to a parser in an actor, the catching the Sink<Triple> interface for the return -- that wokrs in an actor style.
> 
> The key question is what Jena can enable,  this so that possibilities can be built on top.  I don't think jena is a good level to pick one approach over another as it is in danger of clashing with other choice in the application.  Your akka is a good example of one possible choice.
> 
>>   I did open the issue-203 so that when we agree on a solution we could send in
>> some patches.
> 
> Look forward to seeing this,
> 
> 	Andy

Social Web Architect
http://bblfish.net/


Re: Support for Non Blocking Parsers

Posted by Andy Seaborne <an...@apache.org>.
Hi Henry,

On 29/01/12 21:40, Henry Story wrote:
>    [ I just opened a bug report for this, but it was suggested that a wider
> discussion on how to do it would be useful on this list. ]

The thread of interest is:

http://www.mail-archive.com/jena-users@incubator.apache.org/msg02451.html

> Unless I am mistaken the only way to parse some content is using methods that use an
> InputStream such as this:
>
>      val m = ModelFactory.createDefaultModel()
>       m.getReader(lang.jenaLang).read(m, in, base.toString)

As already commented on the thread, passing the reader to an actor 
allows async reading.  Readers are configurable - you can have anything 
you like.  No reason why the RDFReader can't be using async NIO.

There is also RIOT - have you looked parsing the read request to a 
parser in an actor, the catching the Sink<Triple> interface for the 
return -- that wokrs in an actor style.

The key question is what Jena can enable,  this so that possibilities 
can be built on top.  I don't think jena is a good level to pick one 
approach over another as it is in danger of clashing with other choice 
in the application.  Your akka is a good example of one possible choice.

>    I did open the issue-203 so that when we agree on a solution we could send in
> some patches.

Look forward to seeing this,

	Andy