You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Iain Ritchie <ia...@gmail.com> on 2013/08/02 22:22:55 UTC

Jena SPARQL Insert - Fuseki Best Practice

Hello,

I am trying to insert small tens of thousands of triples into a Fuseki
server running in memory. After 10 thousand or so inserts I start to
get sporadic exceptions, which become more frequent as the number of
inserts increase:

org.apache.jena.atlas.AtlasException: java.net.SocketException: No
buffer space available (maximum connections reached?): connect
	at org.apache.jena.atlas.io.IO.exception(IO.java:199)
	at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
	at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
	at com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)

The server is running with default memory (I believe it to be 1GB),
but I have also tried to set it to 2GB without any change. After the
inserts have completed the process only consumes a few hundred
megabytes, having started with an empty data set consuming around
50MB:

java -jar -Xmx2048m  fuseki-server.jar --update --mem /ds

I would like to make sure that my code is correct, and that I should
not be doing any cleanup/connection closing following the execution of
the query:

UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
SPARQL_UPDATE_END_POINT);
u.execute();

The SPARQL query for each update is a single insert statement
consisting of INSERT DATA { }

Welcome thoughts from the group.

Many Thanks.

Re: Jena SPARQL Insert - Fuseki Best Practice

Posted by Andy Seaborne <an...@apache.org>.

On 08/08/13 11:41, Andy Seaborne wrote:
> More information ...
>
> (a bit of a "doh!" moment and actually reading the documentation (=code)
> for Apache HttpClient ...)
>
> If connection pooling is enabled, then the system does not run out of
> network resources and will happily run continuously without needing a
> pause.
>
> This fix will need change to Jena to reuse a connection manager and not
> create it afresh (unless you want to mess around with the low level HTTP
> code and not use UpdateProcessRemote.execute :-)
>
> A semi-temporary fix applied to SVN which will work in all normal cases
> (it sets the Java system property "http.keepAlive" globally if not
> already set - that's pretty ugly).  A complete fix without messing with
> system properties to come later.
>
> Fix applied to SVN, it will be in tonight's development build.

No tin the build - it seems to lead to instability elsewhere so this 
needs more investigation.  Getting closer, but not there yet.

>
>      Andy
>
> On 05/08/13 22:05, Andy Seaborne wrote:
>> On 05/08/13 17:04, Iain Ritchie wrote:
>>> Hi,
>>>
>>> In answer to your questions:
>>>
>>> - Fuseki build 0.2.7
>>> - Yes the strack trace was from the client, no errors visible from the
>>> server
>>> - OS is Windows, with client and Fuseki running on the same machine.
>>>
>>> I worked around this issue by introducing a small sleep between
>>> inserts as you also suggested.
>>>
>>> Many Thanks for your support.
>>
>> A better workaround would be to batch updates into a single request -
>> say 1000 items.
>>
>>  > queryArrayList.get(i)
>>
>> Even as simple as joining with ";"
>>
>> INSERT DATA { ... } ; INSERT DATA { ... } ; INSERT DATA { ... } ;
>>
>> but
>> INSERT DATA {
>>     triple1
>>     triple2
>>     triple3
>> ...
>> }
>>
>> is even better,
>>
>>      Andy
>>
>>>
>>>
>>> On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>>> On 03/08/13 20:50, Andy Seaborne wrote:
>>>>>
>>>>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to insert small tens of thousands of triples into a
>>>>>> Fuseki
>>>>>> server running in memory. After 10 thousand or so inserts I start to
>>>>>> get sporadic exceptions, which become more frequent as the number of
>>>>>> inserts increase:
>>>>>>
>>>>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>>>>> buffer space available (maximum connections reached?): connect
>>>>>>      at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>>>>>      at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>>>>>      at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>>>>>      at
>>>>>>
>>>>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The server is running with default memory (I believe it to be 1GB),
>>>>>> but I have also tried to set it to 2GB without any change. After the
>>>>>> inserts have completed the process only consumes a few hundred
>>>>>> megabytes, having started with an empty data set consuming around
>>>>>> 50MB:
>>>>>>
>>>>>> java -jar -Xmx2048m  fuseki-server.jar --update --mem /ds
>>>>>>
>>>>>> I would like to make sure that my code is correct, and that I should
>>>>>> not be doing any cleanup/connection closing following the
>>>>>> execution of
>>>>>> the query:
>>>>>>
>>>>>> UpdateRequest q = UpdateFactory.create((String)
>>>>>> queryArrayList.get(i));
>>>>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>>>>> SPARQL_UPDATE_END_POINT);
>>>>>> u.execute();
>>>>>>
>>>>>> The SPARQL query for each update is a single insert statement
>>>>>> consisting of INSERT DATA { }
>>>>>
>>>>>
>>>>> Unrelated:
>>>>> You can use one single INSERT DATA { } of all the data
>>>>>
>>>>>>
>>>>>> Welcome thoughts from the group.
>>>>>>
>>>>>> Many Thanks.
>>>>>>
>>>>>
>>>>> Which version of Fuseki is this?  It looks like a recent one but
>>>>> which?
>>>>>
>>>>> Is this:
>>>>>
>>>>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Presumably, the Fuseki log does not show anything?  The stack trace
>>>>> appears to be from the client sending side (please confirm), so
>>>>> anything
>>>>> on the fuseki server side is going to make no difference.
>>>>>
>>>>> I use Fuseki running for long periods of time so I don't think its the
>>>>> server per se - it's more likely to be the client code.
>>>>>
>>>>>   From a code inspection, and looking at stackoverflow, I can see we
>>>>> may
>>>>> not be using HttpClient correctly.
>>>>>
>>>>> Recorded as:
>>>>> https://issues.apache.org/jira/browse/JENA-498
>>>>>
>>>>>       Andy
>>>>
>>>>
>>>> Is the server and the client running on the same machine?
>>>> And what hardware and OS are you using?  This seems to be significant.
>>>>
>>>> It seems to be effectively a DOS attack on the server that manifests
>>>> itself
>>>> in the OS running out of network system resources.  That then breaks
>>>> the
>>>> client.
>>>>
>>>> I can sort-of replicate your example (I get a different IO exception;
>>>> and I
>>>> got to 37945 and 55901 iterations before an error) by code that sits
>>>> in a
>>>> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>>>>
>>>> + The number of iterations before failure is not the same each time
>>>>
>>>> + If I add explicit closing code, the iterations-to-failure goes down
>>>> (!)
>>>> Even complete closedown of the client side HttpClient code does not
>>>> change
>>>> things. Seems the client side resource management is OK.
>>>>
>>>> + If I add a slight pause in the loop every so often, it works (I
>>>> stopped
>>>> the test program at 1.1 million).  Maybe this gives the kernal a
>>>> chance to
>>>> catch up.
>>>>
>>>> + Forcing Java GC in the client makes no visible difference
>>>>
>>>> My current explanation is that the OS is taking time to clear up the
>>>> network
>>>> connections async with the client.  It is effectively a DOS attack on
>>>> the
>>>> server - having the server on the same machine means both it and the
>>>> client
>>>> are competing for OS resources.  It's probably more on the server
>>>> side than
>>>> the client - Jetty may be not instantly releasing network resources
>>>> in the
>>>> hope of reuse.
>>>>
>>>> It is better to write updates in blocks anyway - not triple at a time.
>>>>
>>>> On the client side, Jena HttpOp could manage a pool of connections
>>>> and reuse
>>>> (HTTP cache connection). That mitigates the problem but if it is as I
>>>> guess
>>>> the server/OS lagging on network resource cycling, then it isn't a
>>>> real fix.
>>>>
>>>>          Andy
>>>>
>>
>

Re: Jena SPARQL Insert - Fuseki Best Practice

Posted by Andy Seaborne <an...@apache.org>.

More information ...

(a bit of a "doh!" moment and actually reading the documentation (=code) 
for Apache HttpClient ...)

If connection pooling is enabled, then the system does not run out of 
network resources and will happily run continuously without needing a 
pause.

This fix will need change to Jena to reuse a connection manager and not 
create it afresh (unless you want to mess around with the low level HTTP 
code and not use UpdateProcessRemote.execute :-)

A semi-temporary fix applied to SVN which will work in all normal cases 
(it sets the Java system property "http.keepAlive" globally if not 
already set - that's pretty ugly).  A complete fix without messing with 
system properties to come later.

Fix applied to SVN, it will be in tonight's development build.

	Andy

On 05/08/13 22:05, Andy Seaborne wrote:
> On 05/08/13 17:04, Iain Ritchie wrote:
>> Hi,
>>
>> In answer to your questions:
>>
>> - Fuseki build 0.2.7
>> - Yes the strack trace was from the client, no errors visible from the
>> server
>> - OS is Windows, with client and Fuseki running on the same machine.
>>
>> I worked around this issue by introducing a small sleep between
>> inserts as you also suggested.
>>
>> Many Thanks for your support.
>
> A better workaround would be to batch updates into a single request -
> say 1000 items.
>
>  > queryArrayList.get(i)
>
> Even as simple as joining with ";"
>
> INSERT DATA { ... } ; INSERT DATA { ... } ; INSERT DATA { ... } ;
>
> but
> INSERT DATA {
>     triple1
>     triple2
>     triple3
> ...
> }
>
> is even better,
>
>      Andy
>
>>
>>
>> On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>> On 03/08/13 20:50, Andy Seaborne wrote:
>>>>
>>>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to insert small tens of thousands of triples into a Fuseki
>>>>> server running in memory. After 10 thousand or so inserts I start to
>>>>> get sporadic exceptions, which become more frequent as the number of
>>>>> inserts increase:
>>>>>
>>>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>>>> buffer space available (maximum connections reached?): connect
>>>>>      at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>>>>      at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>>>>      at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>>>>      at
>>>>>
>>>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>>>
>>>>>
>>>>>
>>>>> The server is running with default memory (I believe it to be 1GB),
>>>>> but I have also tried to set it to 2GB without any change. After the
>>>>> inserts have completed the process only consumes a few hundred
>>>>> megabytes, having started with an empty data set consuming around
>>>>> 50MB:
>>>>>
>>>>> java -jar -Xmx2048m  fuseki-server.jar --update --mem /ds
>>>>>
>>>>> I would like to make sure that my code is correct, and that I should
>>>>> not be doing any cleanup/connection closing following the execution of
>>>>> the query:
>>>>>
>>>>> UpdateRequest q = UpdateFactory.create((String)
>>>>> queryArrayList.get(i));
>>>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>>>> SPARQL_UPDATE_END_POINT);
>>>>> u.execute();
>>>>>
>>>>> The SPARQL query for each update is a single insert statement
>>>>> consisting of INSERT DATA { }
>>>>
>>>>
>>>> Unrelated:
>>>> You can use one single INSERT DATA { } of all the data
>>>>
>>>>>
>>>>> Welcome thoughts from the group.
>>>>>
>>>>> Many Thanks.
>>>>>
>>>>
>>>> Which version of Fuseki is this?  It looks like a recent one but which?
>>>>
>>>> Is this:
>>>>
>>>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>>>
>>>>
>>>>
>>>> Presumably, the Fuseki log does not show anything?  The stack trace
>>>> appears to be from the client sending side (please confirm), so
>>>> anything
>>>> on the fuseki server side is going to make no difference.
>>>>
>>>> I use Fuseki running for long periods of time so I don't think its the
>>>> server per se - it's more likely to be the client code.
>>>>
>>>>   From a code inspection, and looking at stackoverflow, I can see we
>>>> may
>>>> not be using HttpClient correctly.
>>>>
>>>> Recorded as:
>>>> https://issues.apache.org/jira/browse/JENA-498
>>>>
>>>>       Andy
>>>
>>>
>>> Is the server and the client running on the same machine?
>>> And what hardware and OS are you using?  This seems to be significant.
>>>
>>> It seems to be effectively a DOS attack on the server that manifests
>>> itself
>>> in the OS running out of network system resources.  That then breaks the
>>> client.
>>>
>>> I can sort-of replicate your example (I get a different IO exception;
>>> and I
>>> got to 37945 and 55901 iterations before an error) by code that sits
>>> in a
>>> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>>>
>>> + The number of iterations before failure is not the same each time
>>>
>>> + If I add explicit closing code, the iterations-to-failure goes down
>>> (!)
>>> Even complete closedown of the client side HttpClient code does not
>>> change
>>> things. Seems the client side resource management is OK.
>>>
>>> + If I add a slight pause in the loop every so often, it works (I
>>> stopped
>>> the test program at 1.1 million).  Maybe this gives the kernal a
>>> chance to
>>> catch up.
>>>
>>> + Forcing Java GC in the client makes no visible difference
>>>
>>> My current explanation is that the OS is taking time to clear up the
>>> network
>>> connections async with the client.  It is effectively a DOS attack on
>>> the
>>> server - having the server on the same machine means both it and the
>>> client
>>> are competing for OS resources.  It's probably more on the server
>>> side than
>>> the client - Jetty may be not instantly releasing network resources
>>> in the
>>> hope of reuse.
>>>
>>> It is better to write updates in blocks anyway - not triple at a time.
>>>
>>> On the client side, Jena HttpOp could manage a pool of connections
>>> and reuse
>>> (HTTP cache connection). That mitigates the problem but if it is as I
>>> guess
>>> the server/OS lagging on network resource cycling, then it isn't a
>>> real fix.
>>>
>>>          Andy
>>>
>

Re: Jena SPARQL Insert - Fuseki Best Practice

Posted by Andy Seaborne <an...@apache.org>.

On 05/08/13 17:04, Iain Ritchie wrote:
> Hi,
>
> In answer to your questions:
>
> - Fuseki build 0.2.7
> - Yes the strack trace was from the client, no errors visible from the server
> - OS is Windows, with client and Fuseki running on the same machine.
>
> I worked around this issue by introducing a small sleep between
> inserts as you also suggested.
>
> Many Thanks for your support.

A better workaround would be to batch updates into a single request - 
say 1000 items.

 > queryArrayList.get(i)

Even as simple as joining with ";"

INSERT DATA { ... } ; INSERT DATA { ... } ; INSERT DATA { ... } ;

but
INSERT DATA {
    triple1
    triple2
    triple3
...
}

is even better,

	Andy

>
>
> On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
>> On 03/08/13 20:50, Andy Seaborne wrote:
>>>
>>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>>
>>>> Hello,
>>>>
>>>> I am trying to insert small tens of thousands of triples into a Fuseki
>>>> server running in memory. After 10 thousand or so inserts I start to
>>>> get sporadic exceptions, which become more frequent as the number of
>>>> inserts increase:
>>>>
>>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>>> buffer space available (maximum connections reached?): connect
>>>>      at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>>>      at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>>>      at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>>>      at
>>>>
>>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>>
>>>>
>>>> The server is running with default memory (I believe it to be 1GB),
>>>> but I have also tried to set it to 2GB without any change. After the
>>>> inserts have completed the process only consumes a few hundred
>>>> megabytes, having started with an empty data set consuming around
>>>> 50MB:
>>>>
>>>> java -jar -Xmx2048m  fuseki-server.jar --update --mem /ds
>>>>
>>>> I would like to make sure that my code is correct, and that I should
>>>> not be doing any cleanup/connection closing following the execution of
>>>> the query:
>>>>
>>>> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
>>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>>> SPARQL_UPDATE_END_POINT);
>>>> u.execute();
>>>>
>>>> The SPARQL query for each update is a single insert statement
>>>> consisting of INSERT DATA { }
>>>
>>>
>>> Unrelated:
>>> You can use one single INSERT DATA { } of all the data
>>>
>>>>
>>>> Welcome thoughts from the group.
>>>>
>>>> Many Thanks.
>>>>
>>>
>>> Which version of Fuseki is this?  It looks like a recent one but which?
>>>
>>> Is this:
>>>
>>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>>
>>>
>>> Presumably, the Fuseki log does not show anything?  The stack trace
>>> appears to be from the client sending side (please confirm), so anything
>>> on the fuseki server side is going to make no difference.
>>>
>>> I use Fuseki running for long periods of time so I don't think its the
>>> server per se - it's more likely to be the client code.
>>>
>>>   From a code inspection, and looking at stackoverflow, I can see we may
>>> not be using HttpClient correctly.
>>>
>>> Recorded as:
>>> https://issues.apache.org/jira/browse/JENA-498
>>>
>>>       Andy
>>
>>
>> Is the server and the client running on the same machine?
>> And what hardware and OS are you using?  This seems to be significant.
>>
>> It seems to be effectively a DOS attack on the server that manifests itself
>> in the OS running out of network system resources.  That then breaks the
>> client.
>>
>> I can sort-of replicate your example (I get a different IO exception; and I
>> got to 37945 and 55901 iterations before an error) by code that sits in a
>> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>>
>> + The number of iterations before failure is not the same each time
>>
>> + If I add explicit closing code, the iterations-to-failure goes down (!)
>> Even complete closedown of the client side HttpClient code does not change
>> things. Seems the client side resource management is OK.
>>
>> + If I add a slight pause in the loop every so often, it works (I stopped
>> the test program at 1.1 million).  Maybe this gives the kernal a chance to
>> catch up.
>>
>> + Forcing Java GC in the client makes no visible difference
>>
>> My current explanation is that the OS is taking time to clear up the network
>> connections async with the client.  It is effectively a DOS attack on the
>> server - having the server on the same machine means both it and the client
>> are competing for OS resources.  It's probably more on the server side than
>> the client - Jetty may be not instantly releasing network resources in the
>> hope of reuse.
>>
>> It is better to write updates in blocks anyway - not triple at a time.
>>
>> On the client side, Jena HttpOp could manage a pool of connections and reuse
>> (HTTP cache connection). That mitigates the problem but if it is as I guess
>> the server/OS lagging on network resource cycling, then it isn't a real fix.
>>
>>          Andy
>>

Re: Jena SPARQL Insert - Fuseki Best Practice

Posted by Iain Ritchie <ia...@gmail.com>.

Hi,

In answer to your questions:

- Fuseki build 0.2.7
- Yes the strack trace was from the client, no errors visible from the server
- OS is Windows, with client and Fuseki running on the same machine.

I worked around this issue by introducing a small sleep between
inserts as you also suggested.

Many Thanks for your support.


On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
> On 03/08/13 20:50, Andy Seaborne wrote:
>>
>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>
>>> Hello,
>>>
>>> I am trying to insert small tens of thousands of triples into a Fuseki
>>> server running in memory. After 10 thousand or so inserts I start to
>>> get sporadic exceptions, which become more frequent as the number of
>>> inserts increase:
>>>
>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>> buffer space available (maximum connections reached?): connect
>>>     at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>>     at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>>     at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>>     at
>>>
>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>
>>>
>>> The server is running with default memory (I believe it to be 1GB),
>>> but I have also tried to set it to 2GB without any change. After the
>>> inserts have completed the process only consumes a few hundred
>>> megabytes, having started with an empty data set consuming around
>>> 50MB:
>>>
>>> java -jar -Xmx2048m  fuseki-server.jar --update --mem /ds
>>>
>>> I would like to make sure that my code is correct, and that I should
>>> not be doing any cleanup/connection closing following the execution of
>>> the query:
>>>
>>> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>> SPARQL_UPDATE_END_POINT);
>>> u.execute();
>>>
>>> The SPARQL query for each update is a single insert statement
>>> consisting of INSERT DATA { }
>>
>>
>> Unrelated:
>> You can use one single INSERT DATA { } of all the data
>>
>>>
>>> Welcome thoughts from the group.
>>>
>>> Many Thanks.
>>>
>>
>> Which version of Fuseki is this?  It looks like a recent one but which?
>>
>> Is this:
>>
>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>
>>
>> Presumably, the Fuseki log does not show anything?  The stack trace
>> appears to be from the client sending side (please confirm), so anything
>> on the fuseki server side is going to make no difference.
>>
>> I use Fuseki running for long periods of time so I don't think its the
>> server per se - it's more likely to be the client code.
>>
>>  From a code inspection, and looking at stackoverflow, I can see we may
>> not be using HttpClient correctly.
>>
>> Recorded as:
>> https://issues.apache.org/jira/browse/JENA-498
>>
>>      Andy
>
>
> Is the server and the client running on the same machine?
> And what hardware and OS are you using?  This seems to be significant.
>
> It seems to be effectively a DOS attack on the server that manifests itself
> in the OS running out of network system resources.  That then breaks the
> client.
>
> I can sort-of replicate your example (I get a different IO exception; and I
> got to 37945 and 55901 iterations before an error) by code that sits in a
> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>
> + The number of iterations before failure is not the same each time
>
> + If I add explicit closing code, the iterations-to-failure goes down (!)
> Even complete closedown of the client side HttpClient code does not change
> things. Seems the client side resource management is OK.
>
> + If I add a slight pause in the loop every so often, it works (I stopped
> the test program at 1.1 million).  Maybe this gives the kernal a chance to
> catch up.
>
> + Forcing Java GC in the client makes no visible difference
>
> My current explanation is that the OS is taking time to clear up the network
> connections async with the client.  It is effectively a DOS attack on the
> server - having the server on the same machine means both it and the client
> are competing for OS resources.  It's probably more on the server side than
> the client - Jetty may be not instantly releasing network resources in the
> hope of reuse.
>
> It is better to write updates in blocks anyway - not triple at a time.
>
> On the client side, Jena HttpOp could manage a pool of connections and reuse
> (HTTP cache connection). That mitigates the problem but if it is as I guess
> the server/OS lagging on network resource cycling, then it isn't a real fix.
>
>         Andy
>

Re: Jena SPARQL Insert - Fuseki Best Practice

Posted by Andy Seaborne <an...@apache.org>.

On 03/08/13 20:50, Andy Seaborne wrote:
> On 02/08/13 21:22, Iain Ritchie wrote:
>> Hello,
>>
>> I am trying to insert small tens of thousands of triples into a Fuseki
>> server running in memory. After 10 thousand or so inserts I start to
>> get sporadic exceptions, which become more frequent as the number of
>> inserts increase:
>>
>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>> buffer space available (maximum connections reached?): connect
>>     at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>     at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>     at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>     at
>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>
>>
>> The server is running with default memory (I believe it to be 1GB),
>> but I have also tried to set it to 2GB without any change. After the
>> inserts have completed the process only consumes a few hundred
>> megabytes, having started with an empty data set consuming around
>> 50MB:
>>
>> java -jar -Xmx2048m  fuseki-server.jar --update --mem /ds
>>
>> I would like to make sure that my code is correct, and that I should
>> not be doing any cleanup/connection closing following the execution of
>> the query:
>>
>> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>> SPARQL_UPDATE_END_POINT);
>> u.execute();
>>
>> The SPARQL query for each update is a single insert statement
>> consisting of INSERT DATA { }
>
> Unrelated:
> You can use one single INSERT DATA { } of all the data
>
>>
>> Welcome thoughts from the group.
>>
>> Many Thanks.
>>
>
> Which version of Fuseki is this?  It looks like a recent one but which?
>
> Is this:
> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>
>
> Presumably, the Fuseki log does not show anything?  The stack trace
> appears to be from the client sending side (please confirm), so anything
> on the fuseki server side is going to make no difference.
>
> I use Fuseki running for long periods of time so I don't think its the
> server per se - it's more likely to be the client code.
>
>  From a code inspection, and looking at stackoverflow, I can see we may
> not be using HttpClient correctly.
>
> Recorded as:
> https://issues.apache.org/jira/browse/JENA-498
>
>      Andy

Is the server and the client running on the same machine?
And what hardware and OS are you using?  This seems to be significant.

It seems to be effectively a DOS attack on the server that manifests 
itself in the OS running out of network system resources.  That then 
breaks the client.

I can sort-of replicate your example (I get a different IO exception; 
and I got to 37945 and 55901 iterations before an error) by code that 
sits in a tight loop and hammers the server with a one-line SPARQL 
INSERT DATA.

+ The number of iterations before failure is not the same each time

+ If I add explicit closing code, the iterations-to-failure goes down 
(!)  Even complete closedown of the client side HttpClient code does not 
change things. Seems the client side resource management is OK.

+ If I add a slight pause in the loop every so often, it works (I 
stopped the test program at 1.1 million).  Maybe this gives the kernal a 
chance to catch up.

+ Forcing Java GC in the client makes no visible difference

My current explanation is that the OS is taking time to clear up the 
network connections async with the client.  It is effectively a DOS 
attack on the server - having the server on the same machine means both 
it and the client are competing for OS resources.  It's probably more on 
the server side than the client - Jetty may be not instantly releasing 
network resources in the hope of reuse.

It is better to write updates in blocks anyway - not triple at a time.

On the client side, Jena HttpOp could manage a pool of connections and 
reuse (HTTP cache connection). That mitigates the problem but if it is 
as I guess the server/OS lagging on network resource cycling, then it 
isn't a real fix.

	Andy

Re: Jena SPARQL Insert - Fuseki Best Practice

Posted by Andy Seaborne <an...@apache.org>.

On 02/08/13 21:22, Iain Ritchie wrote:
> Hello,
>
> I am trying to insert small tens of thousands of triples into a Fuseki
> server running in memory. After 10 thousand or so inserts I start to
> get sporadic exceptions, which become more frequent as the number of
> inserts increase:
>
> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
> buffer space available (maximum connections reached?): connect
> 	at org.apache.jena.atlas.io.IO.exception(IO.java:199)
> 	at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
> 	at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
> 	at com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>
> The server is running with default memory (I believe it to be 1GB),
> but I have also tried to set it to 2GB without any change. After the
> inserts have completed the process only consumes a few hundred
> megabytes, having started with an empty data set consuming around
> 50MB:
>
> java -jar -Xmx2048m  fuseki-server.jar --update --mem /ds
>
> I would like to make sure that my code is correct, and that I should
> not be doing any cleanup/connection closing following the execution of
> the query:
>
> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
> SPARQL_UPDATE_END_POINT);
> u.execute();
>
> The SPARQL query for each update is a single insert statement
> consisting of INSERT DATA { }

Unrelated:
You can use one single INSERT DATA { } of all the data

>
> Welcome thoughts from the group.
>
> Many Thanks.
>

Which version of Fuseki is this?  It looks like a recent one but which?

Is this:
http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki

Presumably, the Fuseki log does not show anything?  The stack trace 
appears to be from the client sending side (please confirm), so anything 
on the fuseki server side is going to make no difference.

I use Fuseki running for long periods of time so I don't think its the 
server per se - it's more likely to be the client code.

 From a code inspection, and looking at stackoverflow, I can see we may 
not be using HttpClient correctly.

Recorded as:
https://issues.apache.org/jira/browse/JENA-498

	Andy