You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Iain Ritchie <ia...@gmail.com> on 2013/08/02 22:22:55 UTC
Jena SPARQL Insert - Fuseki Best Practice
Hello,
I am trying to insert small tens of thousands of triples into a Fuseki
server running in memory. After 10 thousand or so inserts I start to
get sporadic exceptions, which become more frequent as the number of
inserts increase:
org.apache.jena.atlas.AtlasException: java.net.SocketException: No
buffer space available (maximum connections reached?): connect
at org.apache.jena.atlas.io.IO.exception(IO.java:199)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
at com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
The server is running with default memory (I believe it to be 1GB),
but I have also tried to set it to 2GB without any change. After the
inserts have completed the process only consumes a few hundred
megabytes, having started with an empty data set consuming around
50MB:
java -jar -Xmx2048m fuseki-server.jar --update --mem /ds
I would like to make sure that my code is correct, and that I should
not be doing any cleanup/connection closing following the execution of
the query:
UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
SPARQL_UPDATE_END_POINT);
u.execute();
The SPARQL query for each update is a single insert statement
consisting of INSERT DATA { }
Welcome thoughts from the group.
Many Thanks.
Re: Jena SPARQL Insert - Fuseki Best Practice
Posted by Andy Seaborne <an...@apache.org>.
On 08/08/13 11:41, Andy Seaborne wrote:
> More information ...
>
> (a bit of a "doh!" moment and actually reading the documentation (=code)
> for Apache HttpClient ...)
>
> If connection pooling is enabled, then the system does not run out of
> network resources and will happily run continuously without needing a
> pause.
>
> This fix will need change to Jena to reuse a connection manager and not
> create it afresh (unless you want to mess around with the low level HTTP
> code and not use UpdateProcessRemote.execute :-)
>
> A semi-temporary fix applied to SVN which will work in all normal cases
> (it sets the Java system property "http.keepAlive" globally if not
> already set - that's pretty ugly). A complete fix without messing with
> system properties to come later.
>
> Fix applied to SVN, it will be in tonight's development build.
No tin the build - it seems to lead to instability elsewhere so this
needs more investigation. Getting closer, but not there yet.
>
> Andy
>
> On 05/08/13 22:05, Andy Seaborne wrote:
>> On 05/08/13 17:04, Iain Ritchie wrote:
>>> Hi,
>>>
>>> In answer to your questions:
>>>
>>> - Fuseki build 0.2.7
>>> - Yes the strack trace was from the client, no errors visible from the
>>> server
>>> - OS is Windows, with client and Fuseki running on the same machine.
>>>
>>> I worked around this issue by introducing a small sleep between
>>> inserts as you also suggested.
>>>
>>> Many Thanks for your support.
>>
>> A better workaround would be to batch updates into a single request -
>> say 1000 items.
>>
>> > queryArrayList.get(i)
>>
>> Even as simple as joining with ";"
>>
>> INSERT DATA { ... } ; INSERT DATA { ... } ; INSERT DATA { ... } ;
>>
>> but
>> INSERT DATA {
>> triple1
>> triple2
>> triple3
>> ...
>> }
>>
>> is even better,
>>
>> Andy
>>
>>>
>>>
>>> On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>>> On 03/08/13 20:50, Andy Seaborne wrote:
>>>>>
>>>>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to insert small tens of thousands of triples into a
>>>>>> Fuseki
>>>>>> server running in memory. After 10 thousand or so inserts I start to
>>>>>> get sporadic exceptions, which become more frequent as the number of
>>>>>> inserts increase:
>>>>>>
>>>>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>>>>> buffer space available (maximum connections reached?): connect
>>>>>> at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>>>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>>>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>>>>> at
>>>>>>
>>>>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The server is running with default memory (I believe it to be 1GB),
>>>>>> but I have also tried to set it to 2GB without any change. After the
>>>>>> inserts have completed the process only consumes a few hundred
>>>>>> megabytes, having started with an empty data set consuming around
>>>>>> 50MB:
>>>>>>
>>>>>> java -jar -Xmx2048m fuseki-server.jar --update --mem /ds
>>>>>>
>>>>>> I would like to make sure that my code is correct, and that I should
>>>>>> not be doing any cleanup/connection closing following the
>>>>>> execution of
>>>>>> the query:
>>>>>>
>>>>>> UpdateRequest q = UpdateFactory.create((String)
>>>>>> queryArrayList.get(i));
>>>>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>>>>> SPARQL_UPDATE_END_POINT);
>>>>>> u.execute();
>>>>>>
>>>>>> The SPARQL query for each update is a single insert statement
>>>>>> consisting of INSERT DATA { }
>>>>>
>>>>>
>>>>> Unrelated:
>>>>> You can use one single INSERT DATA { } of all the data
>>>>>
>>>>>>
>>>>>> Welcome thoughts from the group.
>>>>>>
>>>>>> Many Thanks.
>>>>>>
>>>>>
>>>>> Which version of Fuseki is this? It looks like a recent one but
>>>>> which?
>>>>>
>>>>> Is this:
>>>>>
>>>>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Presumably, the Fuseki log does not show anything? The stack trace
>>>>> appears to be from the client sending side (please confirm), so
>>>>> anything
>>>>> on the fuseki server side is going to make no difference.
>>>>>
>>>>> I use Fuseki running for long periods of time so I don't think its the
>>>>> server per se - it's more likely to be the client code.
>>>>>
>>>>> From a code inspection, and looking at stackoverflow, I can see we
>>>>> may
>>>>> not be using HttpClient correctly.
>>>>>
>>>>> Recorded as:
>>>>> https://issues.apache.org/jira/browse/JENA-498
>>>>>
>>>>> Andy
>>>>
>>>>
>>>> Is the server and the client running on the same machine?
>>>> And what hardware and OS are you using? This seems to be significant.
>>>>
>>>> It seems to be effectively a DOS attack on the server that manifests
>>>> itself
>>>> in the OS running out of network system resources. That then breaks
>>>> the
>>>> client.
>>>>
>>>> I can sort-of replicate your example (I get a different IO exception;
>>>> and I
>>>> got to 37945 and 55901 iterations before an error) by code that sits
>>>> in a
>>>> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>>>>
>>>> + The number of iterations before failure is not the same each time
>>>>
>>>> + If I add explicit closing code, the iterations-to-failure goes down
>>>> (!)
>>>> Even complete closedown of the client side HttpClient code does not
>>>> change
>>>> things. Seems the client side resource management is OK.
>>>>
>>>> + If I add a slight pause in the loop every so often, it works (I
>>>> stopped
>>>> the test program at 1.1 million). Maybe this gives the kernal a
>>>> chance to
>>>> catch up.
>>>>
>>>> + Forcing Java GC in the client makes no visible difference
>>>>
>>>> My current explanation is that the OS is taking time to clear up the
>>>> network
>>>> connections async with the client. It is effectively a DOS attack on
>>>> the
>>>> server - having the server on the same machine means both it and the
>>>> client
>>>> are competing for OS resources. It's probably more on the server
>>>> side than
>>>> the client - Jetty may be not instantly releasing network resources
>>>> in the
>>>> hope of reuse.
>>>>
>>>> It is better to write updates in blocks anyway - not triple at a time.
>>>>
>>>> On the client side, Jena HttpOp could manage a pool of connections
>>>> and reuse
>>>> (HTTP cache connection). That mitigates the problem but if it is as I
>>>> guess
>>>> the server/OS lagging on network resource cycling, then it isn't a
>>>> real fix.
>>>>
>>>> Andy
>>>>
>>
>
Re: Jena SPARQL Insert - Fuseki Best Practice
Posted by Andy Seaborne <an...@apache.org>.
More information ...
(a bit of a "doh!" moment and actually reading the documentation (=code)
for Apache HttpClient ...)
If connection pooling is enabled, then the system does not run out of
network resources and will happily run continuously without needing a
pause.
This fix will need change to Jena to reuse a connection manager and not
create it afresh (unless you want to mess around with the low level HTTP
code and not use UpdateProcessRemote.execute :-)
A semi-temporary fix applied to SVN which will work in all normal cases
(it sets the Java system property "http.keepAlive" globally if not
already set - that's pretty ugly). A complete fix without messing with
system properties to come later.
Fix applied to SVN, it will be in tonight's development build.
Andy
On 05/08/13 22:05, Andy Seaborne wrote:
> On 05/08/13 17:04, Iain Ritchie wrote:
>> Hi,
>>
>> In answer to your questions:
>>
>> - Fuseki build 0.2.7
>> - Yes the strack trace was from the client, no errors visible from the
>> server
>> - OS is Windows, with client and Fuseki running on the same machine.
>>
>> I worked around this issue by introducing a small sleep between
>> inserts as you also suggested.
>>
>> Many Thanks for your support.
>
> A better workaround would be to batch updates into a single request -
> say 1000 items.
>
> > queryArrayList.get(i)
>
> Even as simple as joining with ";"
>
> INSERT DATA { ... } ; INSERT DATA { ... } ; INSERT DATA { ... } ;
>
> but
> INSERT DATA {
> triple1
> triple2
> triple3
> ...
> }
>
> is even better,
>
> Andy
>
>>
>>
>> On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
>>> On 03/08/13 20:50, Andy Seaborne wrote:
>>>>
>>>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to insert small tens of thousands of triples into a Fuseki
>>>>> server running in memory. After 10 thousand or so inserts I start to
>>>>> get sporadic exceptions, which become more frequent as the number of
>>>>> inserts increase:
>>>>>
>>>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>>>> buffer space available (maximum connections reached?): connect
>>>>> at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>>>> at
>>>>>
>>>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>>>
>>>>>
>>>>>
>>>>> The server is running with default memory (I believe it to be 1GB),
>>>>> but I have also tried to set it to 2GB without any change. After the
>>>>> inserts have completed the process only consumes a few hundred
>>>>> megabytes, having started with an empty data set consuming around
>>>>> 50MB:
>>>>>
>>>>> java -jar -Xmx2048m fuseki-server.jar --update --mem /ds
>>>>>
>>>>> I would like to make sure that my code is correct, and that I should
>>>>> not be doing any cleanup/connection closing following the execution of
>>>>> the query:
>>>>>
>>>>> UpdateRequest q = UpdateFactory.create((String)
>>>>> queryArrayList.get(i));
>>>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>>>> SPARQL_UPDATE_END_POINT);
>>>>> u.execute();
>>>>>
>>>>> The SPARQL query for each update is a single insert statement
>>>>> consisting of INSERT DATA { }
>>>>
>>>>
>>>> Unrelated:
>>>> You can use one single INSERT DATA { } of all the data
>>>>
>>>>>
>>>>> Welcome thoughts from the group.
>>>>>
>>>>> Many Thanks.
>>>>>
>>>>
>>>> Which version of Fuseki is this? It looks like a recent one but which?
>>>>
>>>> Is this:
>>>>
>>>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>>>
>>>>
>>>>
>>>> Presumably, the Fuseki log does not show anything? The stack trace
>>>> appears to be from the client sending side (please confirm), so
>>>> anything
>>>> on the fuseki server side is going to make no difference.
>>>>
>>>> I use Fuseki running for long periods of time so I don't think its the
>>>> server per se - it's more likely to be the client code.
>>>>
>>>> From a code inspection, and looking at stackoverflow, I can see we
>>>> may
>>>> not be using HttpClient correctly.
>>>>
>>>> Recorded as:
>>>> https://issues.apache.org/jira/browse/JENA-498
>>>>
>>>> Andy
>>>
>>>
>>> Is the server and the client running on the same machine?
>>> And what hardware and OS are you using? This seems to be significant.
>>>
>>> It seems to be effectively a DOS attack on the server that manifests
>>> itself
>>> in the OS running out of network system resources. That then breaks the
>>> client.
>>>
>>> I can sort-of replicate your example (I get a different IO exception;
>>> and I
>>> got to 37945 and 55901 iterations before an error) by code that sits
>>> in a
>>> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>>>
>>> + The number of iterations before failure is not the same each time
>>>
>>> + If I add explicit closing code, the iterations-to-failure goes down
>>> (!)
>>> Even complete closedown of the client side HttpClient code does not
>>> change
>>> things. Seems the client side resource management is OK.
>>>
>>> + If I add a slight pause in the loop every so often, it works (I
>>> stopped
>>> the test program at 1.1 million). Maybe this gives the kernal a
>>> chance to
>>> catch up.
>>>
>>> + Forcing Java GC in the client makes no visible difference
>>>
>>> My current explanation is that the OS is taking time to clear up the
>>> network
>>> connections async with the client. It is effectively a DOS attack on
>>> the
>>> server - having the server on the same machine means both it and the
>>> client
>>> are competing for OS resources. It's probably more on the server
>>> side than
>>> the client - Jetty may be not instantly releasing network resources
>>> in the
>>> hope of reuse.
>>>
>>> It is better to write updates in blocks anyway - not triple at a time.
>>>
>>> On the client side, Jena HttpOp could manage a pool of connections
>>> and reuse
>>> (HTTP cache connection). That mitigates the problem but if it is as I
>>> guess
>>> the server/OS lagging on network resource cycling, then it isn't a
>>> real fix.
>>>
>>> Andy
>>>
>
Re: Jena SPARQL Insert - Fuseki Best Practice
Posted by Andy Seaborne <an...@apache.org>.
On 05/08/13 17:04, Iain Ritchie wrote:
> Hi,
>
> In answer to your questions:
>
> - Fuseki build 0.2.7
> - Yes the strack trace was from the client, no errors visible from the server
> - OS is Windows, with client and Fuseki running on the same machine.
>
> I worked around this issue by introducing a small sleep between
> inserts as you also suggested.
>
> Many Thanks for your support.
A better workaround would be to batch updates into a single request -
say 1000 items.
> queryArrayList.get(i)
Even as simple as joining with ";"
INSERT DATA { ... } ; INSERT DATA { ... } ; INSERT DATA { ... } ;
but
INSERT DATA {
triple1
triple2
triple3
...
}
is even better,
Andy
>
>
> On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
>> On 03/08/13 20:50, Andy Seaborne wrote:
>>>
>>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>>
>>>> Hello,
>>>>
>>>> I am trying to insert small tens of thousands of triples into a Fuseki
>>>> server running in memory. After 10 thousand or so inserts I start to
>>>> get sporadic exceptions, which become more frequent as the number of
>>>> inserts increase:
>>>>
>>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>>> buffer space available (maximum connections reached?): connect
>>>> at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>>> at
>>>>
>>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>>
>>>>
>>>> The server is running with default memory (I believe it to be 1GB),
>>>> but I have also tried to set it to 2GB without any change. After the
>>>> inserts have completed the process only consumes a few hundred
>>>> megabytes, having started with an empty data set consuming around
>>>> 50MB:
>>>>
>>>> java -jar -Xmx2048m fuseki-server.jar --update --mem /ds
>>>>
>>>> I would like to make sure that my code is correct, and that I should
>>>> not be doing any cleanup/connection closing following the execution of
>>>> the query:
>>>>
>>>> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
>>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>>> SPARQL_UPDATE_END_POINT);
>>>> u.execute();
>>>>
>>>> The SPARQL query for each update is a single insert statement
>>>> consisting of INSERT DATA { }
>>>
>>>
>>> Unrelated:
>>> You can use one single INSERT DATA { } of all the data
>>>
>>>>
>>>> Welcome thoughts from the group.
>>>>
>>>> Many Thanks.
>>>>
>>>
>>> Which version of Fuseki is this? It looks like a recent one but which?
>>>
>>> Is this:
>>>
>>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>>
>>>
>>> Presumably, the Fuseki log does not show anything? The stack trace
>>> appears to be from the client sending side (please confirm), so anything
>>> on the fuseki server side is going to make no difference.
>>>
>>> I use Fuseki running for long periods of time so I don't think its the
>>> server per se - it's more likely to be the client code.
>>>
>>> From a code inspection, and looking at stackoverflow, I can see we may
>>> not be using HttpClient correctly.
>>>
>>> Recorded as:
>>> https://issues.apache.org/jira/browse/JENA-498
>>>
>>> Andy
>>
>>
>> Is the server and the client running on the same machine?
>> And what hardware and OS are you using? This seems to be significant.
>>
>> It seems to be effectively a DOS attack on the server that manifests itself
>> in the OS running out of network system resources. That then breaks the
>> client.
>>
>> I can sort-of replicate your example (I get a different IO exception; and I
>> got to 37945 and 55901 iterations before an error) by code that sits in a
>> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>>
>> + The number of iterations before failure is not the same each time
>>
>> + If I add explicit closing code, the iterations-to-failure goes down (!)
>> Even complete closedown of the client side HttpClient code does not change
>> things. Seems the client side resource management is OK.
>>
>> + If I add a slight pause in the loop every so often, it works (I stopped
>> the test program at 1.1 million). Maybe this gives the kernal a chance to
>> catch up.
>>
>> + Forcing Java GC in the client makes no visible difference
>>
>> My current explanation is that the OS is taking time to clear up the network
>> connections async with the client. It is effectively a DOS attack on the
>> server - having the server on the same machine means both it and the client
>> are competing for OS resources. It's probably more on the server side than
>> the client - Jetty may be not instantly releasing network resources in the
>> hope of reuse.
>>
>> It is better to write updates in blocks anyway - not triple at a time.
>>
>> On the client side, Jena HttpOp could manage a pool of connections and reuse
>> (HTTP cache connection). That mitigates the problem but if it is as I guess
>> the server/OS lagging on network resource cycling, then it isn't a real fix.
>>
>> Andy
>>
Re: Jena SPARQL Insert - Fuseki Best Practice
Posted by Iain Ritchie <ia...@gmail.com>.
Hi,
In answer to your questions:
- Fuseki build 0.2.7
- Yes the strack trace was from the client, no errors visible from the server
- OS is Windows, with client and Fuseki running on the same machine.
I worked around this issue by introducing a small sleep between
inserts as you also suggested.
Many Thanks for your support.
On Sun, Aug 4, 2013 at 5:21 PM, Andy Seaborne <an...@apache.org> wrote:
> On 03/08/13 20:50, Andy Seaborne wrote:
>>
>> On 02/08/13 21:22, Iain Ritchie wrote:
>>>
>>> Hello,
>>>
>>> I am trying to insert small tens of thousands of triples into a Fuseki
>>> server running in memory. After 10 thousand or so inserts I start to
>>> get sporadic exceptions, which become more frequent as the number of
>>> inserts increase:
>>>
>>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>>> buffer space available (maximum connections reached?): connect
>>> at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>>> at
>>>
>>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>>
>>>
>>> The server is running with default memory (I believe it to be 1GB),
>>> but I have also tried to set it to 2GB without any change. After the
>>> inserts have completed the process only consumes a few hundred
>>> megabytes, having started with an empty data set consuming around
>>> 50MB:
>>>
>>> java -jar -Xmx2048m fuseki-server.jar --update --mem /ds
>>>
>>> I would like to make sure that my code is correct, and that I should
>>> not be doing any cleanup/connection closing following the execution of
>>> the query:
>>>
>>> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
>>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>>> SPARQL_UPDATE_END_POINT);
>>> u.execute();
>>>
>>> The SPARQL query for each update is a single insert statement
>>> consisting of INSERT DATA { }
>>
>>
>> Unrelated:
>> You can use one single INSERT DATA { } of all the data
>>
>>>
>>> Welcome thoughts from the group.
>>>
>>> Many Thanks.
>>>
>>
>> Which version of Fuseki is this? It looks like a recent one but which?
>>
>> Is this:
>>
>> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>>
>>
>> Presumably, the Fuseki log does not show anything? The stack trace
>> appears to be from the client sending side (please confirm), so anything
>> on the fuseki server side is going to make no difference.
>>
>> I use Fuseki running for long periods of time so I don't think its the
>> server per se - it's more likely to be the client code.
>>
>> From a code inspection, and looking at stackoverflow, I can see we may
>> not be using HttpClient correctly.
>>
>> Recorded as:
>> https://issues.apache.org/jira/browse/JENA-498
>>
>> Andy
>
>
> Is the server and the client running on the same machine?
> And what hardware and OS are you using? This seems to be significant.
>
> It seems to be effectively a DOS attack on the server that manifests itself
> in the OS running out of network system resources. That then breaks the
> client.
>
> I can sort-of replicate your example (I get a different IO exception; and I
> got to 37945 and 55901 iterations before an error) by code that sits in a
> tight loop and hammers the server with a one-line SPARQL INSERT DATA.
>
> + The number of iterations before failure is not the same each time
>
> + If I add explicit closing code, the iterations-to-failure goes down (!)
> Even complete closedown of the client side HttpClient code does not change
> things. Seems the client side resource management is OK.
>
> + If I add a slight pause in the loop every so often, it works (I stopped
> the test program at 1.1 million). Maybe this gives the kernal a chance to
> catch up.
>
> + Forcing Java GC in the client makes no visible difference
>
> My current explanation is that the OS is taking time to clear up the network
> connections async with the client. It is effectively a DOS attack on the
> server - having the server on the same machine means both it and the client
> are competing for OS resources. It's probably more on the server side than
> the client - Jetty may be not instantly releasing network resources in the
> hope of reuse.
>
> It is better to write updates in blocks anyway - not triple at a time.
>
> On the client side, Jena HttpOp could manage a pool of connections and reuse
> (HTTP cache connection). That mitigates the problem but if it is as I guess
> the server/OS lagging on network resource cycling, then it isn't a real fix.
>
> Andy
>
Re: Jena SPARQL Insert - Fuseki Best Practice
Posted by Andy Seaborne <an...@apache.org>.
On 03/08/13 20:50, Andy Seaborne wrote:
> On 02/08/13 21:22, Iain Ritchie wrote:
>> Hello,
>>
>> I am trying to insert small tens of thousands of triples into a Fuseki
>> server running in memory. After 10 thousand or so inserts I start to
>> get sporadic exceptions, which become more frequent as the number of
>> inserts increase:
>>
>> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
>> buffer space available (maximum connections reached?): connect
>> at org.apache.jena.atlas.io.IO.exception(IO.java:199)
>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
>> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
>> at
>> com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>>
>>
>> The server is running with default memory (I believe it to be 1GB),
>> but I have also tried to set it to 2GB without any change. After the
>> inserts have completed the process only consumes a few hundred
>> megabytes, having started with an empty data set consuming around
>> 50MB:
>>
>> java -jar -Xmx2048m fuseki-server.jar --update --mem /ds
>>
>> I would like to make sure that my code is correct, and that I should
>> not be doing any cleanup/connection closing following the execution of
>> the query:
>>
>> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
>> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
>> SPARQL_UPDATE_END_POINT);
>> u.execute();
>>
>> The SPARQL query for each update is a single insert statement
>> consisting of INSERT DATA { }
>
> Unrelated:
> You can use one single INSERT DATA { } of all the data
>
>>
>> Welcome thoughts from the group.
>>
>> Many Thanks.
>>
>
> Which version of Fuseki is this? It looks like a recent one but which?
>
> Is this:
> http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
>
>
> Presumably, the Fuseki log does not show anything? The stack trace
> appears to be from the client sending side (please confirm), so anything
> on the fuseki server side is going to make no difference.
>
> I use Fuseki running for long periods of time so I don't think its the
> server per se - it's more likely to be the client code.
>
> From a code inspection, and looking at stackoverflow, I can see we may
> not be using HttpClient correctly.
>
> Recorded as:
> https://issues.apache.org/jira/browse/JENA-498
>
> Andy
Is the server and the client running on the same machine?
And what hardware and OS are you using? This seems to be significant.
It seems to be effectively a DOS attack on the server that manifests
itself in the OS running out of network system resources. That then
breaks the client.
I can sort-of replicate your example (I get a different IO exception;
and I got to 37945 and 55901 iterations before an error) by code that
sits in a tight loop and hammers the server with a one-line SPARQL
INSERT DATA.
+ The number of iterations before failure is not the same each time
+ If I add explicit closing code, the iterations-to-failure goes down
(!) Even complete closedown of the client side HttpClient code does not
change things. Seems the client side resource management is OK.
+ If I add a slight pause in the loop every so often, it works (I
stopped the test program at 1.1 million). Maybe this gives the kernal a
chance to catch up.
+ Forcing Java GC in the client makes no visible difference
My current explanation is that the OS is taking time to clear up the
network connections async with the client. It is effectively a DOS
attack on the server - having the server on the same machine means both
it and the client are competing for OS resources. It's probably more on
the server side than the client - Jetty may be not instantly releasing
network resources in the hope of reuse.
It is better to write updates in blocks anyway - not triple at a time.
On the client side, Jena HttpOp could manage a pool of connections and
reuse (HTTP cache connection). That mitigates the problem but if it is
as I guess the server/OS lagging on network resource cycling, then it
isn't a real fix.
Andy
Re: Jena SPARQL Insert - Fuseki Best Practice
Posted by Andy Seaborne <an...@apache.org>.
On 02/08/13 21:22, Iain Ritchie wrote:
> Hello,
>
> I am trying to insert small tens of thousands of triples into a Fuseki
> server running in memory. After 10 thousand or so inserts I start to
> get sporadic exceptions, which become more frequent as the number of
> inserts increase:
>
> org.apache.jena.atlas.AtlasException: java.net.SocketException: No
> buffer space available (maximum connections reached?): connect
> at org.apache.jena.atlas.io.IO.exception(IO.java:199)
> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:299)
> at org.apache.jena.riot.web.HttpOp.execHttpPost(HttpOp.java:239)
> at com.hp.hpl.jena.sparql.modify.UpdateProcessRemote.execute(UpdateProcessRemote.java:61)
>
> The server is running with default memory (I believe it to be 1GB),
> but I have also tried to set it to 2GB without any change. After the
> inserts have completed the process only consumes a few hundred
> megabytes, having started with an empty data set consuming around
> 50MB:
>
> java -jar -Xmx2048m fuseki-server.jar --update --mem /ds
>
> I would like to make sure that my code is correct, and that I should
> not be doing any cleanup/connection closing following the execution of
> the query:
>
> UpdateRequest q = UpdateFactory.create((String) queryArrayList.get(i));
> UpdateProcessor u = UpdateExecutionFactory.createRemote(q,
> SPARQL_UPDATE_END_POINT);
> u.execute();
>
> The SPARQL query for each update is a single insert statement
> consisting of INSERT DATA { }
Unrelated:
You can use one single INSERT DATA { } of all the data
>
> Welcome thoughts from the group.
>
> Many Thanks.
>
Which version of Fuseki is this? It looks like a recent one but which?
Is this:
http://answers.semanticweb.com/questions/23955/javalangoutofmemoryerror-java-heap-space-error-in-fuseki
Presumably, the Fuseki log does not show anything? The stack trace
appears to be from the client sending side (please confirm), so anything
on the fuseki server side is going to make no difference.
I use Fuseki running for long periods of time so I don't think its the
server per se - it's more likely to be the client code.
From a code inspection, and looking at stackoverflow, I can see we may
not be using HttpClient correctly.
Recorded as:
https://issues.apache.org/jira/browse/JENA-498
Andy