You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2015/08/06 17:45:15 UTC

Re: Errors DELETEing large-ish Graph from Fuseki

Hi Ric,

How did you get on?

Did batch sizes > 50k help?
and/or tdb:transactionJournalWriteBlockMode

	Andy


On 17/07/15 09:37, Ric Roberts wrote:
> Thanks - I’ll look into that, Stephen.
>
> I’d been trying it in 50k-triple chunks but the performance isn’t great: it takes about 1 min per request.
>
>
>> On 16 Jul 2015, at 20:37, Stephen Allen <sa...@apache.org> wrote:
>>
>> Hi Ric,
>>
>> You could try setting two properties for your dataset:
>>
>> <#yourdatasetname> rdf:type tdb:DatasetTDB ;
>>    ja:context [ ja:cxtName "tdb:transactionJournalWriteBlockMode" ;
>> ja:cxtValue "mapped" ] ;
>>    ja:context [ ja:cxtName "arq:spillToDiskThreshold" ; ja:cxtValue 10000 .
>> ] .
>>
>> The first one will use a temporary memory mapped file for storing
>> uncommitted TDB blocks.  By default the blocks are stored in heap memory.
>> The second option will cause the update engine to store temporary bindings
>> generated during the delete operation to be written out to a temporary file
>> on disk after the specified threshold is passed.  For the first option, you
>> can also use "direct", which will use process heap instead of JVM heap.
>>
>> Both of these options should reduce the heap usage and maybe get you to
>> what you are looking for.  Try just the first option (memory mapped blocks)
>> first and see if that does it, since the second option will likely reduce
>> performance a bit.
>>
>> But Andy's suggestion of breaking up the query with limited subselects
>> should be working for you, since it also limits the size of the heap.
>>
>> -Stephen
>>
>>
>> On Wed, Jul 15, 2015 at 7:04 AM, Ric Roberts <ri...@swirrl.com> wrote:
>>
>>> Hello again. I’ve tried upgrading to Fuseki 1.1.2, and it now gives a heap
>>> space error after only 6 minutes (instead of 60-odd).
>>>
>>> There are a few hundred graphs in the database, but most of them small
>>> (apart from the one I’m trying to delete).
>>>
>>> Does this mean that using the Graph Protocol is a no-go? I’ll try the
>>> batched deletes...
>>>
>>>
>>>> On 13 Jul 2015, at 22:51, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>> On 13/07/15 21:31, Andy Seaborne wrote:
>>>>> Hi Ric,
>>>>>
>>>>> Could you please try Fuseki 1.1.2 or Fuseki 2.0.0?
>>>>>
>>>>> How many datasets does the server host?
>>>>>
>>>>> 1.0.1 was Jan 2014 and IIRC this area has changed, especially DELETE of
>>>>> a graph with the Graph Store Protocol.  However, if this is just due to
>>>>> transaction overheads (it's not immediately clear it is or is not), then
>>>>> DELETE {} WHERE { SELECT {...} LIMIT } is the way to go for an immediate
>>>>> solution.
>>>>>
>>>>> TDB1 (i.e. the Jena code) is a bit memory hungry for transactions.
>>>>>
>>>>> TDB2 is not memory bound but it isn't in the Jena codebase.  It has been
>>>>> tested with 100 million triple loads in a single Fuseki2 upload.
>>>>>
>>>>> See
>>>>>   http://www.sparql.org/validate/update
>>>>
>>>> That's the service point.
>>>>
>>>> http://www.sparql.org/update-validator.html
>>>>
>>>> is the HTML formm.
>>>>
>>>>> for checking syntax.
>>>>>
>>>>>     Andy
>>>>>
>>>>> On 13/07/15 18:59, Ric Roberts wrote:
>>>>>> Hi. I’m having problems deleting a moderately large graph from a
>>>>>> jena-fuseki-1.0.1 database.
>>>>>>
>>>>>> The graph contains approximately 60 million triples, and the database
>>>>>> contains about 70 million triples in total.
>>>>>>
>>>>>> I’ve started Fuseki with 16G Heap. (JVM_ARGS=${JVM_ARGS:—Xmx16000M}).
>>>>>> The server has 32G RAM.
>>>>>>
>>>>>> When I issue the DELETE command over http, I see this in the fuseki
>>> log:
>>>>>>
>>>>>> 16:12:03 INFO  [24] DELETE
>>>>>> http://127.0.0.1:3030/stagingdb/data?graph=http://example.com/graph
>>>>>> <http://127.0.0.1:3030/stagingdb/data?graph=http://example.com/graph>
>>>>>> 17:10:40 WARN  [24] RC = 500 : Java heap space
>>>>>> 17:10:40 INFO  [24] 500 Java heap space (3,517.614 s)
>>>>>>
>>>>>> i.e. it takes about an hour, and then 500s with an error about heap
>>>>>> space.
>>>>>>
>>>>>> I’ve also tried DROP and CLEAR SPARQL update statements but they
>>>>>> timeout with our default endpoint timeout of 30s.
>>>>>>
>>>>>> I’ve also tried deleting 1000 triples at a time, from the graph by
>>>>>> issuing a sparql update statement like this:
>>>>>>
>>>>>> DELETE {
>>>>>> GRAPH <http://example.com/graph <http://example.com/graph>>
>>>>>>    { ?s ?p ?o }
>>>>>> }
>>>>>> WHERE {
>>>>>>   GRAPH <http://example.com/graph <http://example.com/graph>>
>>>>>>     { ?s ?p ?o }
>>>>>> }
>>>>>> LIMIT 1000
>>>>>>
>>>>>> … but this times out too (which surprised me, as I only asked it to
>>>>>> find and DELETE 1000 triples).
>>>>>>
>>>>>> What is the recommended way to delete this graph - I need to replace
>>>>>> its contents fairly urgently on a production system. We loaded it by
>>>>>> loading 10,000 triples at a time, which worked fine, but I’m having
>>>>>> trouble deleting its current contents first.
>>>>>>
>>>>>> Any pointers appreciated.
>>>>>> Thanks, Ric.
>>>>>
>>>>
>>>
>>>
>

Re: Errors DELETEing large-ish Graph from Fuseki

Posted by Andy Seaborne <an...@apache.org>.

Hi Ric,

It would be very helpful to know how this went (for better or worse!).

	Andy

On 06/08/15 16:45, Andy Seaborne wrote:
> Hi Ric,
>
> How did you get on?
>
> Did batch sizes > 50k help?
> and/or tdb:transactionJournalWriteBlockMode
>
>      Andy
>
>
> On 17/07/15 09:37, Ric Roberts wrote:
>> Thanks - I’ll look into that, Stephen.
>>
>> I’d been trying it in 50k-triple chunks but the performance isn’t
>> great: it takes about 1 min per request.
>>
>>
>>> On 16 Jul 2015, at 20:37, Stephen Allen <sa...@apache.org> wrote:
>>>
>>> Hi Ric,
>>>
>>> You could try setting two properties for your dataset:
>>>
>>> <#yourdatasetname> rdf:type tdb:DatasetTDB ;
>>>    ja:context [ ja:cxtName "tdb:transactionJournalWriteBlockMode" ;
>>> ja:cxtValue "mapped" ] ;
>>>    ja:context [ ja:cxtName "arq:spillToDiskThreshold" ; ja:cxtValue
>>> 10000 .
>>> ] .
>>>
>>> The first one will use a temporary memory mapped file for storing
>>> uncommitted TDB blocks.  By default the blocks are stored in heap
>>> memory.
>>> The second option will cause the update engine to store temporary
>>> bindings
>>> generated during the delete operation to be written out to a
>>> temporary file
>>> on disk after the specified threshold is passed.  For the first
>>> option, you
>>> can also use "direct", which will use process heap instead of JVM heap.
>>>
>>> Both of these options should reduce the heap usage and maybe get you to
>>> what you are looking for.  Try just the first option (memory mapped
>>> blocks)
>>> first and see if that does it, since the second option will likely
>>> reduce
>>> performance a bit.
>>>
>>> But Andy's suggestion of breaking up the query with limited subselects
>>> should be working for you, since it also limits the size of the heap.
>>>
>>> -Stephen
>>>
>>>
>>> On Wed, Jul 15, 2015 at 7:04 AM, Ric Roberts <ri...@swirrl.com> wrote:
>>>
>>>> Hello again. I’ve tried upgrading to Fuseki 1.1.2, and it now gives
>>>> a heap
>>>> space error after only 6 minutes (instead of 60-odd).
>>>>
>>>> There are a few hundred graphs in the database, but most of them small
>>>> (apart from the one I’m trying to delete).
>>>>
>>>> Does this mean that using the Graph Protocol is a no-go? I’ll try the
>>>> batched deletes...
>>>>
>>>>
>>>>> On 13 Jul 2015, at 22:51, Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>> On 13/07/15 21:31, Andy Seaborne wrote:
>>>>>> Hi Ric,
>>>>>>
>>>>>> Could you please try Fuseki 1.1.2 or Fuseki 2.0.0?
>>>>>>
>>>>>> How many datasets does the server host?
>>>>>>
>>>>>> 1.0.1 was Jan 2014 and IIRC this area has changed, especially
>>>>>> DELETE of
>>>>>> a graph with the Graph Store Protocol.  However, if this is just
>>>>>> due to
>>>>>> transaction overheads (it's not immediately clear it is or is
>>>>>> not), then
>>>>>> DELETE {} WHERE { SELECT {...} LIMIT } is the way to go for an
>>>>>> immediate
>>>>>> solution.
>>>>>>
>>>>>> TDB1 (i.e. the Jena code) is a bit memory hungry for transactions.
>>>>>>
>>>>>> TDB2 is not memory bound but it isn't in the Jena codebase.  It
>>>>>> has been
>>>>>> tested with 100 million triple loads in a single Fuseki2 upload.
>>>>>>
>>>>>> See
>>>>>>   http://www.sparql.org/validate/update
>>>>>
>>>>> That's the service point.
>>>>>
>>>>> http://www.sparql.org/update-validator.html
>>>>>
>>>>> is the HTML formm.
>>>>>
>>>>>> for checking syntax.
>>>>>>
>>>>>>     Andy
>>>>>>
>>>>>> On 13/07/15 18:59, Ric Roberts wrote:
>>>>>>> Hi. I’m having problems deleting a moderately large graph from a
>>>>>>> jena-fuseki-1.0.1 database.
>>>>>>>
>>>>>>> The graph contains approximately 60 million triples, and the
>>>>>>> database
>>>>>>> contains about 70 million triples in total.
>>>>>>>
>>>>>>> I’ve started Fuseki with 16G Heap.
>>>>>>> (JVM_ARGS=${JVM_ARGS:—Xmx16000M}).
>>>>>>> The server has 32G RAM.
>>>>>>>
>>>>>>> When I issue the DELETE command over http, I see this in the fuseki
>>>> log:
>>>>>>>
>>>>>>> 16:12:03 INFO  [24] DELETE
>>>>>>> http://127.0.0.1:3030/stagingdb/data?graph=http://example.com/graph
>>>>>>> <http://127.0.0.1:3030/stagingdb/data?graph=http://example.com/graph>
>>>>>>>
>>>>>>> 17:10:40 WARN  [24] RC = 500 : Java heap space
>>>>>>> 17:10:40 INFO  [24] 500 Java heap space (3,517.614 s)
>>>>>>>
>>>>>>> i.e. it takes about an hour, and then 500s with an error about heap
>>>>>>> space.
>>>>>>>
>>>>>>> I’ve also tried DROP and CLEAR SPARQL update statements but they
>>>>>>> timeout with our default endpoint timeout of 30s.
>>>>>>>
>>>>>>> I’ve also tried deleting 1000 triples at a time, from the graph by
>>>>>>> issuing a sparql update statement like this:
>>>>>>>
>>>>>>> DELETE {
>>>>>>> GRAPH <http://example.com/graph <http://example.com/graph>>
>>>>>>>    { ?s ?p ?o }
>>>>>>> }
>>>>>>> WHERE {
>>>>>>>   GRAPH <http://example.com/graph <http://example.com/graph>>
>>>>>>>     { ?s ?p ?o }
>>>>>>> }
>>>>>>> LIMIT 1000
>>>>>>>
>>>>>>> … but this times out too (which surprised me, as I only asked it to
>>>>>>> find and DELETE 1000 triples).
>>>>>>>
>>>>>>> What is the recommended way to delete this graph - I need to replace
>>>>>>> its contents fairly urgently on a production system. We loaded it by
>>>>>>> loading 10,000 triples at a time, which worked fine, but I’m having
>>>>>>> trouble deleting its current contents first.
>>>>>>>
>>>>>>> Any pointers appreciated.
>>>>>>> Thanks, Ric.
>>>>>>
>>>>>
>>>>
>>>>
>>
>