You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Amit Kumar <as...@gmail.com> on 2019/01/25 20:05:14 UTC

Updating a large TDB database on a live Fuseki server

My team has a big knowledge graph that we want to server via a Sparql
endpoint. We are looking into using Apache Fuseki for the same. I have some
questions and was hoping someone here can guide me.

Right now, I'm working on a dataset which consists on 175 Million Triples
which translates to  around 250GB size of TDB2 table using the
tdb2.tdbloader.

The entire knowledge db is generated once a day and as per our rough count,
approx 14 Million triples ( 1.6GB uncompressed) changes(including additions
and deletion) everyday ~8%.

What is the best way to update live fuseki dataset when you have to update
such large number of triples ?

We have tried doing something like this

> curl -X POST -d @update.txt --header "Content-type: application/sparql-update" -v http://localhost:9999/my/update
>
>
Where update.txt file looks something like

> DELETE DATA {
> <sub1> <pred1> <obj1> .
> <sub2> <pred2> <obj2> .
> ...
> };
> INSERT DATA {
> <sub1> <pred1> <obj11> .
> <sub2> <pred2> <obj22> .
> ....
> }


It takes around 15-20 minutes on our beefy machine. I had some questions
regarding this approach

   - Does making a curl request like this warps the entire call within a
   transaction?
   - Is there a size limit on how big a call I can make ?
   - My understanding is that the Fuseki server will have to download the
   full file on its side and then apply the changes? Is it correct ? Also,
   will it affect any ongoing read requests running in parallel?
   - Is there any other better way  to update the db?

Thanks for your help.

Regards
Amit

Re: Updating a large TDB database on a live Fuseki server

Posted by Andy Seaborne <an...@apache.org>.

On 26/01/2019 21:42, Andy Seaborne wrote:
> 
> 
> On 25/01/2019 20:05, Amit Kumar wrote:
>> My team has a big knowledge graph that we want to server via a Sparql
>> endpoint. We are looking into using Apache Fuseki for the same. I have 
>> some
>> questions and was hoping someone here can guide me.
>>
>> Right now, I'm working on a dataset which consists on 175 Million Triples
>> which translates to  around 250GB size of TDB2 table using the
>> tdb2.tdbloader.
>>
>> The entire knowledge db is generated once a day and as per our rough 
>> count,
>> approx 14 Million triples ( 1.6GB uncompressed) changes(including 
>> additions
>> and deletion) everyday ~8%.
>>
>> What is the best way to update live fuseki dataset when you have to 
>> update
>> such large number of triples ?
> 
> The method you describe below with INSERT DATA / DELETE DATA is fine for 
> TDB2. It is specially handled by the parser and execution and the 
> actions are execute streaming straight on the database.
> 
> There is another way that might be useful to you.
> 
>>
>> We have tried doing something like this
>>
>>> curl -X POST -d @update.txt --header "Content-type: 
>>> application/sparql-update" -v http://localhost:9999/my/update
> 
> --data-binary is a bit better.
> 
> (-d can do some processing on the file)
> 
>>>
>>>
>> Where update.txt file looks something like
>>
>>> DELETE DATA {
>>> <sub1> <pred1> <obj1> .
>>> <sub2> <pred2> <obj2> .
>>> ...
>>> };
>>> INSERT DATA {
>>> <sub1> <pred1> <obj11> .
>>> <sub2> <pred2> <obj22> .
>>> ....
>>> }
> 
> Good.
> 
> DELETE DATA and INSERT DATA are the way to go.
> 
>> It takes around 15-20 minutes on our beefy machine.

With an SSD?

Changes to the data happen jumping around a lot so an SSD can help - 
update to a live server is not as fast as the bulk loader though.

>> I had some questions
>> regarding this approach
>>
>>     - Does making a curl request like this warps the entire call within a
>>     transaction?
> 
> Yes.
> 
>>     - Is there a size limit on how big a call I can make ?
> 
> No, not on the Fuseki side.
> 
>>     - My understanding is that the Fuseki server will have to download 
>> the
>>     full file on its side and then apply the changes? Is it correct ?
> 
> No - not in the setup using TDB2 - it should

it should process the incoming request as a stream inside a transaction, 
not buffer it first.

> 
>> Also,
>>     will it affect any ongoing read requests running in parallel?
> 
> No.
> 
> One W and any number of R transactions happen in true parallel.
> 
>>     - Is there any other better way  to update the db?
> 
> For bulk changes, it is better to send in this way, not send complex 
> DELETE/INSERT/WHERE.
> 
> -----------------
> 
> An alternative is provide by RDF Delta.
> https://afs.github.io/rdf-delta/
> 
> It is open source, Apache license and if the PMC accepts it, will 
> migrate to Jena. <disclosure : I'm "afs" on github>
> 
> This is a patch format which is similar to the DELETE DATA/INSERT DATA 
> except it can be generated as a stream of changes as they happen and  it 
> handles blank nodes.
> 
> There is a Fuseki server with a built-in patch handler:
> http://central.maven.org/maven2/org/seaborne/rdf-delta/rdf-delta-dist/
> 
> Sending updates to a live server is one use case (we have that in 
> customer production with a customer as part of my day-job).
> 
>  From that, it can be used to keep several servers in-step (high 
> availability) from one used for updates (actually, in the general case 
> it can be a cluster of Fuseki behind a load balancer and any server can 
> receive and execute an update).
> 
> This setup is also in deployment in a different customer's cloud 
> infrastructure.
> 
> Just ask if you want to know more.
> 
>      Andy
> 
>>
>> Thanks for your help.
>>
>> Regards
>> Amit
>>

Re: Updating a large TDB database on a live Fuseki server

Posted by Andy Seaborne <an...@apache.org>.

On 25/01/2019 20:05, Amit Kumar wrote:
> My team has a big knowledge graph that we want to server via a Sparql
> endpoint. We are looking into using Apache Fuseki for the same. I have some
> questions and was hoping someone here can guide me.
> 
> Right now, I'm working on a dataset which consists on 175 Million Triples
> which translates to  around 250GB size of TDB2 table using the
> tdb2.tdbloader.
> 
> The entire knowledge db is generated once a day and as per our rough count,
> approx 14 Million triples ( 1.6GB uncompressed) changes(including additions
> and deletion) everyday ~8%.
> 
> What is the best way to update live fuseki dataset when you have to update
> such large number of triples ?

The method you describe below with INSERT DATA / DELETE DATA is fine for 
TDB2. It is specially handled by the parser and execution and the 
actions are execute streaming straight on the database.

There is another way that might be useful to you.

> 
> We have tried doing something like this
> 
>> curl -X POST -d @update.txt --header "Content-type: application/sparql-update" -v http://localhost:9999/my/update

--data-binary is a bit better.

(-d can do some processing on the file)

>>
>>
> Where update.txt file looks something like
> 
>> DELETE DATA {
>> <sub1> <pred1> <obj1> .
>> <sub2> <pred2> <obj2> .
>> ...
>> };
>> INSERT DATA {
>> <sub1> <pred1> <obj11> .
>> <sub2> <pred2> <obj22> .
>> ....
>> }

Good.

DELETE DATA and INSERT DATA are the way to go.

> It takes around 15-20 minutes on our beefy machine. I had some questions
> regarding this approach
> 
>     - Does making a curl request like this warps the entire call within a
>     transaction?

Yes.

>     - Is there a size limit on how big a call I can make ?

No, not on the Fuseki side.

>     - My understanding is that the Fuseki server will have to download the
>     full file on its side and then apply the changes? Is it correct ?

No - not in the setup using TDB2 - it should

> Also,
>     will it affect any ongoing read requests running in parallel?

No.

One W and any number of R transactions happen in true parallel.

>     - Is there any other better way  to update the db?

For bulk changes, it is better to send in this way, not send complex 
DELETE/INSERT/WHERE.

-----------------

An alternative is provide by RDF Delta.
https://afs.github.io/rdf-delta/

It is open source, Apache license and if the PMC accepts it, will 
migrate to Jena. <disclosure : I'm "afs" on github>

This is a patch format which is similar to the DELETE DATA/INSERT DATA 
except it can be generated as a stream of changes as they happen and  it 
handles blank nodes.

There is a Fuseki server with a built-in patch handler:
http://central.maven.org/maven2/org/seaborne/rdf-delta/rdf-delta-dist/

Sending updates to a live server is one use case (we have that in 
customer production with a customer as part of my day-job).

 From that, it can be used to keep several servers in-step (high 
availability) from one used for updates (actually, in the general case 
it can be a cluster of Fuseki behind a load balancer and any server can 
receive and execute an update).

This setup is also in deployment in a different customer's cloud 
infrastructure.

Just ask if you want to know more.

     Andy

> 
> Thanks for your help.
> 
> Regards
> Amit
>