You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Simon Helsen (JIRA)" <ji...@apache.org> on 2012/09/18 22:25:09 UTC

[jira] [Created] (JENA-327) TDB Tx transaction lock to permit backups

Simon Helsen created JENA-327:
---------------------------------

             Summary: TDB Tx transaction lock to permit backups
                 Key: JENA-327
                 URL: https://issues.apache.org/jira/browse/JENA-327
             Project: Apache Jena
          Issue Type: Improvement
          Components: TDB
    Affects Versions: TDB 0.9.4
            Reporter: Simon Helsen


With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 

However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.

For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:

try {
dataset.begin(ReadWrite.BACKUP) ;

<do whatever is necessary to backup the index>

} finally {
dataset.end()
}

As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-327) TDB Tx transaction lock to permit backups

Posted by "Simon Helsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463125#comment-13463125 ] 

Simon Helsen commented on JENA-327:
-----------------------------------

"what counts as 'extremely large stores' in triple/quad count?" 

50million+, i.e. many gigabytes of disk space

"A READ action does not depend on internal characteristics so it is the most stable option"

ok, so let me make sure I get this right: you are saying that the best way to perform online backups is to perform a quad dump on the dataset inside a READ transaction and that this is guaranteed to remain safe over time? 

I am not following "There is no need to flush the journal - just back it up like everything else", Is this separately required when backing up from the regular dataset in the READ transaction? Or was this comment referring to a separate option. 


                
> TDB Tx transaction lock to permit backups
> -----------------------------------------
>
>                 Key: JENA-327
>                 URL: https://issues.apache.org/jira/browse/JENA-327
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>    Affects Versions: TDB 0.9.4
>            Reporter: Simon Helsen
>
> With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 
> However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.
> For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:
> try {
> dataset.begin(ReadWrite.BACKUP) ;
> <do whatever is necessary to backup the index>
> } finally {
> dataset.end()
> }
> As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-327) TDB Tx transaction lock to permit backups

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460462#comment-13460462 ] 

Andy Seaborne commented on JENA-327:
------------------------------------

(what counts as 'extremely large stores' in triple/quad count?)

There can be no absolute guarantees about manipulating the database files because it limits future features unknown.

A READ action does not depend on internal characteristics so it is the most stable option.

There is no need to flush the journal - just back it up like everything else.  The requirement is that nothing is changing files and having a WRITE lock ensures that.  I can't see that changing but, theoretically, it could.

The 3rd option is that you manage TDB activity and hold everything up (maybe manually flush the journal because this would likely work with all future writeback schemes but currently is not necessary).

What you can't have is detailed low-level guarantees and also evolution of the system in the future.

A transaction type of EXCLUSIVE might be useful to add as a general feature but it's not necessary currently.  Defining it and supporting it in future systems could turn out to be a burden so adding only when needed is a better way forward to my mind.
                
> TDB Tx transaction lock to permit backups
> -----------------------------------------
>
>                 Key: JENA-327
>                 URL: https://issues.apache.org/jira/browse/JENA-327
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>    Affects Versions: TDB 0.9.4
>            Reporter: Simon Helsen
>
> With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 
> However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.
> For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:
> try {
> dataset.begin(ReadWrite.BACKUP) ;
> <do whatever is necessary to backup the index>
> } finally {
> dataset.end()
> }
> As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-327) TDB Tx transaction lock to permit backups

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459519#comment-13459519 ] 

Andy Seaborne commented on JENA-327:
------------------------------------

You can start a READ transaction and then work with the dataset (or DatasetGraph) - this is the most stable approach as it is using the current API contract.  Fuseki uses it to take dataset backups (it writes NQuads) .  Operations can continue while the backup is done.

Or you can start a WRITE transaction, whereby you are guaranteed nothing else will be changing the files on disk.  Even if async writeback is introduced, an open write transaction is going to hold back writeback.

Finally, you could manage the request flow in the client by holding everything up, flushing the journal manually, then backing up the files.

The READ mechanism is the safest long-term and does not block other threads (readers or writers).

                
> TDB Tx transaction lock to permit backups
> -----------------------------------------
>
>                 Key: JENA-327
>                 URL: https://issues.apache.org/jira/browse/JENA-327
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>    Affects Versions: TDB 0.9.4
>            Reporter: Simon Helsen
>
> With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 
> However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.
> For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:
> try {
> dataset.begin(ReadWrite.BACKUP) ;
> <do whatever is necessary to backup the index>
> } finally {
> dataset.end()
> }
> As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-327) TDB Tx transaction lock to permit backups

Posted by "Simon Helsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464023#comment-13464023 ] 

Simon Helsen commented on JENA-327:
-----------------------------------

ok, so if we write out n-quads in a read transaction and that takes a long time, TDB/Tx will keep serving queries and permitting updates, but it won't merge the journal during that time (to honor the transactional semantics). What about memory usage? I am trying to assess whether this approach would not cause other dangers to a running system. And of course, a restore in such cases will take as long as it takes to import the quads. I did some quick calculations and a store with 50 million quads would take 15 minutes on my machine, probably faster on decent server hardware.
                
> TDB Tx transaction lock to permit backups
> -----------------------------------------
>
>                 Key: JENA-327
>                 URL: https://issues.apache.org/jira/browse/JENA-327
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>    Affects Versions: TDB 0.9.4
>            Reporter: Simon Helsen
>
> With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 
> However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.
> For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:
> try {
> dataset.begin(ReadWrite.BACKUP) ;
> <do whatever is necessary to backup the index>
> } finally {
> dataset.end()
> }
> As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-327) TDB Tx transaction lock to permit backups

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464008#comment-13464008 ] 

Andy Seaborne commented on JENA-327:
------------------------------------

Re: READ transaction backup.

This is the most stable and it means the disk format, or even implementing system, can be changed as the backup is in a standard format, N-Quads.

And the system can keep running at the same time.

It's not as fast as a disk copy, but then a disk copy requires locking out writers and also needs to stop the system doing any write-back. A WRITE transaction currently does that, but that is knowing the internal details.
                
> TDB Tx transaction lock to permit backups
> -----------------------------------------
>
>                 Key: JENA-327
>                 URL: https://issues.apache.org/jira/browse/JENA-327
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>    Affects Versions: TDB 0.9.4
>            Reporter: Simon Helsen
>
> With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 
> However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.
> For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:
> try {
> dataset.begin(ReadWrite.BACKUP) ;
> <do whatever is necessary to backup the index>
> } finally {
> dataset.end()
> }
> As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-327) TDB Tx transaction lock to permit backups

Posted by "Simon Helsen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13459725#comment-13459725 ] 

Simon Helsen commented on JENA-327:
-----------------------------------

first of all, we currently do a backup by copying in the file system. Our initial impression was that this would be faster than any other approach especially given that the restore has to be fast as well. I have no numbers right now, but I think for now, we want to stick to this unless we have evidence the n-quad dump is fast enough on extremely large stores. 

So if I understand you well, you are saying that a WRITE transaction will guarantee that nothing changes on disk even if async writeback is introduced? That sounds good and would serve our purpose. Why would using the read mechanism be safer long-term? 

The 3rd option is not clear to me. Are you saying there is API (not internal interfaces) we can use to a) hold up everything and b) manually flush the journal?
                
> TDB Tx transaction lock to permit backups
> -----------------------------------------
>
>                 Key: JENA-327
>                 URL: https://issues.apache.org/jira/browse/JENA-327
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>    Affects Versions: TDB 0.9.4
>            Reporter: Simon Helsen
>
> With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 
> However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.
> For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:
> try {
> dataset.begin(ReadWrite.BACKUP) ;
> <do whatever is necessary to backup the index>
> } finally {
> dataset.end()
> }
> As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-327) TDB Tx transaction lock to permit backups

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JENA-327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13464002#comment-13464002 ] 

Andy Seaborne commented on JENA-327:
------------------------------------

Re: "no need to flush the journal"

Unrelated to a READ backup.

If you know the disk is quiescent, you can copy the file out for backup (you can't copy the files in under a running system).

Just copy the disk state - if the journal is non-zero, it will be processed when the system starts up again.  No need to flush before backup.



                
> TDB Tx transaction lock to permit backups
> -----------------------------------------
>
>                 Key: JENA-327
>                 URL: https://issues.apache.org/jira/browse/JENA-327
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>    Affects Versions: TDB 0.9.4
>            Reporter: Simon Helsen
>
> With large repositories, it is important to be able to create backups once in a while. This is because recreating an rdf store with millions of triples can be forbiddingly expensive. Moreover, it should be possible to take those backups while still allowing read activity on the store as in many cases, a complete shutdown is usually not possible. Before the introduction of tx, it was relatively straightforward to provide the right locks on the client-side to safely suspend any disk activity for a period of time enough to make a backup of the index. 
> However, since tx, things have become slightly more complicated because TDB Tx touches the disk at other times than when performing write/sync activities. Right now, because of some understanding of how TDB Tx is implemented, it is still possible for clients to avoid disk activities to implement a backup process, but this dependency on TDB Tx implementation details is not very good. Moreover, we anticipate that in the future, the merging process from the journal into the main index may become entirely asynchornous for performance reasons. The moment that happens, client have no control anymore as to when the disk is being touched.
> For this reason, we are requesting the following feature: a "backup" lock (by lack of a better name). Its semantics is that when the lock is taken, TDB Tx guarantees that no disk activity takes place and if necessary pauses activities. In other words, no write transaction should be able to complete and read transactions will not attempt to merge the journal. The idea would be that regular read activities can still continue. The API could be as simple as something like this:
> try {
> dataset.begin(ReadWrite.BACKUP) ;
> <do whatever is necessary to backup the index>
> } finally {
> dataset.end()
> }
> As for the implementation, we suspect you currently have locks in place which could be used to guarantee this behavior. E.g. could txn.getBaseDataset().getLock().enterCriticalSection(Lock.WRITE) be sufficient?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira