You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Shashank Gupta (JIRA)" <ji...@apache.org> on 2014/03/12 05:46:43 UTC
[jira] [Comment Edited] (JCR-3733) Asynchronous upload file to S3

    [ https://issues.apache.org/jira/browse/JCR-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906853#comment-13906853 ] 

Shashank Gupta edited comment on JCR-3733 at 3/12/14 4:45 AM:
--------------------------------------------------------------

h2. Specification 
h3. S3DataStore Asynchronous Upload to S3
The current logic to add a file record to S3DataStore is first add the file in local cache and then upload that file to S3 in a single synchronous step. This feature contemplates to break the current logic with synchronous adding to local cache and asynchronous uploading of the file to S3. Till asynchronous upload completes, all data (inputstream, length and lastModified) for that file record is fetched from local cache. 
AWS SDK provides [upload progress listeners|http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/event/ProgressListener.html] which provides various
callbacks on the status of in-progress upload. 

h3. Flag to turn it off
The parameter 'asyncUploadLimit' limits the number of asynchronous uploads to S3. Once this limit is reached, the next upload to S3 is synchronous till one of asynchronous uploads completes. To disable this feature, set asyncUploadLimit  parameter to 0 in repository.xml. By default it is 100.

h3. Caution
# This feature should not be used in clustered Active-Active Jackrabbit deployment. It is possible that file is not fully uploaded to S3 before it is being accessed on other node.  For active-passive clustered mode, this feature requires to manually upload uncompleted asynchronous uploads to S3 after failover.
# If using this feature, it is strongly recommended  to NOT delete any file from local cache manually.   As local cache may contain files whose uploads are not completed to S3. 

h3. Asynchronous Upload Cache
S3DataStore keeps AsyncUploadCache which holds in-progress asynchronous uploads. This class contains two data structures, one is \{@link Map<String, Long>\} of file path vs lastModified  to hold in-progress asynchronous uploads . The other is \{@link Set<String>\} of in-progress uploads which is marked for delete when asynchronous upload was in-progress. When asynchronous upload is initiated, an entry is added to this cache and when asynchronous upload completes, the corresponding entry is flushed.  Any modification to this cache is immediately serialized to filesytem in a synchronized code block. 

h3.Semantics of various DataStore and DataRecord APIs w.r.t  AsyncUploadCache
Previous to this feature, the S3 is single source of truth. For e.g. DataStore#getRecordIfStored(DataIdentifier) returns DataRecord if dataIdentifier exists in S3 and else it returns null. It doesn't matter if dataIdentifier exists in local cache.   With this feature, S3 remains source of truth for completed uploads and AsyncUploadCache for in-progress asynchronous uploads.  

h4. DataRecord DataStore#addRecord(InputStream)
Checks if asynchronous upload can be started on inputstream based on asyncUploadLimit and current local cache size. If local cache advised to proceed with asynchronous upload, this method adds asynchronous upload entry to AsyncUploadCache and start asynchronous upload.  If no, it proceeds with synchronous upload to S3. If there is already asynchronous upload in-progress for that dataIdentifier, it just updates lastModified in AsyncUploadCache. Once asynchronous  uploads completes, the callback removes asynchronous  upload entry from AsyncUploadCache. 
 
h4. DataRecord  DataStore#getRecordIfStored(DataIdentifier)
 Return DataRecord if  asynchronous in-progress upload exists in AsyncUploadCache or record exists in S3 for dataIdentifier. If minModified > 0, timestamp is updated in AsyncUploadCache and S3. 
 
h4. MultiDataStoreAware#deleteRecord(DataIdentifier)
 For in-progress uploads, this method adds identifier to "toBeDeleted" set in AsyncUploadCache. When asynchronous  upload completes and invokes callback, the callback checks if asynchronous in-progress upload is marked for delete. If yes it invokes the deleteRecord to actually delete the record.
 
h4. DataStore#deleteAllOlderThan(long min )
It deletes deleteAllOlderThan(long min ) records from S3. As AsyncUploadCache maintains map of in-progress asynchronous  uploads Vs lastModified, it marks asynchronous  uploads for delete whose lastModified < min. When asynchronous  uploads completes and invokes callback, the callback checks if asynchronous in-progress upload is marked for delete. If yes it invokes the deleteRecord to actually delete the record.
 
h4. Iterator<DataIdentifier> DataStore#getAllIdentifiers()
It returns all identifiers in S3 plus in-progress upload identifiers from AsyncUploadCache minus identifiers from the "toBeDeleted" set in AsyncUploadCache.

h4. long DataRecord#getLength()
If file exits in local cache, it retrieves length from it. Other it retrieves length from S3 and cache it in local cache. 

h4. DataRecord#getLastModified()
If record is in-progress upload, the lastModified is retrieved from AsyncUploadCache else it is retrieved from S3 and cache it in local cache. 

h3. Behavior of Local cache Purge
The local cache has a  size limit, when currentsize of cache exceeds the limit, the cache undergoes auto-purge mode to clean older entries and reclaim space. During purging, local cache makes sure that it doesn’t delete any in-progress asynchronous upload file. 

h3. DataStore initialization behavior w.r.t. AsyncUploadCache 
It is possible that there are asynchronous  in-progress uploads when server shuts down.  When asynchronous upload is added to AsyncUploadCache  it is immediately persisted to filesytem on a file. During  S3DataStore's initialization it checks for any incomplete asynchronous  uploads and uploads them concurrently in multiple threads. It throws RepositoryException if file is not found in local cache for that asynchronous upload. As far as code is concerned, it is only possible when somebody has removed files from local cache manually.  If there is an exception and user want to proceed with inconsistencies, set parameter contOnAsyncUploadFailure to true in repository.xml. This will ignore all missing files and reset AsyncUploadCache.



was (Author: shgupta):
h2. Specification 
h3. S3DataStore Asynchronous Upload to S3
The current logic to add a file record to S3DataStore is first add the file in local cache and then upload that file to S3 in a single synchronous step. This feature contemplates to break the current logic with synchronous adding to local cache and asynchronous uploading of the file to S3. Till asynchronous upload completes, all data (inputstream, length and lastModified) for that file record is fetched from local cache. 
AWS SDK provides [upload progress listeners|http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/event/ProgressListener.html] which provides various
callbacks on the status of in-progress upload. 

h3. Flag to turn it off
The parameter 'asyncUploadLimit' limits the number of asynchronous uploads to S3. Once this limit is reached, the next upload to S3 is synchronous till one of asynchronous uploads completes. To disable this feature, set asyncUploadLimit  parameter to 0 in repository.xml. By default it is 100.

h3. Caution
# This feature should not be used in clustered Active-Active Jackrabbit deployment. It is possible that file is not fully uploaded to S3 before it is being accessed on other node.  For active-passive clustered mode, this feature requires to manually upload uncompleted asynchronous uploads to S3 after failover.
# If using this feature, it is strongly recommended  to NOT delete any file from local cache manually.   As local cache may contain files whose uploads are not completed to S3. 

h3. Asynchronous Upload Cache
S3DataStore keeps AsyncUploadCache which holds in-progress asynchronous uploads. This class contains two data structures, one is \{@link Map<String, Long>\} of file path vs lastModified  to hold in-progress asynchronous uploads . The other is \{@link Set<String>\} of in-progress uploads which is marked for delete when asynchronous upload was in-progress. When asynchronous upload is initiated, an entry is added to this cache and when asynchronous upload completes, the corresponding entry is flushed.  Any modification to this cache is immediately serialized to filesytem in a synchronized code block. 

h3.Semantics of various DataStore and DataRecord APIs w.r.t  AsyncUploadCache
Previous to this feature, the S3 is single source of truth. For e.g. DataStore#getRecordIfStored(DataIdentifier) returns DataRecord if dataIdentifier exists in S3 and else it returns null. It doesn't matter if dataIdentifier exists in local cache.   With this feature, S3 remains source of truth for completed uploads and AsyncUploadCache for in-progress asynchronous uploads.  

h4. DataRecord DataStore#addRecord(InputStream)
Checks if asynchronous upload can be started on inputstream based on asyncUploadLimit and current local cache size. If local cache advised to proceed with asynchronous upload, this method adds asynchronous upload entry to AsyncUploadCache and start asynchronous upload.  If no, it proceeds with synchronous upload to S3. If there is already asynchronous upload in-progress for that dataIdentifier, it just updates lastModified in AsyncUploadCache. Once asynchronous  uploads completes, the callback removes asynchronous  upload entry from AsyncUploadCache. 
 
h4. DataRecord  DataStore#getRecordIfStored(DataIdentifier)
 Return DataRecord if  asynchronous in-progress upload exists in AsyncUploadCache or record exists in S3 for dataIdentifier. If minModified > 0, timestamp is updated in AsyncUploadCache and S3. 
 
h4. MultiDataStoreAware#deleteRecord(DataIdentifier)
 For in-progress uploads, this method adds identifier to "toBeDeleted" set in AsyncUploadCache. When asynchronous  upload completes and invokes callback, the callback checks if asynchronous in-progress upload is marked for delete. If yes it invokes the deleteRecord to actually delete the record.
 
h4. DataStore#deleteAllOlderThan(long min )
It deletes deleteAllOlderThan(long min ) records from S3. As AsyncUploadCache maintains map of in-progress asynchronous  uploads Vs lastModified, it marks asynchronous  uploads for delete whose lastModified < min. When asynchronous  uploads completes and invokes callback, the callback checks if asynchronous in-progress upload is marked for delete. If yes it invokes the deleteRecord to actually delete the record.
 
h4. Iterator<DataIdentifier> DataStore#getAllIdentifiers()
It returns all identifiers in S3 plus in-progress upload identifiers from AsyncUploadCache minus identifiers from the "toBeDeleted" set in AsyncUploadCache.

h4. long DataRecord#getLength()
If file exits in local cache, it retrieves length from it. Other it retrieves length from S3. 

h4. DataRecord#getLastModified()
If record is in-progress upload, the lastModified is retrieved from AsyncUploadCache else it is retrieved from S3. 

h3. Behavior of Local cache Purge
The local cache has a  size limit, when currentsize of cache exceeds the limit, the cache undergoes auto-purge mode to clean older entries and reclaim space. During purging, local cache makes sure that it doesn’t delete any in-progress asynchronous upload file. 

h3. DataStore initialization behavior w.r.t. AsyncUploadCache 
It is possible that there are asynchronous  in-progress uploads when server shuts down.  When asynchronous upload is added to AsyncUploadCache  it is immediately persisted to filesytem on a file. During  S3DataStore's initialization it checks for any incomplete asynchronous  uploads and uploads them concurrently in multiple threads. It throws RepositoryException if file is not found in local cache for that asynchronous upload. As far as code is concerned, it is only possible when somebody has removed files from local cache manually.  If there is an exception and user want to proceed with inconsistencies, set parameter contOnAsyncUploadFailure to true in repository.xml. This will ignore all missing files and reset AsyncUploadCache.


> Asynchronous upload file to S3
> ------------------------------
>
>                 Key: JCR-3733
>                 URL: https://issues.apache.org/jira/browse/JCR-3733
>             Project: Jackrabbit Content Repository
>          Issue Type: Sub-task
>          Components: jackrabbit-core
>            Reporter: Shashank Gupta
>             Fix For: 2.7.5
>
>
> S3DataStore Asynchronous Upload to S3
> The current logic to add a file record to S3DataStore is first add the file in local cache and then upload that file to S3 in a single synchronous step. This feature contemplates to break the current logic with synchronous adding to local cache and asynchronous uploading of the file to S3. Till asynchronous upload completes, all data (inputstream, length and lastModified) for that file record is fetched from local cache. 
> AWS SDK provides upload progress listeners which provides various callbacks on the status of in-progress upload.
> As of now customer reported that write performance of EBS based Datastore is 3x  better than S3 DataStore. 
> With this feature, the objective is to have comparable write performance of S3 DataStore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)