You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by Apache Wiki <wi...@apache.org> on 2016/06/01 04:37:50 UTC
[Jackrabbit Wiki] Update of "JCR Binary Usecase" by ChetanMehrotra

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "JCR Binary Usecase" page has been changed by ChetanMehrotra:
https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase?action=diff&rev1=4&rev2=5

    a. The file are stored in a directory structure like /xx/yy/zz/<contenthash> where xx,yy,zz are the initial few letters of the hex encoded content hash
    a. Upon writing the stream is first written to a temporary file and then it is renamed. In case of NFS based DataStore this would essentially means file is written twice!. This design problem was solved with FileBlobStore in Oak but that is not being used in production. So something which we need to live with
  
- Currently the JCR Binary interface only allows InputStream based access to the binary content. In certain cases where the deployment is using certain type of BlobStore like FileDataStore or S3DataStore its desirable that an optimal path can be leveraged if possible
  
  <<Anchor(UC1)>>
- === UC1 - Image Rendition generation ===
+ === UC1 - processing a binary in JCR with a native library that only has access to the file system ===
  
  ''Need access to absolute path of the File which back JCR Binary when using FileDataStore for processing by native program''
  
- DataStore - FileDataStore
- 
  There are deployments where lots of images gets uploaded to the repository and there are some conversions (rendition generation) which are performed by OS specific native executable. Such programs work directly on file handle.
  
- Without this change currently we need to first spool the file content into some temporary location and then pass that to the other program. This add unnecessary overhead and something which can be avoided in case there is a FileDataStore being used where we can provide a direct access to the file
+ Without this change currently we need to first spool the file content into some temporary location and then pass that to the other program. This add unnecessary overhead and something which can be avoided in case there is a DataStore where binary content is already in form of a file on the file system like FileDataStore
  
  <<Anchor(UC2)>>
  === UC2 - Efficient replication across regions in S3 ===
  
- ''For binary less replication in non shared DataStore across multiple regions need access to S3Object ID backing the blob such that it can be efficient copied to a bucket in different region via S3 Copy Command''
+ ''For binary less replication between multiple Sling instances in non shared DataStore setup across multiple regions need access to S3Object ID backing the blob such that it can be efficiently copied to a bucket in different region via S3 Copy Command''
  
  DataStore - S3DataStore
  
- This for setup which is running on Oak with S3DataStore. There we have global deployment where author instance is running in 1 region and binary content is to be distributed to publish instances running in different regions. The DataStore size is huge say 100TB and for efficient operation we need to use Binary less replication. In most cases only a very small subset of binary content would need to be present in other regions. Current way (via shared DataStore) to support that would involve synchronizing the S3 bucket across all such regions which would increase the storage cost considerably. 
+ This for setup which is running on Oak with S3DataStore. There we have global deployment where a Sling based app (running on Oak with S3DataStore) is running in 1 region and binary content is to be distributed to publish instances running in different regions. The DataStore size is huge say 100TB and for efficient operation we need to use Binary less replication. In most cases only a very small subset of binary content would need to be present in other regions. Current way (via shared DataStore) to support that would involve synchronizing the S3 bucket across all such regions which would increase the storage cost considerably. 
  
  Instead of that plan is to replicate the specific assets via s3 copy operation. This would ensure that big assets can be copied efficiently at S3 level
+ 
+ Note that such a case can also be present in other DataStore where a binary content can be retrieved from source DataStore and added to target DataStore in optimal way
  
  <<Anchor(UC3)>>
  === UC3 - Text Extraction without temporary File with Tika ===
@@ -76, +75 @@

  The problem though: how to efficiently get them into the S3DS, ideally without moving them
  
  <<Anchor(UC7)>>
- === UC7 - Editing large files ===
+ === UC7 - Random write access in binaries ===
  
  Think: a video file exposed onto the desktop via WebDAV. Desktop tools would do random writes in that file. How can we cover this use case without up/downloadin the large file. (essentially: random write access in binaries)
  
@@ -85, +84 @@

  
  [[https://tn123.org/mod_xsendfile/|X-SendFile]] is module support in Apache which enabled spooling the file content from the OS via using Apache internals including all optimizations like caching-headers and sendfile or mmap if configured. So if the file is present on the filesystem which Apache can access then it would be spooled in much more efficient way and avoid add load to the JVM. 
  
- For this feature to work the web layer like Sling needs to know the path to the binary. Note that path is not disclosed to the client.
+ For this feature to work the web layer like Sling needs to know the path to the binary. Note that path is not disclosed to the client. 
  
+ To an extent this feature is similar to UC1 however here the scope is more broader 
+