You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by Apache Wiki <wi...@apache.org> on 2016/09/15 08:53:32 UTC

[Jackrabbit Wiki] Update of "JCR Binary Usecase" by IanBoston

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "JCR Binary Usecase" page has been changed by IanBoston:
https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase?action=diff&rev1=15&rev2=16

Comment:
Added comments to some of the use cases.

  
  For this feature to work the web layer like Sling needs to know the path to the binary. Note that path is not disclosed to the client. 
  
- To an extent this feature is similar to UC1 however here the scope is more broader 
+ To an extent this feature is similar to UC1 however here the scope is more broader.
+ 
+ NB Although mod_xsendfile in Apache needs a path on the file system local to the Apache instance to get the binary it is a pointer. It does not need to be the same path that Oak uses internally provided it is presented as the patch on the file system. Other variants of X-sendfile (https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/) allow that pointer to be resoved to an http location for streaming. 
  
  === UC9 - S3 datastore in a cluster ===
  
@@ -120, +122 @@

  
  How to ensure Oak GC doesn't delete the binary too early. One solution is that if the native library reads the file (or knows it will need to read the file soon), it updates the last modified time. This should already work. Another solution might be to add the file name to a text file (read log), but it would probably be more complicated, and probably wouldn't improve performance much.
  
+ '''Ian'''
+ This use case needs to cover file systems and other storage mechanisms like S3. Controlling access is outside the scope of what Oak can control and depends on deployment teams.
+ 
  === UC3 - Text Extraction without temporary File with Tika ===
  
  '''Thomas'''
@@ -130, +135 @@

  
  '''Thomas'''
  Similar to Tika, we could extend the binary and add the missing / required features (for example get a `ByteBuffer`, which might be backed by a memory mapped file).
+ 
+ '''Ian'''
+ Transfers should drill down to the underlying stream to see if they support NIO and use it if present. An example being the DS is a File system DS which has NIO, as does S3 the Jetty stream also supports NIO, so it should be possible for a servlet to get hold of both streams and connect the channels. This requires that the streams are available directly which requires the rest of the implementation to be efficient enough to not require local caching and copies of the files. There are a number of issues in Sling that need addressing first, some of which are being worked on. Streaming uploads, Streamed downloads to IE11 etc. I dont think adding NIO capabilities to streams that dont natively support NIO is the right solution and will only hide a more fundamental issue. The biggest issue (imvho) is that JCR Binary doesn't provide both an OutputStream and an InputStream directly connected to the raw underlying storage, blocking the client from performing zero cost transfers available to most other stacks. 
  
  === UC5 - Transferring the file to `FileDataStore` with minimal overhead ===
  
@@ -148, +156 @@

  
  The Oak `BlobStore` chunks binaries, so that chunks could be shared among binaries. Random writes could then copy just the references to the chunks if possible. That would make random write relatively efficient, but still binaries would be immutable. We would need to add the required API. Please note this only works when using the `BlobStore`, not the current `FileDataStore` and `S3DataStore` as is (at least a wrapper around the `FileDataStore` / `S3DataStore` would be needed). This includes efficiently cutting away some bytes in the middle of a binary, or inserting some bytes. Typical file system don't support this case efficiently, however with the BlobStore it is possible.
  
+ === UC8 - X-SendFile ===
+ 
+ '''Ian'''
+ The Aim of X-Sendfile is to offload the streaming of large binaries from an expensive server capable of performing complex authorizations to a less expensive farm of servers capable of streaming data to a large number of clients. The X-sendfile header provides a location where the upstream proxy can find the response. That location has to be resolvable to the upstream server, it may contain authZ for the response and it may not divulge the structure of the store of neighbouring resources. What can be achieved depends on the implementation of the upstream servers X-sendfile capability. The apache mod_xsendfile module only supports mapping the location to the filesystem, so the DS would have to be mounted. Other X-Sendfile implementations, like nginX X-Accel-redriect that created the concept, supports mapping the location through to any URI location including through to http locations. This would allow the X-Accel-Redirect location to be mapped through to a http location capable of serving > C10K request all streaming. In AWS, ELBs support signed URLs, so if a S3 store needed to be exposed, the URL  X-Accel-Redirect location could be a S3 bucket location fronted by an ELB configured to only allow access to signed requests conforming to a ACL policy. That policy including token expiry. Other variants of this are possible including requiring signed urls and hosting the content behind an elastic farm of Node.js/Jetty or any C10K capable server, each one validating the signature and token on every request from the nginX front end.  To achieve this Oak or Sling would need to expose the pointer to the binary, and document a signing structure giving access to that binary. If the identifier of the Binary is already exposed, via JCR properties, this may already be possible, with knowledge of the DS without any changes to Oak.
+ 
+ Documentation on the AWS ELB signed urls is here http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/private-content-signed-urls.html
+ Documentation on nginX's original concept is here https://www.nginx.com/resources/wiki/start/topics/examples/xsendfile/ 
+ 
+ 
  === UC9 - S3 datastore in a cluster ===
  
  '''Thomas'''
  Possible solutions are: (1) disable async upload; (2) not sure if that works correctly: use a shared local cache (NFS for example). Other solutions (would need more work probably) could be to send / request the binary using the broadcasting cache mechanism.
  
+ '''Ian'''
+ This has been partially addressed with streaming uploads in Sling Engine 2.4.6 and Sling Servlets Post 2.3.14. WHen async cache is disabled, the session.save() connects the request InputStream for the upload directly to the S3 OutputStream performing the transfer with no local disk IO, using a byte[]. As noted under UC4, this should be done over NIO wherever possible. Downloads of the binary also need to be streamed in a similar way. Local Disk IO is reported to be an expensive commodity by those who are deploying Sling/AEM at scale.
+