You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jackrabbit.apache.org by Apache Wiki <wi...@apache.org> on 2016/06/09 09:11:37 UTC

[Jackrabbit Wiki] Update of "JCR Binary Usecase" by ThomasMueller

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "JCR Binary Usecase" page has been changed by ThomasMueller:
https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase?action=diff&rev1=9&rev2=10

  
  Without this change currently we need to first spool the file content into some temporary location and then pass that to the other program. This add unnecessary overhead and something which can be avoided in case there is a DataStore where binary content is already in form of a file on the file system like FileDataStore
  
+ I assume the native library only requires read-only access and will not delete or move the file; could we somehow enforce this? Maybe using a symbolic link to the real file in a read-only directory?
+ 
+ How to ensure Oak GC doesn't delete the binary too early. One solution is that if the native library reads the file (or knows it will need to read the file soon), it updates the last modified time. This should already work. Another solution might be to add the file name to a text file (read log), but it would probably be more complicated, and probably wouldn't improve performance much.
+ 
  <<Anchor(UC2)>>
  === UC2 - Efficient replication across regions in S3 ===
  
@@ -33, +37 @@

  
  Instead of that plan is to replicate the specific assets via s3 copy operation. This would ensure that big assets can be copied efficiently at S3 level
  
- Note that such a case can also be present in other DataStore where a binary content can be retrieved from source DataStore and added to target DataStore in optimal way
+ Note that such a case can also be present in other DataStore where a binary content can be retrieved from source DataStore and added to target DataStore in optimal way (copy the binary from one repository to another repository).
  
  <<Anchor(UC3)>>
  === UC3 - Text Extraction without temporary File with Tika ===
@@ -43, +47 @@

  While performing text extraction by Tika in many cases it would be creating a temporary file as many parser need random access to the binary. So while using BlobStore where per implementation the binary exist as File we can use a TikaInputStream backed by that file which would avoid creation of such temporary file and thus speed up text extraction performance
  
  Going forward if we need to make use of [[https://issues.apache.org/jira/browse/TIKA-416|Out of Process Text Extraction]] then this aspect would be useful there also
+ 
+ We could add random access features to the binary, and possibly change Tika so it doesn't require a "java.io.File" but maybe a "java.nio.channels.SeekableByteChannel", or just FileChannel, or ByteBuffer, or so. We might still need to write a wrapper for Tika, but maybe not.
  
  <<Anchor(UC4)>>
  === UC4 - Spooling the binary content to socket output via NIO ===
@@ -55, +61 @@

  
  Key aspect here is that where possible we should be able to avoid IO. Also have a look at [[https://kafka.apache.org/08/design.html#maximizingefficiency|Kafka design]] which tries to make use of OS cache as much as possible and avoid Io via jvm if possible thus providing much better throughputs
  
+ Similar to Tika, we could extend the binary and add the missing / required features (for example get a ByteBuffer, which might be backed by a memory mapped file).
+ 
  <<Anchor(UC5)>>
  === UC5 - Transferring the file to FileDataStore with minimal overhead ===
  
@@ -63, +71 @@

  DataStore - FileDataStore
  
  In some deployments a customer would typically upload lots of files in a FTP folder and then from there the files are transferred to Oak. As mentioned in 2b above with NAS based storage  this would result in file being copied twice. So to avoid the extra overhead it would be helpful if one can create a File directly in the NFS as per FileDataStore structure (content hash -> split 3 level) and then add the Binary via ReferenceBinary approach
+ 
+ Or provide a way to create a JCR binary from a temp file. Oak might then move (File.renameTo or otherwise) the file or copy the content if needed. That way we don't expose the implementation details (hash algorithm, file name format).
  
  <<Anchor(UC6)>>
  === UC6 - S3 import ===
@@ -73, +83 @@

  
  The problem though: how to efficiently get them into the S3DS, ideally without moving them
  
+ Or provide a way to create a JCR binary from a S3 binary. Moving from S3 to S3 might not be a problem.
+ 
  <<Anchor(UC7)>>
  === UC7 - Random write access in binaries ===
  
  Think: a video file exposed onto the desktop via WebDAV. Desktop tools would do random writes in that file. How can we cover this use case without up/downloading the large file. (essentially: random write access in binaries)
+ 
+ The Oak BlobStore chunks binaries, so that chunks could be shared among binaries. Random writes could then copy just the references to the chunks if possible. That would make random write relatively efficient, but still binaries would be immutable. We would need to add the required API. Please note this only works when using the BlobStore, not the current FileDataStore and S3DataStore as is (at least a wrapper around the FileDataStore / S3DataStore would be needed).
  
  <<Anchor(UC8)>>
  === UC8 - X-SendFile ===
@@ -87, +101 @@

  
  To an extent this feature is similar to UC1 however here the scope is more broader 
  
+ X-SendFile seems to imply that headers are in the file itself, which is typically not the case with JCR. So we would need to know more details on the possible options we have on the Apache side.
+ 
+ === UC9 - S3 datastore in a cluster ===
+ 
+ Currently, each cluster node connects to S3 and has a "local cache" on the file system. Binaries are uploaded asynchronously to S3, that means first written to the local cache. If a binary is added in one cluster node, it is not immediately available on S3 to be read from another cluster node, if async upload is enabled.
+ 
+ Possible solutions are: (1) disable async upload; (2) not sure if that works correctly: use a shared local cache (NFS for example). Other solutions (would need more work probably) could be to send / request the binary using the broadcasting cache mechanism.
+