You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by Clint Goudie-Nice <go...@adobe.com> on 2016/07/29 17:16:33 UTC

[Binary Uploads] Do binary uploads go to disk before their next hop into the repository?

Hello all,

Do binary uploads (assets, etc) get written to a temp location before being put into the repository, or are they streamed end-to-end for these 4 transfer types:


1)      Mime Multipart uploads / form uploads

2)      Content-Transfer:Chunked uploads

3)      Plain binary uploads with a specified length header

4)      Plain binary uploads with no specified length header

There are pros and cons to each approach. Obviously, if you stream it end to end, if the client is uploading a large stream of data, you have to maintain a session over a long period, possibly hours.

If it is being streamed to a temporary location first, and then to the repository, you require an additional write and an additional read of IO, but potentially less session time.

I would like to better understand the requirements on the system imposed by these different upload types.

Clint

Re: [Binary Uploads] Do binary uploads go to disk before their next hop into the repository?

Posted by Ian Boston <ie...@tfd.co.uk>.
Hi,

BTW: There is IIRC a 32 bit problem (2GB files) with http via some proxies
that can be avoided by using Chunked transfer encoding as each transfer
doesn't need to be large, hence a PUT with Chunked encoding will stream.

More inline.




On 29 July 2016 at 18:16, Clint Goudie-Nice <go...@adobe.com> wrote:

> Hello all,
>
> Do binary uploads (assets, etc) get written to a temp location before
> being put into the repository, or are they streamed end-to-end for these 4
> transfer types:
>
>
> 1)      Mime Multipart uploads / form uploads
>

Multipart uploads > 256000 bytes are written to disk, using commons file
upload ServletFileUpload[1] in [2] which produced FileItems which are then
read. I think the InputStream from the FileInput is connected to the
OutputStream of a jcd:data and the data pumped between the 2 in blocks.

I cant find any evindence of Sling using the FIleUpload streaming API for
multi part posts [3]


>
> 2)      Content-Transfer:Chunked uploads
>


This is a lower level transfer encoding handled by jetty, chunked encoding
does not surface in the Servlet API (IIRC). When streaming it does allow
response output and upload output to stream without knowing the content
length, so Jetty uses it producing 1 chunk on every flush. I would expect a
modern browser to use chunked encoding for uploads.


> 3)      Plain binary uploads with a specified length header
>

PUT operations are handled by the Jackrabbit WebDav bundle. I am not
familiar with the code but do remember sending large volumes of data
through it in 2007 and not seeing heap or local file IO. [4] backs that up.
I think


>
> 4)      Plain binary uploads with no specified length header
>

If the content length is missing and it;s not chunked encoding jetty will
read until the socket is closed. There is no difference from a server point
of view in how the request is handled.




>
> There are pros and cons to each approach. Obviously, if you stream it end
> to end, if the client is uploading a large stream of data, you have to
> maintain a session over a long period, possibly hours.
>

I assume you mean JCR session not http session.
The request will be authenticated before streaming starts, so the session
will be validated at the start of the request and close when the session is
logged out, ie at the end of the request. (IIRC).


>
> If it is being streamed to a temporary location first, and then to the
> repository, you require an additional write and an additional read of IO,
> but potentially less session time.
>

The session time is the same regardless, but the time taken to upload will
require more IO so the operation will take longer and there is no
interleaving between request and stream to the underlying DS. If the
networks are the same speed, then upload takes 2x the time. Since the
session is created before the upload starts and before commons file upload
processes the request the session is open for the enture request.

There is no load on the underlying repository from a file upload, other
than the metadata which is minimal. I mean in the sense that there wont be
1000s of Oak Documents being created during the upload, only a pointer to
the DataStore and a handfull of nodes. Since thats a small commit it wont
generate a branch.

Obviously if you are using a MongoDB DS it will generate lots of blobs
which will impact replication and other things.
A S3 DS will not start sending the data until a second copy of the data is
made into the S3 Async upload cache (assuming that's enabled) otherwise I
think it will stream directly to the S3 API.
FS DS is , well, FS.


>
> I would like to better understand the requirements on the system imposed
> by these different upload types.
>
> Clint
>

HTH
Best Regards
Ian

1 https://commons.apache.org/proper/commons-fileupload/using.html
2 org.apache.sling.engine.impl.parameters.ParameterSupport#parseMultiPartPost
3 https://commons.apache.org/proper/commons-fileupload/streaming.html
4
https://github.com/apache/jackrabbit/blob/b252e505e34b03638207e959aaafce7c480ebaaa/jackrabbit-webdav/src/main/java/org/apache/jackrabbit/webdav/server/AbstractWebdavServlet.java#L629