You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by Shashank Gupta <sh...@adobe.com> on 2013/02/21 17:34:19 UTC

[POST] Servlet resolution for non existing resource

Hi, 

Wrt to implementation of chunked upload feature SLING-2707[0] and [1],  it require to hit chunkupload servlet (registered on selector "chunk") for a non existing resource.  So when I post request as [2], servlet resolves to default slingpostservlet[3]. 
Is there a way to resolve servlet to chunkuploadservlet?

Regards,
Shashank

[0]
https://issues.apache.org/jira/browse/SLING-2707

[1]
https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support#ChunkedFileUploadSupport-UploadchunkusingPOST
[2]
POST /temp/dam/Desert.jpg.chunk.1.res HTTP/1.1
Host: localhost:4502
If-Match: "e0023aa4e"

[3] 
0 (2013-02-21 21:40:18) TIMER_END{0,ResourceResolution} URI=/temp/dam/Desert.jpg.chunk.1.res resolves to Resource=, type=sling:nonexisting, path=/temp/dam/Desert.jpg.chunk.1.res, resource=[NonExistingResource, path=/temp/dam/Desert.jpg.chunk.1.res]
      0 (2013-02-21 21:40:18) LOG Resource Path Info: SlingRequestPathInfo: path='/temp/dam/Desert.jpg.chunk.1.res', selectorString='jpg.chunk.1', extension='res', suffix='null'
      0 (2013-02-21 21:40:18) TIMER_START{ServletResolution}
      0 (2013-02-21 21:40:18) TIMER_START{resolveServlet(, type=sling:nonexisting, path=/temp/dam/Desert.jpg.chunk.1.res, resource=[NonExistingResource, path=/temp/dam/Desert.jpg.chunk.1.res])}
      0 (2013-02-21 21:40:18) LOG {0}: no servlet found
      0 (2013-02-21 21:40:18) TIMER_END{0,resolveServlet(, type=sling:nonexisting, path=/temp/dam/Desert.jpg.chunk.1.res, resource=[NonExistingResource, path=/temp/dam/Desert.jpg.chunk.1.res])} Using servlet org.apache.sling.servlets.post.impl.SlingPostServlet
      0 (2013-02-21 21:40:18) TIMER_END{0,ServletResolution} URI=/temp/dam/Desert.jpg.chunk.1.res handled by Servlet=org.apache.sling.servlets.post.impl.SlingPostServlet

Re: [POST] Servlet resolution for non existing resource

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 27.02.2013, at 09:39, Shashank Gupta <sh...@adobe.com> wrote:

> Answers--
>> The problem of chunks is that they are
>> - not self-describing (what size?)
> When you parse multipart request, you should get chunk size. In single part request, content-length gives chunk size.
>> - must all have the same length
> Not mandatory.  Why you think so?  In merge you append chunks in serial manner till you find last chunk.

If you don't include the start byte index, then you have to know all the previously uploaded chunks and add their lengths together. This requires to always have those chunks available for subsequent chunks and is thus a restriction. Or you are forced to keep chunk size fixed so you can calculate the offset by chunk_size * chunk_number-1, but the last chunk is forced to have a different size (unless the file size is eactly dividable by the desired chunk size), so you'd need to also send along the desired_chunk_size with the last request.

Now if chunk requests include start and length as a solution to that, it's basically the same as byte ranges. And the chunk index becomes a repetitive and useless information. (Note that the total file length needs to be transferred in the first or all requests).

>> - introduce an arbitrary numbering scheme that you cannot break out of
> Explain it more.

See above.

>> - if problems arise for one chunk, the actual order might become very different, so "last chunk" is not a fixed thing
> I don't see it as an issue. "last chunk" not required to be fixed thing as client explicitly mark last chunk request.
>> - as noted above, not in line with existing HTTP concepts for this (even if they currently only apply to GETs)
> Afaics, Aws s3 uses chunk number approaches and we believe AWS is doing good in terms of scalability and concurrency.  One of the primary reason that we thrust on chunknumber approach.

Please don't take AWS APIs as good examples for REST. They simply are not :-) This is separate from the fact that their backend is great, just the API part is not optimal.

>> Do you mean the file would be deferred or the other parameters? I'd say only the file, because you probably only sent the other ones with the first file snippet (and the client is free to chose when to send them along) and secondly making all params defer is going to be very complex.
> Sling cannot create nt:file without jcr:data because it will throw node type constraint violations exception. So node creation  and other parameters  processing has to be done in last request. Client require to parameters in last request  but free to send/not send in first and intermediate request.  Sling ignores other parameters send  in  first and intermediate requests.

Right, but I still think it's simpler to have a placeholder nt:file than to keep the request; or simply create the nt:file already, even if the binary is incomplete (as described for the streaming use case). This makes it really simple.

Delaying it for the last request is tremendously complex. And we don't know if that's really needed (from a use case perspective), so why not start simple.

>> - have a configurable size limit for those partial files (e.g. 50 GB)
>> - once the limit is hit, clean up expired files; if not enough, clean up the oldest files
> Imo, size check upon each chunk upload request is not required.  It adds complexity. We already have scheduled job which can be configured to do necessary clean up.

I would opt for the solution to write directly to the repository, than we spare all this complexity... In the end, JCR is the persistence layer, and that covers temporary resources as well. Persistence is none of Sling's business.

>> Then a protocol questions arises: What to respond when a client uploads the "final" byte range, but the previously uploaded ones are no longer present on the server?
> On first chunk upload,  sling sends "location"  header in response.  Subsequent upload request use this header as uploadId.
> Sling sends 404 when no resumable upload found.

Please no uploadId. They all adress the same resource just like multiple normal update requests to the sling post servlet.

And again, this is solved if we write to the JCR and nt:file directly.

Cheers,
Alex

RE: [POST] Servlet resolution for non existing resource

Posted by Shashank Gupta <sh...@adobe.com>.
Answers--
>The problem of chunks is that they are
>- not self-describing (what size?)
When you parse multipart request, you should get chunk size. In single part request, content-length gives chunk size. 
>- must all have the same length
Not mandatory.  Why you think so?  In merge you append chunks in serial manner till you find last chunk. 
> - introduce an arbitrary numbering scheme that you cannot break out of
Explain it more.
>- if problems arise for one chunk, the actual order might become very different, so "last chunk" is not a fixed thing
I don't see it as an issue. "last chunk" not required to be fixed thing as client explicitly mark last chunk request.
>- as noted above, not in line with existing HTTP concepts for this (even if they currently only apply to GETs)
Afaics, Aws s3 uses chunk number approaches and we believe AWS is doing good in terms of scalability and concurrency.  One of the primary reason that we thrust on chunknumber approach. 

> Do you mean the file would be deferred or the other parameters? I'd say only the file, because you probably only sent the other ones with the first file snippet (and the client is free to chose when to send them along) and secondly making all params defer is going to be very complex.
Sling cannot create nt:file without jcr:data because it will throw node type constraint violations exception. So node creation  and other parameters  processing has to be done in last request. Client require to parameters in last request  but free to send/not send in first and intermediate request.  Sling ignores other parameters send  in  first and intermediate requests.

> - have a configurable size limit for those partial files (e.g. 50 GB)
>- once the limit is hit, clean up expired files; if not enough, clean up the oldest files
Imo, size check upon each chunk upload request is not required.  It adds complexity. We already have scheduled job which can be configured to do necessary clean up. 

> Then a protocol questions arises: What to respond when a client uploads the "final" byte range, but the previously uploaded ones are no longer present on the server?
On first chunk upload,  sling sends "location"  header in response.  Subsequent upload request use this header as uploadId. 
Sling sends 404 when no resumable upload found. 


-----Original Message-----
From: Alexander Klimetschek [mailto:aklimets@adobe.com] 
Sent: 26 February 2013 21:35
To: dev@sling.apache.org
Subject: Re: [POST] Servlet resolution for non existing resource

Beware ;-) XXL mail with multiple proposals, the more interesting ones coming later...

On 25.02.2013, at 14:48, Shashank Gupta <sh...@adobe.com> wrote:
> This would make it simpler to switch to a Range-header based upload if it might be standardized around HTTP in the future.
> 
> [SG] Range based upload is not a standard and declined by Julian.

Yes :-) What I mean is if it's going to be standardized in a new HTTP version or extension, it will very very likely be byte-range based as well.

> Introduction of new type "sling:partialFile"  complicate things. How does it solve "modify binary" use case.  I will not take this approach unless there is consensus for this approach.

Avoiding nt:file?:

Discussed that with Felix and he pointed out that we'd need to avoid having something that looks like an nt:file (either is or extends from it), to avoid that jcr events are thrown and generic application code tries to read it before the file is finished. Any specific marker we introduce (properties or node type such as sling:PartialFile < nt:file) would need to be handled in all the existing code, which is not feasible.

So if we go this route, sling:PartialFile must not extend from nt:file.

Clean up of temp files:

But since we need to work around the data store issue (especially since this feature is targeted at large files), it's probably better to start with storing the partial chunks in the file system. The difficult part here is the clean up, mostly because the expiry time for those files needs to be quite long: imagine a user starting to upload a big file, then going home, upload fails over night/weekend, and during the next day he says "resume upload"... this will give typical expiry time of at least one day (ignoring automatic resumes here).

Felix and I discussed this:
- store partial files, including metadata: jcr path + total length
- once full file range is covered, create nt:file in jcr (and clean up partial files)
- have a configurable size limit for those partial files (e.g. 50 GB)
- have a configurable expiry time (e.g. 2 days)
- once the limit is hit, clean up expired files; if not enough, clean up the oldest files
- run cleanup periodically as well (expiry time or 1/2 expiry time or ...)

Then a protocol questions arises: What to respond when a client uploads the "final" byte range, but the previously uploaded ones are no longer present on the server? Do we need an additional acknowledge if the file was successfully uploaded (HEAD request on the file returning 200?)?

> What if the second chunk failed and you want to repeat it? While all the others including "lastChunk" where successfull. I think chunks where the server doesn't know the real byte coordinates for every single request won't work. You need to specify exactly what is uploaded - and byte ranges + full length are IMHO more generic than to say chunk X out of N, size of chunk = M.
> 
> [SG] I think we are not discussing about parallel chunk upload here and thus invalidates your point.  The spec and impl is for simple, resumeable and serial chunk upload. 
> Query chunk upload provides chunk number and bytes uploaded till failure point.  Client will resume from next chunknumber and byte offset. 

The problem of chunks is that they are
- not self-describing (what size?)
- must all have the same length
- introduce an arbitrary numbering scheme that you cannot break out of
- if problems arise for one chunk, the actual order might become very different, so "last chunk" is not a fixed thing
- as noted above, not in line with existing HTTP concepts for this (even if they currently only apply to GETs)

Hence my -1 on indexed chunks.

> [SG] too complex.  We have to live with current datastore implementation at least for the time being.

Data store & Oak:

I had a chat with Thomas Müller (works on the Jackrabbit & Oak team). He said that
a) the data store in oak will be improved and share binaries on smaller 2MB patches (instead of entire files)
b) for the existing JR2 FileDataStore, we should not care about the additional space overhead, the garbage collector will take care of it (just a matter of enough disk space and gc configuration)

This means we could put the structure into the repository right away (e.g. /tmp) and then combine the chunks into the final file. This would happen via some kind of SequenceInputStream that combines multiple input streams in a fixed sequence into one stream (this actually exists already). Doing so now would mean we would basically duplicate the binary in the data store (all chunks in /tmp + final file), that shouldn't be an issue.

Later, Oak with its updated data store could optimize here: we replace the input stream with a SequenceBinaryInputStream that gives a single input stream for the input streams of multiple binaries. It would hold the list of binaries and it would be part of the jackrabbit/oak API. The Oak implementation could detect that and instead of reading the input stream (and thus copying everything, taking time), resolve the binaries and use their internal representation to aggregate them into the new one.

This way the Sling solution is more or less the same (only importing a different API class later), and the underlying persistence layer improves by itself.

When putting things into /tmp and with Jackrabbit 2 we'd need a similar cleanup mechanism as with the file system, but at least it would count as normal repository content. With Oak, this would even be less of an issue since the partial binaries and the final ones would share their data store snippets - so it's only a matter of cleaning up /tmp in JCR for the sake of structure cleanup, not space savings.

Streaming end-to-end:

Finally, there would actually be a nice use case for updating a nt:file in place (no /tmp): streaming formats such as video or audio. Imagine an encoder streams a real time video feed into JCR using the partial upload feature - and different clients on the other side streaming that video from the JCR, using the existing HTTP GET Range support in Sling (which is used e.g. by most modern video streaming solutions).

In this case the file could really be nt:file like, if the file is unreadable because it is not complete yet, and then try again on the next modification and basically succeed with the final one. Such apps have a marker to say it's not finished and applications are somewhat forced to know it. And for events: if they get one on every modification, they would simply handle it, fail could easily be changed to handle the flag to fail fast - as those apps are also most likely the ones that ask for the resumable upload in the first place.


> This would apply to files in the form request only, so that is already handled specifically in the sling post servlet (i.e. updating a binary property). If the request contains other sling post servlet params, they would be processed normally.
> [SG] Yes they would be *but* it would be deferred till sling receives last chunk. 

Do you mean the file would be deferred or the other parameters? I'd say only the file, because you probably only sent the other ones with the first file snippet (and the client is free to chose when to send them along) and secondly making all params defer is going to be very complex.

> The question (also for 2.+3.) is: does Sling have to know when everything was uploaded?
> - if it's temporarily stored on the file system, then yes (to move it to jcr)
> - if it's stored in a special content structure in jcr, then yes (to convert it to a nt:file)
> - if it's stored in nt:file in jcr, then no
> 
> [SG]  you break modify binary use case if you append in place.  The binary would be corrupted unless all chunks have arrived.

I guess you refer to the last point (nt:file): of course you'd update the files properly (get content, update the appropriate byte range) - not just naively append...

> The last two would probably require some investigation in data store enhancements to avoid wasting space.
> [SG] afaik, additive "CUD"  is very fundamental to tarpersistence and datastore so we have to live with it at least till  oak. 

See above.

Cheers,
Alex

Re: [POST] Servlet resolution for non existing resource

Posted by Alexander Klimetschek <ak...@adobe.com>.
Beware ;-) XXL mail with multiple proposals, the more interesting ones coming later...

On 25.02.2013, at 14:48, Shashank Gupta <sh...@adobe.com> wrote:
> This would make it simpler to switch to a Range-header based upload if it might be standardized around HTTP in the future.
> 
> [SG] Range based upload is not a standard and declined by Julian.

Yes :-) What I mean is if it's going to be standardized in a new HTTP version or extension, it will very very likely be byte-range based as well.

> Introduction of new type "sling:partialFile"  complicate things. How does it solve "modify binary" use case.  I will not take this approach unless there is consensus for this approach.

Avoiding nt:file?:

Discussed that with Felix and he pointed out that we'd need to avoid having something that looks like an nt:file (either is or extends from it), to avoid that jcr events are thrown and generic application code tries to read it before the file is finished. Any specific marker we introduce (properties or node type such as sling:PartialFile < nt:file) would need to be handled in all the existing code, which is not feasible.

So if we go this route, sling:PartialFile must not extend from nt:file.

Clean up of temp files:

But since we need to work around the data store issue (especially since this feature is targeted at large files), it's probably better to start with storing the partial chunks in the file system. The difficult part here is the clean up, mostly because the expiry time for those files needs to be quite long: imagine a user starting to upload a big file, then going home, upload fails over night/weekend, and during the next day he says "resume upload"... this will give typical expiry time of at least one day (ignoring automatic resumes here).

Felix and I discussed this:
- store partial files, including metadata: jcr path + total length
- once full file range is covered, create nt:file in jcr (and clean up partial files)
- have a configurable size limit for those partial files (e.g. 50 GB)
- have a configurable expiry time (e.g. 2 days)
- once the limit is hit, clean up expired files; if not enough, clean up the oldest files
- run cleanup periodically as well (expiry time or 1/2 expiry time or ...)

Then a protocol questions arises: What to respond when a client uploads the "final" byte range, but the previously uploaded ones are no longer present on the server? Do we need an additional acknowledge if the file was successfully uploaded (HEAD request on the file returning 200?)?

> What if the second chunk failed and you want to repeat it? While all the others including "lastChunk" where successfull. I think chunks where the server doesn't know the real byte coordinates for every single request won't work. You need to specify exactly what is uploaded - and byte ranges + full length are IMHO more generic than to say chunk X out of N, size of chunk = M.
> 
> [SG] I think we are not discussing about parallel chunk upload here and thus invalidates your point.  The spec and impl is for simple, resumeable and serial chunk upload. 
> Query chunk upload provides chunk number and bytes uploaded till failure point.  Client will resume from next chunknumber and byte offset. 

The problem of chunks is that they are
- not self-describing (what size?)
- must all have the same length
- introduce an arbitrary numbering scheme that you cannot break out of
- if problems arise for one chunk, the actual order might become very different, so "last chunk" is not a fixed thing
- as noted above, not in line with existing HTTP concepts for this (even if they currently only apply to GETs)

Hence my -1 on indexed chunks.

> [SG] too complex.  We have to live with current datastore implementation at least for the time being.

Data store & Oak:

I had a chat with Thomas Müller (works on the Jackrabbit & Oak team). He said that
a) the data store in oak will be improved and share binaries on smaller 2MB patches (instead of entire files)
b) for the existing JR2 FileDataStore, we should not care about the additional space overhead, the garbage collector will take care of it (just a matter of enough disk space and gc configuration)

This means we could put the structure into the repository right away (e.g. /tmp) and then combine the chunks into the final file. This would happen via some kind of SequenceInputStream that combines multiple input streams in a fixed sequence into one stream (this actually exists already). Doing so now would mean we would basically duplicate the binary in the data store (all chunks in /tmp + final file), that shouldn't be an issue.

Later, Oak with its updated data store could optimize here: we replace the input stream with a SequenceBinaryInputStream that gives a single input stream for the input streams of multiple binaries. It would hold the list of binaries and it would be part of the jackrabbit/oak API. The Oak implementation could detect that and instead of reading the input stream (and thus copying everything, taking time), resolve the binaries and use their internal representation to aggregate them into the new one.

This way the Sling solution is more or less the same (only importing a different API class later), and the underlying persistence layer improves by itself.

When putting things into /tmp and with Jackrabbit 2 we'd need a similar cleanup mechanism as with the file system, but at least it would count as normal repository content. With Oak, this would even be less of an issue since the partial binaries and the final ones would share their data store snippets - so it's only a matter of cleaning up /tmp in JCR for the sake of structure cleanup, not space savings.

Streaming end-to-end:

Finally, there would actually be a nice use case for updating a nt:file in place (no /tmp): streaming formats such as video or audio. Imagine an encoder streams a real time video feed into JCR using the partial upload feature - and different clients on the other side streaming that video from the JCR, using the existing HTTP GET Range support in Sling (which is used e.g. by most modern video streaming solutions).

In this case the file could really be nt:file like, have a marker to say it's not finished and applications are somewhat forced to know it. And for events: if they get one on every modification, they would simply handle it, fail if the file is unreadable because it is not complete yet, and then try again on the next modification and basically succeed with the final one. Such apps could easily be changed to handle the flag to fail fast - as those apps are also most likely the ones that ask for the resumable upload in the first place.


> This would apply to files in the form request only, so that is already handled specifically in the sling post servlet (i.e. updating a binary property). If the request contains other sling post servlet params, they would be processed normally.
> [SG] Yes they would be *but* it would be deferred till sling receives last chunk. 

Do you mean the file would be deferred or the other parameters? I'd say only the file, because you probably only sent the other ones with the first file snippet (and the client is free to chose when to send them along) and secondly making all params defer is going to be very complex.

> The question (also for 2.+3.) is: does Sling have to know when everything was uploaded?
> - if it's temporarily stored on the file system, then yes (to move it to jcr)
> - if it's stored in a special content structure in jcr, then yes (to convert it to a nt:file)
> - if it's stored in nt:file in jcr, then no
> 
> [SG]  you break modify binary use case if you append in place.  The binary would be corrupted unless all chunks have arrived.

I guess you refer to the last point (nt:file): of course you'd update the files properly (get content, update the appropriate byte range) - not just naively append...

> The last two would probably require some investigation in data store enhancements to avoid wasting space.
> [SG] afaik, additive "CUD"  is very fundamental to tarpersistence and datastore so we have to live with it at least till  oak. 

See above.

Cheers,
Alex

RE: [POST] Servlet resolution for non existing resource

Posted by Shashank Gupta <sh...@adobe.com>.
Inline with [SG] initial. 

-----Original Message-----
From: Alexander Klimetschek [mailto:aklimets@adobe.com] 
Sent: 25 February 2013 18:36
To: dev@sling.apache.org
Subject: Re: [POST] Servlet resolution for non existing resource

On 25.02.2013, at 10:38, Shashank Gupta <sh...@adobe.com> wrote:

> Here are the salient points wrt resumable upload implementation/integration in SlingPostServlet[1].  
> 
> 1.	Resumeable upload is supported in "modify"operation ( ie. In default operation) . No new operation introduced for it.  

Yes.

> 2.	Request parameter ":chunkNumber" distinguish between partial and 'single shot upload'. Better than "content range" parameter approach, as it avoids ambiguity in overlapping ranges like 100-199, 100-299 etc.

What's wrong with overlapping ranges? And client errors are to be expected, for example you will always have the possible case that a client never uploads chunk X or range A-B. And either some garbage collection kicks in (if stored in the file system, but adds complexity) or it's just left in the repository (i.e. with a "sling:partialFile" node type or simply a normal nt:file that is constantly updated).

The newest upload wins, i.e. existing ranges would always be overwritten.

This would make it simpler to switch to a Range-header based upload if it might be standardized around HTTP in the future.

[SG] Range based upload is not a standard and declined by Julian. Introduction of new type "sling:partialFile"  complicate things. How does it solve "modify binary" use case.  I will not take this approach unless there is consensus for this approach.  

> 3.	Request parameter ":lastChunk=true" distinguish between intermediate and last upload chunk.

What if the second chunk failed and you want to repeat it? While all the others including "lastChunk" where successfull. I think chunks where the server doesn't know the real byte coordinates for every single request won't work. You need to specify exactly what is uploaded - and byte ranges + full length are IMHO more generic than to say chunk X out of N, size of chunk = M.

[SG] I think we are not discussing about parallel chunk upload here and thus invalidates your point.  The spec and impl is for simple, resumeable and serial chunk upload. 
Query chunk upload provides chunk number and bytes uploaded till failure point.  Client will resume from next chunknumber and byte offset. 

> 4.	Chunk Storage:
> 	*	In place append: will not work in modify use case.  In create,  If chunk size is x,  Square(x) space consumed in datastore. [x +2x  + 3x+ ...  = O(square(x))
> 	*	Chunks saved in JCR. If chunk size is x,  2x space consumed in datastore. Chunks stored at temp location in /var/chunks/<uploadid>/<chunkNumber>

Right, good point. But that's really an issue of the data store, but not of the sling/jcr API. Ideally for this case the datastore would allow update of a binary (initially @FileLength filled with zeros and first range), while adapting the hash along the way (but moving the actual file). It would require real-time reference tracking in the data store though...

[SG] too complex.  We have to live with current datastore implementation at least for the time being.  

> 5.	Chunk upload response:
> 	*	First/Intermediate chunks: 200 OK. Response body will *not* contain list of changes . Location header contain temp location "/var/chunks/<uploadid>". Client can use it to retrieve upload information and hole information.
> 	*	Last chunk: 201/200 in case of creation/modification.  Response body contains list of changes in json or html format. 

Response body would be the normal sling post servlet response (html or json IIRC) and would always be 200 if successful.
[SG] ok.

> 6.	Chunk upload processing
> 	*	First/Intermediate chunks: Only save chunk is saved in jcr. Ignores all upload semantics (@TypeHint, etc)  and request parameter. 

This would apply to files in the form request only, so that is already handled specifically in the sling post servlet (i.e. updating a binary property). If the request contains other sling post servlet params, they would be processed normally.
[SG] Yes they would be *but* it would be deferred till sling receives last chunk. 

> 	*	Last chunk: Stiches all chunks. Process all upload semantics and request parameters and creates jcr node structure. 

The question (also for 2.+3.) is: does Sling have to know when everything was uploaded?
- if it's temporarily stored on the file system, then yes (to move it to jcr)
- if it's stored in a special content structure in jcr, then yes (to convert it to a nt:file)
- if it's stored in nt:file in jcr, then no

[SG]  you break modify binary use case if you append in place.  The binary would be corrupted unless all chunks have arrived.

The last two would probably require some investigation in data store enhancements to avoid wasting space.
[SG] afaik, additive "CUD"  is very fundamental to tarpersistence and datastore so we have to live with it at least till  oak. 

Cheers,
Alex

Re: [POST] Servlet resolution for non existing resource

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 25.02.2013, at 10:38, Shashank Gupta <sh...@adobe.com> wrote:

> Here are the salient points wrt resumable upload implementation/integration in SlingPostServlet[1].  
> 
> 1.	Resumeable upload is supported in "modify"operation ( ie. In default operation) . No new operation introduced for it.  

Yes.

> 2.	Request parameter ":chunkNumber" distinguish between partial and 'single shot upload'. Better than "content range" parameter approach, as it avoids ambiguity in overlapping ranges like 100-199, 100-299 etc.

What's wrong with overlapping ranges? And client errors are to be expected, for example you will always have the possible case that a client never uploads chunk X or range A-B. And either some garbage collection kicks in (if stored in the file system, but adds complexity) or it's just left in the repository (i.e. with a "sling:partialFile" node type or simply a normal nt:file that is constantly updated).

The newest upload wins, i.e. existing ranges would always be overwritten.

This would make it simpler to switch to a Range-header based upload if it might be standardized around HTTP in the future.

> 3.	Request parameter ":lastChunk=true" distinguish between intermediate and last upload chunk.

What if the second chunk failed and you want to repeat it? While all the others including "lastChunk" where successfull. I think chunks where the server doesn't know the real byte coordinates for every single request won't work. You need to specify exactly what is uploaded - and byte ranges + full length are IMHO more generic than to say chunk X out of N, size of chunk = M.

> 4.	Chunk Storage:
> 	*	In place append: will not work in modify use case.  In create,  If chunk size is x,  Square(x) space consumed in datastore. [x +2x  + 3x+ ...  = O(square(x))
> 	*	Chunks saved in JCR. If chunk size is x,  2x space consumed in datastore. Chunks stored at temp location in /var/chunks/<uploadid>/<chunkNumber>

Right, good point. But that's really an issue of the data store, but not of the sling/jcr API. Ideally for this case the datastore would allow update of a binary (initially @FileLength filled with zeros and first range), while adapting the hash along the way (but moving the actual file). It would require real-time reference tracking in the data store though...

> 5.	Chunk upload response:
> 	*	First/Intermediate chunks: 200 OK. Response body will *not* contain list of changes . Location header contain temp location "/var/chunks/<uploadid>". Client can use it to retrieve upload information and hole information.
> 	*	Last chunk: 201/200 in case of creation/modification.  Response body contains list of changes in json or html format. 

Response body would be the normal sling post servlet response (html or json IIRC) and would always be 200 if successful.

> 6.	Chunk upload processing
> 	*	First/Intermediate chunks: Only save chunk is saved in jcr. Ignores all upload semantics (@TypeHint, etc)  and request parameter. 

This would apply to files in the form request only, so that is already handled specifically in the sling post servlet (i.e. updating a binary property). If the request contains other sling post servlet params, they would be processed normally.

> 	*	Last chunk: Stiches all chunks. Process all upload semantics and request parameters and creates jcr node structure. 

The question (also for 2.+3.) is: does Sling have to know when everything was uploaded?
- if it's temporarily stored on the file system, then yes (to move it to jcr)
- if it's stored in a special content structure in jcr, then yes (to convert it to a nt:file)
- if it's stored in nt:file in jcr, then no

The last two would probably require some investigation in data store enhancements to avoid wasting space.

Cheers,
Alex

RE: [POST] Servlet resolution for non existing resource

Posted by Shashank Gupta <sh...@adobe.com>.
Hi, 
Here are the salient points wrt resumable upload implementation/integration in SlingPostServlet[1].  

1.	Resumeable upload is supported in "modify"operation ( ie. In default operation) . No new operation introduced for it.  
2.	Request parameter ":chunkNumber" distinguish between partial and 'single shot upload'. Better than "content range" parameter approach, as it avoids ambiguity in overlapping ranges like 100-199, 100-299 etc.
3.	Request parameter ":lastChunk=true" distinguish between intermediate and last upload chunk.
4.	Chunk Storage:
	*	In place append: will not work in modify use case.  In create,  If chunk size is x,  Square(x) space consumed in datastore. [x +2x  + 3x+ ...  = O(square(x))
	*	Chunks saved in JCR. If chunk size is x,  2x space consumed in datastore. Chunks stored at temp location in /var/chunks/<uploadid>/<chunkNumber>
5.	Chunk upload response:
	*	First/Intermediate chunks: 200 OK. Response body will *not* contain list of changes . Location header contain temp location "/var/chunks/<uploadid>". Client can use it to retrieve upload information and hole information.
	*	Last chunk: 201/200 in case of creation/modification.  Response body contains list of changes in json or html format. 
6.	Chunk upload processing
	*	First/Intermediate chunks: Only save chunk is saved in jcr. Ignores all upload semantics (@TypeHint, etc)  and request parameter. 
	*	Last chunk: Stiches all chunks. Process all upload semantics and request parameters and creates jcr node structure. 

Regards,
Shashank
[1] http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html#ManipulatingContent-TheSlingPostServlet%2528servlets.post%2529-ContentCreationorModification





[1]

http://sling.apache.org/site/manipulating-content-the-slingpostservlet-servletspost.html#ManipulatingContent-TheSlingPostServlet%2528servlets.post%2529-ContentCreationorModification

-----Original Message-----
From: Alexander Klimetschek [mailto:aklimets@adobe.com] 
Sent: 22 February 2013 04:07
To: dev@sling.apache.org
Subject: Re: [POST] Servlet resolution for non existing resource

On 21.02.2013, at 17:34, Shashank Gupta <sh...@adobe.com> wrote:

> Wrt to implementation of chunked upload feature SLING-2707[0] and [1],  it require to hit chunkupload servlet (registered on selector "chunk") for a non existing resource.  So when I post request as [2], servlet resolves to default slingpostservlet[3]. 

> Is there a way to resolve servlet to chunkuploadservlet?

That's actually another good reason to not use the "chunk" selector here, but integrate the resumable upload with the sling post servlet right away as proposed in the issue:

https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13580754&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13580754

Cheers,
Alex

Re: [POST] Servlet resolution for non existing resource

Posted by Alexander Klimetschek <ak...@adobe.com>.
On 21.02.2013, at 17:34, Shashank Gupta <sh...@adobe.com> wrote:

> Wrt to implementation of chunked upload feature SLING-2707[0] and [1],  it require to hit chunkupload servlet (registered on selector "chunk") for a non existing resource.  So when I post request as [2], servlet resolves to default slingpostservlet[3]. 
> Is there a way to resolve servlet to chunkuploadservlet?

That's actually another good reason to not use the "chunk" selector here, but integrate the resumable upload with the sling post servlet right away as proposed in the issue:

https://issues.apache.org/jira/browse/SLING-2707?focusedCommentId=13580754&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13580754

Cheers,
Alex

Re: [POST] Servlet resolution for non existing resource

Posted by Ian Boston <ie...@tfd.co.uk>.
Hi,
Have you tried registering the chunkupload servlet to handle a
sling:resourceType of sling:nonexisting ?
Ian

On 22 February 2013 03:34, Shashank Gupta <sh...@adobe.com> wrote:
> Hi,
>
> Wrt to implementation of chunked upload feature SLING-2707[0] and [1],  it require to hit chunkupload servlet (registered on selector "chunk") for a non existing resource.  So when I post request as [2], servlet resolves to default slingpostservlet[3].
> Is there a way to resolve servlet to chunkuploadservlet?
>
> Regards,
> Shashank
>
> [0]
> https://issues.apache.org/jira/browse/SLING-2707
>
> [1]
> https://cwiki.apache.org/confluence/display/SLING/Chunked+File+Upload+Support#ChunkedFileUploadSupport-UploadchunkusingPOST
> [2]
> POST /temp/dam/Desert.jpg.chunk.1.res HTTP/1.1
> Host: localhost:4502
> If-Match: "e0023aa4e"
>
> [3]
> 0 (2013-02-21 21:40:18) TIMER_END{0,ResourceResolution} URI=/temp/dam/Desert.jpg.chunk.1.res resolves to Resource=, type=sling:nonexisting, path=/temp/dam/Desert.jpg.chunk.1.res, resource=[NonExistingResource, path=/temp/dam/Desert.jpg.chunk.1.res]
>       0 (2013-02-21 21:40:18) LOG Resource Path Info: SlingRequestPathInfo: path='/temp/dam/Desert.jpg.chunk.1.res', selectorString='jpg.chunk.1', extension='res', suffix='null'
>       0 (2013-02-21 21:40:18) TIMER_START{ServletResolution}
>       0 (2013-02-21 21:40:18) TIMER_START{resolveServlet(, type=sling:nonexisting, path=/temp/dam/Desert.jpg.chunk.1.res, resource=[NonExistingResource, path=/temp/dam/Desert.jpg.chunk.1.res])}
>       0 (2013-02-21 21:40:18) LOG {0}: no servlet found
>       0 (2013-02-21 21:40:18) TIMER_END{0,resolveServlet(, type=sling:nonexisting, path=/temp/dam/Desert.jpg.chunk.1.res, resource=[NonExistingResource, path=/temp/dam/Desert.jpg.chunk.1.res])} Using servlet org.apache.sling.servlets.post.impl.SlingPostServlet
>       0 (2013-02-21 21:40:18) TIMER_END{0,ServletResolution} URI=/temp/dam/Desert.jpg.chunk.1.res handled by Servlet=org.apache.sling.servlets.post.impl.SlingPostServlet