You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@jclouds.apache.org by Veit Guna <ve...@gmx.de> on 2015/10/01 23:28:43 UTC

Re: aws-s3 etag when using multipart

Hi Andrew.

Thanks for the detailed explanation.

I think an option sounds like the way to go. Although I've never checked
how expensive
the hash calculation is. Maybe I perform some benchmarks for that.

Anyway, if jclouds would calculate the complete-payload-md5 by itself
where necessary, the contract could be kept - also when using multipart.
Besides checking the returned hashes from the providers.

Maybe I find some times looking into this by myself.

Cheers
Veit


Am 29.09.2015 um 21:01 schrieb Andrew Gaul:
> S3 emits different ETags for single- and multi-part uploads.  You can
> use both types of ETags for future conditional GET and PUT operations
> but only single-part upload returns an MD5 hash.  Multi-part upload
> returns an opaque token which is likely a hash of hashes combined with
> number of parts.
>
> You can ensure data integrity in-transit via comparing the ETag or via
> providing a Content-MD5 for single-part uploads.  Multi-part is more
> complicated; each upload part call can have a Content-MD5 and each call
> returns the MD5 hash.  jclouds supplies the per-part ETag hashes to the
> final complete multi-part upload call but does not provide a way to
> check the results of per-part calls or a way to supply a Content-MD5 for
> each.
>
> Fixing this requires calculating the MD5 in
> BaseBlobStore.putMultipartBlob.  We could either calculate it beforehand
> for repeatable Payloads or compare afterwards for InputStream payloads.
> There is some subtlety to this for providers like Azure which do not
> return an MD5 ETag.  We would likely want to guard this with a property
> since not every caller wants to pay the CPU overhead.  Would you like to
> take a look at this?
>
> If you want a purely application fix, look at calling the BlobStore
> methods initiateMultipartUpload, uploadMultipartPart, and
> completeMultipartUpload.  jclouds internally uses these to implement
> putBlob(new PutOptions.multipart()).
>
> On Tue, Sep 22, 2015 at 05:10:18PM +0200, Veit Guna wrote:
>> Hi.
>>  
>> We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used the returned etag of blobStore.putBlob() to manually verify
>> against a client provided hash. That worked quite well for us. But since we are hitting the 5GB limit of S3, we switched to the multipart() upload
>> that jclouds offers. But now, putBlob() returns someting like <md5-hash>-<number> e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course
>> fails with our validation.
>>  
>> I guess this is due to the fact, that each chunk is hashed separately and send to S3. So there is no complete hash over the whole payload that could
>> be returned by putBlob() - is that correct?
>>  
>> During my research I stumbled across this:
>>  
>> https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6
>>  
>> Now I'm wondering, what the contract of putBlob() is. Should it only return valid etag/hashes otherwise return null?
>>  
>> I'm asking that, because otherwise, I would have to start parsing and validating the returned value by myself and skip any
>> validation when it isn't a normal md5 hash. My guess is, that this is the hash from the last transferred chunk plus
>> the chunk number?
>>  
>> Maybe someone can shed some light on this :).
>>  
>> Thanks
>> Veit
>>