You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@jclouds.apache.org by Veit Guna <Ve...@gmx.de> on 2015/09/22 17:10:18 UTC

aws-s3 etag when using multipart

Hi.
 
We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used the returned etag of blobStore.putBlob() to manually verify
against a client provided hash. That worked quite well for us. But since we are hitting the 5GB limit of S3, we switched to the multipart() upload
that jclouds offers. But now, putBlob() returns someting like <md5-hash>-<number> e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course
fails with our validation.
 
I guess this is due to the fact, that each chunk is hashed separately and send to S3. So there is no complete hash over the whole payload that could
be returned by putBlob() - is that correct?
 
During my research I stumbled across this:
 
https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6
 
Now I'm wondering, what the contract of putBlob() is. Should it only return valid etag/hashes otherwise return null?
 
I'm asking that, because otherwise, I would have to start parsing and validating the returned value by myself and skip any
validation when it isn't a normal md5 hash. My guess is, that this is the hash from the last transferred chunk plus
the chunk number?
 
Maybe someone can shed some light on this :).
 
Thanks
Veit
 

Re: aws-s3 etag when using multipart

Posted by Yury Kats <yu...@yahoo.com>.
In AWS, the Etag for multipart object is hash of hashes of all parts dash number of parts.
See: https://forums.aws.amazon.com/thread.jspa?messageID=456442

In general, S3 says the ETag would not be a valid MD5 in a number of cases, including multipart.
See ETag definition here: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html



On 9/22/2015 11:10 AM, Veit Guna wrote:
> Hi.
>  
> We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used the returned etag of blobStore.putBlob() to manually verify
> against a client provided hash. That worked quite well for us. But since we are hitting the 5GB limit of S3, we switched to the multipart() upload
> that jclouds offers. But now, putBlob() returns someting like <md5-hash>-<number> e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course
> fails with our validation.
>  
> I guess this is due to the fact, that each chunk is hashed separately and send to S3. So there is no complete hash over the whole payload that could
> be returned by putBlob() - is that correct?
>  
> During my research I stumbled across this:
>  
> https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6
>  
> Now I'm wondering, what the contract of putBlob() is. Should it only return valid etag/hashes otherwise return null?
>  
> I'm asking that, because otherwise, I would have to start parsing and validating the returned value by myself and skip any
> validation when it isn't a normal md5 hash. My guess is, that this is the hash from the last transferred chunk plus
> the chunk number?
>  
> Maybe someone can shed some light on this :).
>  
> Thanks
> Veit
>  
> 


Re: aws-s3 etag when using multipart

Posted by Veit Guna <ve...@gmx.de>.
Hi Andrew.

Thanks for the detailed explanation.

I think an option sounds like the way to go. Although I've never checked
how expensive
the hash calculation is. Maybe I perform some benchmarks for that.

Anyway, if jclouds would calculate the complete-payload-md5 by itself
where necessary, the contract could be kept - also when using multipart.
Besides checking the returned hashes from the providers.

Maybe I find some times looking into this by myself.

Cheers
Veit


Am 29.09.2015 um 21:01 schrieb Andrew Gaul:
> S3 emits different ETags for single- and multi-part uploads.  You can
> use both types of ETags for future conditional GET and PUT operations
> but only single-part upload returns an MD5 hash.  Multi-part upload
> returns an opaque token which is likely a hash of hashes combined with
> number of parts.
>
> You can ensure data integrity in-transit via comparing the ETag or via
> providing a Content-MD5 for single-part uploads.  Multi-part is more
> complicated; each upload part call can have a Content-MD5 and each call
> returns the MD5 hash.  jclouds supplies the per-part ETag hashes to the
> final complete multi-part upload call but does not provide a way to
> check the results of per-part calls or a way to supply a Content-MD5 for
> each.
>
> Fixing this requires calculating the MD5 in
> BaseBlobStore.putMultipartBlob.  We could either calculate it beforehand
> for repeatable Payloads or compare afterwards for InputStream payloads.
> There is some subtlety to this for providers like Azure which do not
> return an MD5 ETag.  We would likely want to guard this with a property
> since not every caller wants to pay the CPU overhead.  Would you like to
> take a look at this?
>
> If you want a purely application fix, look at calling the BlobStore
> methods initiateMultipartUpload, uploadMultipartPart, and
> completeMultipartUpload.  jclouds internally uses these to implement
> putBlob(new PutOptions.multipart()).
>
> On Tue, Sep 22, 2015 at 05:10:18PM +0200, Veit Guna wrote:
>> Hi.
>>  
>> We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used the returned etag of blobStore.putBlob() to manually verify
>> against a client provided hash. That worked quite well for us. But since we are hitting the 5GB limit of S3, we switched to the multipart() upload
>> that jclouds offers. But now, putBlob() returns someting like <md5-hash>-<number> e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course
>> fails with our validation.
>>  
>> I guess this is due to the fact, that each chunk is hashed separately and send to S3. So there is no complete hash over the whole payload that could
>> be returned by putBlob() - is that correct?
>>  
>> During my research I stumbled across this:
>>  
>> https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6
>>  
>> Now I'm wondering, what the contract of putBlob() is. Should it only return valid etag/hashes otherwise return null?
>>  
>> I'm asking that, because otherwise, I would have to start parsing and validating the returned value by myself and skip any
>> validation when it isn't a normal md5 hash. My guess is, that this is the hash from the last transferred chunk plus
>> the chunk number?
>>  
>> Maybe someone can shed some light on this :).
>>  
>> Thanks
>> Veit
>>  



Re: aws-s3 etag when using multipart

Posted by Andrew Gaul <ga...@apache.org>.
S3 emits different ETags for single- and multi-part uploads.  You can
use both types of ETags for future conditional GET and PUT operations
but only single-part upload returns an MD5 hash.  Multi-part upload
returns an opaque token which is likely a hash of hashes combined with
number of parts.

You can ensure data integrity in-transit via comparing the ETag or via
providing a Content-MD5 for single-part uploads.  Multi-part is more
complicated; each upload part call can have a Content-MD5 and each call
returns the MD5 hash.  jclouds supplies the per-part ETag hashes to the
final complete multi-part upload call but does not provide a way to
check the results of per-part calls or a way to supply a Content-MD5 for
each.

Fixing this requires calculating the MD5 in
BaseBlobStore.putMultipartBlob.  We could either calculate it beforehand
for repeatable Payloads or compare afterwards for InputStream payloads.
There is some subtlety to this for providers like Azure which do not
return an MD5 ETag.  We would likely want to guard this with a property
since not every caller wants to pay the CPU overhead.  Would you like to
take a look at this?

If you want a purely application fix, look at calling the BlobStore
methods initiateMultipartUpload, uploadMultipartPart, and
completeMultipartUpload.  jclouds internally uses these to implement
putBlob(new PutOptions.multipart()).

On Tue, Sep 22, 2015 at 05:10:18PM +0200, Veit Guna wrote:
> Hi.
>  
> We're using jclouds 1.9.1 with the aws-s3 provider. Until now, we have used the returned etag of blobStore.putBlob() to manually verify
> against a client provided hash. That worked quite well for us. But since we are hitting the 5GB limit of S3, we switched to the multipart() upload
> that jclouds offers. But now, putBlob() returns someting like <md5-hash>-<number> e.g. 90644a2d0c7b74483f8d2036f3e29fc5-2 that of course
> fails with our validation.
>  
> I guess this is due to the fact, that each chunk is hashed separately and send to S3. So there is no complete hash over the whole payload that could
> be returned by putBlob() - is that correct?
>  
> During my research I stumbled across this:
>  
> https://github.com/jclouds/jclouds/commit/f2d897d9774c2c0225c199c7f2f46971637327d6
>  
> Now I'm wondering, what the contract of putBlob() is. Should it only return valid etag/hashes otherwise return null?
>  
> I'm asking that, because otherwise, I would have to start parsing and validating the returned value by myself and skip any
> validation when it isn't a normal md5 hash. My guess is, that this is the hash from the last transferred chunk plus
> the chunk number?
>  
> Maybe someone can shed some light on this :).
>  
> Thanks
> Veit
>  

-- 
Andrew Gaul
http://gaul.org/