You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@libcloud.apache.org by mahendra <gi...@git.apache.org> on 2012/12/20 11:02:22 UTC
[dev] libcloud pull request: LIBCLOUD-269 : Multipart upload for amazon S3

GitHub user mahendra opened a pull request:

    https://github.com/apache/libcloud/pull/80

    LIBCLOUD-269 : Multipart upload for amazon S3

    This patch adds support for streaming data upload using Amazon's multipart upload support as listed in [S3 docs](http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html)
    
    As per current behaviour, the ```upload_object_via_stream()``` API will collect the entire object in memory, and then upload it to S3. This can turn problematic with large files (think HD videos around 4GB). This will be a huge hit in performance and memory of the python application.
    
    With this patch, the API ```upload_object_via_stream()``` will use the S3 multipart upload feature to upload data in 5MB chunks, thus reducing the overall memory impact of the application.
    
    ## Design of this feature:
    * The ```S3StorageDriver()``` is not used just for Amazon S3. It is sub-classed for being used with other S3 compliant cloud storage providers like Google Storage.
    * The Amazon S3 multipart upload is not (or may not be) supported by other storage providers (who will prefer the chunked upload mechanism)
    
    We can solve this problem in two ways:
    * Create a new subclass of ```S3StorageDriver``` (say ```AmazonS3StorageDriver```), which implements this new multipart upload mechanism. Other storage providers will subclass ```S3StorageDriver```. This is a more cleaner approach.
    * Introduce an attribute ```supports_s3_multipart_upload``` and based on it's value, control the callback function passed to ```_put_object()``` API. This makes the code look a bit hacky, but this approach is better for supporting such features in the future. We don't have to keep making sub-classes for each feature.
    
    In the current patch, I have implemented the latter approach, though I prefer the former. After discussions with the community and knowing their preferences, we can select a final approach.
    
    ## Design notes:
    * Implementation has three steps
      1. ```POST``` request to ```/container/object_name?uploads```. This returns an XML with a unique ```uploadId```. This is handled as part of ```_put_object()```. Doing it via ```_put_object()``` ensures that all S3 related parameters are set correctly.
      2. Upload each chunk via ```PUT``` to ```/container/object_name?partNumber=X&uploadId=***``` - This is done via the callback that is passed to ```_put_object()``` named ```_upload_multipart()```
      3. ```POST``` an XML containing part-numbers and etag headers returned for each part to ```/container/object_name?uploadId=***```, implemented via ```_commit_multipart()```
      4. In case of any failures in steps (2) or (3), the upload is deleted from S3 through a ```DELETE``` request to ```/container/object_name?uploadId=****```, implemented via ```_abort_multipart()```
    
    * The chunk size for upload was set as 5MB - This is the minimum allowed size as per Amazon S3 docs.
    
    ## Other changes:
    * Did some PEP8 cleanup on s3.py
    
    * ```s3.get_container()``` would iterate through the list of containers for finding the requested entry. This can be simplified by making a ```HEAD``` request. The only downside is that ```created_time``` is not available for the container. Let me know if this approach is OK or if I must revert it.
    
    * Introduced the following APIs for the ```S3StorageDriver()```, to make some functionality easier.
     * ```get_container_cdn_url()```
     * ```get_object_cdn_url()```
    
    * In ```libcloud.common.base.Connection```, the ```request()``` method is used as the basis for all HTTP requests made by libcloud. This method had a limitation, which became apparent in S3 multipart upload implementation. For initializing an upload, the API invoked was  ```/container/object_name?uploads```.  The ```uploads``` parameter had to be passed as-is, without any values. If we made use of ```params``` argument in ```request()``` method, it would have come up as ```uploads=***```. To prevent this, the ```action``` was set to ```/container/object_name?uploads``` and slight modifications were made to how parameters were appended. This also forced a change in ```BaseMockHttpObject._get_method_name()```
    
    ## Bug fixes in test framework
    * While working on the test cases, I noticed a small issue. Not sure if it was a bug or as per design. ```MockRawResponse._get_response_if_not_availale()``` would return two different values on subsequent invocations.
    ```
         if not self._response:
             ...
             return self  <----- this was inconsistent.
         return self._response
    ```
      While adding test cases for the Amazon S3 functionality, I noticed that instead of getting back ```MockResponse```, I was getting ```MockRawResponse``` instance (which did not have methods like ```read()``` or ```parse_body()```). So, I fixed this issue. Because of this other test cases started failing and they were subsequently fixed. Not sure if this has to be fixed or if it was done on purpose. If someone can throw some light on it, I can work on it further. As of now, all test cases pass.
    
    * In test_s3.py, the driver was being set everywhere to ```S3StorageDriver```. This same test case is used for ```GoogleStorageDriver```, where the driver turns up as ```S3StorageDriver``` instead of ```GoogleStorageDriver```. This was fixed by changing code to ```driver=self.driver_type```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mahendra/libcloud hash

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/libcloud/pull/80.patch

----
commit 94c45ce21938ee8fe974ffbaf3a4da37bc445514
Author: Mahendra M <ma...@gmail.com>
Date:   2012-11-18T04:06:59Z

    LIBCLOUD-261 : Iterator based API for container listing

commit 6f2f6a190d361b9c6886e37b7e1164f033c97e50
Author: Mahendra M <ma...@gmail.com>
Date:   2012-11-21T11:24:43Z

    LIBCLOUD-265 : In local storage, remove empty parent folders

commit aaa139aa39f7f684e058e124b77e207ff3a55486
Author: Mahendra M <ma...@gmail.com>
Date:   2012-11-21T11:31:33Z

    Merge branch 'trunk' into hash
    
    Conflicts:
    	libcloud/storage/drivers/cloudfiles.py
    	libcloud/storage/drivers/nimbus.py

commit 8ee0b23a9880a03a1cfeb3b8cbda757bd0a9d71f
Author: Mahendra M <ma...@gmail.com>
Date:   2012-12-19T13:36:12Z

    Merge branch 'trunk' into hash

commit 6621ecd33853420a28ba80c1ed9a0ec59d3a25fe
Author: Mahendra M <ma...@gmail.com>
Date:   2012-12-20T06:23:08Z

    Merge branch 'trunk' into hash

commit 0514c5c175168ff0f66fd3e4b7ff6096ed0fe8b9
Author: Mahendra M <ma...@gmail.com>
Date:   2012-12-20T09:47:34Z

    LIBCLOUD-269 : Multipart upload for amazon S3

----