You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by Chaitanya Chebolu <ch...@datatorrent.com> on 2016/10/20 14:41:55 UTC

Re: S3 Output Module

Hi All,

I am proposing the below new design for S3 Output Module using multi part
upload feature:

Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord

Steps for uploading files using S3 multipart feature:

=============================

   1.

   Initiate the upload. S3 will return upload id.

Mandatory : bucket name, file path

Note: Upload id is the unique identifier for multi part upload of a file.

   1.

   Upload each block using the received upload id. S3 will return ETag in
   response of each upload.

Mandatory: block number, upload id

   1.

   Send the merge request by providing the upload id and list of ETags .

Mandatory: upload id, file path, block ETags.

Here <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html>
is an example link for uploading a file using multi part feature:


I am proposing the below two approaches for S3 output module.


(Solution 1)

S3 Output Module consists of the below two operators:

1) BlockWriter : Write the blocks into the HDFS. Once successfully written
into HDFS, then this will emit the BlockMetadata.

2) S3MultiPartUpload: This consists of two parts:

     a) If the number of blocks of a file is > 1 then upload the blocks
using multi part feature. Otherwise, will upload the block using
putObject().

     b) Once all the blocks are successfully uploaded then will send the
merge complete request.


(Solution 2)

DAG for this solution as follows:

1) InitateS3Upload:

Input: FileMetadata

Initiates the upload. This operator emits (filemetadata, uploadId) to
S3FileMerger and (filePath, uploadId) to S3BlockUpload.

2) S3BlockUpload:

Input: FileBlockMetadata, ReaderRecord

Upload the blocks into S3. S3 will return ETag for each upload.
S3BlockUpload emits (path, ETag) to S3FileMerger.

3) S3FileMerger: Sends the file merge request to S3.

Pros:

(1) Supports the size of file to upload is up to 5 TB.

(2) Reduces the end to end latency. Because, we are not waiting to upload
until all the blocks of a file written to HDFS.

Please vote and share your thoughts on these approaches.

Regards,
Chaitanya

On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu <
chaitanya@datatorrent.com> wrote:

> @ Tushar
>
>   S3 Copy Output Module consists of following operators:
> 1) BlockWriter : Writes the blocks into the HDFS.
> 2) Synchronizer: Sends trigger to downstream operator, when all the blocks
> for a file written to HDFS.
> 3) FileMerger: Merges all the blocks into a file and will upload the
> merged file into S3 bucket.
>
> @ Ashwin
>
>     Good suggestion. In the first iteration, I will add the proposed
> design.
> Multipart support will add it in the next iteration.
>
> Regards,
> Chaitanya
>
> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
> ashwinchandrap@gmail.com> wrote:
>
>> +1 regarding the s3 upload functionality.
>>
>> However, I think we should just focus on multipart upload directly as it
>> comes with various advantages like higher throughput, faster recovery, not
>> needing to wait for entire file being created before uploading each part.
>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin
>> gmpu.html
>>
>> Also, seems like we can do multipart upload if the file size is more than
>> 5MB. They do recommend using multipart if the file size is more than
>> 100MB.
>> I am not sure if there is a hard lower limit though. See:
>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
>>
>> This way, it seems like we don't to have to wait until a file is
>> completely
>> written to hdfs before performing the upload operation.
>>
>> Regards,
>> Ashwin.
>>
>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <tu...@datatorrent.com>
>> wrote:
>>
>> > +1 , we need this functionality.
>> >
>> > Is it going to be a single operator or multiple operators? If multiple
>> > operators, then can you explain what functionality each operator will
>> > provide?
>> >
>> >
>> > Regards,
>> > -Tushar.
>> >
>> >
>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <yogidevendra@apache.org
>> >
>> > wrote:
>> >
>> > > Writing to S3 is a common use-case for applications.
>> > > This module will be definitely helpful.
>> > >
>> > > +1 for adding this module.
>> > >
>> > >
>> > > ~ Yogi
>> > >
>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
>> chaitanya@datatorrent.com>
>> > > wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > >   I am proposing S3 output copy Module. Primary functionality of
>> this
>> > > > module is uploading files to S3 bucket using block-by-block
>> approach.
>> > > >
>> > > >   Below is the JIRA created for this task:
>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
>> > > >
>> > > >   Design of this module is similar to HDFS copy module. So, I will
>> > extend
>> > > > HDFS copy module for S3.
>> > > >
>> > > > Design of this Module:
>> > > > =======================
>> > > > 1) Writing blocks into HDFS.
>> > > > 2) Merge the blocks into a file .
>> > > > 3) Upload the above merged file into S3 Bucket using AmazonS3Client
>> > > API's.
>> > > >
>> > > > Steps (1) & (2) are same as HDFS copy module.
>> > > >
>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please refer
>> the
>> > > > below link about limitations of Uploading objects into S3:
>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje
>> cts.html
>> > > >
>> > > > We can resolve the above limitation by using S3 Multipart feature. I
>> > will
>> > > > add multipart support in next iteration.
>> > > >
>> > > >  Please share your thoughts on this.
>> > > >
>> > > > Regards,
>> > > > Chaitanya
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>>
>> Regards,
>> Ashwin.
>>
>
>

Re: S3 Output Module

Posted by Mohit Jotwani <mo...@datatorrent.com>.

+1 for Solution 2

Regards,
Mohit
On 27 Oct 2016 2:02 p.m., "Sandeep Deshmukh" <sa...@datatorrent.com>
wrote:

> +1
>
> Regards,
> Sandeep
>
> On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu <
> chaitanya@datatorrent.com> wrote:
>
> > Hi All,
> >
> >   I am planning to implement the approach (2) of S3 Output Module which I
> > proposed in my previous email. Performance would be better as compared to
> > approach (1) because of uploading the blocks without saving it on HDFS.
> >
> >   Please share your opinions.
> >
> > Regards,
> > Chaitanya
> >
> > On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu <
> > chaitanya@datatorrent.com> wrote:
> >
> > > Hi All,
> > >
> > > I am proposing the below new design for S3 Output Module using multi
> part
> > > upload feature:
> > >
> > > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord
> > >
> > > Steps for uploading files using S3 multipart feature:
> > >
> > > =============================
> > >
> > >    1.
> > >
> > >    Initiate the upload. S3 will return upload id.
> > >
> > > Mandatory : bucket name, file path
> > >
> > > Note: Upload id is the unique identifier for multi part upload of a
> file.
> > >
> > >    1.
> > >
> > >    Upload each block using the received upload id. S3 will return ETag
> in
> > >    response of each upload.
> > >
> > > Mandatory: block number, upload id
> > >
> > >    1.
> > >
> > >    Send the merge request by providing the upload id and list of ETags
> .
> > >
> > > Mandatory: upload id, file path, block ETags.
> > >
> > > Here
> > > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html>
> > is
> > > an example link for uploading a file using multi part feature:
> > >
> > >
> > > I am proposing the below two approaches for S3 output module.
> > >
> > >
> > > (Solution 1)
> > >
> > > S3 Output Module consists of the below two operators:
> > >
> > > 1) BlockWriter : Write the blocks into the HDFS. Once successfully
> > written
> > > into HDFS, then this will emit the BlockMetadata.
> > >
> > > 2) S3MultiPartUpload: This consists of two parts:
> > >
> > >      a) If the number of blocks of a file is > 1 then upload the blocks
> > > using multi part feature. Otherwise, will upload the block using
> > > putObject().
> > >
> > >      b) Once all the blocks are successfully uploaded then will send
> the
> > > merge complete request.
> > >
> > >
> > > (Solution 2)
> > >
> > > DAG for this solution as follows:
> > >
> > > 1) InitateS3Upload:
> > >
> > > Input: FileMetadata
> > >
> > > Initiates the upload. This operator emits (filemetadata, uploadId) to
> > > S3FileMerger and (filePath, uploadId) to S3BlockUpload.
> > >
> > > 2) S3BlockUpload:
> > >
> > > Input: FileBlockMetadata, ReaderRecord
> > >
> > > Upload the blocks into S3. S3 will return ETag for each upload.
> > > S3BlockUpload emits (path, ETag) to S3FileMerger.
> > >
> > > 3) S3FileMerger: Sends the file merge request to S3.
> > >
> > > Pros:
> > >
> > > (1) Supports the size of file to upload is up to 5 TB.
> > >
> > > (2) Reduces the end to end latency. Because, we are not waiting to
> upload
> > > until all the blocks of a file written to HDFS.
> > >
> > > Please vote and share your thoughts on these approaches.
> > >
> > > Regards,
> > > Chaitanya
> > >
> > > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu <
> > > chaitanya@datatorrent.com> wrote:
> > >
> > >> @ Tushar
> > >>
> > >>   S3 Copy Output Module consists of following operators:
> > >> 1) BlockWriter : Writes the blocks into the HDFS.
> > >> 2) Synchronizer: Sends trigger to downstream operator, when all the
> > >> blocks for a file written to HDFS.
> > >> 3) FileMerger: Merges all the blocks into a file and will upload the
> > >> merged file into S3 bucket.
> > >>
> > >> @ Ashwin
> > >>
> > >>     Good suggestion. In the first iteration, I will add the proposed
> > >> design.
> > >> Multipart support will add it in the next iteration.
> > >>
> > >> Regards,
> > >> Chaitanya
> > >>
> > >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
> > >> ashwinchandrap@gmail.com> wrote:
> > >>
> > >>> +1 regarding the s3 upload functionality.
> > >>>
> > >>> However, I think we should just focus on multipart upload directly as
> > it
> > >>> comes with various advantages like higher throughput, faster
> recovery,
> > >>> not
> > >>> needing to wait for entire file being created before uploading each
> > part.
> > >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin
> > >>> gmpu.html
> > >>>
> > >>> Also, seems like we can do multipart upload if the file size is more
> > than
> > >>> 5MB. They do recommend using multipart if the file size is more than
> > >>> 100MB.
> > >>> I am not sure if there is a hard lower limit though. See:
> > >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
> > >>>
> > >>> This way, it seems like we don't to have to wait until a file is
> > >>> completely
> > >>> written to hdfs before performing the upload operation.
> > >>>
> > >>> Regards,
> > >>> Ashwin.
> > >>>
> > >>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <
> tushar@datatorrent.com
> > >
> > >>> wrote:
> > >>>
> > >>> > +1 , we need this functionality.
> > >>> >
> > >>> > Is it going to be a single operator or multiple operators? If
> > multiple
> > >>> > operators, then can you explain what functionality each operator
> will
> > >>> > provide?
> > >>> >
> > >>> >
> > >>> > Regards,
> > >>> > -Tushar.
> > >>> >
> > >>> >
> > >>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <
> > >>> yogidevendra@apache.org>
> > >>> > wrote:
> > >>> >
> > >>> > > Writing to S3 is a common use-case for applications.
> > >>> > > This module will be definitely helpful.
> > >>> > >
> > >>> > > +1 for adding this module.
> > >>> > >
> > >>> > >
> > >>> > > ~ Yogi
> > >>> > >
> > >>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
> > >>> chaitanya@datatorrent.com>
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi All,
> > >>> > > >
> > >>> > > >   I am proposing S3 output copy Module. Primary functionality
> of
> > >>> this
> > >>> > > > module is uploading files to S3 bucket using block-by-block
> > >>> approach.
> > >>> > > >
> > >>> > > >   Below is the JIRA created for this task:
> > >>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
> > >>> > > >
> > >>> > > >   Design of this module is similar to HDFS copy module. So, I
> > will
> > >>> > extend
> > >>> > > > HDFS copy module for S3.
> > >>> > > >
> > >>> > > > Design of this Module:
> > >>> > > > =======================
> > >>> > > > 1) Writing blocks into HDFS.
> > >>> > > > 2) Merge the blocks into a file .
> > >>> > > > 3) Upload the above merged file into S3 Bucket using
> > AmazonS3Client
> > >>> > > API's.
> > >>> > > >
> > >>> > > > Steps (1) & (2) are same as HDFS copy module.
> > >>> > > >
> > >>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please
> > >>> refer the
> > >>> > > > below link about limitations of Uploading objects into S3:
> > >>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje
> > >>> cts.html
> > >>> > > >
> > >>> > > > We can resolve the above limitation by using S3 Multipart
> > feature.
> > >>> I
> > >>> > will
> > >>> > > > add multipart support in next iteration.
> > >>> > > >
> > >>> > > >  Please share your thoughts on this.
> > >>> > > >
> > >>> > > > Regards,
> > >>> > > > Chaitanya
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Regards,
> > >>> Ashwin.
> > >>>
> > >>
> > >>
> > >
> >
>

Re: S3 Output Module

Posted by Sandeep Deshmukh <sa...@datatorrent.com>.

+1

Regards,
Sandeep

On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu <
chaitanya@datatorrent.com> wrote:

> Hi All,
>
>   I am planning to implement the approach (2) of S3 Output Module which I
> proposed in my previous email. Performance would be better as compared to
> approach (1) because of uploading the blocks without saving it on HDFS.
>
>   Please share your opinions.
>
> Regards,
> Chaitanya
>
> On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu <
> chaitanya@datatorrent.com> wrote:
>
> > Hi All,
> >
> > I am proposing the below new design for S3 Output Module using multi part
> > upload feature:
> >
> > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord
> >
> > Steps for uploading files using S3 multipart feature:
> >
> > =============================
> >
> >    1.
> >
> >    Initiate the upload. S3 will return upload id.
> >
> > Mandatory : bucket name, file path
> >
> > Note: Upload id is the unique identifier for multi part upload of a file.
> >
> >    1.
> >
> >    Upload each block using the received upload id. S3 will return ETag in
> >    response of each upload.
> >
> > Mandatory: block number, upload id
> >
> >    1.
> >
> >    Send the merge request by providing the upload id and list of ETags .
> >
> > Mandatory: upload id, file path, block ETags.
> >
> > Here
> > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html>
> is
> > an example link for uploading a file using multi part feature:
> >
> >
> > I am proposing the below two approaches for S3 output module.
> >
> >
> > (Solution 1)
> >
> > S3 Output Module consists of the below two operators:
> >
> > 1) BlockWriter : Write the blocks into the HDFS. Once successfully
> written
> > into HDFS, then this will emit the BlockMetadata.
> >
> > 2) S3MultiPartUpload: This consists of two parts:
> >
> >      a) If the number of blocks of a file is > 1 then upload the blocks
> > using multi part feature. Otherwise, will upload the block using
> > putObject().
> >
> >      b) Once all the blocks are successfully uploaded then will send the
> > merge complete request.
> >
> >
> > (Solution 2)
> >
> > DAG for this solution as follows:
> >
> > 1) InitateS3Upload:
> >
> > Input: FileMetadata
> >
> > Initiates the upload. This operator emits (filemetadata, uploadId) to
> > S3FileMerger and (filePath, uploadId) to S3BlockUpload.
> >
> > 2) S3BlockUpload:
> >
> > Input: FileBlockMetadata, ReaderRecord
> >
> > Upload the blocks into S3. S3 will return ETag for each upload.
> > S3BlockUpload emits (path, ETag) to S3FileMerger.
> >
> > 3) S3FileMerger: Sends the file merge request to S3.
> >
> > Pros:
> >
> > (1) Supports the size of file to upload is up to 5 TB.
> >
> > (2) Reduces the end to end latency. Because, we are not waiting to upload
> > until all the blocks of a file written to HDFS.
> >
> > Please vote and share your thoughts on these approaches.
> >
> > Regards,
> > Chaitanya
> >
> > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu <
> > chaitanya@datatorrent.com> wrote:
> >
> >> @ Tushar
> >>
> >>   S3 Copy Output Module consists of following operators:
> >> 1) BlockWriter : Writes the blocks into the HDFS.
> >> 2) Synchronizer: Sends trigger to downstream operator, when all the
> >> blocks for a file written to HDFS.
> >> 3) FileMerger: Merges all the blocks into a file and will upload the
> >> merged file into S3 bucket.
> >>
> >> @ Ashwin
> >>
> >>     Good suggestion. In the first iteration, I will add the proposed
> >> design.
> >> Multipart support will add it in the next iteration.
> >>
> >> Regards,
> >> Chaitanya
> >>
> >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
> >> ashwinchandrap@gmail.com> wrote:
> >>
> >>> +1 regarding the s3 upload functionality.
> >>>
> >>> However, I think we should just focus on multipart upload directly as
> it
> >>> comes with various advantages like higher throughput, faster recovery,
> >>> not
> >>> needing to wait for entire file being created before uploading each
> part.
> >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin
> >>> gmpu.html
> >>>
> >>> Also, seems like we can do multipart upload if the file size is more
> than
> >>> 5MB. They do recommend using multipart if the file size is more than
> >>> 100MB.
> >>> I am not sure if there is a hard lower limit though. See:
> >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
> >>>
> >>> This way, it seems like we don't to have to wait until a file is
> >>> completely
> >>> written to hdfs before performing the upload operation.
> >>>
> >>> Regards,
> >>> Ashwin.
> >>>
> >>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <tushar@datatorrent.com
> >
> >>> wrote:
> >>>
> >>> > +1 , we need this functionality.
> >>> >
> >>> > Is it going to be a single operator or multiple operators? If
> multiple
> >>> > operators, then can you explain what functionality each operator will
> >>> > provide?
> >>> >
> >>> >
> >>> > Regards,
> >>> > -Tushar.
> >>> >
> >>> >
> >>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <
> >>> yogidevendra@apache.org>
> >>> > wrote:
> >>> >
> >>> > > Writing to S3 is a common use-case for applications.
> >>> > > This module will be definitely helpful.
> >>> > >
> >>> > > +1 for adding this module.
> >>> > >
> >>> > >
> >>> > > ~ Yogi
> >>> > >
> >>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
> >>> chaitanya@datatorrent.com>
> >>> > > wrote:
> >>> > >
> >>> > > > Hi All,
> >>> > > >
> >>> > > >   I am proposing S3 output copy Module. Primary functionality of
> >>> this
> >>> > > > module is uploading files to S3 bucket using block-by-block
> >>> approach.
> >>> > > >
> >>> > > >   Below is the JIRA created for this task:
> >>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
> >>> > > >
> >>> > > >   Design of this module is similar to HDFS copy module. So, I
> will
> >>> > extend
> >>> > > > HDFS copy module for S3.
> >>> > > >
> >>> > > > Design of this Module:
> >>> > > > =======================
> >>> > > > 1) Writing blocks into HDFS.
> >>> > > > 2) Merge the blocks into a file .
> >>> > > > 3) Upload the above merged file into S3 Bucket using
> AmazonS3Client
> >>> > > API's.
> >>> > > >
> >>> > > > Steps (1) & (2) are same as HDFS copy module.
> >>> > > >
> >>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please
> >>> refer the
> >>> > > > below link about limitations of Uploading objects into S3:
> >>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje
> >>> cts.html
> >>> > > >
> >>> > > > We can resolve the above limitation by using S3 Multipart
> feature.
> >>> I
> >>> > will
> >>> > > > add multipart support in next iteration.
> >>> > > >
> >>> > > >  Please share your thoughts on this.
> >>> > > >
> >>> > > > Regards,
> >>> > > > Chaitanya
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Regards,
> >>> Ashwin.
> >>>
> >>
> >>
> >
>

Re: S3 Output Module

Posted by Chaitanya Chebolu <ch...@datatorrent.com>.

Hi All,

  I am planning to implement the approach (2) of S3 Output Module which I
proposed in my previous email. Performance would be better as compared to
approach (1) because of uploading the blocks without saving it on HDFS.

  Please share your opinions.

Regards,
Chaitanya

On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu <
chaitanya@datatorrent.com> wrote:

> Hi All,
>
> I am proposing the below new design for S3 Output Module using multi part
> upload feature:
>
> Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord
>
> Steps for uploading files using S3 multipart feature:
>
> =============================
>
>    1.
>
>    Initiate the upload. S3 will return upload id.
>
> Mandatory : bucket name, file path
>
> Note: Upload id is the unique identifier for multi part upload of a file.
>
>    1.
>
>    Upload each block using the received upload id. S3 will return ETag in
>    response of each upload.
>
> Mandatory: block number, upload id
>
>    1.
>
>    Send the merge request by providing the upload id and list of ETags .
>
> Mandatory: upload id, file path, block ETags.
>
> Here
> <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html> is
> an example link for uploading a file using multi part feature:
>
>
> I am proposing the below two approaches for S3 output module.
>
>
> (Solution 1)
>
> S3 Output Module consists of the below two operators:
>
> 1) BlockWriter : Write the blocks into the HDFS. Once successfully written
> into HDFS, then this will emit the BlockMetadata.
>
> 2) S3MultiPartUpload: This consists of two parts:
>
>      a) If the number of blocks of a file is > 1 then upload the blocks
> using multi part feature. Otherwise, will upload the block using
> putObject().
>
>      b) Once all the blocks are successfully uploaded then will send the
> merge complete request.
>
>
> (Solution 2)
>
> DAG for this solution as follows:
>
> 1) InitateS3Upload:
>
> Input: FileMetadata
>
> Initiates the upload. This operator emits (filemetadata, uploadId) to
> S3FileMerger and (filePath, uploadId) to S3BlockUpload.
>
> 2) S3BlockUpload:
>
> Input: FileBlockMetadata, ReaderRecord
>
> Upload the blocks into S3. S3 will return ETag for each upload.
> S3BlockUpload emits (path, ETag) to S3FileMerger.
>
> 3) S3FileMerger: Sends the file merge request to S3.
>
> Pros:
>
> (1) Supports the size of file to upload is up to 5 TB.
>
> (2) Reduces the end to end latency. Because, we are not waiting to upload
> until all the blocks of a file written to HDFS.
>
> Please vote and share your thoughts on these approaches.
>
> Regards,
> Chaitanya
>
> On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu <
> chaitanya@datatorrent.com> wrote:
>
>> @ Tushar
>>
>>   S3 Copy Output Module consists of following operators:
>> 1) BlockWriter : Writes the blocks into the HDFS.
>> 2) Synchronizer: Sends trigger to downstream operator, when all the
>> blocks for a file written to HDFS.
>> 3) FileMerger: Merges all the blocks into a file and will upload the
>> merged file into S3 bucket.
>>
>> @ Ashwin
>>
>>     Good suggestion. In the first iteration, I will add the proposed
>> design.
>> Multipart support will add it in the next iteration.
>>
>> Regards,
>> Chaitanya
>>
>> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
>> ashwinchandrap@gmail.com> wrote:
>>
>>> +1 regarding the s3 upload functionality.
>>>
>>> However, I think we should just focus on multipart upload directly as it
>>> comes with various advantages like higher throughput, faster recovery,
>>> not
>>> needing to wait for entire file being created before uploading each part.
>>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin
>>> gmpu.html
>>>
>>> Also, seems like we can do multipart upload if the file size is more than
>>> 5MB. They do recommend using multipart if the file size is more than
>>> 100MB.
>>> I am not sure if there is a hard lower limit though. See:
>>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
>>>
>>> This way, it seems like we don't to have to wait until a file is
>>> completely
>>> written to hdfs before performing the upload operation.
>>>
>>> Regards,
>>> Ashwin.
>>>
>>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <tu...@datatorrent.com>
>>> wrote:
>>>
>>> > +1 , we need this functionality.
>>> >
>>> > Is it going to be a single operator or multiple operators? If multiple
>>> > operators, then can you explain what functionality each operator will
>>> > provide?
>>> >
>>> >
>>> > Regards,
>>> > -Tushar.
>>> >
>>> >
>>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <
>>> yogidevendra@apache.org>
>>> > wrote:
>>> >
>>> > > Writing to S3 is a common use-case for applications.
>>> > > This module will be definitely helpful.
>>> > >
>>> > > +1 for adding this module.
>>> > >
>>> > >
>>> > > ~ Yogi
>>> > >
>>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
>>> chaitanya@datatorrent.com>
>>> > > wrote:
>>> > >
>>> > > > Hi All,
>>> > > >
>>> > > >   I am proposing S3 output copy Module. Primary functionality of
>>> this
>>> > > > module is uploading files to S3 bucket using block-by-block
>>> approach.
>>> > > >
>>> > > >   Below is the JIRA created for this task:
>>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
>>> > > >
>>> > > >   Design of this module is similar to HDFS copy module. So, I will
>>> > extend
>>> > > > HDFS copy module for S3.
>>> > > >
>>> > > > Design of this Module:
>>> > > > =======================
>>> > > > 1) Writing blocks into HDFS.
>>> > > > 2) Merge the blocks into a file .
>>> > > > 3) Upload the above merged file into S3 Bucket using AmazonS3Client
>>> > > API's.
>>> > > >
>>> > > > Steps (1) & (2) are same as HDFS copy module.
>>> > > >
>>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please
>>> refer the
>>> > > > below link about limitations of Uploading objects into S3:
>>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje
>>> cts.html
>>> > > >
>>> > > > We can resolve the above limitation by using S3 Multipart feature.
>>> I
>>> > will
>>> > > > add multipart support in next iteration.
>>> > > >
>>> > > >  Please share your thoughts on this.
>>> > > >
>>> > > > Regards,
>>> > > > Chaitanya
>>> > > >
>>> > >
>>> >
>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>> Ashwin.
>>>
>>
>>
>