You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/11/01 08:46:58 UTC

[jira] [Commented] (APEXMALHAR-2303) S3 Line By Line Module

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624810#comment-15624810 ] 

ASF GitHub Bot commented on APEXMALHAR-2303:
--------------------------------------------

GitHub user ajaygit158 opened a pull request:

    https://github.com/apache/apex-malhar/pull/478

    APEXMALHAR-2303 Added S3RecordReaderModule for reading records line by line

    @chaithu14 @yogidevendra Kindly review

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajaygit158/apex-malhar APEXMALHAR-2303

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/apex-malhar/pull/478.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #478
    
----
commit b999cbd044b370a271ea8265f2b3e4b7be3935bc
Author: Ajay <aj...@gmail.com>
Date:   2016-10-27T12:57:28Z

    Added S3 Record Reader module

commit 426f8f6efc838ca754ad6070c3d0110537b1f222
Author: Ajay <aj...@gmail.com>
Date:   2016-10-28T13:42:51Z

    Changes to ensure compilation with jdk 1.7

commit a2e7d9892e00784b881c53e2d44cff12ceb6abb1
Author: Ajay <aj...@gmail.com>
Date:   2016-11-01T08:42:27Z

    Few corrections in S3RecordReader

----


> S3 Line By Line Module
> ----------------------
>
>                 Key: APEXMALHAR-2303
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2303
>             Project: Apache Apex Malhar
>          Issue Type: Bug
>            Reporter: Ajay Gupta
>            Assignee: Ajay Gupta
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> This is a new module which will consist of 2 operators
> 1) File Splitter -- Already existing in Malhar library
> 2) S3RecordReader -- Read a file from S3 and output the records (delimited or fixed width) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Commented] (APEXMALHAR-2303) S3 Line By Line Module

Posted by AJAY GUPTA <aj...@gmail.com>.
Hi Apex Dev Community,

For Fixed Width S3 record Reader, the input is the block metadata
containing the block offset and the block length.
The length of the block may not be a factor of the length of the record.
(For eg, block length can be 1MB, record length can be 23 bytes)
Hence, the first byte in the block may belong to a record starting in the
previous block. Similarly, the last record may not have all its bytes in
this block and may spill over to next block.

Since the record is fixed width, we can make some optimization in the way
data is fetched from S3.
We can change the start offset and end offset so that we fetch data from S3
such that records are also aligned and do not span multiple blocks.

While retriving the block, we will retrive from X upto Y where
*X is the startbyte of a record whose first byte in current block*
*Y is the endbyte of the last record which exists in the current block*

*startOffset = block.startOffset + (recordLength - block.startOffset %
recordLength) % recordLength*
endOffset = *block.endOffset + (recordLength - block.endOffset %
recordLength) % recordLength - 1*

This will ensure no multiple get requests to fetch entire record and also
ensure no extra bytes are read from S3.

Kindly let me know your views.alternative approaches for the same.


*Regards*
*Ajay*


On Tue, Nov 1, 2016 at 2:16 PM, ASF GitHub Bot (JIRA) <ji...@apache.org>
wrote:

>
>     [ https://issues.apache.org/jira/browse/APEXMALHAR-2303?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&
> focusedCommentId=15624810#comment-15624810 ]
>
> ASF GitHub Bot commented on APEXMALHAR-2303:
> --------------------------------------------
>
> GitHub user ajaygit158 opened a pull request:
>
>     https://github.com/apache/apex-malhar/pull/478
>
>     APEXMALHAR-2303 Added S3RecordReaderModule for reading records line by
> line
>
>     @chaithu14 @yogidevendra Kindly review
>
> You can merge this pull request into a Git repository by running:
>
>     $ git pull https://github.com/ajaygit158/apex-malhar APEXMALHAR-2303
>
> Alternatively you can review and apply these changes as the patch at:
>
>     https://github.com/apache/apex-malhar/pull/478.patch
>
> To close this pull request, make a commit to your master/trunk branch
> with (at least) the following in the commit message:
>
>     This closes #478
>
> ----
> commit b999cbd044b370a271ea8265f2b3e4b7be3935bc
> Author: Ajay <aj...@gmail.com>
> Date:   2016-10-27T12:57:28Z
>
>     Added S3 Record Reader module
>
> commit 426f8f6efc838ca754ad6070c3d0110537b1f222
> Author: Ajay <aj...@gmail.com>
> Date:   2016-10-28T13:42:51Z
>
>     Changes to ensure compilation with jdk 1.7
>
> commit a2e7d9892e00784b881c53e2d44cff12ceb6abb1
> Author: Ajay <aj...@gmail.com>
> Date:   2016-11-01T08:42:27Z
>
>     Few corrections in S3RecordReader
>
> ----
>
>
> > S3 Line By Line Module
> > ----------------------
> >
> >                 Key: APEXMALHAR-2303
> >                 URL: https://issues.apache.org/
> jira/browse/APEXMALHAR-2303
> >             Project: Apache Apex Malhar
> >          Issue Type: Bug
> >            Reporter: Ajay Gupta
> >            Assignee: Ajay Gupta
> >   Original Estimate: 336h
> >  Remaining Estimate: 336h
> >
> > This is a new module which will consist of 2 operators
> > 1) File Splitter -- Already existing in Malhar library
> > 2) S3RecordReader -- Read a file from S3 and output the records
> (delimited or fixed width)
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>