You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Yan Fang (JIRA)" <ji...@apache.org> on 2014/04/30 02:41:15 UTC

[jira] [Updated] (SAMZA-138) System that places specified file contents onto stream

     [ https://issues.apache.org/jira/browse/SAMZA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yan Fang updated SAMZA-138:
---------------------------

    Attachment: SAMZA-138.patch

RB: https://reviews.apache.org/r/20869/

1) read multiple files from task.input. (e.g. filereader.~/test.txt, filereader.~/test2.txt)
2) record offset. Will start from latest line. (skip alread-read lines)
3) set threshold to 10,000 as suggested by [~criccomini]
4) did not add unit test for the code because I suppose users want to change the code for their own testing. Adding unit test will give them a little headache. But tested in local mode, yarn mode and bootstrap streams.

Other thoughts:
* Not sure how we want to use this system since SAMZA-235 has another approache. Guess it's always a good idea to have more examples in hello-samza expecially when we will have a stable release.
* In FilereaderSystemAdmin, I set oldest offset, newest offset and upcoming offset to "0". Thought the system works, I can not convince myself why it works.

Thanks.

> System that places specified file contents onto stream
> ------------------------------------------------------
>
>                 Key: SAMZA-138
>                 URL: https://issues.apache.org/jira/browse/SAMZA-138
>             Project: Samza
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>         Environment: RHELinux 2.6.18-371.4.1.el5
>            Reporter: Jonathan Poltak Samosir
>            Priority: Minor
>              Labels: feature, newbie, patch
>         Attachments: FileReaderConsumer.java, FileReaderSystemFactory.java, SAMZA-138.patch
>
>
> A fairly straightforward Samza System that reads from a specified file, and places that file's contents onto a SystemStreamPartition for use as input for a StreamTask.
> Roughly based off how the hello-samza example project's WikipediaSystem works (more the SystemConsumerFactory rather than SystemConsumer class). 
> Probably needs a bit of work, but basic functionality works as intended. Hopefully useful to some, either as a functioning system or as a base for a more robust and functionally-promising system that you wish to implement.
> Some suggested improvements (not yet implemented):
> * handle reading from multiple files ([suggested alternative input specification|https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201401.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179BA7465%40ESV4-MBX01.linkedin.biz%3E]- point 2)
> * use of filepos for IncomingMessageEnvelope offset ([more info here|https://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201401.mbox/%3C1B43C7411DB20E47AB0FB62E7262B80179BA749D%40ESV4-MBX01.linkedin.biz%3E]
> * come up with a reasonable bounded queue threshold (the value of 100 was arbitrary, as I was unsure of a reasonable value here) 
> * better handling for the exceptions encountered (I wasn't 100% sure about some of them)



--
This message was sent by Atlassian JIRA
(v6.2#6252)