You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by "Giesin, Peter" <Pe...@fisglobal.com> on 2016/03/14 18:23:56 UTC

[PROPOSAL] MultiLineIO

Hi all!

I am looking to get involved in the project. I have a MultiLineIO file-based source that I think would be useful. I know the project is just spinning up but can I simply clone the repo and create a PR for the new IO? Also looked over JIRA and there are some tickets I can help out with.

Best regards,
Peter Giesin
peter.giesin@fisglobal.com


_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.

Re: [PROPOSAL] MultiLineIO

Posted by Lukasz Cwik <lc...@google.com.INVALID>.
Have you considered expanding TextIO to support an arbitrary delimiter
instead of defining MultiLineIO?
https://github.com/apache/incubator-beam/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/TextIO.java#L737

TextIO currently splits on '\n', '\r\n', or '\r'. It seems as though having
it split on any arbitrary delimiter would be useful.
Note that even though TextIO implies its used for strings, this is not
necessarily required since a user can use any coder to decode the bytes
between two delimiters.

On Thu, Mar 17, 2016 at 12:53 AM, Dan Halperin <dh...@google.com.invalid>
wrote:

> Hi Peter,
>
> Echoing Eugene's and JB's thoughts -- we'd love a PR!
>
> I also wanted to say: we've hit you with a lot of recommendations in this
> email thread. If you have any questions, you can ask us here -- but we'll
> of course be happy to answer them during code review as well. Do not feel
> like meeting all these many criteria is a pre-requisite for opening a Pull
> Request -- we just may give you feedback and ask for changes before merging
> :).
>
> Thanks!
> Dan
>
> On Mon, Mar 14, 2016 at 12:27 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Yes, you already use the "new style" as you use BoundedSource.
> >
> > Regards
> > JB
> >
> >
> > On 03/14/2016 08:08 PM, Giesin, Peter wrote:
> >
> >> The MultiLineIO is a BoundedSource and an extension of FileBasedSource.
> >> Where the FileBasedSource reads a single line at a time the MultiLineIO
> >> allows the user to define an arbitrary “message” delimiter. It then
> reads
> >> through the file, removing newlines, until the separator is read,
> finally
> >> returning the character sequence that is built.
> >>
> >>
> >>
> >> I believe it is already built using the new style but I will compare it
> >> to the BigTableIO to confirm that.
> >>
> >> Peter
> >>
> >> On 3/14/16, 1:50 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:
> >>
> >> I second Eugene here.
> >>>
> >>> In the past, I developed some IOs using the "old style" (as did in the
> >>> PubSubIO). I'm now refactoring it to use the "new style".
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 03/14/2016 06:47 PM, Eugene Kirpichov wrote:
> >>>
> >>>> Hi Peter,
> >>>> Looking forward to your PR. Please note that source classes are
> >>>> relatively
> >>>> tricky to develop, so would you mind briefly explaining what your
> source
> >>>> will do here over email, so that we hash out some possible issues
> early
> >>>> rather than in PR comments?
> >>>> Also note that now recommend to package IO connectors as PTransforms,
> >>>> making the PTransform class itself be a builder - while the
> Source/Sink
> >>>> classes should be kept package-private (rather than exposed to the
> >>>> user).
> >>>> For an example of a connector packaged in this style, see BigtableIO (
> >>>>
> >>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoogleCloudPlatform_DataflowJavaSDK_blob_master_sdk_src_main_java_com_google_cloud_dataflow_sdk_io_bigtable_BigtableIO.java&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=qJJMaoRlOHxy1MRcAwa7aIJxwGYJyUKL93FdO4jZr1I&e=
> >>>> ).
> >>>> The advantage is that this style allows you to restructure the
> >>>> connector or
> >>>> add additional transforms into its implementation if necessary,
> without
> >>>> changing the call sites. It might seem less important in case of a
> >>>> simple
> >>>> connector like reading lines from file, but it will become much more
> >>>> important with things like SplittableDoFn
> >>>> <
> >>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D65&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=POJMhWDTbkUnHHLnKcH9FtzeP-lrZkuGZG3YPNNhXSU&e=
> >>>> >.
> >>>>
> >>>> On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <
> jb@nanthrax.net>
> >>>> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>>
> >>>>> awesome !
> >>>>>
> >>>>> Yes, you can create the PR using the github mirror.
> >>>>>
> >>>>> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
> >>>>>
> >>>>>> Hi all!
> >>>>>>
> >>>>>> I am looking to get involved in the project. I have a MultiLineIO
> >>>>>>
> >>>>> file-based source that I think would be useful. I know the project is
> >>>>> just
> >>>>> spinning up but can I simply clone the repo and create a PR for the
> >>>>> new IO?
> >>>>> Also looked over JIRA and there are some tickets I can help out with.
> >>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Peter Giesin
> >>>>>> peter.giesin@fisglobal.com
> >>>>>>
> >>>>>>
> >>>>>> _____________
> >>>>>> The information contained in this message is proprietary and/or
> >>>>>>
> >>>>> confidential. If you are not the intended recipient, please: (i)
> >>>>> delete the
> >>>>> message and all copies; (ii) do not disclose, distribute or use the
> >>>>> message
> >>>>> in any manner; and (iii) notify the sender immediately. In addition,
> >>>>> please
> >>>>> be aware that any message addressed to our domain is subject to
> >>>>> archiving
> >>>>> and review by persons other than the intended recipient. Thank you.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbonofre@apache.org
> >>>>>
> >>>>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
> >>>>> Talend -
> >>>>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
> >>>>>
> >>>>>
> >>>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>>
> >>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
> >>> Talend -
> >>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
> >>>
> >>> _____________
> >>> The information contained in this message is proprietary and/or
> >>> confidential. If you are not the intended recipient, please: (i)
> delete the
> >>> message and all copies; (ii) do not disclose, distribute or use the
> message
> >>> in any manner; and (iii) notify the sender immediately. In addition,
> please
> >>> be aware that any message addressed to our domain is subject to
> archiving
> >>> and review by persons other than the intended recipient. Thank you.
> >>>
> >>
> >> _____________
> >> The information contained in this message is proprietary and/or
> >> confidential. If you are not the intended recipient, please: (i) delete
> the
> >> message and all copies; (ii) do not disclose, distribute or use the
> message
> >> in any manner; and (iii) notify the sender immediately. In addition,
> please
> >> be aware that any message addressed to our domain is subject to
> archiving
> >> and review by persons other than the intended recipient. Thank you.
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [PROPOSAL] MultiLineIO

Posted by Dan Halperin <dh...@google.com.INVALID>.
Hi Peter,

Echoing Eugene's and JB's thoughts -- we'd love a PR!

I also wanted to say: we've hit you with a lot of recommendations in this
email thread. If you have any questions, you can ask us here -- but we'll
of course be happy to answer them during code review as well. Do not feel
like meeting all these many criteria is a pre-requisite for opening a Pull
Request -- we just may give you feedback and ask for changes before merging
:).

Thanks!
Dan

On Mon, Mar 14, 2016 at 12:27 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Yes, you already use the "new style" as you use BoundedSource.
>
> Regards
> JB
>
>
> On 03/14/2016 08:08 PM, Giesin, Peter wrote:
>
>> The MultiLineIO is a BoundedSource and an extension of FileBasedSource.
>> Where the FileBasedSource reads a single line at a time the MultiLineIO
>> allows the user to define an arbitrary “message” delimiter. It then reads
>> through the file, removing newlines, until the separator is read, finally
>> returning the character sequence that is built.
>>
>>
>>
>> I believe it is already built using the new style but I will compare it
>> to the BigTableIO to confirm that.
>>
>> Peter
>>
>> On 3/14/16, 1:50 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:
>>
>> I second Eugene here.
>>>
>>> In the past, I developed some IOs using the "old style" (as did in the
>>> PubSubIO). I'm now refactoring it to use the "new style".
>>>
>>> Regards
>>> JB
>>>
>>> On 03/14/2016 06:47 PM, Eugene Kirpichov wrote:
>>>
>>>> Hi Peter,
>>>> Looking forward to your PR. Please note that source classes are
>>>> relatively
>>>> tricky to develop, so would you mind briefly explaining what your source
>>>> will do here over email, so that we hash out some possible issues early
>>>> rather than in PR comments?
>>>> Also note that now recommend to package IO connectors as PTransforms,
>>>> making the PTransform class itself be a builder - while the Source/Sink
>>>> classes should be kept package-private (rather than exposed to the
>>>> user).
>>>> For an example of a connector packaged in this style, see BigtableIO (
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoogleCloudPlatform_DataflowJavaSDK_blob_master_sdk_src_main_java_com_google_cloud_dataflow_sdk_io_bigtable_BigtableIO.java&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=qJJMaoRlOHxy1MRcAwa7aIJxwGYJyUKL93FdO4jZr1I&e=
>>>> ).
>>>> The advantage is that this style allows you to restructure the
>>>> connector or
>>>> add additional transforms into its implementation if necessary, without
>>>> changing the call sites. It might seem less important in case of a
>>>> simple
>>>> connector like reading lines from file, but it will become much more
>>>> important with things like SplittableDoFn
>>>> <
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D65&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=POJMhWDTbkUnHHLnKcH9FtzeP-lrZkuGZG3YPNNhXSU&e=
>>>> >.
>>>>
>>>> On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>>> wrote:
>>>>
>>>> Hi Peter,
>>>>>
>>>>> awesome !
>>>>>
>>>>> Yes, you can create the PR using the github mirror.
>>>>>
>>>>> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
>>>>>
>>>>>> Hi all!
>>>>>>
>>>>>> I am looking to get involved in the project. I have a MultiLineIO
>>>>>>
>>>>> file-based source that I think would be useful. I know the project is
>>>>> just
>>>>> spinning up but can I simply clone the repo and create a PR for the
>>>>> new IO?
>>>>> Also looked over JIRA and there are some tickets I can help out with.
>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Peter Giesin
>>>>>> peter.giesin@fisglobal.com
>>>>>>
>>>>>>
>>>>>> _____________
>>>>>> The information contained in this message is proprietary and/or
>>>>>>
>>>>> confidential. If you are not the intended recipient, please: (i)
>>>>> delete the
>>>>> message and all copies; (ii) do not disclose, distribute or use the
>>>>> message
>>>>> in any manner; and (iii) notify the sender immediately. In addition,
>>>>> please
>>>>> be aware that any message addressed to our domain is subject to
>>>>> archiving
>>>>> and review by persons other than the intended recipient. Thank you.
>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> Jean-Baptiste Onofré
>>>>> jbonofre@apache.org
>>>>>
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
>>>>> Talend -
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
>>>>>
>>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
>>> Talend -
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
>>>
>>> _____________
>>> The information contained in this message is proprietary and/or
>>> confidential. If you are not the intended recipient, please: (i) delete the
>>> message and all copies; (ii) do not disclose, distribute or use the message
>>> in any manner; and (iii) notify the sender immediately. In addition, please
>>> be aware that any message addressed to our domain is subject to archiving
>>> and review by persons other than the intended recipient. Thank you.
>>>
>>
>> _____________
>> The information contained in this message is proprietary and/or
>> confidential. If you are not the intended recipient, please: (i) delete the
>> message and all copies; (ii) do not disclose, distribute or use the message
>> in any manner; and (iii) notify the sender immediately. In addition, please
>> be aware that any message addressed to our domain is subject to archiving
>> and review by persons other than the intended recipient. Thank you.
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [PROPOSAL] MultiLineIO

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Yes, you already use the "new style" as you use BoundedSource.

Regards
JB

On 03/14/2016 08:08 PM, Giesin, Peter wrote:
> The MultiLineIO is a BoundedSource and an extension of FileBasedSource. Where the FileBasedSource reads a single line at a time the MultiLineIO allows the user to define an arbitrary “message” delimiter. It then reads through the file, removing newlines, until the separator is read, finally returning the character sequence that is built.
>
>
>
> I believe it is already built using the new style but I will compare it to the BigTableIO to confirm that.
>
> Peter
>
> On 3/14/16, 1:50 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:
>
>> I second Eugene here.
>>
>> In the past, I developed some IOs using the "old style" (as did in the
>> PubSubIO). I'm now refactoring it to use the "new style".
>>
>> Regards
>> JB
>>
>> On 03/14/2016 06:47 PM, Eugene Kirpichov wrote:
>>> Hi Peter,
>>> Looking forward to your PR. Please note that source classes are relatively
>>> tricky to develop, so would you mind briefly explaining what your source
>>> will do here over email, so that we hash out some possible issues early
>>> rather than in PR comments?
>>> Also note that now recommend to package IO connectors as PTransforms,
>>> making the PTransform class itself be a builder - while the Source/Sink
>>> classes should be kept package-private (rather than exposed to the user).
>>> For an example of a connector packaged in this style, see BigtableIO (
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoogleCloudPlatform_DataflowJavaSDK_blob_master_sdk_src_main_java_com_google_cloud_dataflow_sdk_io_bigtable_BigtableIO.java&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=qJJMaoRlOHxy1MRcAwa7aIJxwGYJyUKL93FdO4jZr1I&e=
>>> ).
>>> The advantage is that this style allows you to restructure the connector or
>>> add additional transforms into its implementation if necessary, without
>>> changing the call sites. It might seem less important in case of a simple
>>> connector like reading lines from file, but it will become much more
>>> important with things like SplittableDoFn
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D65&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=POJMhWDTbkUnHHLnKcH9FtzeP-lrZkuGZG3YPNNhXSU&e= >.
>>>
>>> On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi Peter,
>>>>
>>>> awesome !
>>>>
>>>> Yes, you can create the PR using the github mirror.
>>>>
>>>> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
>>>>> Hi all!
>>>>>
>>>>> I am looking to get involved in the project. I have a MultiLineIO
>>>> file-based source that I think would be useful. I know the project is just
>>>> spinning up but can I simply clone the repo and create a PR for the new IO?
>>>> Also looked over JIRA and there are some tickets I can help out with.
>>>>>
>>>>> Best regards,
>>>>> Peter Giesin
>>>>> peter.giesin@fisglobal.com
>>>>>
>>>>>
>>>>> _____________
>>>>> The information contained in this message is proprietary and/or
>>>> confidential. If you are not the intended recipient, please: (i) delete the
>>>> message and all copies; (ii) do not disclose, distribute or use the message
>>>> in any manner; and (iii) notify the sender immediately. In addition, please
>>>> be aware that any message addressed to our domain is subject to archiving
>>>> and review by persons other than the intended recipient. Thank you.
>>>>>
>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbonofre@apache.org
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
>>>> Talend - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
>>>>
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
>> Talend - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
>>
>> _____________
>> The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
>
> _____________
> The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [PROPOSAL] MultiLineIO

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
Thanks Peter! Please also make sure to use SourceTestUtils to verify that
your FileBasedSource is well-behaved w.r.t. dynamic work rebalancing
(especially the various assertSplitAtFraction methods). For examples, see
XmlSourceTest
<https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/test/java/com/google/cloud/dataflow/sdk/io/XmlSourceTest.java>
.

On Mon, Mar 14, 2016 at 12:10 PM Giesin, Peter <Pe...@fisglobal.com>
wrote:

> The MultiLineIO is a BoundedSource and an extension of FileBasedSource.
> Where the FileBasedSource reads a single line at a time the MultiLineIO
> allows the user to define an arbitrary “message” delimiter. It then reads
> through the file, removing newlines, until the separator is read, finally
> returning the character sequence that is built.
>
>
>
> I believe it is already built using the new style but I will compare it to
> the BigTableIO to confirm that.
>
> Peter
>
> On 3/14/16, 1:50 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:
>
> >I second Eugene here.
> >
> >In the past, I developed some IOs using the "old style" (as did in the
> >PubSubIO). I'm now refactoring it to use the "new style".
> >
> >Regards
> >JB
> >
> >On 03/14/2016 06:47 PM, Eugene Kirpichov wrote:
> >> Hi Peter,
> >> Looking forward to your PR. Please note that source classes are
> relatively
> >> tricky to develop, so would you mind briefly explaining what your source
> >> will do here over email, so that we hash out some possible issues early
> >> rather than in PR comments?
> >> Also note that now recommend to package IO connectors as PTransforms,
> >> making the PTransform class itself be a builder - while the Source/Sink
> >> classes should be kept package-private (rather than exposed to the
> user).
> >> For an example of a connector packaged in this style, see BigtableIO (
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoogleCloudPlatform_DataflowJavaSDK_blob_master_sdk_src_main_java_com_google_cloud_dataflow_sdk_io_bigtable_BigtableIO.java&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=qJJMaoRlOHxy1MRcAwa7aIJxwGYJyUKL93FdO4jZr1I&e=
> >> ).
> >> The advantage is that this style allows you to restructure the
> connector or
> >> add additional transforms into its implementation if necessary, without
> >> changing the call sites. It might seem less important in case of a
> simple
> >> connector like reading lines from file, but it will become much more
> >> important with things like SplittableDoFn
> >> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D65&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=POJMhWDTbkUnHHLnKcH9FtzeP-lrZkuGZG3YPNNhXSU&e=
> >.
> >>
> >> On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> >> wrote:
> >>
> >>> Hi Peter,
> >>>
> >>> awesome !
> >>>
> >>> Yes, you can create the PR using the github mirror.
> >>>
> >>> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
> >>>> Hi all!
> >>>>
> >>>> I am looking to get involved in the project. I have a MultiLineIO
> >>> file-based source that I think would be useful. I know the project is
> just
> >>> spinning up but can I simply clone the repo and create a PR for the
> new IO?
> >>> Also looked over JIRA and there are some tickets I can help out with.
> >>>>
> >>>> Best regards,
> >>>> Peter Giesin
> >>>> peter.giesin@fisglobal.com
> >>>>
> >>>>
> >>>> _____________
> >>>> The information contained in this message is proprietary and/or
> >>> confidential. If you are not the intended recipient, please: (i)
> delete the
> >>> message and all copies; (ii) do not disclose, distribute or use the
> message
> >>> in any manner; and (iii) notify the sender immediately. In addition,
> please
> >>> be aware that any message addressed to our domain is subject to
> archiving
> >>> and review by persons other than the intended recipient. Thank you.
> >>>>
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
> >>> Talend -
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
> >>>
> >>
> >
> >--
> >Jean-Baptiste Onofré
> >jbonofre@apache.org
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
> >Talend -
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
> >
> >_____________
> >The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.
>
> _____________
> The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.
>

Re: [PROPOSAL] MultiLineIO

Posted by "Giesin, Peter" <Pe...@fisglobal.com>.
The MultiLineIO is a BoundedSource and an extension of FileBasedSource. Where the FileBasedSource reads a single line at a time the MultiLineIO allows the user to define an arbitrary “message” delimiter. It then reads through the file, removing newlines, until the separator is read, finally returning the character sequence that is built.



I believe it is already built using the new style but I will compare it to the BigTableIO to confirm that.

Peter

On 3/14/16, 1:50 PM, "Jean-Baptiste Onofré" <jb...@nanthrax.net> wrote:

>I second Eugene here.
>
>In the past, I developed some IOs using the "old style" (as did in the 
>PubSubIO). I'm now refactoring it to use the "new style".
>
>Regards
>JB
>
>On 03/14/2016 06:47 PM, Eugene Kirpichov wrote:
>> Hi Peter,
>> Looking forward to your PR. Please note that source classes are relatively
>> tricky to develop, so would you mind briefly explaining what your source
>> will do here over email, so that we hash out some possible issues early
>> rather than in PR comments?
>> Also note that now recommend to package IO connectors as PTransforms,
>> making the PTransform class itself be a builder - while the Source/Sink
>> classes should be kept package-private (rather than exposed to the user).
>> For an example of a connector packaged in this style, see BigtableIO (
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoogleCloudPlatform_DataflowJavaSDK_blob_master_sdk_src_main_java_com_google_cloud_dataflow_sdk_io_bigtable_BigtableIO.java&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=qJJMaoRlOHxy1MRcAwa7aIJxwGYJyUKL93FdO4jZr1I&e= 
>> ).
>> The advantage is that this style allows you to restructure the connector or
>> add additional transforms into its implementation if necessary, without
>> changing the call sites. It might seem less important in case of a simple
>> connector like reading lines from file, but it will become much more
>> important with things like SplittableDoFn
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D65&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=POJMhWDTbkUnHHLnKcH9FtzeP-lrZkuGZG3YPNNhXSU&e= >.
>>
>> On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>>> Hi Peter,
>>>
>>> awesome !
>>>
>>> Yes, you can create the PR using the github mirror.
>>>
>>> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
>>>
>>> Regards
>>> JB
>>>
>>> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
>>>> Hi all!
>>>>
>>>> I am looking to get involved in the project. I have a MultiLineIO
>>> file-based source that I think would be useful. I know the project is just
>>> spinning up but can I simply clone the repo and create a PR for the new IO?
>>> Also looked over JIRA and there are some tickets I can help out with.
>>>>
>>>> Best regards,
>>>> Peter Giesin
>>>> peter.giesin@fisglobal.com
>>>>
>>>>
>>>> _____________
>>>> The information contained in this message is proprietary and/or
>>> confidential. If you are not the intended recipient, please: (i) delete the
>>> message and all copies; (ii) do not disclose, distribute or use the message
>>> in any manner; and (iii) notify the sender immediately. In addition, please
>>> be aware that any message addressed to our domain is subject to archiving
>>> and review by persons other than the intended recipient. Thank you.
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e= 
>>> Talend - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e= 
>>>
>>
>
>-- 
>Jean-Baptiste Onofré
>jbonofre@apache.org
>https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e= 
>Talend - https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e= 
>
>_____________
>The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.

_____________
The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.

Re: [PROPOSAL] MultiLineIO

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
I second Eugene here.

In the past, I developed some IOs using the "old style" (as did in the 
PubSubIO). I'm now refactoring it to use the "new style".

Regards
JB

On 03/14/2016 06:47 PM, Eugene Kirpichov wrote:
> Hi Peter,
> Looking forward to your PR. Please note that source classes are relatively
> tricky to develop, so would you mind briefly explaining what your source
> will do here over email, so that we hash out some possible issues early
> rather than in PR comments?
> Also note that now recommend to package IO connectors as PTransforms,
> making the PTransform class itself be a builder - while the Source/Sink
> classes should be kept package-private (rather than exposed to the user).
> For an example of a connector packaged in this style, see BigtableIO (
> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/bigtable/BigtableIO.java
> ).
> The advantage is that this style allows you to restructure the connector or
> add additional transforms into its implementation if necessary, without
> changing the call sites. It might seem less important in case of a simple
> connector like reading lines from file, but it will become much more
> important with things like SplittableDoFn
> <https://issues.apache.org/jira/browse/BEAM-65>.
>
> On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
>> Hi Peter,
>>
>> awesome !
>>
>> Yes, you can create the PR using the github mirror.
>>
>> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
>>
>> Regards
>> JB
>>
>> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
>>> Hi all!
>>>
>>> I am looking to get involved in the project. I have a MultiLineIO
>> file-based source that I think would be useful. I know the project is just
>> spinning up but can I simply clone the repo and create a PR for the new IO?
>> Also looked over JIRA and there are some tickets I can help out with.
>>>
>>> Best regards,
>>> Peter Giesin
>>> peter.giesin@fisglobal.com
>>>
>>>
>>> _____________
>>> The information contained in this message is proprietary and/or
>> confidential. If you are not the intended recipient, please: (i) delete the
>> message and all copies; (ii) do not disclose, distribute or use the message
>> in any manner; and (iii) notify the sender immediately. In addition, please
>> be aware that any message addressed to our domain is subject to archiving
>> and review by persons other than the intended recipient. Thank you.
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [PROPOSAL] MultiLineIO

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
Hi Peter,
Looking forward to your PR. Please note that source classes are relatively
tricky to develop, so would you mind briefly explaining what your source
will do here over email, so that we hash out some possible issues early
rather than in PR comments?
Also note that now recommend to package IO connectors as PTransforms,
making the PTransform class itself be a builder - while the Source/Sink
classes should be kept package-private (rather than exposed to the user).
For an example of a connector packaged in this style, see BigtableIO (
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/bigtable/BigtableIO.java
).
The advantage is that this style allows you to restructure the connector or
add additional transforms into its implementation if necessary, without
changing the call sites. It might seem less important in case of a simple
connector like reading lines from file, but it will become much more
important with things like SplittableDoFn
<https://issues.apache.org/jira/browse/BEAM-65>.

On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Peter,
>
> awesome !
>
> Yes, you can create the PR using the github mirror.
>
> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
>
> Regards
> JB
>
> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
> > Hi all!
> >
> > I am looking to get involved in the project. I have a MultiLineIO
> file-based source that I think would be useful. I know the project is just
> spinning up but can I simply clone the repo and create a PR for the new IO?
> Also looked over JIRA and there are some tickets I can help out with.
> >
> > Best regards,
> > Peter Giesin
> > peter.giesin@fisglobal.com
> >
> >
> > _____________
> > The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [PROPOSAL] MultiLineIO

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi Peter,

awesome !

Yes, you can create the PR using the github mirror.

Does your MultiLineIO use Bounded/Unbounded "new" classes ?

Regards
JB

On 03/14/2016 06:23 PM, Giesin, Peter wrote:
> Hi all!
>
> I am looking to get involved in the project. I have a MultiLineIO file-based source that I think would be useful. I know the project is just spinning up but can I simply clone the repo and create a PR for the new IO? Also looked over JIRA and there are some tickets I can help out with.
>
> Best regards,
> Peter Giesin
> peter.giesin@fisglobal.com
>
>
> _____________
> The information contained in this message is proprietary and/or confidential. If you are not the intended recipient, please: (i) delete the message and all copies; (ii) do not disclose, distribute or use the message in any manner; and (iii) notify the sender immediately. In addition, please be aware that any message addressed to our domain is subject to archiving and review by persons other than the intended recipient. Thank you.
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com