You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Matthias Baetens <ma...@datatonic.com> on 2017/05/28 09:29:22 UTC

Python SDK: BigTableIO

Hey guys,

We have been using Beam for quite a few months now, so we (my colleague
Robert & I) thought it might be cool to contribute a bit as well.

The challenge we want to take up is writing the BigTableIO for the Python
SDK (which is not yet in the works according to the website
<https://github.com/apache/beam-site/blob/asf-site/src/documentation/io/built-in.md>.
I have searched JIRA for the BigTableIO issue and did not find it, so I
suppose this is the first step we take.

Any pointers or feedback more than welcome!

Best,

Matthias

Re: Python SDK: BigTableIO

Posted by Davor Bonaci <da...@apache.org>.

(JIRA role added; reassigned.)

On Thu, Jun 1, 2017 at 10:05 AM, Chamikara Jayalath <ch...@apache.org>
wrote:

> Thanks. I added some comments to the doc.
>
> Davor should be able to assign this JIRA to you. Also, Solomon who
> implemented the Java BigTable connector might have more input here.
>
> - Cham
>
>
> On Thu, Jun 1, 2017 at 2:19 AM Matthias Baetens <
> matthias.baetens@datatonic.com> wrote:
>
>> Hi Cham, Stephan,
>>
>> Thanks a lot for the input, really useful to get started.
>>
>> We'll probably start with implementing the Source (looks the most
>> straightforward).
>> I made a working document
>> <https://docs.google.com/document/d/1iXeQvIAsGjp9orleDy0o5ExU-
>> eMqWesgvtt231UoaPg/edit?usp=sharing>
>> to
>> organise and track our progress a bit, happy to discuss or receive
>> feedback
>> there as well. We made a JIRA issue
>> <https://issues.apache.org/jira/browse/BEAM-2395> as well; should we get
>> assigned to it?
>>
>> About writing the Sink: are there any examples of how this was done
>> previously where we can get some inspiration from? I think it would be
>> good
>> to discuss this in more detail once we finish writing the Source.
>>
>> Matthias
>> ᐧ
>>
>> On Tue, May 30, 2017 at 7:28 PM, Stephen Sisk <si...@google.com.invalid>
>> wrote:
>>
>> > Hey Matthias,
>> >
>> > to add on to what Chamikara mentioned, we have lots of info in the
>> generic
>> > IO authoring guide [1], the Python IO authoring guide [2] and the
>> > PTransform Style Guide[3].  The PTransform style guide doesn't sound
>> like
>> > it applies, but it has a lot of specific tips from lessons we've
>> learned in
>> > the past from I/O work.
>> >
>> > If you plan on contributing it back to the community, I'd also suggest
>> > opening up a JIRA issue & updating the beam website (eg [4]) that you're
>> > working on this (those steps are pretty trivial.)
>> >
>> > We've recently been trying out using branches when we add new I/Os since
>> > the PRs tend to get bigger than we like for a since PR.
>> >
>> > Please feel free to email the dev mailing list if you have questions! We
>> > are excited and happy to help out with thinking about design/etc...
>> (eg, as
>> > cham hinted at, should you use a Source vs. use regular ParDo
>> transforms?)
>> >
>> > S
>> >
>> > [1] https://beam.apache.org/documentation/io/authoring-overview/
>> > [2] https://beam.apache.org/documentation/sdks/python-custom-io/
>> > [3] https://beam.apache.org/contribute/ptransform-style-guide/
>> > [4] https://github.com/apache/beam-site/pull/250
>> >
>> > On Sun, May 28, 2017 at 5:32 PM Chamikara Jayalath <
>> chamikara@apache.org>
>> > wrote:
>> >
>> > > Thanks for offering to help. I would suggest to look into existing
>> Java
>> > > BigTableIO connector and currently available Python client library for
>> > > Cloud BigTable to see how feasible it is to develop an efficient
>> BigTable
>> > > connector at this point. From Python SDK's perspective you can use
>> > > iobase.BoundedSource API (wrapped by a PTrasnform) to develop a read
>> > > PTransform with support for dynamic/static splitting. Sinks are
>> usually
>> > > developed as PTransforms (iobase.Sink interface is deprecated so I
>> > suggest
>> > > not to use that). I would be happy to review any PRs related to this.
>> > >
>> > > Thanks,
>> > > Cham
>> > >
>> > > On Sun, May 28, 2017 at 2:30 AM Matthias Baetens <
>> > > matthias.baetens@datatonic.com> wrote:
>> > >
>> > > > Hey guys,
>> > > >
>> > > > We have been using Beam for quite a few months now, so we (my
>> colleague
>> > > > Robert & I) thought it might be cool to contribute a bit as well.
>> > > >
>> > > > The challenge we want to take up is writing the BigTableIO for the
>> > Python
>> > > > SDK (which is not yet in the works according to the website
>> > > > <
>> > > >
>> > > https://github.com/apache/beam-site/blob/asf-site/src/
>> > documentation/io/built-in.md
>> > > > >.
>> > > > I have searched JIRA for the BigTableIO issue and did not find it,
>> so I
>> > > > suppose this is the first step we take.
>> > > >
>> > > > Any pointers or feedback more than welcome!
>> > > >
>> > > > Best,
>> > > >
>> > > > Matthias
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>>
>>
>> *Matthias Baetens*
>>
>>
>> *datatonic | data power unleashed*
>>
>> office +44 203 668 3680 <+44%2020%203668%203680>  |  mobile +44 74 918
>> 20646
>>
>> Level24 | 1 Canada Square | Canary Wharf | E14 5AB London
>>
>>
>> We've been announced
>> <https://blog.google/topics/google-cloud/investing-vibrant-google-cloud-
>> ecosystem-new-programs-and-partnerships/>
>> as
>> one of the top global Google Cloud Machine Learning partners.
>>
>

Re: Python SDK: BigTableIO

Posted by Chamikara Jayalath <ch...@apache.org>.

Thanks. I added some comments to the doc.

Davor should be able to assign this JIRA to you. Also, Solomon who
implemented the Java BigTable connector might have more input here.

- Cham

On Thu, Jun 1, 2017 at 2:19 AM Matthias Baetens <
matthias.baetens@datatonic.com> wrote:

> Hi Cham, Stephan,
>
> Thanks a lot for the input, really useful to get started.
>
> We'll probably start with implementing the Source (looks the most
> straightforward).
> I made a working document
> <
> https://docs.google.com/document/d/1iXeQvIAsGjp9orleDy0o5ExU-eMqWesgvtt231UoaPg/edit?usp=sharing
> >
> to
> organise and track our progress a bit, happy to discuss or receive feedback
> there as well. We made a JIRA issue
> <https://issues.apache.org/jira/browse/BEAM-2395> as well; should we get
> assigned to it?
>
> About writing the Sink: are there any examples of how this was done
> previously where we can get some inspiration from? I think it would be good
> to discuss this in more detail once we finish writing the Source.
>
> Matthias
> ᐧ
>
> On Tue, May 30, 2017 at 7:28 PM, Stephen Sisk <si...@google.com.invalid>
> wrote:
>
> > Hey Matthias,
> >
> > to add on to what Chamikara mentioned, we have lots of info in the
> generic
> > IO authoring guide [1], the Python IO authoring guide [2] and the
> > PTransform Style Guide[3].  The PTransform style guide doesn't sound like
> > it applies, but it has a lot of specific tips from lessons we've learned
> in
> > the past from I/O work.
> >
> > If you plan on contributing it back to the community, I'd also suggest
> > opening up a JIRA issue & updating the beam website (eg [4]) that you're
> > working on this (those steps are pretty trivial.)
> >
> > We've recently been trying out using branches when we add new I/Os since
> > the PRs tend to get bigger than we like for a since PR.
> >
> > Please feel free to email the dev mailing list if you have questions! We
> > are excited and happy to help out with thinking about design/etc... (eg,
> as
> > cham hinted at, should you use a Source vs. use regular ParDo
> transforms?)
> >
> > S
> >
> > [1] https://beam.apache.org/documentation/io/authoring-overview/
> > [2] https://beam.apache.org/documentation/sdks/python-custom-io/
> > [3] https://beam.apache.org/contribute/ptransform-style-guide/
> > [4] https://github.com/apache/beam-site/pull/250
> >
> > On Sun, May 28, 2017 at 5:32 PM Chamikara Jayalath <chamikara@apache.org
> >
> > wrote:
> >
> > > Thanks for offering to help. I would suggest to look into existing Java
> > > BigTableIO connector and currently available Python client library for
> > > Cloud BigTable to see how feasible it is to develop an efficient
> BigTable
> > > connector at this point. From Python SDK's perspective you can use
> > > iobase.BoundedSource API (wrapped by a PTrasnform) to develop a read
> > > PTransform with support for dynamic/static splitting. Sinks are usually
> > > developed as PTransforms (iobase.Sink interface is deprecated so I
> > suggest
> > > not to use that). I would be happy to review any PRs related to this.
> > >
> > > Thanks,
> > > Cham
> > >
> > > On Sun, May 28, 2017 at 2:30 AM Matthias Baetens <
> > > matthias.baetens@datatonic.com> wrote:
> > >
> > > > Hey guys,
> > > >
> > > > We have been using Beam for quite a few months now, so we (my
> colleague
> > > > Robert & I) thought it might be cool to contribute a bit as well.
> > > >
> > > > The challenge we want to take up is writing the BigTableIO for the
> > Python
> > > > SDK (which is not yet in the works according to the website
> > > > <
> > > >
> > > https://github.com/apache/beam-site/blob/asf-site/src/
> > documentation/io/built-in.md
> > > > >.
> > > > I have searched JIRA for the BigTableIO issue and did not find it,
> so I
> > > > suppose this is the first step we take.
> > > >
> > > > Any pointers or feedback more than welcome!
> > > >
> > > > Best,
> > > >
> > > > Matthias
> > > >
> > >
> >
>
>
>
> --
>
>
> *Matthias Baetens*
>
>
> *datatonic | data power unleashed*
>
> office +44 203 668 3680 <+44%2020%203668%203680>  |  mobile +44 74 918
> 20646
>
> Level24 | 1 Canada Square | Canary Wharf | E14 5AB London
>
>
> We've been announced
> <
> https://blog.google/topics/google-cloud/investing-vibrant-google-cloud-ecosystem-new-programs-and-partnerships/
> >
> as
> one of the top global Google Cloud Machine Learning partners.
>

Re: Python SDK: BigTableIO

Posted by Matthias Baetens <ma...@datatonic.com>.

Hi Cham, Stephan,

Thanks a lot for the input, really useful to get started.

We'll probably start with implementing the Source (looks the most
straightforward).
I made a working document
<https://docs.google.com/document/d/1iXeQvIAsGjp9orleDy0o5ExU-eMqWesgvtt231UoaPg/edit?usp=sharing>
to
organise and track our progress a bit, happy to discuss or receive feedback
there as well. We made a JIRA issue
<https://issues.apache.org/jira/browse/BEAM-2395> as well; should we get
assigned to it?

About writing the Sink: are there any examples of how this was done
previously where we can get some inspiration from? I think it would be good
to discuss this in more detail once we finish writing the Source.

Matthias
ᐧ

On Tue, May 30, 2017 at 7:28 PM, Stephen Sisk <si...@google.com.invalid>
wrote:

> Hey Matthias,
>
> to add on to what Chamikara mentioned, we have lots of info in the generic
> IO authoring guide [1], the Python IO authoring guide [2] and the
> PTransform Style Guide[3].  The PTransform style guide doesn't sound like
> it applies, but it has a lot of specific tips from lessons we've learned in
> the past from I/O work.
>
> If you plan on contributing it back to the community, I'd also suggest
> opening up a JIRA issue & updating the beam website (eg [4]) that you're
> working on this (those steps are pretty trivial.)
>
> We've recently been trying out using branches when we add new I/Os since
> the PRs tend to get bigger than we like for a since PR.
>
> Please feel free to email the dev mailing list if you have questions! We
> are excited and happy to help out with thinking about design/etc... (eg, as
> cham hinted at, should you use a Source vs. use regular ParDo transforms?)
>
> S
>
> [1] https://beam.apache.org/documentation/io/authoring-overview/
> [2] https://beam.apache.org/documentation/sdks/python-custom-io/
> [3] https://beam.apache.org/contribute/ptransform-style-guide/
> [4] https://github.com/apache/beam-site/pull/250
>
> On Sun, May 28, 2017 at 5:32 PM Chamikara Jayalath <ch...@apache.org>
> wrote:
>
> > Thanks for offering to help. I would suggest to look into existing Java
> > BigTableIO connector and currently available Python client library for
> > Cloud BigTable to see how feasible it is to develop an efficient BigTable
> > connector at this point. From Python SDK's perspective you can use
> > iobase.BoundedSource API (wrapped by a PTrasnform) to develop a read
> > PTransform with support for dynamic/static splitting. Sinks are usually
> > developed as PTransforms (iobase.Sink interface is deprecated so I
> suggest
> > not to use that). I would be happy to review any PRs related to this.
> >
> > Thanks,
> > Cham
> >
> > On Sun, May 28, 2017 at 2:30 AM Matthias Baetens <
> > matthias.baetens@datatonic.com> wrote:
> >
> > > Hey guys,
> > >
> > > We have been using Beam for quite a few months now, so we (my colleague
> > > Robert & I) thought it might be cool to contribute a bit as well.
> > >
> > > The challenge we want to take up is writing the BigTableIO for the
> Python
> > > SDK (which is not yet in the works according to the website
> > > <
> > >
> > https://github.com/apache/beam-site/blob/asf-site/src/
> documentation/io/built-in.md
> > > >.
> > > I have searched JIRA for the BigTableIO issue and did not find it, so I
> > > suppose this is the first step we take.
> > >
> > > Any pointers or feedback more than welcome!
> > >
> > > Best,
> > >
> > > Matthias
> > >
> >
>



-- 


*Matthias Baetens*


*datatonic | data power unleashed*

office +44 203 668 3680  |  mobile +44 74 918 20646

Level24 | 1 Canada Square | Canary Wharf | E14 5AB London


We've been announced
<https://blog.google/topics/google-cloud/investing-vibrant-google-cloud-ecosystem-new-programs-and-partnerships/>
as
one of the top global Google Cloud Machine Learning partners.

Re: Python SDK: BigTableIO

Posted by Stephen Sisk <si...@google.com.INVALID>.

Hey Matthias,

to add on to what Chamikara mentioned, we have lots of info in the generic
IO authoring guide [1], the Python IO authoring guide [2] and the
PTransform Style Guide[3].  The PTransform style guide doesn't sound like
it applies, but it has a lot of specific tips from lessons we've learned in
the past from I/O work.

If you plan on contributing it back to the community, I'd also suggest
opening up a JIRA issue & updating the beam website (eg [4]) that you're
working on this (those steps are pretty trivial.)

We've recently been trying out using branches when we add new I/Os since
the PRs tend to get bigger than we like for a since PR.

Please feel free to email the dev mailing list if you have questions! We
are excited and happy to help out with thinking about design/etc... (eg, as
cham hinted at, should you use a Source vs. use regular ParDo transforms?)

S

[1] https://beam.apache.org/documentation/io/authoring-overview/
[2] https://beam.apache.org/documentation/sdks/python-custom-io/
[3] https://beam.apache.org/contribute/ptransform-style-guide/
[4] https://github.com/apache/beam-site/pull/250

On Sun, May 28, 2017 at 5:32 PM Chamikara Jayalath <ch...@apache.org>
wrote:

> Thanks for offering to help. I would suggest to look into existing Java
> BigTableIO connector and currently available Python client library for
> Cloud BigTable to see how feasible it is to develop an efficient BigTable
> connector at this point. From Python SDK's perspective you can use
> iobase.BoundedSource API (wrapped by a PTrasnform) to develop a read
> PTransform with support for dynamic/static splitting. Sinks are usually
> developed as PTransforms (iobase.Sink interface is deprecated so I suggest
> not to use that). I would be happy to review any PRs related to this.
>
> Thanks,
> Cham
>
> On Sun, May 28, 2017 at 2:30 AM Matthias Baetens <
> matthias.baetens@datatonic.com> wrote:
>
> > Hey guys,
> >
> > We have been using Beam for quite a few months now, so we (my colleague
> > Robert & I) thought it might be cool to contribute a bit as well.
> >
> > The challenge we want to take up is writing the BigTableIO for the Python
> > SDK (which is not yet in the works according to the website
> > <
> >
> https://github.com/apache/beam-site/blob/asf-site/src/documentation/io/built-in.md
> > >.
> > I have searched JIRA for the BigTableIO issue and did not find it, so I
> > suppose this is the first step we take.
> >
> > Any pointers or feedback more than welcome!
> >
> > Best,
> >
> > Matthias
> >
>

Re: Python SDK: BigTableIO

Posted by Chamikara Jayalath <ch...@apache.org>.

Thanks for offering to help. I would suggest to look into existing Java
BigTableIO connector and currently available Python client library for
Cloud BigTable to see how feasible it is to develop an efficient BigTable
connector at this point. From Python SDK's perspective you can use
iobase.BoundedSource API (wrapped by a PTrasnform) to develop a read
PTransform with support for dynamic/static splitting. Sinks are usually
developed as PTransforms (iobase.Sink interface is deprecated so I suggest
not to use that). I would be happy to review any PRs related to this.

Thanks,
Cham

On Sun, May 28, 2017 at 2:30 AM Matthias Baetens <
matthias.baetens@datatonic.com> wrote:

> Hey guys,
>
> We have been using Beam for quite a few months now, so we (my colleague
> Robert & I) thought it might be cool to contribute a bit as well.
>
> The challenge we want to take up is writing the BigTableIO for the Python
> SDK (which is not yet in the works according to the website
> <
> https://github.com/apache/beam-site/blob/asf-site/src/documentation/io/built-in.md
> >.
> I have searched JIRA for the BigTableIO issue and did not find it, so I
> suppose this is the first step we take.
>
> Any pointers or feedback more than welcome!
>
> Best,
>
> Matthias
>