You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Stephen Sisk <si...@google.com.INVALID> on 2017/04/04 02:22:32 UTC

IO ITs: Hosting Docker images

Summary:

For IO ITs that use data stores that need custom docker images in order to
run, we can't currently use them in a kubernetes cluster (which is where we
host our data stores.) I have a couple options for how to solve this and am
looking for feedback from folks involved in creating IO ITs/opinions on
kubernetes.


Details:

We've discussed in the past that we'll want to allow developers to submit
just a dockerfile, and then we'll use that when creating the data store on
kubernetes. This is the case for ElasticsearchIO and I assume more data
stores in the future will want to do this. It's also looking like it'll be
necessary to use custom docker images for the HadoopInputFormatIO's
cassandra ITs - to run a cassandra cluster, there doesn't seem to be a good
image you can use out of the box.

In either case, in order to retrieve a docker image, kubernetes needs a
container registry - it will read the docker images from there. A simple
private container registry doesn't work because kubernetes config files are
static - this means that if local devs try to use the kubernetes files,
they point at the private container registry and they wouldn't be able to
retrieve the images since they don't have access. They'd have to manually
edit the files, which in theory is an option, but I don't consider that to
be acceptable since it feels pretty unfriendly (it is simple, so if we
really don't like the below options we can revisit it.)

Quick summary of the options

=======================

We can:

* Start using something like k8 helm - this adds more dependencies, adds a
small amount of complexity (this is my recommendation, but only by a little)

* Start pushing images to docker hub - this means they'll be publicly
visible and raises the bar for maintenance of those images

* Host our own public container registry - this means running our own
public service with costs, etc..

Below are detailed discussions of these options. You can skip to the "My
thoughts on this" section if you're not interested in the details.


1. Templated kubernetes images

=========================

Kubernetes (k8) does not currently have built in support for parameterizing
scripts - there's an issues open for this[1], but it doesn't seem to be
very active.

There are tools like Kubernetes helm that allow users to specify parameters
when running their kubernetes scripts. They also enable a lot more (they're
probably closer to a package manager like apt-get) - see this
description[3] for an overview.

I'm open to other options besides helm, but it seems to be the officially
supported one.

How the world would look using helm:

* When developing an IO IT, someone (either the developer or one of us),
would need to create a chart (the name for the helm script) - it's
basically another set of config files but in theory is as simple as a
couple metadata files plus a templatized version of a regular k8 script.
This should be trivial compared to the task of creating a k8 script.

*  When creating an instance of a data store, the developer (or the beam CI
server) would first build the docker image for the data store and push to
their container registry, then run a command like `helm install -f
mydb.yaml --set imageRepo=1.2.3.4`

* when done running tests/developing/etc…  the developer/beam CI server
would run `helm delete -f mydb.yaml`

Upsides:

* Something like helm is pretty interesting - we talked about it as an
upside and something we wanted to do when we talked about using kubernetes

* We pick up a set of working kubernetes scripts this way. The full list is
at [2], but some ones that stood out: mongodb, memcached, mysql, postgres,
redis, elasticsearch (incubating), kafka (incubating), zookeeper
(incubating) - this could speed development

Downsides:

* Adds an additional dependency to run our ITs (helm or another k8
templating tool)

* Requires people to build their own images run a container registry if
they don't already have one (it will not surprise you that there's a docker
image for running the registry [0] - so it's not crazy. :) I *think* this
will probably just be a simple one/two line command once we have it
scripted.

* Helm in particular is kind of heavyweight for what we really need - it
requires running a service in the k8 cluster and adds additional complexity.

* Adds to the complexity of creating a new kubernetes script. Until I've
tried it, I can't really speak to the complexity, but taking a look at the
instructions [4], it doesn't seem too bad.




2. Push images to docker hub

=======================

This requires that users push images that we want to use to docker hub, and
then our IO ITs will rely on that. I  think the developer of the dockerfile
should be responsible for the image - having the beam project responsible
for a publicly available artifact (like the docker images) outside of our
core deliverables doesn't seem like the right move.

We would still retain a copy of the source dockerfiles and could regenerate
the images at any time, so I'm not concerned about a scenario where docker
hub went away (it would be pretty simple to switch to another repo - just
change some config files.)

For someone running the k8 scripts (ie, running the IO ITs), this is pretty
easy - they just run the k8 script like they do today.

For someone creating the k8 scripts (ie, creating the IO ITs), this is more
complex - either they or we have to push this to docker hub and make sure
it's up to date, etc..


Upsides:

* No additional complexity for IO IT runners.

Downsides:

* Higher bar for creating the image in the first place - someone has to
maintain the publicly available docker hub image.

* It seems weird to have a custom docker image up on docker hub - maybe
that's common, but if we need specific changes to images for our needs, I'd
prefer it be private.


3. Run our own *public* container registry

==============================================

We would run a beam-specific container registry service - it would be used
by the apache beam CI servers, but it would also be available for use by
anyone running beam IO ITs on their local dev setup.

From a IO IT creator's perspective, this would look pretty similar to how
things are now - they just check in a dockerfile. For someone running the
k8 scripts, they similarly don't need to think about it.

Upsides:

* we're not adding any additional complexity for end developer

Downsides:

* Have to keep docker registry software up to date

* The service is a single of failure for any beam devs running IO ITs

* It can incur costs, etc… As an open source project, it doesn't seem great
for us to be running a public service.



My thoughts on this

===============

In spite of the additional complexity, I think using k8 helm is probably
the best option. The general goal behind the IO ITs has been to keep
ourselves self-contained: avoid having centralized infrastructure for those
running the ITs. Helm is a good match for those criteria. I will admit that
I find the additional dependencies/complexity to be worrisome. However, I
really like the idea of picking up additional data store configs for free -
if we were doing this in 5 years, we'd say "we should just use the
ecosystem of helm charts" and go from there.

I do think that pushing images to docker hub is a viable option, and if the
community is more excited to do that/wants to push the images there, I'd
support it. I can see how folks would be hesitant. I would like for the
developer of the docker file to do

Of the 3 options, I would strongly push back against running a public
container registry - I would not want to administer it, and I don't think
we as a project want to be paying for the costs associated with it.

Next steps

=========

Let me know what you think! This is definitely a topic where understanding
what the community of IO devs wants is helpful. As we discuss, I'll
probably spend a little time exploring helm since I want to play around
with it and understand if there are other drawbacks. I ran into this
question while working on getting the HIFIO cassandra cluster running, so I
might prototype with that.

I'll create JIRA for this in the next day or so.

Stephen



[0] docker registry container - https://hub.docker.com/_/registry/

[1] kubernetes issue open for supporting templates -
https://github.com/kubernetes/kubernetes/issues/23896

[2] set of available charts - https://github.com/kubernetes/charts

[3] kubernetes helm introduction -
https://deis.com/blog/2015/introducing-helm-for-kubernetes/
[4] kubernetes charts instructions -
https://github.com/kubernetes/helm/blob/master/docs/charts.md

Re: IO ITs: Hosting Docker images

Posted by Stephen Sisk <si...@google.com.INVALID>.

Can you expand a bit on why you'd like to do that vs the other options?

One of the downsides of the "we create our own docker images" plan is that
we have to maintain them. How do you want to do that?

to be clear: I'm asking because I'm trying to figure out what the community
wants  and right now I'm just hearing "I want X". It's most helpful for me
to hear "I want X because of reasons A,B, and C".

S

On Mon, Apr 10, 2017 at 9:59 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Agree it's what I said in a previous email.
>
> Regards
> JB
>
> On Apr 10, 2017, 18:58, at 18:58, Ekrem Aksoy <ek...@gmail.com>
> wrote:
> >Hi Stephen,
> >
> >Can we piggyback on current Apache Docker Hub account? I think images
> >can
> >be hold there, too.
> >
> >-E
> >
> >On Mon, Apr 10, 2017 at 5:22 PM, Stephen Sisk <si...@google.com.invalid>
> >wrote:
> >
> >> for 4 - there's a number of logistics involved. How do you propose
> >handling
> >> cost, potential DOS, etc? People in different timezones would need to
> >be
> >> oncall for it since it impacts people's ability to dev work (or they
> >need
> >> to be okay if it goes out.) Can you give some reasons why you think
> >it's
> >> better than the other options? I put it on the list, but I'm strongly
> >not a
> >> fan.
> >>
> >> S
> >>
> >> On Sat, Apr 8, 2017 at 5:31 AM Ted Yu <yu...@gmail.com> wrote:
> >>
> >> > +1
> >> >
> >> > > On Apr 7, 2017, at 10:46 PM, Jean-Baptiste Onofré
> ><jb...@nanthrax.net>
> >> > wrote:
> >> > >
> >> > > Hi Stephen,
> >> > >
> >> > > I think we should go to 1 and 4:
> >> > >
> >> > > 1. Try to use existing images providing what we need. If we don't
> >find
> >> > existing image, we can always ask and help other community to
> >provide so.
> >> > > 4. If we don't find a suitable image, and waiting for this image,
> >we
> >> can
> >> > store the image in our own "IT dockerhub".
> >> > >
> >> > > Regards
> >> > > JB
> >> > >
> >> > >> On 04/08/2017 01:03 AM, Stephen Sisk wrote:
> >> > >> Wanted to see if anyone else had opinions on this/provide a
> >quick
> >> > update.
> >> > >>
> >> > >> I think for both elasticsearch and HIFIO that we can find
> >existing,
> >> > >> supported images that could serve those purposes - HIFIO is
> >looking
> >> like
> >> > >> it'll able to do so for cassandra, which was proving tricky.
> >> > >>
> >> > >> So to summarize my current proposed solutions: (ordered by my
> >> > preference)
> >> > >> 1. (new) Strongly urge people to find existing docker images
> >that meet
> >> > our
> >> > >> image criteria - regularly updated/security checked
> >> > >> 2. Start using helm
> >> > >> 3. Push our docker images to docker hub
> >> > >> 4. Host our own public container registry
> >> > >>
> >> > >> S
> >> > >>
> >> > >>> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <si...@google.com>
> >> wrote:
> >> > >>>
> >> > >>> I'd like to hear what direction folks want to go in, and from
> >there
> >> > look
> >> > >>> at the options. I think for some of these options (like running
> >our
> >> own
> >> > >>> public registry), they may be able to and it's something we
> >should
> >> > look at,
> >> > >>> but I don't assume they have time to work on this type of
> >issue.
> >> > >>>
> >> > >>> S
> >> > >>>
> >> > >>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik
> ><lcwik@google.com.invalid
> >> >
> >> > >>> wrote:
> >> > >>>
> >> > >>> Is this something that Apache infra could help us with?
> >> > >>>
> >> > >>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk
> ><sisk@google.com.invalid
> >> >
> >> > >>> wrote:
> >> > >>>
> >> > >>>> Summary:
> >> > >>>>
> >> > >>>> For IO ITs that use data stores that need custom docker images
> >in
> >> > order
> >> > >>> to
> >> > >>>> run, we can't currently use them in a kubernetes cluster
> >(which is
> >> > where
> >> > >>> we
> >> > >>>> host our data stores.) I have a couple options for how to
> >solve this
> >> > and
> >> > >>> am
> >> > >>>> looking for feedback from folks involved in creating IO
> >ITs/opinions
> >> > on
> >> > >>>> kubernetes.
> >> > >>>>
> >> > >>>>
> >> > >>>> Details:
> >> > >>>>
> >> > >>>> We've discussed in the past that we'll want to allow
> >developers to
> >> > submit
> >> > >>>> just a dockerfile, and then we'll use that when creating the
> >data
> >> > store
> >> > >>> on
> >> > >>>> kubernetes. This is the case for ElasticsearchIO and I assume
> >more
> >> > data
> >> > >>>> stores in the future will want to do this. It's also looking
> >like
> >> > it'll
> >> > >>> be
> >> > >>>> necessary to use custom docker images for the
> >HadoopInputFormatIO's
> >> > >>>> cassandra ITs - to run a cassandra cluster, there doesn't seem
> >to
> >> be a
> >> > >>> good
> >> > >>>> image you can use out of the box.
> >> > >>>>
> >> > >>>> In either case, in order to retrieve a docker image,
> >kubernetes
> >> needs
> >> > a
> >> > >>>> container registry - it will read the docker images from
> >there. A
> >> > simple
> >> > >>>> private container registry doesn't work because kubernetes
> >config
> >> > files
> >> > >>> are
> >> > >>>> static - this means that if local devs try to use the
> >kubernetes
> >> > files,
> >> > >>>> they point at the private container registry and they wouldn't
> >be
> >> > able to
> >> > >>>> retrieve the images since they don't have access. They'd have
> >to
> >> > manually
> >> > >>>> edit the files, which in theory is an option, but I don't
> >consider
> >> > that
> >> > >>> to
> >> > >>>> be acceptable since it feels pretty unfriendly (it is simple,
> >so if
> >> we
> >> > >>>> really don't like the below options we can revisit it.)
> >> > >>>>
> >> > >>>> Quick summary of the options
> >> > >>>>
> >> > >>>> =======================
> >> > >>>>
> >> > >>>> We can:
> >> > >>>>
> >> > >>>> * Start using something like k8 helm - this adds more
> >dependencies,
> >> > adds
> >> > >>> a
> >> > >>>> small amount of complexity (this is my recommendation, but
> >only by a
> >> > >>>> little)
> >> > >>>>
> >> > >>>> * Start pushing images to docker hub - this means they'll be
> >> publicly
> >> > >>>> visible and raises the bar for maintenance of those images
> >> > >>>>
> >> > >>>> * Host our own public container registry - this means running
> >our
> >> own
> >> > >>>> public service with costs, etc..
> >> > >>>>
> >> > >>>> Below are detailed discussions of these options. You can skip
> >to the
> >> > "My
> >> > >>>> thoughts on this" section if you're not interested in the
> >details.
> >> > >>>>
> >> > >>>>
> >> > >>>> 1. Templated kubernetes images
> >> > >>>>
> >> > >>>> =========================
> >> > >>>>
> >> > >>>> Kubernetes (k8) does not currently have built in support for
> >> > >>> parameterizing
> >> > >>>> scripts - there's an issues open for this[1], but it doesn't
> >seem to
> >> > be
> >> > >>>> very active.
> >> > >>>>
> >> > >>>> There are tools like Kubernetes helm that allow users to
> >specify
> >> > >>> parameters
> >> > >>>> when running their kubernetes scripts. They also enable a lot
> >more
> >> > >>> (they're
> >> > >>>> probably closer to a package manager like apt-get) - see this
> >> > >>>> description[3] for an overview.
> >> > >>>>
> >> > >>>> I'm open to other options besides helm, but it seems to be the
> >> > officially
> >> > >>>> supported one.
> >> > >>>>
> >> > >>>> How the world would look using helm:
> >> > >>>>
> >> > >>>> * When developing an IO IT, someone (either the developer or
> >one of
> >> > us),
> >> > >>>> would need to create a chart (the name for the helm script) -
> >it's
> >> > >>>> basically another set of config files but in theory is as
> >simple as
> >> a
> >> > >>>> couple metadata files plus a templatized version of a regular
> >k8
> >> > script.
> >> > >>>> This should be trivial compared to the task of creating a k8
> >script.
> >> > >>>>
> >> > >>>> *  When creating an instance of a data store, the developer
> >(or the
> >> > beam
> >> > >>> CI
> >> > >>>> server) would first build the docker image for the data store
> >and
> >> > push to
> >> > >>>> their container registry, then run a command like `helm
> >install -f
> >> > >>>> mydb.yaml --set imageRepo=1.2.3.4`
> >> > >>>>
> >> > >>>> * when done running tests/developing/etc…  the developer/beam
> >CI
> >> > server
> >> > >>>> would run `helm delete -f mydb.yaml`
> >> > >>>>
> >> > >>>> Upsides:
> >> > >>>>
> >> > >>>> * Something like helm is pretty interesting - we talked about
> >it as
> >> an
> >> > >>>> upside and something we wanted to do when we talked about
> >using
> >> > >>> kubernetes
> >> > >>>>
> >> > >>>> * We pick up a set of working kubernetes scripts this way. The
> >full
> >> > list
> >> > >>> is
> >> > >>>> at [2], but some ones that stood out: mongodb, memcached,
> >mysql,
> >> > >>> postgres,
> >> > >>>> redis, elasticsearch (incubating), kafka (incubating),
> >zookeeper
> >> > >>>> (incubating) - this could speed development
> >> > >>>>
> >> > >>>> Downsides:
> >> > >>>>
> >> > >>>> * Adds an additional dependency to run our ITs (helm or
> >another k8
> >> > >>>> templating tool)
> >> > >>>>
> >> > >>>> * Requires people to build their own images run a container
> >registry
> >> > if
> >> > >>>> they don't already have one (it will not surprise you that
> >there's a
> >> > >>> docker
> >> > >>>> image for running the registry [0] - so it's not crazy. :) I
> >*think*
> >> > this
> >> > >>>> will probably just be a simple one/two line command once we
> >have it
> >> > >>>> scripted.
> >> > >>>>
> >> > >>>> * Helm in particular is kind of heavyweight for what we really
> >need
> >> -
> >> > it
> >> > >>>> requires running a service in the k8 cluster and adds
> >additional
> >> > >>>> complexity.
> >> > >>>>
> >> > >>>> * Adds to the complexity of creating a new kubernetes script.
> >Until
> >> > I've
> >> > >>>> tried it, I can't really speak to the complexity, but taking a
> >look
> >> at
> >> > >>> the
> >> > >>>> instructions [4], it doesn't seem too bad.
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> 2. Push images to docker hub
> >> > >>>>
> >> > >>>> =======================
> >> > >>>>
> >> > >>>> This requires that users push images that we want to use to
> >docker
> >> > hub,
> >> > >>> and
> >> > >>>> then our IO ITs will rely on that. I  think the developer of
> >the
> >> > >>> dockerfile
> >> > >>>> should be responsible for the image - having the beam project
> >> > responsible
> >> > >>>> for a publicly available artifact (like the docker images)
> >outside
> >> of
> >> > our
> >> > >>>> core deliverables doesn't seem like the right move.
> >> > >>>>
> >> > >>>> We would still retain a copy of the source dockerfiles and
> >could
> >> > >>> regenerate
> >> > >>>> the images at any time, so I'm not concerned about a scenario
> >where
> >> > >>> docker
> >> > >>>> hub went away (it would be pretty simple to switch to another
> >repo -
> >> > just
> >> > >>>> change some config files.)
> >> > >>>>
> >> > >>>> For someone running the k8 scripts (ie, running the IO ITs),
> >this is
> >> > >>> pretty
> >> > >>>> easy - they just run the k8 script like they do today.
> >> > >>>>
> >> > >>>> For someone creating the k8 scripts (ie, creating the IO ITs),
> >this
> >> is
> >> > >>> more
> >> > >>>> complex - either they or we have to push this to docker hub
> >and make
> >> > sure
> >> > >>>> it's up to date, etc..
> >> > >>>>
> >> > >>>>
> >> > >>>> Upsides:
> >> > >>>>
> >> > >>>> * No additional complexity for IO IT runners.
> >> > >>>>
> >> > >>>> Downsides:
> >> > >>>>
> >> > >>>> * Higher bar for creating the image in the first place -
> >someone has
> >> > to
> >> > >>>> maintain the publicly available docker hub image.
> >> > >>>>
> >> > >>>> * It seems weird to have a custom docker image up on docker
> >hub -
> >> > maybe
> >> > >>>> that's common, but if we need specific changes to images for
> >our
> >> > needs,
> >> > >>> I'd
> >> > >>>> prefer it be private.
> >> > >>>>
> >> > >>>>
> >> > >>>> 3. Run our own *public* container registry
> >> > >>>>
> >> > >>>> ==============================================
> >> > >>>>
> >> > >>>> We would run a beam-specific container registry service - it
> >would
> >> be
> >> > >>> used
> >> > >>>> by the apache beam CI servers, but it would also be available
> >for
> >> use
> >> > by
> >> > >>>> anyone running beam IO ITs on their local dev setup.
> >> > >>>>
> >> > >>>> From a IO IT creator's perspective, this would look pretty
> >similar
> >> to
> >> > how
> >> > >>>> things are now - they just check in a dockerfile. For someone
> >> running
> >> > the
> >> > >>>> k8 scripts, they similarly don't need to think about it.
> >> > >>>>
> >> > >>>> Upsides:
> >> > >>>>
> >> > >>>> * we're not adding any additional complexity for end developer
> >> > >>>>
> >> > >>>> Downsides:
> >> > >>>>
> >> > >>>> * Have to keep docker registry software up to date
> >> > >>>>
> >> > >>>> * The service is a single of failure for any beam devs running
> >IO
> >> ITs
> >> > >>>>
> >> > >>>> * It can incur costs, etc… As an open source project, it
> >doesn't
> >> seem
> >> > >>> great
> >> > >>>> for us to be running a public service.
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> My thoughts on this
> >> > >>>>
> >> > >>>> ===============
> >> > >>>>
> >> > >>>> In spite of the additional complexity, I think using k8 helm
> >is
> >> > probably
> >> > >>>> the best option. The general goal behind the IO ITs has been
> >to keep
> >> > >>>> ourselves self-contained: avoid having centralized
> >infrastructure
> >> for
> >> > >>> those
> >> > >>>> running the ITs. Helm is a good match for those criteria. I
> >will
> >> admit
> >> > >>> that
> >> > >>>> I find the additional dependencies/complexity to be worrisome.
> >> > However, I
> >> > >>>> really like the idea of picking up additional data store
> >configs for
> >> > >>> free -
> >> > >>>> if we were doing this in 5 years, we'd say "we should just use
> >the
> >> > >>>> ecosystem of helm charts" and go from there.
> >> > >>>>
> >> > >>>> I do think that pushing images to docker hub is a viable
> >option, and
> >> > if
> >> > >>> the
> >> > >>>> community is more excited to do that/wants to push the images
> >there,
> >> > I'd
> >> > >>>> support it. I can see how folks would be hesitant. I would
> >like for
> >> > the
> >> > >>>> developer of the docker file to do
> >> > >>>>
> >> > >>>> Of the 3 options, I would strongly push back against running a
> >> public
> >> > >>>> container registry - I would not want to administer it, and I
> >don't
> >> > think
> >> > >>>> we as a project want to be paying for the costs associated
> >with it.
> >> > >>>>
> >> > >>>> Next steps
> >> > >>>>
> >> > >>>> =========
> >> > >>>>
> >> > >>>> Let me know what you think! This is definitely a topic where
> >> > >>> understanding
> >> > >>>> what the community of IO devs wants is helpful. As we discuss,
> >I'll
> >> > >>>> probably spend a little time exploring helm since I want to
> >play
> >> > around
> >> > >>>> with it and understand if there are other drawbacks. I ran
> >into this
> >> > >>>> question while working on getting the HIFIO cassandra cluster
> >> running,
> >> > >>> so I
> >> > >>>> might prototype with that.
> >> > >>>>
> >> > >>>> I'll create JIRA for this in the next day or so.
> >> > >>>>
> >> > >>>> Stephen
> >> > >>>>
> >> > >>>>
> >> > >>>>
> >> > >>>> [0] docker registry container -
> >https://hub.docker.com/_/registry/
> >> > >>>>
> >> > >>>> [1] kubernetes issue open for supporting templates -
> >> > >>>> https://github.com/kubernetes/kubernetes/issues/23896
> >> > >>>>
> >> > >>>> [2] set of available charts -
> >https://github.com/kubernetes/charts
> >> > >>>>
> >> > >>>> [3] kubernetes helm introduction -
> >> > >>>> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> >> > >>>> [4] kubernetes charts instructions -
> >> > >>>> https://github.com/kubernetes/helm/blob/master/docs/charts.md
> >> > >
> >> > > --
> >> > > Jean-Baptiste Onofré
> >> > > jbonofre@apache.org
> >> > > http://blog.nanthrax.net
> >> > > Talend - http://www.talend.com
> >> >
> >>
>

Re: IO ITs: Hosting Docker images

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Agree it's what I said in a previous email.

Regards
JB

On Apr 10, 2017, 18:58, at 18:58, Ekrem Aksoy <ek...@gmail.com> wrote:
>Hi Stephen,
>
>Can we piggyback on current Apache Docker Hub account? I think images
>can
>be hold there, too.
>
>-E
>
>On Mon, Apr 10, 2017 at 5:22 PM, Stephen Sisk <si...@google.com.invalid>
>wrote:
>
>> for 4 - there's a number of logistics involved. How do you propose
>handling
>> cost, potential DOS, etc? People in different timezones would need to
>be
>> oncall for it since it impacts people's ability to dev work (or they
>need
>> to be okay if it goes out.) Can you give some reasons why you think
>it's
>> better than the other options? I put it on the list, but I'm strongly
>not a
>> fan.
>>
>> S
>>
>> On Sat, Apr 8, 2017 at 5:31 AM Ted Yu <yu...@gmail.com> wrote:
>>
>> > +1
>> >
>> > > On Apr 7, 2017, at 10:46 PM, Jean-Baptiste Onofré
><jb...@nanthrax.net>
>> > wrote:
>> > >
>> > > Hi Stephen,
>> > >
>> > > I think we should go to 1 and 4:
>> > >
>> > > 1. Try to use existing images providing what we need. If we don't
>find
>> > existing image, we can always ask and help other community to
>provide so.
>> > > 4. If we don't find a suitable image, and waiting for this image,
>we
>> can
>> > store the image in our own "IT dockerhub".
>> > >
>> > > Regards
>> > > JB
>> > >
>> > >> On 04/08/2017 01:03 AM, Stephen Sisk wrote:
>> > >> Wanted to see if anyone else had opinions on this/provide a
>quick
>> > update.
>> > >>
>> > >> I think for both elasticsearch and HIFIO that we can find
>existing,
>> > >> supported images that could serve those purposes - HIFIO is
>looking
>> like
>> > >> it'll able to do so for cassandra, which was proving tricky.
>> > >>
>> > >> So to summarize my current proposed solutions: (ordered by my
>> > preference)
>> > >> 1. (new) Strongly urge people to find existing docker images
>that meet
>> > our
>> > >> image criteria - regularly updated/security checked
>> > >> 2. Start using helm
>> > >> 3. Push our docker images to docker hub
>> > >> 4. Host our own public container registry
>> > >>
>> > >> S
>> > >>
>> > >>> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <si...@google.com>
>> wrote:
>> > >>>
>> > >>> I'd like to hear what direction folks want to go in, and from
>there
>> > look
>> > >>> at the options. I think for some of these options (like running
>our
>> own
>> > >>> public registry), they may be able to and it's something we
>should
>> > look at,
>> > >>> but I don't assume they have time to work on this type of
>issue.
>> > >>>
>> > >>> S
>> > >>>
>> > >>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik
><lcwik@google.com.invalid
>> >
>> > >>> wrote:
>> > >>>
>> > >>> Is this something that Apache infra could help us with?
>> > >>>
>> > >>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk
><sisk@google.com.invalid
>> >
>> > >>> wrote:
>> > >>>
>> > >>>> Summary:
>> > >>>>
>> > >>>> For IO ITs that use data stores that need custom docker images
>in
>> > order
>> > >>> to
>> > >>>> run, we can't currently use them in a kubernetes cluster
>(which is
>> > where
>> > >>> we
>> > >>>> host our data stores.) I have a couple options for how to
>solve this
>> > and
>> > >>> am
>> > >>>> looking for feedback from folks involved in creating IO
>ITs/opinions
>> > on
>> > >>>> kubernetes.
>> > >>>>
>> > >>>>
>> > >>>> Details:
>> > >>>>
>> > >>>> We've discussed in the past that we'll want to allow
>developers to
>> > submit
>> > >>>> just a dockerfile, and then we'll use that when creating the
>data
>> > store
>> > >>> on
>> > >>>> kubernetes. This is the case for ElasticsearchIO and I assume
>more
>> > data
>> > >>>> stores in the future will want to do this. It's also looking
>like
>> > it'll
>> > >>> be
>> > >>>> necessary to use custom docker images for the
>HadoopInputFormatIO's
>> > >>>> cassandra ITs - to run a cassandra cluster, there doesn't seem
>to
>> be a
>> > >>> good
>> > >>>> image you can use out of the box.
>> > >>>>
>> > >>>> In either case, in order to retrieve a docker image,
>kubernetes
>> needs
>> > a
>> > >>>> container registry - it will read the docker images from
>there. A
>> > simple
>> > >>>> private container registry doesn't work because kubernetes
>config
>> > files
>> > >>> are
>> > >>>> static - this means that if local devs try to use the
>kubernetes
>> > files,
>> > >>>> they point at the private container registry and they wouldn't
>be
>> > able to
>> > >>>> retrieve the images since they don't have access. They'd have
>to
>> > manually
>> > >>>> edit the files, which in theory is an option, but I don't
>consider
>> > that
>> > >>> to
>> > >>>> be acceptable since it feels pretty unfriendly (it is simple,
>so if
>> we
>> > >>>> really don't like the below options we can revisit it.)
>> > >>>>
>> > >>>> Quick summary of the options
>> > >>>>
>> > >>>> =======================
>> > >>>>
>> > >>>> We can:
>> > >>>>
>> > >>>> * Start using something like k8 helm - this adds more
>dependencies,
>> > adds
>> > >>> a
>> > >>>> small amount of complexity (this is my recommendation, but
>only by a
>> > >>>> little)
>> > >>>>
>> > >>>> * Start pushing images to docker hub - this means they'll be
>> publicly
>> > >>>> visible and raises the bar for maintenance of those images
>> > >>>>
>> > >>>> * Host our own public container registry - this means running
>our
>> own
>> > >>>> public service with costs, etc..
>> > >>>>
>> > >>>> Below are detailed discussions of these options. You can skip
>to the
>> > "My
>> > >>>> thoughts on this" section if you're not interested in the
>details.
>> > >>>>
>> > >>>>
>> > >>>> 1. Templated kubernetes images
>> > >>>>
>> > >>>> =========================
>> > >>>>
>> > >>>> Kubernetes (k8) does not currently have built in support for
>> > >>> parameterizing
>> > >>>> scripts - there's an issues open for this[1], but it doesn't
>seem to
>> > be
>> > >>>> very active.
>> > >>>>
>> > >>>> There are tools like Kubernetes helm that allow users to
>specify
>> > >>> parameters
>> > >>>> when running their kubernetes scripts. They also enable a lot
>more
>> > >>> (they're
>> > >>>> probably closer to a package manager like apt-get) - see this
>> > >>>> description[3] for an overview.
>> > >>>>
>> > >>>> I'm open to other options besides helm, but it seems to be the
>> > officially
>> > >>>> supported one.
>> > >>>>
>> > >>>> How the world would look using helm:
>> > >>>>
>> > >>>> * When developing an IO IT, someone (either the developer or
>one of
>> > us),
>> > >>>> would need to create a chart (the name for the helm script) -
>it's
>> > >>>> basically another set of config files but in theory is as
>simple as
>> a
>> > >>>> couple metadata files plus a templatized version of a regular
>k8
>> > script.
>> > >>>> This should be trivial compared to the task of creating a k8
>script.
>> > >>>>
>> > >>>> *  When creating an instance of a data store, the developer
>(or the
>> > beam
>> > >>> CI
>> > >>>> server) would first build the docker image for the data store
>and
>> > push to
>> > >>>> their container registry, then run a command like `helm
>install -f
>> > >>>> mydb.yaml --set imageRepo=1.2.3.4`
>> > >>>>
>> > >>>> * when done running tests/developing/etc…  the developer/beam
>CI
>> > server
>> > >>>> would run `helm delete -f mydb.yaml`
>> > >>>>
>> > >>>> Upsides:
>> > >>>>
>> > >>>> * Something like helm is pretty interesting - we talked about
>it as
>> an
>> > >>>> upside and something we wanted to do when we talked about
>using
>> > >>> kubernetes
>> > >>>>
>> > >>>> * We pick up a set of working kubernetes scripts this way. The
>full
>> > list
>> > >>> is
>> > >>>> at [2], but some ones that stood out: mongodb, memcached,
>mysql,
>> > >>> postgres,
>> > >>>> redis, elasticsearch (incubating), kafka (incubating),
>zookeeper
>> > >>>> (incubating) - this could speed development
>> > >>>>
>> > >>>> Downsides:
>> > >>>>
>> > >>>> * Adds an additional dependency to run our ITs (helm or
>another k8
>> > >>>> templating tool)
>> > >>>>
>> > >>>> * Requires people to build their own images run a container
>registry
>> > if
>> > >>>> they don't already have one (it will not surprise you that
>there's a
>> > >>> docker
>> > >>>> image for running the registry [0] - so it's not crazy. :) I
>*think*
>> > this
>> > >>>> will probably just be a simple one/two line command once we
>have it
>> > >>>> scripted.
>> > >>>>
>> > >>>> * Helm in particular is kind of heavyweight for what we really
>need
>> -
>> > it
>> > >>>> requires running a service in the k8 cluster and adds
>additional
>> > >>>> complexity.
>> > >>>>
>> > >>>> * Adds to the complexity of creating a new kubernetes script.
>Until
>> > I've
>> > >>>> tried it, I can't really speak to the complexity, but taking a
>look
>> at
>> > >>> the
>> > >>>> instructions [4], it doesn't seem too bad.
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> 2. Push images to docker hub
>> > >>>>
>> > >>>> =======================
>> > >>>>
>> > >>>> This requires that users push images that we want to use to
>docker
>> > hub,
>> > >>> and
>> > >>>> then our IO ITs will rely on that. I  think the developer of
>the
>> > >>> dockerfile
>> > >>>> should be responsible for the image - having the beam project
>> > responsible
>> > >>>> for a publicly available artifact (like the docker images)
>outside
>> of
>> > our
>> > >>>> core deliverables doesn't seem like the right move.
>> > >>>>
>> > >>>> We would still retain a copy of the source dockerfiles and
>could
>> > >>> regenerate
>> > >>>> the images at any time, so I'm not concerned about a scenario
>where
>> > >>> docker
>> > >>>> hub went away (it would be pretty simple to switch to another
>repo -
>> > just
>> > >>>> change some config files.)
>> > >>>>
>> > >>>> For someone running the k8 scripts (ie, running the IO ITs),
>this is
>> > >>> pretty
>> > >>>> easy - they just run the k8 script like they do today.
>> > >>>>
>> > >>>> For someone creating the k8 scripts (ie, creating the IO ITs),
>this
>> is
>> > >>> more
>> > >>>> complex - either they or we have to push this to docker hub
>and make
>> > sure
>> > >>>> it's up to date, etc..
>> > >>>>
>> > >>>>
>> > >>>> Upsides:
>> > >>>>
>> > >>>> * No additional complexity for IO IT runners.
>> > >>>>
>> > >>>> Downsides:
>> > >>>>
>> > >>>> * Higher bar for creating the image in the first place -
>someone has
>> > to
>> > >>>> maintain the publicly available docker hub image.
>> > >>>>
>> > >>>> * It seems weird to have a custom docker image up on docker
>hub -
>> > maybe
>> > >>>> that's common, but if we need specific changes to images for
>our
>> > needs,
>> > >>> I'd
>> > >>>> prefer it be private.
>> > >>>>
>> > >>>>
>> > >>>> 3. Run our own *public* container registry
>> > >>>>
>> > >>>> ==============================================
>> > >>>>
>> > >>>> We would run a beam-specific container registry service - it
>would
>> be
>> > >>> used
>> > >>>> by the apache beam CI servers, but it would also be available
>for
>> use
>> > by
>> > >>>> anyone running beam IO ITs on their local dev setup.
>> > >>>>
>> > >>>> From a IO IT creator's perspective, this would look pretty
>similar
>> to
>> > how
>> > >>>> things are now - they just check in a dockerfile. For someone
>> running
>> > the
>> > >>>> k8 scripts, they similarly don't need to think about it.
>> > >>>>
>> > >>>> Upsides:
>> > >>>>
>> > >>>> * we're not adding any additional complexity for end developer
>> > >>>>
>> > >>>> Downsides:
>> > >>>>
>> > >>>> * Have to keep docker registry software up to date
>> > >>>>
>> > >>>> * The service is a single of failure for any beam devs running
>IO
>> ITs
>> > >>>>
>> > >>>> * It can incur costs, etc… As an open source project, it
>doesn't
>> seem
>> > >>> great
>> > >>>> for us to be running a public service.
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> My thoughts on this
>> > >>>>
>> > >>>> ===============
>> > >>>>
>> > >>>> In spite of the additional complexity, I think using k8 helm
>is
>> > probably
>> > >>>> the best option. The general goal behind the IO ITs has been
>to keep
>> > >>>> ourselves self-contained: avoid having centralized
>infrastructure
>> for
>> > >>> those
>> > >>>> running the ITs. Helm is a good match for those criteria. I
>will
>> admit
>> > >>> that
>> > >>>> I find the additional dependencies/complexity to be worrisome.
>> > However, I
>> > >>>> really like the idea of picking up additional data store
>configs for
>> > >>> free -
>> > >>>> if we were doing this in 5 years, we'd say "we should just use
>the
>> > >>>> ecosystem of helm charts" and go from there.
>> > >>>>
>> > >>>> I do think that pushing images to docker hub is a viable
>option, and
>> > if
>> > >>> the
>> > >>>> community is more excited to do that/wants to push the images
>there,
>> > I'd
>> > >>>> support it. I can see how folks would be hesitant. I would
>like for
>> > the
>> > >>>> developer of the docker file to do
>> > >>>>
>> > >>>> Of the 3 options, I would strongly push back against running a
>> public
>> > >>>> container registry - I would not want to administer it, and I
>don't
>> > think
>> > >>>> we as a project want to be paying for the costs associated
>with it.
>> > >>>>
>> > >>>> Next steps
>> > >>>>
>> > >>>> =========
>> > >>>>
>> > >>>> Let me know what you think! This is definitely a topic where
>> > >>> understanding
>> > >>>> what the community of IO devs wants is helpful. As we discuss,
>I'll
>> > >>>> probably spend a little time exploring helm since I want to
>play
>> > around
>> > >>>> with it and understand if there are other drawbacks. I ran
>into this
>> > >>>> question while working on getting the HIFIO cassandra cluster
>> running,
>> > >>> so I
>> > >>>> might prototype with that.
>> > >>>>
>> > >>>> I'll create JIRA for this in the next day or so.
>> > >>>>
>> > >>>> Stephen
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> [0] docker registry container -
>https://hub.docker.com/_/registry/
>> > >>>>
>> > >>>> [1] kubernetes issue open for supporting templates -
>> > >>>> https://github.com/kubernetes/kubernetes/issues/23896
>> > >>>>
>> > >>>> [2] set of available charts -
>https://github.com/kubernetes/charts
>> > >>>>
>> > >>>> [3] kubernetes helm introduction -
>> > >>>> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
>> > >>>> [4] kubernetes charts instructions -
>> > >>>> https://github.com/kubernetes/helm/blob/master/docs/charts.md
>> > >
>> > > --
>> > > Jean-Baptiste Onofré
>> > > jbonofre@apache.org
>> > > http://blog.nanthrax.net
>> > > Talend - http://www.talend.com
>> >
>>

Re: IO ITs: Hosting Docker images

Posted by Ekrem Aksoy <ek...@gmail.com>.

Hi Stephen,

Can we piggyback on current Apache Docker Hub account? I think images can
be hold there, too.

-E

On Mon, Apr 10, 2017 at 5:22 PM, Stephen Sisk <si...@google.com.invalid>
wrote:

> for 4 - there's a number of logistics involved. How do you propose handling
> cost, potential DOS, etc? People in different timezones would need to be
> oncall for it since it impacts people's ability to dev work (or they need
> to be okay if it goes out.) Can you give some reasons why you think it's
> better than the other options? I put it on the list, but I'm strongly not a
> fan.
>
> S
>
> On Sat, Apr 8, 2017 at 5:31 AM Ted Yu <yu...@gmail.com> wrote:
>
> > +1
> >
> > > On Apr 7, 2017, at 10:46 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> > >
> > > Hi Stephen,
> > >
> > > I think we should go to 1 and 4:
> > >
> > > 1. Try to use existing images providing what we need. If we don't find
> > existing image, we can always ask and help other community to provide so.
> > > 4. If we don't find a suitable image, and waiting for this image, we
> can
> > store the image in our own "IT dockerhub".
> > >
> > > Regards
> > > JB
> > >
> > >> On 04/08/2017 01:03 AM, Stephen Sisk wrote:
> > >> Wanted to see if anyone else had opinions on this/provide a quick
> > update.
> > >>
> > >> I think for both elasticsearch and HIFIO that we can find existing,
> > >> supported images that could serve those purposes - HIFIO is looking
> like
> > >> it'll able to do so for cassandra, which was proving tricky.
> > >>
> > >> So to summarize my current proposed solutions: (ordered by my
> > preference)
> > >> 1. (new) Strongly urge people to find existing docker images that meet
> > our
> > >> image criteria - regularly updated/security checked
> > >> 2. Start using helm
> > >> 3. Push our docker images to docker hub
> > >> 4. Host our own public container registry
> > >>
> > >> S
> > >>
> > >>> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <si...@google.com>
> wrote:
> > >>>
> > >>> I'd like to hear what direction folks want to go in, and from there
> > look
> > >>> at the options. I think for some of these options (like running our
> own
> > >>> public registry), they may be able to and it's something we should
> > look at,
> > >>> but I don't assume they have time to work on this type of issue.
> > >>>
> > >>> S
> > >>>
> > >>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lcwik@google.com.invalid
> >
> > >>> wrote:
> > >>>
> > >>> Is this something that Apache infra could help us with?
> > >>>
> > >>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <sisk@google.com.invalid
> >
> > >>> wrote:
> > >>>
> > >>>> Summary:
> > >>>>
> > >>>> For IO ITs that use data stores that need custom docker images in
> > order
> > >>> to
> > >>>> run, we can't currently use them in a kubernetes cluster (which is
> > where
> > >>> we
> > >>>> host our data stores.) I have a couple options for how to solve this
> > and
> > >>> am
> > >>>> looking for feedback from folks involved in creating IO ITs/opinions
> > on
> > >>>> kubernetes.
> > >>>>
> > >>>>
> > >>>> Details:
> > >>>>
> > >>>> We've discussed in the past that we'll want to allow developers to
> > submit
> > >>>> just a dockerfile, and then we'll use that when creating the data
> > store
> > >>> on
> > >>>> kubernetes. This is the case for ElasticsearchIO and I assume more
> > data
> > >>>> stores in the future will want to do this. It's also looking like
> > it'll
> > >>> be
> > >>>> necessary to use custom docker images for the HadoopInputFormatIO's
> > >>>> cassandra ITs - to run a cassandra cluster, there doesn't seem to
> be a
> > >>> good
> > >>>> image you can use out of the box.
> > >>>>
> > >>>> In either case, in order to retrieve a docker image, kubernetes
> needs
> > a
> > >>>> container registry - it will read the docker images from there. A
> > simple
> > >>>> private container registry doesn't work because kubernetes config
> > files
> > >>> are
> > >>>> static - this means that if local devs try to use the kubernetes
> > files,
> > >>>> they point at the private container registry and they wouldn't be
> > able to
> > >>>> retrieve the images since they don't have access. They'd have to
> > manually
> > >>>> edit the files, which in theory is an option, but I don't consider
> > that
> > >>> to
> > >>>> be acceptable since it feels pretty unfriendly (it is simple, so if
> we
> > >>>> really don't like the below options we can revisit it.)
> > >>>>
> > >>>> Quick summary of the options
> > >>>>
> > >>>> =======================
> > >>>>
> > >>>> We can:
> > >>>>
> > >>>> * Start using something like k8 helm - this adds more dependencies,
> > adds
> > >>> a
> > >>>> small amount of complexity (this is my recommendation, but only by a
> > >>>> little)
> > >>>>
> > >>>> * Start pushing images to docker hub - this means they'll be
> publicly
> > >>>> visible and raises the bar for maintenance of those images
> > >>>>
> > >>>> * Host our own public container registry - this means running our
> own
> > >>>> public service with costs, etc..
> > >>>>
> > >>>> Below are detailed discussions of these options. You can skip to the
> > "My
> > >>>> thoughts on this" section if you're not interested in the details.
> > >>>>
> > >>>>
> > >>>> 1. Templated kubernetes images
> > >>>>
> > >>>> =========================
> > >>>>
> > >>>> Kubernetes (k8) does not currently have built in support for
> > >>> parameterizing
> > >>>> scripts - there's an issues open for this[1], but it doesn't seem to
> > be
> > >>>> very active.
> > >>>>
> > >>>> There are tools like Kubernetes helm that allow users to specify
> > >>> parameters
> > >>>> when running their kubernetes scripts. They also enable a lot more
> > >>> (they're
> > >>>> probably closer to a package manager like apt-get) - see this
> > >>>> description[3] for an overview.
> > >>>>
> > >>>> I'm open to other options besides helm, but it seems to be the
> > officially
> > >>>> supported one.
> > >>>>
> > >>>> How the world would look using helm:
> > >>>>
> > >>>> * When developing an IO IT, someone (either the developer or one of
> > us),
> > >>>> would need to create a chart (the name for the helm script) - it's
> > >>>> basically another set of config files but in theory is as simple as
> a
> > >>>> couple metadata files plus a templatized version of a regular k8
> > script.
> > >>>> This should be trivial compared to the task of creating a k8 script.
> > >>>>
> > >>>> *  When creating an instance of a data store, the developer (or the
> > beam
> > >>> CI
> > >>>> server) would first build the docker image for the data store and
> > push to
> > >>>> their container registry, then run a command like `helm install -f
> > >>>> mydb.yaml --set imageRepo=1.2.3.4`
> > >>>>
> > >>>> * when done running tests/developing/etc…  the developer/beam CI
> > server
> > >>>> would run `helm delete -f mydb.yaml`
> > >>>>
> > >>>> Upsides:
> > >>>>
> > >>>> * Something like helm is pretty interesting - we talked about it as
> an
> > >>>> upside and something we wanted to do when we talked about using
> > >>> kubernetes
> > >>>>
> > >>>> * We pick up a set of working kubernetes scripts this way. The full
> > list
> > >>> is
> > >>>> at [2], but some ones that stood out: mongodb, memcached, mysql,
> > >>> postgres,
> > >>>> redis, elasticsearch (incubating), kafka (incubating), zookeeper
> > >>>> (incubating) - this could speed development
> > >>>>
> > >>>> Downsides:
> > >>>>
> > >>>> * Adds an additional dependency to run our ITs (helm or another k8
> > >>>> templating tool)
> > >>>>
> > >>>> * Requires people to build their own images run a container registry
> > if
> > >>>> they don't already have one (it will not surprise you that there's a
> > >>> docker
> > >>>> image for running the registry [0] - so it's not crazy. :) I *think*
> > this
> > >>>> will probably just be a simple one/two line command once we have it
> > >>>> scripted.
> > >>>>
> > >>>> * Helm in particular is kind of heavyweight for what we really need
> -
> > it
> > >>>> requires running a service in the k8 cluster and adds additional
> > >>>> complexity.
> > >>>>
> > >>>> * Adds to the complexity of creating a new kubernetes script. Until
> > I've
> > >>>> tried it, I can't really speak to the complexity, but taking a look
> at
> > >>> the
> > >>>> instructions [4], it doesn't seem too bad.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> 2. Push images to docker hub
> > >>>>
> > >>>> =======================
> > >>>>
> > >>>> This requires that users push images that we want to use to docker
> > hub,
> > >>> and
> > >>>> then our IO ITs will rely on that. I  think the developer of the
> > >>> dockerfile
> > >>>> should be responsible for the image - having the beam project
> > responsible
> > >>>> for a publicly available artifact (like the docker images) outside
> of
> > our
> > >>>> core deliverables doesn't seem like the right move.
> > >>>>
> > >>>> We would still retain a copy of the source dockerfiles and could
> > >>> regenerate
> > >>>> the images at any time, so I'm not concerned about a scenario where
> > >>> docker
> > >>>> hub went away (it would be pretty simple to switch to another repo -
> > just
> > >>>> change some config files.)
> > >>>>
> > >>>> For someone running the k8 scripts (ie, running the IO ITs), this is
> > >>> pretty
> > >>>> easy - they just run the k8 script like they do today.
> > >>>>
> > >>>> For someone creating the k8 scripts (ie, creating the IO ITs), this
> is
> > >>> more
> > >>>> complex - either they or we have to push this to docker hub and make
> > sure
> > >>>> it's up to date, etc..
> > >>>>
> > >>>>
> > >>>> Upsides:
> > >>>>
> > >>>> * No additional complexity for IO IT runners.
> > >>>>
> > >>>> Downsides:
> > >>>>
> > >>>> * Higher bar for creating the image in the first place - someone has
> > to
> > >>>> maintain the publicly available docker hub image.
> > >>>>
> > >>>> * It seems weird to have a custom docker image up on docker hub -
> > maybe
> > >>>> that's common, but if we need specific changes to images for our
> > needs,
> > >>> I'd
> > >>>> prefer it be private.
> > >>>>
> > >>>>
> > >>>> 3. Run our own *public* container registry
> > >>>>
> > >>>> ==============================================
> > >>>>
> > >>>> We would run a beam-specific container registry service - it would
> be
> > >>> used
> > >>>> by the apache beam CI servers, but it would also be available for
> use
> > by
> > >>>> anyone running beam IO ITs on their local dev setup.
> > >>>>
> > >>>> From a IO IT creator's perspective, this would look pretty similar
> to
> > how
> > >>>> things are now - they just check in a dockerfile. For someone
> running
> > the
> > >>>> k8 scripts, they similarly don't need to think about it.
> > >>>>
> > >>>> Upsides:
> > >>>>
> > >>>> * we're not adding any additional complexity for end developer
> > >>>>
> > >>>> Downsides:
> > >>>>
> > >>>> * Have to keep docker registry software up to date
> > >>>>
> > >>>> * The service is a single of failure for any beam devs running IO
> ITs
> > >>>>
> > >>>> * It can incur costs, etc… As an open source project, it doesn't
> seem
> > >>> great
> > >>>> for us to be running a public service.
> > >>>>
> > >>>>
> > >>>>
> > >>>> My thoughts on this
> > >>>>
> > >>>> ===============
> > >>>>
> > >>>> In spite of the additional complexity, I think using k8 helm is
> > probably
> > >>>> the best option. The general goal behind the IO ITs has been to keep
> > >>>> ourselves self-contained: avoid having centralized infrastructure
> for
> > >>> those
> > >>>> running the ITs. Helm is a good match for those criteria. I will
> admit
> > >>> that
> > >>>> I find the additional dependencies/complexity to be worrisome.
> > However, I
> > >>>> really like the idea of picking up additional data store configs for
> > >>> free -
> > >>>> if we were doing this in 5 years, we'd say "we should just use the
> > >>>> ecosystem of helm charts" and go from there.
> > >>>>
> > >>>> I do think that pushing images to docker hub is a viable option, and
> > if
> > >>> the
> > >>>> community is more excited to do that/wants to push the images there,
> > I'd
> > >>>> support it. I can see how folks would be hesitant. I would like for
> > the
> > >>>> developer of the docker file to do
> > >>>>
> > >>>> Of the 3 options, I would strongly push back against running a
> public
> > >>>> container registry - I would not want to administer it, and I don't
> > think
> > >>>> we as a project want to be paying for the costs associated with it.
> > >>>>
> > >>>> Next steps
> > >>>>
> > >>>> =========
> > >>>>
> > >>>> Let me know what you think! This is definitely a topic where
> > >>> understanding
> > >>>> what the community of IO devs wants is helpful. As we discuss, I'll
> > >>>> probably spend a little time exploring helm since I want to play
> > around
> > >>>> with it and understand if there are other drawbacks. I ran into this
> > >>>> question while working on getting the HIFIO cassandra cluster
> running,
> > >>> so I
> > >>>> might prototype with that.
> > >>>>
> > >>>> I'll create JIRA for this in the next day or so.
> > >>>>
> > >>>> Stephen
> > >>>>
> > >>>>
> > >>>>
> > >>>> [0] docker registry container - https://hub.docker.com/_/registry/
> > >>>>
> > >>>> [1] kubernetes issue open for supporting templates -
> > >>>> https://github.com/kubernetes/kubernetes/issues/23896
> > >>>>
> > >>>> [2] set of available charts - https://github.com/kubernetes/charts
> > >>>>
> > >>>> [3] kubernetes helm introduction -
> > >>>> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> > >>>> [4] kubernetes charts instructions -
> > >>>> https://github.com/kubernetes/helm/blob/master/docs/charts.md
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> >
>

Re: IO ITs: Hosting Docker images

Posted by Stephen Sisk <si...@google.com.INVALID>.

for 4 - there's a number of logistics involved. How do you propose handling
cost, potential DOS, etc? People in different timezones would need to be
oncall for it since it impacts people's ability to dev work (or they need
to be okay if it goes out.) Can you give some reasons why you think it's
better than the other options? I put it on the list, but I'm strongly not a
fan.

S

On Sat, Apr 8, 2017 at 5:31 AM Ted Yu <yu...@gmail.com> wrote:

> +1
>
> > On Apr 7, 2017, at 10:46 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> >
> > Hi Stephen,
> >
> > I think we should go to 1 and 4:
> >
> > 1. Try to use existing images providing what we need. If we don't find
> existing image, we can always ask and help other community to provide so.
> > 4. If we don't find a suitable image, and waiting for this image, we can
> store the image in our own "IT dockerhub".
> >
> > Regards
> > JB
> >
> >> On 04/08/2017 01:03 AM, Stephen Sisk wrote:
> >> Wanted to see if anyone else had opinions on this/provide a quick
> update.
> >>
> >> I think for both elasticsearch and HIFIO that we can find existing,
> >> supported images that could serve those purposes - HIFIO is looking like
> >> it'll able to do so for cassandra, which was proving tricky.
> >>
> >> So to summarize my current proposed solutions: (ordered by my
> preference)
> >> 1. (new) Strongly urge people to find existing docker images that meet
> our
> >> image criteria - regularly updated/security checked
> >> 2. Start using helm
> >> 3. Push our docker images to docker hub
> >> 4. Host our own public container registry
> >>
> >> S
> >>
> >>> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <si...@google.com> wrote:
> >>>
> >>> I'd like to hear what direction folks want to go in, and from there
> look
> >>> at the options. I think for some of these options (like running our own
> >>> public registry), they may be able to and it's something we should
> look at,
> >>> but I don't assume they have time to work on this type of issue.
> >>>
> >>> S
> >>>
> >>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lc...@google.com.invalid>
> >>> wrote:
> >>>
> >>> Is this something that Apache infra could help us with?
> >>>
> >>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <si...@google.com.invalid>
> >>> wrote:
> >>>
> >>>> Summary:
> >>>>
> >>>> For IO ITs that use data stores that need custom docker images in
> order
> >>> to
> >>>> run, we can't currently use them in a kubernetes cluster (which is
> where
> >>> we
> >>>> host our data stores.) I have a couple options for how to solve this
> and
> >>> am
> >>>> looking for feedback from folks involved in creating IO ITs/opinions
> on
> >>>> kubernetes.
> >>>>
> >>>>
> >>>> Details:
> >>>>
> >>>> We've discussed in the past that we'll want to allow developers to
> submit
> >>>> just a dockerfile, and then we'll use that when creating the data
> store
> >>> on
> >>>> kubernetes. This is the case for ElasticsearchIO and I assume more
> data
> >>>> stores in the future will want to do this. It's also looking like
> it'll
> >>> be
> >>>> necessary to use custom docker images for the HadoopInputFormatIO's
> >>>> cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
> >>> good
> >>>> image you can use out of the box.
> >>>>
> >>>> In either case, in order to retrieve a docker image, kubernetes needs
> a
> >>>> container registry - it will read the docker images from there. A
> simple
> >>>> private container registry doesn't work because kubernetes config
> files
> >>> are
> >>>> static - this means that if local devs try to use the kubernetes
> files,
> >>>> they point at the private container registry and they wouldn't be
> able to
> >>>> retrieve the images since they don't have access. They'd have to
> manually
> >>>> edit the files, which in theory is an option, but I don't consider
> that
> >>> to
> >>>> be acceptable since it feels pretty unfriendly (it is simple, so if we
> >>>> really don't like the below options we can revisit it.)
> >>>>
> >>>> Quick summary of the options
> >>>>
> >>>> =======================
> >>>>
> >>>> We can:
> >>>>
> >>>> * Start using something like k8 helm - this adds more dependencies,
> adds
> >>> a
> >>>> small amount of complexity (this is my recommendation, but only by a
> >>>> little)
> >>>>
> >>>> * Start pushing images to docker hub - this means they'll be publicly
> >>>> visible and raises the bar for maintenance of those images
> >>>>
> >>>> * Host our own public container registry - this means running our own
> >>>> public service with costs, etc..
> >>>>
> >>>> Below are detailed discussions of these options. You can skip to the
> "My
> >>>> thoughts on this" section if you're not interested in the details.
> >>>>
> >>>>
> >>>> 1. Templated kubernetes images
> >>>>
> >>>> =========================
> >>>>
> >>>> Kubernetes (k8) does not currently have built in support for
> >>> parameterizing
> >>>> scripts - there's an issues open for this[1], but it doesn't seem to
> be
> >>>> very active.
> >>>>
> >>>> There are tools like Kubernetes helm that allow users to specify
> >>> parameters
> >>>> when running their kubernetes scripts. They also enable a lot more
> >>> (they're
> >>>> probably closer to a package manager like apt-get) - see this
> >>>> description[3] for an overview.
> >>>>
> >>>> I'm open to other options besides helm, but it seems to be the
> officially
> >>>> supported one.
> >>>>
> >>>> How the world would look using helm:
> >>>>
> >>>> * When developing an IO IT, someone (either the developer or one of
> us),
> >>>> would need to create a chart (the name for the helm script) - it's
> >>>> basically another set of config files but in theory is as simple as a
> >>>> couple metadata files plus a templatized version of a regular k8
> script.
> >>>> This should be trivial compared to the task of creating a k8 script.
> >>>>
> >>>> *  When creating an instance of a data store, the developer (or the
> beam
> >>> CI
> >>>> server) would first build the docker image for the data store and
> push to
> >>>> their container registry, then run a command like `helm install -f
> >>>> mydb.yaml --set imageRepo=1.2.3.4`
> >>>>
> >>>> * when done running tests/developing/etc…  the developer/beam CI
> server
> >>>> would run `helm delete -f mydb.yaml`
> >>>>
> >>>> Upsides:
> >>>>
> >>>> * Something like helm is pretty interesting - we talked about it as an
> >>>> upside and something we wanted to do when we talked about using
> >>> kubernetes
> >>>>
> >>>> * We pick up a set of working kubernetes scripts this way. The full
> list
> >>> is
> >>>> at [2], but some ones that stood out: mongodb, memcached, mysql,
> >>> postgres,
> >>>> redis, elasticsearch (incubating), kafka (incubating), zookeeper
> >>>> (incubating) - this could speed development
> >>>>
> >>>> Downsides:
> >>>>
> >>>> * Adds an additional dependency to run our ITs (helm or another k8
> >>>> templating tool)
> >>>>
> >>>> * Requires people to build their own images run a container registry
> if
> >>>> they don't already have one (it will not surprise you that there's a
> >>> docker
> >>>> image for running the registry [0] - so it's not crazy. :) I *think*
> this
> >>>> will probably just be a simple one/two line command once we have it
> >>>> scripted.
> >>>>
> >>>> * Helm in particular is kind of heavyweight for what we really need -
> it
> >>>> requires running a service in the k8 cluster and adds additional
> >>>> complexity.
> >>>>
> >>>> * Adds to the complexity of creating a new kubernetes script. Until
> I've
> >>>> tried it, I can't really speak to the complexity, but taking a look at
> >>> the
> >>>> instructions [4], it doesn't seem too bad.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2. Push images to docker hub
> >>>>
> >>>> =======================
> >>>>
> >>>> This requires that users push images that we want to use to docker
> hub,
> >>> and
> >>>> then our IO ITs will rely on that. I  think the developer of the
> >>> dockerfile
> >>>> should be responsible for the image - having the beam project
> responsible
> >>>> for a publicly available artifact (like the docker images) outside of
> our
> >>>> core deliverables doesn't seem like the right move.
> >>>>
> >>>> We would still retain a copy of the source dockerfiles and could
> >>> regenerate
> >>>> the images at any time, so I'm not concerned about a scenario where
> >>> docker
> >>>> hub went away (it would be pretty simple to switch to another repo -
> just
> >>>> change some config files.)
> >>>>
> >>>> For someone running the k8 scripts (ie, running the IO ITs), this is
> >>> pretty
> >>>> easy - they just run the k8 script like they do today.
> >>>>
> >>>> For someone creating the k8 scripts (ie, creating the IO ITs), this is
> >>> more
> >>>> complex - either they or we have to push this to docker hub and make
> sure
> >>>> it's up to date, etc..
> >>>>
> >>>>
> >>>> Upsides:
> >>>>
> >>>> * No additional complexity for IO IT runners.
> >>>>
> >>>> Downsides:
> >>>>
> >>>> * Higher bar for creating the image in the first place - someone has
> to
> >>>> maintain the publicly available docker hub image.
> >>>>
> >>>> * It seems weird to have a custom docker image up on docker hub -
> maybe
> >>>> that's common, but if we need specific changes to images for our
> needs,
> >>> I'd
> >>>> prefer it be private.
> >>>>
> >>>>
> >>>> 3. Run our own *public* container registry
> >>>>
> >>>> ==============================================
> >>>>
> >>>> We would run a beam-specific container registry service - it would be
> >>> used
> >>>> by the apache beam CI servers, but it would also be available for use
> by
> >>>> anyone running beam IO ITs on their local dev setup.
> >>>>
> >>>> From a IO IT creator's perspective, this would look pretty similar to
> how
> >>>> things are now - they just check in a dockerfile. For someone running
> the
> >>>> k8 scripts, they similarly don't need to think about it.
> >>>>
> >>>> Upsides:
> >>>>
> >>>> * we're not adding any additional complexity for end developer
> >>>>
> >>>> Downsides:
> >>>>
> >>>> * Have to keep docker registry software up to date
> >>>>
> >>>> * The service is a single of failure for any beam devs running IO ITs
> >>>>
> >>>> * It can incur costs, etc… As an open source project, it doesn't seem
> >>> great
> >>>> for us to be running a public service.
> >>>>
> >>>>
> >>>>
> >>>> My thoughts on this
> >>>>
> >>>> ===============
> >>>>
> >>>> In spite of the additional complexity, I think using k8 helm is
> probably
> >>>> the best option. The general goal behind the IO ITs has been to keep
> >>>> ourselves self-contained: avoid having centralized infrastructure for
> >>> those
> >>>> running the ITs. Helm is a good match for those criteria. I will admit
> >>> that
> >>>> I find the additional dependencies/complexity to be worrisome.
> However, I
> >>>> really like the idea of picking up additional data store configs for
> >>> free -
> >>>> if we were doing this in 5 years, we'd say "we should just use the
> >>>> ecosystem of helm charts" and go from there.
> >>>>
> >>>> I do think that pushing images to docker hub is a viable option, and
> if
> >>> the
> >>>> community is more excited to do that/wants to push the images there,
> I'd
> >>>> support it. I can see how folks would be hesitant. I would like for
> the
> >>>> developer of the docker file to do
> >>>>
> >>>> Of the 3 options, I would strongly push back against running a public
> >>>> container registry - I would not want to administer it, and I don't
> think
> >>>> we as a project want to be paying for the costs associated with it.
> >>>>
> >>>> Next steps
> >>>>
> >>>> =========
> >>>>
> >>>> Let me know what you think! This is definitely a topic where
> >>> understanding
> >>>> what the community of IO devs wants is helpful. As we discuss, I'll
> >>>> probably spend a little time exploring helm since I want to play
> around
> >>>> with it and understand if there are other drawbacks. I ran into this
> >>>> question while working on getting the HIFIO cassandra cluster running,
> >>> so I
> >>>> might prototype with that.
> >>>>
> >>>> I'll create JIRA for this in the next day or so.
> >>>>
> >>>> Stephen
> >>>>
> >>>>
> >>>>
> >>>> [0] docker registry container - https://hub.docker.com/_/registry/
> >>>>
> >>>> [1] kubernetes issue open for supporting templates -
> >>>> https://github.com/kubernetes/kubernetes/issues/23896
> >>>>
> >>>> [2] set of available charts - https://github.com/kubernetes/charts
> >>>>
> >>>> [3] kubernetes helm introduction -
> >>>> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> >>>> [4] kubernetes charts instructions -
> >>>> https://github.com/kubernetes/helm/blob/master/docs/charts.md
> >
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>

Re: IO ITs: Hosting Docker images

Posted by Ted Yu <yu...@gmail.com>.

+1

> On Apr 7, 2017, at 10:46 PM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> 
> Hi Stephen,
> 
> I think we should go to 1 and 4:
> 
> 1. Try to use existing images providing what we need. If we don't find existing image, we can always ask and help other community to provide so.
> 4. If we don't find a suitable image, and waiting for this image, we can store the image in our own "IT dockerhub".
> 
> Regards
> JB
> 
>> On 04/08/2017 01:03 AM, Stephen Sisk wrote:
>> Wanted to see if anyone else had opinions on this/provide a quick update.
>> 
>> I think for both elasticsearch and HIFIO that we can find existing,
>> supported images that could serve those purposes - HIFIO is looking like
>> it'll able to do so for cassandra, which was proving tricky.
>> 
>> So to summarize my current proposed solutions: (ordered by my preference)
>> 1. (new) Strongly urge people to find existing docker images that meet our
>> image criteria - regularly updated/security checked
>> 2. Start using helm
>> 3. Push our docker images to docker hub
>> 4. Host our own public container registry
>> 
>> S
>> 
>>> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <si...@google.com> wrote:
>>> 
>>> I'd like to hear what direction folks want to go in, and from there look
>>> at the options. I think for some of these options (like running our own
>>> public registry), they may be able to and it's something we should look at,
>>> but I don't assume they have time to work on this type of issue.
>>> 
>>> S
>>> 
>>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lc...@google.com.invalid>
>>> wrote:
>>> 
>>> Is this something that Apache infra could help us with?
>>> 
>>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <si...@google.com.invalid>
>>> wrote:
>>> 
>>>> Summary:
>>>> 
>>>> For IO ITs that use data stores that need custom docker images in order
>>> to
>>>> run, we can't currently use them in a kubernetes cluster (which is where
>>> we
>>>> host our data stores.) I have a couple options for how to solve this and
>>> am
>>>> looking for feedback from folks involved in creating IO ITs/opinions on
>>>> kubernetes.
>>>> 
>>>> 
>>>> Details:
>>>> 
>>>> We've discussed in the past that we'll want to allow developers to submit
>>>> just a dockerfile, and then we'll use that when creating the data store
>>> on
>>>> kubernetes. This is the case for ElasticsearchIO and I assume more data
>>>> stores in the future will want to do this. It's also looking like it'll
>>> be
>>>> necessary to use custom docker images for the HadoopInputFormatIO's
>>>> cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
>>> good
>>>> image you can use out of the box.
>>>> 
>>>> In either case, in order to retrieve a docker image, kubernetes needs a
>>>> container registry - it will read the docker images from there. A simple
>>>> private container registry doesn't work because kubernetes config files
>>> are
>>>> static - this means that if local devs try to use the kubernetes files,
>>>> they point at the private container registry and they wouldn't be able to
>>>> retrieve the images since they don't have access. They'd have to manually
>>>> edit the files, which in theory is an option, but I don't consider that
>>> to
>>>> be acceptable since it feels pretty unfriendly (it is simple, so if we
>>>> really don't like the below options we can revisit it.)
>>>> 
>>>> Quick summary of the options
>>>> 
>>>> =======================
>>>> 
>>>> We can:
>>>> 
>>>> * Start using something like k8 helm - this adds more dependencies, adds
>>> a
>>>> small amount of complexity (this is my recommendation, but only by a
>>>> little)
>>>> 
>>>> * Start pushing images to docker hub - this means they'll be publicly
>>>> visible and raises the bar for maintenance of those images
>>>> 
>>>> * Host our own public container registry - this means running our own
>>>> public service with costs, etc..
>>>> 
>>>> Below are detailed discussions of these options. You can skip to the "My
>>>> thoughts on this" section if you're not interested in the details.
>>>> 
>>>> 
>>>> 1. Templated kubernetes images
>>>> 
>>>> =========================
>>>> 
>>>> Kubernetes (k8) does not currently have built in support for
>>> parameterizing
>>>> scripts - there's an issues open for this[1], but it doesn't seem to be
>>>> very active.
>>>> 
>>>> There are tools like Kubernetes helm that allow users to specify
>>> parameters
>>>> when running their kubernetes scripts. They also enable a lot more
>>> (they're
>>>> probably closer to a package manager like apt-get) - see this
>>>> description[3] for an overview.
>>>> 
>>>> I'm open to other options besides helm, but it seems to be the officially
>>>> supported one.
>>>> 
>>>> How the world would look using helm:
>>>> 
>>>> * When developing an IO IT, someone (either the developer or one of us),
>>>> would need to create a chart (the name for the helm script) - it's
>>>> basically another set of config files but in theory is as simple as a
>>>> couple metadata files plus a templatized version of a regular k8 script.
>>>> This should be trivial compared to the task of creating a k8 script.
>>>> 
>>>> *  When creating an instance of a data store, the developer (or the beam
>>> CI
>>>> server) would first build the docker image for the data store and push to
>>>> their container registry, then run a command like `helm install -f
>>>> mydb.yaml --set imageRepo=1.2.3.4`
>>>> 
>>>> * when done running tests/developing/etc…  the developer/beam CI server
>>>> would run `helm delete -f mydb.yaml`
>>>> 
>>>> Upsides:
>>>> 
>>>> * Something like helm is pretty interesting - we talked about it as an
>>>> upside and something we wanted to do when we talked about using
>>> kubernetes
>>>> 
>>>> * We pick up a set of working kubernetes scripts this way. The full list
>>> is
>>>> at [2], but some ones that stood out: mongodb, memcached, mysql,
>>> postgres,
>>>> redis, elasticsearch (incubating), kafka (incubating), zookeeper
>>>> (incubating) - this could speed development
>>>> 
>>>> Downsides:
>>>> 
>>>> * Adds an additional dependency to run our ITs (helm or another k8
>>>> templating tool)
>>>> 
>>>> * Requires people to build their own images run a container registry if
>>>> they don't already have one (it will not surprise you that there's a
>>> docker
>>>> image for running the registry [0] - so it's not crazy. :) I *think* this
>>>> will probably just be a simple one/two line command once we have it
>>>> scripted.
>>>> 
>>>> * Helm in particular is kind of heavyweight for what we really need - it
>>>> requires running a service in the k8 cluster and adds additional
>>>> complexity.
>>>> 
>>>> * Adds to the complexity of creating a new kubernetes script. Until I've
>>>> tried it, I can't really speak to the complexity, but taking a look at
>>> the
>>>> instructions [4], it doesn't seem too bad.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2. Push images to docker hub
>>>> 
>>>> =======================
>>>> 
>>>> This requires that users push images that we want to use to docker hub,
>>> and
>>>> then our IO ITs will rely on that. I  think the developer of the
>>> dockerfile
>>>> should be responsible for the image - having the beam project responsible
>>>> for a publicly available artifact (like the docker images) outside of our
>>>> core deliverables doesn't seem like the right move.
>>>> 
>>>> We would still retain a copy of the source dockerfiles and could
>>> regenerate
>>>> the images at any time, so I'm not concerned about a scenario where
>>> docker
>>>> hub went away (it would be pretty simple to switch to another repo - just
>>>> change some config files.)
>>>> 
>>>> For someone running the k8 scripts (ie, running the IO ITs), this is
>>> pretty
>>>> easy - they just run the k8 script like they do today.
>>>> 
>>>> For someone creating the k8 scripts (ie, creating the IO ITs), this is
>>> more
>>>> complex - either they or we have to push this to docker hub and make sure
>>>> it's up to date, etc..
>>>> 
>>>> 
>>>> Upsides:
>>>> 
>>>> * No additional complexity for IO IT runners.
>>>> 
>>>> Downsides:
>>>> 
>>>> * Higher bar for creating the image in the first place - someone has to
>>>> maintain the publicly available docker hub image.
>>>> 
>>>> * It seems weird to have a custom docker image up on docker hub - maybe
>>>> that's common, but if we need specific changes to images for our needs,
>>> I'd
>>>> prefer it be private.
>>>> 
>>>> 
>>>> 3. Run our own *public* container registry
>>>> 
>>>> ==============================================
>>>> 
>>>> We would run a beam-specific container registry service - it would be
>>> used
>>>> by the apache beam CI servers, but it would also be available for use by
>>>> anyone running beam IO ITs on their local dev setup.
>>>> 
>>>> From a IO IT creator's perspective, this would look pretty similar to how
>>>> things are now - they just check in a dockerfile. For someone running the
>>>> k8 scripts, they similarly don't need to think about it.
>>>> 
>>>> Upsides:
>>>> 
>>>> * we're not adding any additional complexity for end developer
>>>> 
>>>> Downsides:
>>>> 
>>>> * Have to keep docker registry software up to date
>>>> 
>>>> * The service is a single of failure for any beam devs running IO ITs
>>>> 
>>>> * It can incur costs, etc… As an open source project, it doesn't seem
>>> great
>>>> for us to be running a public service.
>>>> 
>>>> 
>>>> 
>>>> My thoughts on this
>>>> 
>>>> ===============
>>>> 
>>>> In spite of the additional complexity, I think using k8 helm is probably
>>>> the best option. The general goal behind the IO ITs has been to keep
>>>> ourselves self-contained: avoid having centralized infrastructure for
>>> those
>>>> running the ITs. Helm is a good match for those criteria. I will admit
>>> that
>>>> I find the additional dependencies/complexity to be worrisome. However, I
>>>> really like the idea of picking up additional data store configs for
>>> free -
>>>> if we were doing this in 5 years, we'd say "we should just use the
>>>> ecosystem of helm charts" and go from there.
>>>> 
>>>> I do think that pushing images to docker hub is a viable option, and if
>>> the
>>>> community is more excited to do that/wants to push the images there, I'd
>>>> support it. I can see how folks would be hesitant. I would like for the
>>>> developer of the docker file to do
>>>> 
>>>> Of the 3 options, I would strongly push back against running a public
>>>> container registry - I would not want to administer it, and I don't think
>>>> we as a project want to be paying for the costs associated with it.
>>>> 
>>>> Next steps
>>>> 
>>>> =========
>>>> 
>>>> Let me know what you think! This is definitely a topic where
>>> understanding
>>>> what the community of IO devs wants is helpful. As we discuss, I'll
>>>> probably spend a little time exploring helm since I want to play around
>>>> with it and understand if there are other drawbacks. I ran into this
>>>> question while working on getting the HIFIO cassandra cluster running,
>>> so I
>>>> might prototype with that.
>>>> 
>>>> I'll create JIRA for this in the next day or so.
>>>> 
>>>> Stephen
>>>> 
>>>> 
>>>> 
>>>> [0] docker registry container - https://hub.docker.com/_/registry/
>>>> 
>>>> [1] kubernetes issue open for supporting templates -
>>>> https://github.com/kubernetes/kubernetes/issues/23896
>>>> 
>>>> [2] set of available charts - https://github.com/kubernetes/charts
>>>> 
>>>> [3] kubernetes helm introduction -
>>>> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
>>>> [4] kubernetes charts instructions -
>>>> https://github.com/kubernetes/helm/blob/master/docs/charts.md
> 
> -- 
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: IO ITs: Hosting Docker images

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Stephen,

I think we should go to 1 and 4:

1. Try to use existing images providing what we need. If we don't find existing 
image, we can always ask and help other community to provide so.
4. If we don't find a suitable image, and waiting for this image, we can store 
the image in our own "IT dockerhub".

Regards
JB

On 04/08/2017 01:03 AM, Stephen Sisk wrote:
> Wanted to see if anyone else had opinions on this/provide a quick update.
>
> I think for both elasticsearch and HIFIO that we can find existing,
> supported images that could serve those purposes - HIFIO is looking like
> it'll able to do so for cassandra, which was proving tricky.
>
> So to summarize my current proposed solutions: (ordered by my preference)
> 1. (new) Strongly urge people to find existing docker images that meet our
> image criteria - regularly updated/security checked
> 2. Start using helm
> 3. Push our docker images to docker hub
> 4. Host our own public container registry
>
> S
>
> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <si...@google.com> wrote:
>
>> I'd like to hear what direction folks want to go in, and from there look
>> at the options. I think for some of these options (like running our own
>> public registry), they may be able to and it's something we should look at,
>> but I don't assume they have time to work on this type of issue.
>>
>> S
>>
>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lc...@google.com.invalid>
>> wrote:
>>
>> Is this something that Apache infra could help us with?
>>
>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <si...@google.com.invalid>
>> wrote:
>>
>>> Summary:
>>>
>>> For IO ITs that use data stores that need custom docker images in order
>> to
>>> run, we can't currently use them in a kubernetes cluster (which is where
>> we
>>> host our data stores.) I have a couple options for how to solve this and
>> am
>>> looking for feedback from folks involved in creating IO ITs/opinions on
>>> kubernetes.
>>>
>>>
>>> Details:
>>>
>>> We've discussed in the past that we'll want to allow developers to submit
>>> just a dockerfile, and then we'll use that when creating the data store
>> on
>>> kubernetes. This is the case for ElasticsearchIO and I assume more data
>>> stores in the future will want to do this. It's also looking like it'll
>> be
>>> necessary to use custom docker images for the HadoopInputFormatIO's
>>> cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
>> good
>>> image you can use out of the box.
>>>
>>> In either case, in order to retrieve a docker image, kubernetes needs a
>>> container registry - it will read the docker images from there. A simple
>>> private container registry doesn't work because kubernetes config files
>> are
>>> static - this means that if local devs try to use the kubernetes files,
>>> they point at the private container registry and they wouldn't be able to
>>> retrieve the images since they don't have access. They'd have to manually
>>> edit the files, which in theory is an option, but I don't consider that
>> to
>>> be acceptable since it feels pretty unfriendly (it is simple, so if we
>>> really don't like the below options we can revisit it.)
>>>
>>> Quick summary of the options
>>>
>>> =======================
>>>
>>> We can:
>>>
>>> * Start using something like k8 helm - this adds more dependencies, adds
>> a
>>> small amount of complexity (this is my recommendation, but only by a
>>> little)
>>>
>>> * Start pushing images to docker hub - this means they'll be publicly
>>> visible and raises the bar for maintenance of those images
>>>
>>> * Host our own public container registry - this means running our own
>>> public service with costs, etc..
>>>
>>> Below are detailed discussions of these options. You can skip to the "My
>>> thoughts on this" section if you're not interested in the details.
>>>
>>>
>>> 1. Templated kubernetes images
>>>
>>> =========================
>>>
>>> Kubernetes (k8) does not currently have built in support for
>> parameterizing
>>> scripts - there's an issues open for this[1], but it doesn't seem to be
>>> very active.
>>>
>>> There are tools like Kubernetes helm that allow users to specify
>> parameters
>>> when running their kubernetes scripts. They also enable a lot more
>> (they're
>>> probably closer to a package manager like apt-get) - see this
>>> description[3] for an overview.
>>>
>>> I'm open to other options besides helm, but it seems to be the officially
>>> supported one.
>>>
>>> How the world would look using helm:
>>>
>>> * When developing an IO IT, someone (either the developer or one of us),
>>> would need to create a chart (the name for the helm script) - it's
>>> basically another set of config files but in theory is as simple as a
>>> couple metadata files plus a templatized version of a regular k8 script.
>>> This should be trivial compared to the task of creating a k8 script.
>>>
>>> *  When creating an instance of a data store, the developer (or the beam
>> CI
>>> server) would first build the docker image for the data store and push to
>>> their container registry, then run a command like `helm install -f
>>> mydb.yaml --set imageRepo=1.2.3.4`
>>>
>>> * when done running tests/developing/etc\u2026  the developer/beam CI server
>>> would run `helm delete -f mydb.yaml`
>>>
>>> Upsides:
>>>
>>> * Something like helm is pretty interesting - we talked about it as an
>>> upside and something we wanted to do when we talked about using
>> kubernetes
>>>
>>> * We pick up a set of working kubernetes scripts this way. The full list
>> is
>>> at [2], but some ones that stood out: mongodb, memcached, mysql,
>> postgres,
>>> redis, elasticsearch (incubating), kafka (incubating), zookeeper
>>> (incubating) - this could speed development
>>>
>>> Downsides:
>>>
>>> * Adds an additional dependency to run our ITs (helm or another k8
>>> templating tool)
>>>
>>> * Requires people to build their own images run a container registry if
>>> they don't already have one (it will not surprise you that there's a
>> docker
>>> image for running the registry [0] - so it's not crazy. :) I *think* this
>>> will probably just be a simple one/two line command once we have it
>>> scripted.
>>>
>>> * Helm in particular is kind of heavyweight for what we really need - it
>>> requires running a service in the k8 cluster and adds additional
>>> complexity.
>>>
>>> * Adds to the complexity of creating a new kubernetes script. Until I've
>>> tried it, I can't really speak to the complexity, but taking a look at
>> the
>>> instructions [4], it doesn't seem too bad.
>>>
>>>
>>>
>>>
>>> 2. Push images to docker hub
>>>
>>> =======================
>>>
>>> This requires that users push images that we want to use to docker hub,
>> and
>>> then our IO ITs will rely on that. I  think the developer of the
>> dockerfile
>>> should be responsible for the image - having the beam project responsible
>>> for a publicly available artifact (like the docker images) outside of our
>>> core deliverables doesn't seem like the right move.
>>>
>>> We would still retain a copy of the source dockerfiles and could
>> regenerate
>>> the images at any time, so I'm not concerned about a scenario where
>> docker
>>> hub went away (it would be pretty simple to switch to another repo - just
>>> change some config files.)
>>>
>>> For someone running the k8 scripts (ie, running the IO ITs), this is
>> pretty
>>> easy - they just run the k8 script like they do today.
>>>
>>> For someone creating the k8 scripts (ie, creating the IO ITs), this is
>> more
>>> complex - either they or we have to push this to docker hub and make sure
>>> it's up to date, etc..
>>>
>>>
>>> Upsides:
>>>
>>> * No additional complexity for IO IT runners.
>>>
>>> Downsides:
>>>
>>> * Higher bar for creating the image in the first place - someone has to
>>> maintain the publicly available docker hub image.
>>>
>>> * It seems weird to have a custom docker image up on docker hub - maybe
>>> that's common, but if we need specific changes to images for our needs,
>> I'd
>>> prefer it be private.
>>>
>>>
>>> 3. Run our own *public* container registry
>>>
>>> ==============================================
>>>
>>> We would run a beam-specific container registry service - it would be
>> used
>>> by the apache beam CI servers, but it would also be available for use by
>>> anyone running beam IO ITs on their local dev setup.
>>>
>>> From a IO IT creator's perspective, this would look pretty similar to how
>>> things are now - they just check in a dockerfile. For someone running the
>>> k8 scripts, they similarly don't need to think about it.
>>>
>>> Upsides:
>>>
>>> * we're not adding any additional complexity for end developer
>>>
>>> Downsides:
>>>
>>> * Have to keep docker registry software up to date
>>>
>>> * The service is a single of failure for any beam devs running IO ITs
>>>
>>> * It can incur costs, etc\u2026 As an open source project, it doesn't seem
>> great
>>> for us to be running a public service.
>>>
>>>
>>>
>>> My thoughts on this
>>>
>>> ===============
>>>
>>> In spite of the additional complexity, I think using k8 helm is probably
>>> the best option. The general goal behind the IO ITs has been to keep
>>> ourselves self-contained: avoid having centralized infrastructure for
>> those
>>> running the ITs. Helm is a good match for those criteria. I will admit
>> that
>>> I find the additional dependencies/complexity to be worrisome. However, I
>>> really like the idea of picking up additional data store configs for
>> free -
>>> if we were doing this in 5 years, we'd say "we should just use the
>>> ecosystem of helm charts" and go from there.
>>>
>>> I do think that pushing images to docker hub is a viable option, and if
>> the
>>> community is more excited to do that/wants to push the images there, I'd
>>> support it. I can see how folks would be hesitant. I would like for the
>>> developer of the docker file to do
>>>
>>> Of the 3 options, I would strongly push back against running a public
>>> container registry - I would not want to administer it, and I don't think
>>> we as a project want to be paying for the costs associated with it.
>>>
>>> Next steps
>>>
>>> =========
>>>
>>> Let me know what you think! This is definitely a topic where
>> understanding
>>> what the community of IO devs wants is helpful. As we discuss, I'll
>>> probably spend a little time exploring helm since I want to play around
>>> with it and understand if there are other drawbacks. I ran into this
>>> question while working on getting the HIFIO cassandra cluster running,
>> so I
>>> might prototype with that.
>>>
>>> I'll create JIRA for this in the next day or so.
>>>
>>> Stephen
>>>
>>>
>>>
>>> [0] docker registry container - https://hub.docker.com/_/registry/
>>>
>>> [1] kubernetes issue open for supporting templates -
>>> https://github.com/kubernetes/kubernetes/issues/23896
>>>
>>> [2] set of available charts - https://github.com/kubernetes/charts
>>>
>>> [3] kubernetes helm introduction -
>>> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
>>> [4] kubernetes charts instructions -
>>> https://github.com/kubernetes/helm/blob/master/docs/charts.md
>>>
>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: IO ITs: Hosting Docker images

Posted by Stephen Sisk <si...@google.com.INVALID>.

Wanted to see if anyone else had opinions on this/provide a quick update.

I think for both elasticsearch and HIFIO that we can find existing,
supported images that could serve those purposes - HIFIO is looking like
it'll able to do so for cassandra, which was proving tricky.

So to summarize my current proposed solutions: (ordered by my preference)
1. (new) Strongly urge people to find existing docker images that meet our
image criteria - regularly updated/security checked
2. Start using helm
3. Push our docker images to docker hub
4. Host our own public container registry

S

On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <si...@google.com> wrote:

> I'd like to hear what direction folks want to go in, and from there look
> at the options. I think for some of these options (like running our own
> public registry), they may be able to and it's something we should look at,
> but I don't assume they have time to work on this type of issue.
>
> S
>
> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lc...@google.com.invalid>
> wrote:
>
> Is this something that Apache infra could help us with?
>
> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <si...@google.com.invalid>
> wrote:
>
> > Summary:
> >
> > For IO ITs that use data stores that need custom docker images in order
> to
> > run, we can't currently use them in a kubernetes cluster (which is where
> we
> > host our data stores.) I have a couple options for how to solve this and
> am
> > looking for feedback from folks involved in creating IO ITs/opinions on
> > kubernetes.
> >
> >
> > Details:
> >
> > We've discussed in the past that we'll want to allow developers to submit
> > just a dockerfile, and then we'll use that when creating the data store
> on
> > kubernetes. This is the case for ElasticsearchIO and I assume more data
> > stores in the future will want to do this. It's also looking like it'll
> be
> > necessary to use custom docker images for the HadoopInputFormatIO's
> > cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
> good
> > image you can use out of the box.
> >
> > In either case, in order to retrieve a docker image, kubernetes needs a
> > container registry - it will read the docker images from there. A simple
> > private container registry doesn't work because kubernetes config files
> are
> > static - this means that if local devs try to use the kubernetes files,
> > they point at the private container registry and they wouldn't be able to
> > retrieve the images since they don't have access. They'd have to manually
> > edit the files, which in theory is an option, but I don't consider that
> to
> > be acceptable since it feels pretty unfriendly (it is simple, so if we
> > really don't like the below options we can revisit it.)
> >
> > Quick summary of the options
> >
> > =======================
> >
> > We can:
> >
> > * Start using something like k8 helm - this adds more dependencies, adds
> a
> > small amount of complexity (this is my recommendation, but only by a
> > little)
> >
> > * Start pushing images to docker hub - this means they'll be publicly
> > visible and raises the bar for maintenance of those images
> >
> > * Host our own public container registry - this means running our own
> > public service with costs, etc..
> >
> > Below are detailed discussions of these options. You can skip to the "My
> > thoughts on this" section if you're not interested in the details.
> >
> >
> > 1. Templated kubernetes images
> >
> > =========================
> >
> > Kubernetes (k8) does not currently have built in support for
> parameterizing
> > scripts - there's an issues open for this[1], but it doesn't seem to be
> > very active.
> >
> > There are tools like Kubernetes helm that allow users to specify
> parameters
> > when running their kubernetes scripts. They also enable a lot more
> (they're
> > probably closer to a package manager like apt-get) - see this
> > description[3] for an overview.
> >
> > I'm open to other options besides helm, but it seems to be the officially
> > supported one.
> >
> > How the world would look using helm:
> >
> > * When developing an IO IT, someone (either the developer or one of us),
> > would need to create a chart (the name for the helm script) - it's
> > basically another set of config files but in theory is as simple as a
> > couple metadata files plus a templatized version of a regular k8 script.
> > This should be trivial compared to the task of creating a k8 script.
> >
> > *  When creating an instance of a data store, the developer (or the beam
> CI
> > server) would first build the docker image for the data store and push to
> > their container registry, then run a command like `helm install -f
> > mydb.yaml --set imageRepo=1.2.3.4`
> >
> > * when done running tests/developing/etc…  the developer/beam CI server
> > would run `helm delete -f mydb.yaml`
> >
> > Upsides:
> >
> > * Something like helm is pretty interesting - we talked about it as an
> > upside and something we wanted to do when we talked about using
> kubernetes
> >
> > * We pick up a set of working kubernetes scripts this way. The full list
> is
> > at [2], but some ones that stood out: mongodb, memcached, mysql,
> postgres,
> > redis, elasticsearch (incubating), kafka (incubating), zookeeper
> > (incubating) - this could speed development
> >
> > Downsides:
> >
> > * Adds an additional dependency to run our ITs (helm or another k8
> > templating tool)
> >
> > * Requires people to build their own images run a container registry if
> > they don't already have one (it will not surprise you that there's a
> docker
> > image for running the registry [0] - so it's not crazy. :) I *think* this
> > will probably just be a simple one/two line command once we have it
> > scripted.
> >
> > * Helm in particular is kind of heavyweight for what we really need - it
> > requires running a service in the k8 cluster and adds additional
> > complexity.
> >
> > * Adds to the complexity of creating a new kubernetes script. Until I've
> > tried it, I can't really speak to the complexity, but taking a look at
> the
> > instructions [4], it doesn't seem too bad.
> >
> >
> >
> >
> > 2. Push images to docker hub
> >
> > =======================
> >
> > This requires that users push images that we want to use to docker hub,
> and
> > then our IO ITs will rely on that. I  think the developer of the
> dockerfile
> > should be responsible for the image - having the beam project responsible
> > for a publicly available artifact (like the docker images) outside of our
> > core deliverables doesn't seem like the right move.
> >
> > We would still retain a copy of the source dockerfiles and could
> regenerate
> > the images at any time, so I'm not concerned about a scenario where
> docker
> > hub went away (it would be pretty simple to switch to another repo - just
> > change some config files.)
> >
> > For someone running the k8 scripts (ie, running the IO ITs), this is
> pretty
> > easy - they just run the k8 script like they do today.
> >
> > For someone creating the k8 scripts (ie, creating the IO ITs), this is
> more
> > complex - either they or we have to push this to docker hub and make sure
> > it's up to date, etc..
> >
> >
> > Upsides:
> >
> > * No additional complexity for IO IT runners.
> >
> > Downsides:
> >
> > * Higher bar for creating the image in the first place - someone has to
> > maintain the publicly available docker hub image.
> >
> > * It seems weird to have a custom docker image up on docker hub - maybe
> > that's common, but if we need specific changes to images for our needs,
> I'd
> > prefer it be private.
> >
> >
> > 3. Run our own *public* container registry
> >
> > ==============================================
> >
> > We would run a beam-specific container registry service - it would be
> used
> > by the apache beam CI servers, but it would also be available for use by
> > anyone running beam IO ITs on their local dev setup.
> >
> > From a IO IT creator's perspective, this would look pretty similar to how
> > things are now - they just check in a dockerfile. For someone running the
> > k8 scripts, they similarly don't need to think about it.
> >
> > Upsides:
> >
> > * we're not adding any additional complexity for end developer
> >
> > Downsides:
> >
> > * Have to keep docker registry software up to date
> >
> > * The service is a single of failure for any beam devs running IO ITs
> >
> > * It can incur costs, etc… As an open source project, it doesn't seem
> great
> > for us to be running a public service.
> >
> >
> >
> > My thoughts on this
> >
> > ===============
> >
> > In spite of the additional complexity, I think using k8 helm is probably
> > the best option. The general goal behind the IO ITs has been to keep
> > ourselves self-contained: avoid having centralized infrastructure for
> those
> > running the ITs. Helm is a good match for those criteria. I will admit
> that
> > I find the additional dependencies/complexity to be worrisome. However, I
> > really like the idea of picking up additional data store configs for
> free -
> > if we were doing this in 5 years, we'd say "we should just use the
> > ecosystem of helm charts" and go from there.
> >
> > I do think that pushing images to docker hub is a viable option, and if
> the
> > community is more excited to do that/wants to push the images there, I'd
> > support it. I can see how folks would be hesitant. I would like for the
> > developer of the docker file to do
> >
> > Of the 3 options, I would strongly push back against running a public
> > container registry - I would not want to administer it, and I don't think
> > we as a project want to be paying for the costs associated with it.
> >
> > Next steps
> >
> > =========
> >
> > Let me know what you think! This is definitely a topic where
> understanding
> > what the community of IO devs wants is helpful. As we discuss, I'll
> > probably spend a little time exploring helm since I want to play around
> > with it and understand if there are other drawbacks. I ran into this
> > question while working on getting the HIFIO cassandra cluster running,
> so I
> > might prototype with that.
> >
> > I'll create JIRA for this in the next day or so.
> >
> > Stephen
> >
> >
> >
> > [0] docker registry container - https://hub.docker.com/_/registry/
> >
> > [1] kubernetes issue open for supporting templates -
> > https://github.com/kubernetes/kubernetes/issues/23896
> >
> > [2] set of available charts - https://github.com/kubernetes/charts
> >
> > [3] kubernetes helm introduction -
> > https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> > [4] kubernetes charts instructions -
> > https://github.com/kubernetes/helm/blob/master/docs/charts.md
> >
>
>

Re: IO ITs: Hosting Docker images

Posted by Stephen Sisk <si...@google.com.INVALID>.

I'd like to hear what direction folks want to go in, and from there look at
the options. I think for some of these options (like running our own public
registry), they may be able to and it's something we should look at, but I
don't assume they have time to work on this type of issue.

S

On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lc...@google.com.invalid>
wrote:

> Is this something that Apache infra could help us with?
>
> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <si...@google.com.invalid>
> wrote:
>
> > Summary:
> >
> > For IO ITs that use data stores that need custom docker images in order
> to
> > run, we can't currently use them in a kubernetes cluster (which is where
> we
> > host our data stores.) I have a couple options for how to solve this and
> am
> > looking for feedback from folks involved in creating IO ITs/opinions on
> > kubernetes.
> >
> >
> > Details:
> >
> > We've discussed in the past that we'll want to allow developers to submit
> > just a dockerfile, and then we'll use that when creating the data store
> on
> > kubernetes. This is the case for ElasticsearchIO and I assume more data
> > stores in the future will want to do this. It's also looking like it'll
> be
> > necessary to use custom docker images for the HadoopInputFormatIO's
> > cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
> good
> > image you can use out of the box.
> >
> > In either case, in order to retrieve a docker image, kubernetes needs a
> > container registry - it will read the docker images from there. A simple
> > private container registry doesn't work because kubernetes config files
> are
> > static - this means that if local devs try to use the kubernetes files,
> > they point at the private container registry and they wouldn't be able to
> > retrieve the images since they don't have access. They'd have to manually
> > edit the files, which in theory is an option, but I don't consider that
> to
> > be acceptable since it feels pretty unfriendly (it is simple, so if we
> > really don't like the below options we can revisit it.)
> >
> > Quick summary of the options
> >
> > =======================
> >
> > We can:
> >
> > * Start using something like k8 helm - this adds more dependencies, adds
> a
> > small amount of complexity (this is my recommendation, but only by a
> > little)
> >
> > * Start pushing images to docker hub - this means they'll be publicly
> > visible and raises the bar for maintenance of those images
> >
> > * Host our own public container registry - this means running our own
> > public service with costs, etc..
> >
> > Below are detailed discussions of these options. You can skip to the "My
> > thoughts on this" section if you're not interested in the details.
> >
> >
> > 1. Templated kubernetes images
> >
> > =========================
> >
> > Kubernetes (k8) does not currently have built in support for
> parameterizing
> > scripts - there's an issues open for this[1], but it doesn't seem to be
> > very active.
> >
> > There are tools like Kubernetes helm that allow users to specify
> parameters
> > when running their kubernetes scripts. They also enable a lot more
> (they're
> > probably closer to a package manager like apt-get) - see this
> > description[3] for an overview.
> >
> > I'm open to other options besides helm, but it seems to be the officially
> > supported one.
> >
> > How the world would look using helm:
> >
> > * When developing an IO IT, someone (either the developer or one of us),
> > would need to create a chart (the name for the helm script) - it's
> > basically another set of config files but in theory is as simple as a
> > couple metadata files plus a templatized version of a regular k8 script.
> > This should be trivial compared to the task of creating a k8 script.
> >
> > *  When creating an instance of a data store, the developer (or the beam
> CI
> > server) would first build the docker image for the data store and push to
> > their container registry, then run a command like `helm install -f
> > mydb.yaml --set imageRepo=1.2.3.4`
> >
> > * when done running tests/developing/etc…  the developer/beam CI server
> > would run `helm delete -f mydb.yaml`
> >
> > Upsides:
> >
> > * Something like helm is pretty interesting - we talked about it as an
> > upside and something we wanted to do when we talked about using
> kubernetes
> >
> > * We pick up a set of working kubernetes scripts this way. The full list
> is
> > at [2], but some ones that stood out: mongodb, memcached, mysql,
> postgres,
> > redis, elasticsearch (incubating), kafka (incubating), zookeeper
> > (incubating) - this could speed development
> >
> > Downsides:
> >
> > * Adds an additional dependency to run our ITs (helm or another k8
> > templating tool)
> >
> > * Requires people to build their own images run a container registry if
> > they don't already have one (it will not surprise you that there's a
> docker
> > image for running the registry [0] - so it's not crazy. :) I *think* this
> > will probably just be a simple one/two line command once we have it
> > scripted.
> >
> > * Helm in particular is kind of heavyweight for what we really need - it
> > requires running a service in the k8 cluster and adds additional
> > complexity.
> >
> > * Adds to the complexity of creating a new kubernetes script. Until I've
> > tried it, I can't really speak to the complexity, but taking a look at
> the
> > instructions [4], it doesn't seem too bad.
> >
> >
> >
> >
> > 2. Push images to docker hub
> >
> > =======================
> >
> > This requires that users push images that we want to use to docker hub,
> and
> > then our IO ITs will rely on that. I  think the developer of the
> dockerfile
> > should be responsible for the image - having the beam project responsible
> > for a publicly available artifact (like the docker images) outside of our
> > core deliverables doesn't seem like the right move.
> >
> > We would still retain a copy of the source dockerfiles and could
> regenerate
> > the images at any time, so I'm not concerned about a scenario where
> docker
> > hub went away (it would be pretty simple to switch to another repo - just
> > change some config files.)
> >
> > For someone running the k8 scripts (ie, running the IO ITs), this is
> pretty
> > easy - they just run the k8 script like they do today.
> >
> > For someone creating the k8 scripts (ie, creating the IO ITs), this is
> more
> > complex - either they or we have to push this to docker hub and make sure
> > it's up to date, etc..
> >
> >
> > Upsides:
> >
> > * No additional complexity for IO IT runners.
> >
> > Downsides:
> >
> > * Higher bar for creating the image in the first place - someone has to
> > maintain the publicly available docker hub image.
> >
> > * It seems weird to have a custom docker image up on docker hub - maybe
> > that's common, but if we need specific changes to images for our needs,
> I'd
> > prefer it be private.
> >
> >
> > 3. Run our own *public* container registry
> >
> > ==============================================
> >
> > We would run a beam-specific container registry service - it would be
> used
> > by the apache beam CI servers, but it would also be available for use by
> > anyone running beam IO ITs on their local dev setup.
> >
> > From a IO IT creator's perspective, this would look pretty similar to how
> > things are now - they just check in a dockerfile. For someone running the
> > k8 scripts, they similarly don't need to think about it.
> >
> > Upsides:
> >
> > * we're not adding any additional complexity for end developer
> >
> > Downsides:
> >
> > * Have to keep docker registry software up to date
> >
> > * The service is a single of failure for any beam devs running IO ITs
> >
> > * It can incur costs, etc… As an open source project, it doesn't seem
> great
> > for us to be running a public service.
> >
> >
> >
> > My thoughts on this
> >
> > ===============
> >
> > In spite of the additional complexity, I think using k8 helm is probably
> > the best option. The general goal behind the IO ITs has been to keep
> > ourselves self-contained: avoid having centralized infrastructure for
> those
> > running the ITs. Helm is a good match for those criteria. I will admit
> that
> > I find the additional dependencies/complexity to be worrisome. However, I
> > really like the idea of picking up additional data store configs for
> free -
> > if we were doing this in 5 years, we'd say "we should just use the
> > ecosystem of helm charts" and go from there.
> >
> > I do think that pushing images to docker hub is a viable option, and if
> the
> > community is more excited to do that/wants to push the images there, I'd
> > support it. I can see how folks would be hesitant. I would like for the
> > developer of the docker file to do
> >
> > Of the 3 options, I would strongly push back against running a public
> > container registry - I would not want to administer it, and I don't think
> > we as a project want to be paying for the costs associated with it.
> >
> > Next steps
> >
> > =========
> >
> > Let me know what you think! This is definitely a topic where
> understanding
> > what the community of IO devs wants is helpful. As we discuss, I'll
> > probably spend a little time exploring helm since I want to play around
> > with it and understand if there are other drawbacks. I ran into this
> > question while working on getting the HIFIO cassandra cluster running,
> so I
> > might prototype with that.
> >
> > I'll create JIRA for this in the next day or so.
> >
> > Stephen
> >
> >
> >
> > [0] docker registry container - https://hub.docker.com/_/registry/
> >
> > [1] kubernetes issue open for supporting templates -
> > https://github.com/kubernetes/kubernetes/issues/23896
> >
> > [2] set of available charts - https://github.com/kubernetes/charts
> >
> > [3] kubernetes helm introduction -
> > https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> > [4] kubernetes charts instructions -
> > https://github.com/kubernetes/helm/blob/master/docs/charts.md
> >
>

Re: IO ITs: Hosting Docker images

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

Is this something that Apache infra could help us with?

On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <si...@google.com.invalid>
wrote:

> Summary:
>
> For IO ITs that use data stores that need custom docker images in order to
> run, we can't currently use them in a kubernetes cluster (which is where we
> host our data stores.) I have a couple options for how to solve this and am
> looking for feedback from folks involved in creating IO ITs/opinions on
> kubernetes.
>
>
> Details:
>
> We've discussed in the past that we'll want to allow developers to submit
> just a dockerfile, and then we'll use that when creating the data store on
> kubernetes. This is the case for ElasticsearchIO and I assume more data
> stores in the future will want to do this. It's also looking like it'll be
> necessary to use custom docker images for the HadoopInputFormatIO's
> cassandra ITs - to run a cassandra cluster, there doesn't seem to be a good
> image you can use out of the box.
>
> In either case, in order to retrieve a docker image, kubernetes needs a
> container registry - it will read the docker images from there. A simple
> private container registry doesn't work because kubernetes config files are
> static - this means that if local devs try to use the kubernetes files,
> they point at the private container registry and they wouldn't be able to
> retrieve the images since they don't have access. They'd have to manually
> edit the files, which in theory is an option, but I don't consider that to
> be acceptable since it feels pretty unfriendly (it is simple, so if we
> really don't like the below options we can revisit it.)
>
> Quick summary of the options
>
> =======================
>
> We can:
>
> * Start using something like k8 helm - this adds more dependencies, adds a
> small amount of complexity (this is my recommendation, but only by a
> little)
>
> * Start pushing images to docker hub - this means they'll be publicly
> visible and raises the bar for maintenance of those images
>
> * Host our own public container registry - this means running our own
> public service with costs, etc..
>
> Below are detailed discussions of these options. You can skip to the "My
> thoughts on this" section if you're not interested in the details.
>
>
> 1. Templated kubernetes images
>
> =========================
>
> Kubernetes (k8) does not currently have built in support for parameterizing
> scripts - there's an issues open for this[1], but it doesn't seem to be
> very active.
>
> There are tools like Kubernetes helm that allow users to specify parameters
> when running their kubernetes scripts. They also enable a lot more (they're
> probably closer to a package manager like apt-get) - see this
> description[3] for an overview.
>
> I'm open to other options besides helm, but it seems to be the officially
> supported one.
>
> How the world would look using helm:
>
> * When developing an IO IT, someone (either the developer or one of us),
> would need to create a chart (the name for the helm script) - it's
> basically another set of config files but in theory is as simple as a
> couple metadata files plus a templatized version of a regular k8 script.
> This should be trivial compared to the task of creating a k8 script.
>
> *  When creating an instance of a data store, the developer (or the beam CI
> server) would first build the docker image for the data store and push to
> their container registry, then run a command like `helm install -f
> mydb.yaml --set imageRepo=1.2.3.4`
>
> * when done running tests/developing/etc…  the developer/beam CI server
> would run `helm delete -f mydb.yaml`
>
> Upsides:
>
> * Something like helm is pretty interesting - we talked about it as an
> upside and something we wanted to do when we talked about using kubernetes
>
> * We pick up a set of working kubernetes scripts this way. The full list is
> at [2], but some ones that stood out: mongodb, memcached, mysql, postgres,
> redis, elasticsearch (incubating), kafka (incubating), zookeeper
> (incubating) - this could speed development
>
> Downsides:
>
> * Adds an additional dependency to run our ITs (helm or another k8
> templating tool)
>
> * Requires people to build their own images run a container registry if
> they don't already have one (it will not surprise you that there's a docker
> image for running the registry [0] - so it's not crazy. :) I *think* this
> will probably just be a simple one/two line command once we have it
> scripted.
>
> * Helm in particular is kind of heavyweight for what we really need - it
> requires running a service in the k8 cluster and adds additional
> complexity.
>
> * Adds to the complexity of creating a new kubernetes script. Until I've
> tried it, I can't really speak to the complexity, but taking a look at the
> instructions [4], it doesn't seem too bad.
>
>
>
>
> 2. Push images to docker hub
>
> =======================
>
> This requires that users push images that we want to use to docker hub, and
> then our IO ITs will rely on that. I  think the developer of the dockerfile
> should be responsible for the image - having the beam project responsible
> for a publicly available artifact (like the docker images) outside of our
> core deliverables doesn't seem like the right move.
>
> We would still retain a copy of the source dockerfiles and could regenerate
> the images at any time, so I'm not concerned about a scenario where docker
> hub went away (it would be pretty simple to switch to another repo - just
> change some config files.)
>
> For someone running the k8 scripts (ie, running the IO ITs), this is pretty
> easy - they just run the k8 script like they do today.
>
> For someone creating the k8 scripts (ie, creating the IO ITs), this is more
> complex - either they or we have to push this to docker hub and make sure
> it's up to date, etc..
>
>
> Upsides:
>
> * No additional complexity for IO IT runners.
>
> Downsides:
>
> * Higher bar for creating the image in the first place - someone has to
> maintain the publicly available docker hub image.
>
> * It seems weird to have a custom docker image up on docker hub - maybe
> that's common, but if we need specific changes to images for our needs, I'd
> prefer it be private.
>
>
> 3. Run our own *public* container registry
>
> ==============================================
>
> We would run a beam-specific container registry service - it would be used
> by the apache beam CI servers, but it would also be available for use by
> anyone running beam IO ITs on their local dev setup.
>
> From a IO IT creator's perspective, this would look pretty similar to how
> things are now - they just check in a dockerfile. For someone running the
> k8 scripts, they similarly don't need to think about it.
>
> Upsides:
>
> * we're not adding any additional complexity for end developer
>
> Downsides:
>
> * Have to keep docker registry software up to date
>
> * The service is a single of failure for any beam devs running IO ITs
>
> * It can incur costs, etc… As an open source project, it doesn't seem great
> for us to be running a public service.
>
>
>
> My thoughts on this
>
> ===============
>
> In spite of the additional complexity, I think using k8 helm is probably
> the best option. The general goal behind the IO ITs has been to keep
> ourselves self-contained: avoid having centralized infrastructure for those
> running the ITs. Helm is a good match for those criteria. I will admit that
> I find the additional dependencies/complexity to be worrisome. However, I
> really like the idea of picking up additional data store configs for free -
> if we were doing this in 5 years, we'd say "we should just use the
> ecosystem of helm charts" and go from there.
>
> I do think that pushing images to docker hub is a viable option, and if the
> community is more excited to do that/wants to push the images there, I'd
> support it. I can see how folks would be hesitant. I would like for the
> developer of the docker file to do
>
> Of the 3 options, I would strongly push back against running a public
> container registry - I would not want to administer it, and I don't think
> we as a project want to be paying for the costs associated with it.
>
> Next steps
>
> =========
>
> Let me know what you think! This is definitely a topic where understanding
> what the community of IO devs wants is helpful. As we discuss, I'll
> probably spend a little time exploring helm since I want to play around
> with it and understand if there are other drawbacks. I ran into this
> question while working on getting the HIFIO cassandra cluster running, so I
> might prototype with that.
>
> I'll create JIRA for this in the next day or so.
>
> Stephen
>
>
>
> [0] docker registry container - https://hub.docker.com/_/registry/
>
> [1] kubernetes issue open for supporting templates -
> https://github.com/kubernetes/kubernetes/issues/23896
>
> [2] set of available charts - https://github.com/kubernetes/charts
>
> [3] kubernetes helm introduction -
> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> [4] kubernetes charts instructions -
> https://github.com/kubernetes/helm/blob/master/docs/charts.md
>