You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Stephen Sisk <si...@google.com.INVALID> on 2017/01/18 01:36:58 UTC

Re: Hosting data stores for IO Transform testing

hi!

I've been continuing this investigation, and have some more info to report,
and hopefully we can start making some decisions.

To support performance testing, I've been investigating mesos+marathon and
kubernetes for running data stores in their high availability mode. I have
been examining features that kubernetes/mesos+marathon use to support this.

Setting up a multi-node cluster in a high availability mode tends to be
more expensive time-wise than the single node instances I've played around
with in the past. Rather than do a full build out with both kubernetes and
mesos, I'd like to pick one of the two options to build the prototype
cluster with. If the prototype doesn't go well, we could still go back to
the other option, but I'd like to change us from a mode of "let's look at
all the options" to one of "here's the favorite, let's prove that works for
us".

Below are the features that I've seen are important to multi-node instances
of data stores. I'm sure other folks on the list have done this before, so
feel free to pipe up if I'm missing a good solution to a problem.

DNS/Discovery

--------------------

Necessary for talking between nodes (eg, cassandra nodes all need to be
able to talk to a set of seed nodes.)

* Kubernetes has built-in DNS/discovery between nodes.

* Mesos has supports this via mesos-dns, which isn't a part of core mesos,
but is in dcos, which is the mesos distribution I've been using and that I
would expect us to use.

Instances properly distributed across nodes

------------------------------------------------------------

If multiple instances of a data source end up on the same underlying VM, we
may not get good performance out of those instances since the underlying VM
may be more taxed than other VMs.

* Kubernetes has a beta feature StatefulSets[1] which allow for containers
distributed so that there's one container per underlying machine (as well
as a lot of other useful features like easy stable dns names.)

* Mesos can support this via the built in UNIQUE constraint [2]

Load balancing

--------------------

Incoming requests from users need to be distributed to the various machines
- this is important for many data stores' high availability modes.

* Kubernetes supports easily hooking up to an external load balancer when
on a cloud (and can be configured to work with a built-in load balancer if
not)

* Mesos supports this via marathon-lb [3], which is an install-able package
in DC/OS

Persistent Volumes tied to specific instances

------------------------------------------------------------

Databases often need persistent state (for example to store the data :), so
it's an important part of running our service.

* Kubernetes StatefulSets supports this

* Mesos+marathon apps with persistent volumes supports this [4] [5]

As I mentioned above, I'd like to focus on either kubernetes or mesos for
my investigation, and as I go further along, I'm seeing kubernetes as
better suited to our needs.

(1) It supports more of the features we want out of the box and with
StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
requires marathon-lb to be installed and mesos-dns to be configured.

(2) I'm also finding that there seem to be more examples of using
kubernetes to solve the types of problems we're working on. This is
somewhat subjective, but in my experience as I've tried to learn both
kubernetes and mesos, I personally found it generally easier to get
kubernetes running than mesos due to the tutorials/examples available for
kubernetes.

(3) Lower cost of initial setup - as I discussed in a previous mail[6],
kubernetes was far easier to get set up even when I knew the exact steps.
Mesos took me around 27 steps [7], which involved a lot of config that was
easy to get wrong (it took me about 5 tries to get all the steps correct in
one go.) Kubernetes took me around 8 steps and very little config.

Given that, I'd like to focus my investigation/prototyping on Kubernetes. To
be clear, it's fairly close and I think both Mesos and Kubernetes could
support what we need, so if we run into issues with kubernetes, Mesos still
seems like a viable option that we could fall back to.

Thanks,
Stephen


[1] Kubernetes StatefulSets
https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/

[2] mesos unique constraint -
https://mesosphere.github.io/marathon/docs/constraints.html

[3]
https://mesosphere.github.io/marathon/docs/service-discovery-load-balancing.html
 and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/

[4] https://mesosphere.github.io/marathon/docs/persistent-volumes.html

[5] https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/

[6] Container Orchestration software for hosting data stores
https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E

[7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md


On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org> wrote:

> Just a quick drive-by comment: how tests are laid out has non-trivial
> tradeoffs on how/where continuous integration runs, and how results are
> integrated into the tooling. The current state is certainly not ideal
> (e.g., due to multiple test executions some links in Jenkins point where
> they shouldn't), but most other alternatives had even bigger drawbacks at
> the time. If someone has great ideas that don't explode the number of
> modules, please share ;-)
>
> On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot <ec...@gmail.com>
> wrote:
>
> > Hi Stephen,
> >
> > Thanks for taking the time to comment.
> >
> > My comments are bellow in the email:
> >
> >
> > Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> >
> >> hey Etienne -
> >>
> >> thanks for your thoughts and thanks for sharing your experiences. I
> >> generally agree with what you're saying. Quick comments below:
> >>
> >> IT are stored alongside with UT in src/test directory of the IO but they
> >>>
> >> might go to dedicated module, waiting for a consensus
> >> I don't have a strong opinion or feel that I've worked enough with maven
> >> to
> >> understand all the consequences - I'd love for someone with more maven
> >> experience to weigh in. If this becomes blocking, I'd say check it in,
> and
> >> we can refactor later if it proves problematic.
> >>
> > Sure, not a blocking point, it could be refactored afterwards. Just as a
> > reminder, JB mentioned that storing IT in separate module allows to have
> > more coherence between all IT (same behavior) and to do cross IO
> > integration tests. JB, have you experienced some long term drawbacks of
> > storing IT in a separate module, like, for example, more difficult
> > maintenance due to "distance" with production code?
> >
> >
> >>   Also IMHO, it is better that tests load/clean data than doing some
> >>>
> >> assumptions about the running order of the tests.
> >> I definitely agree that we don't want to make assumptions about the
> >> running
> >> order of the tests - that way lies pain. :) It will be interesting to
> see
> >> how the performance tests work out since they will need more data (and
> >> thus
> >> loading data can take much longer.)
> >>
> > Yes, performance testing might push in the direction of data loading from
> > outside the tests due to loading time.
> >
> >>   This should also be an easier problem
> >> for read tests than for write tests - if we have long running instances,
> >> read tests don't really need cleanup. And if write tests only write a
> >> small
> >> amount of data, as long as we are sure we're writing to uniquely
> >> identifiable locations (ie, new table per test or something similar), we
> >> can clean up the write test data on a slower schedule.
> >>
> > I agree
> >
> >>
> >> this will tend to go to the direction of long running data store
> >>>
> >> instances rather than data store instances started (and optionally
> loaded)
> >> before tests.
> >> It may be easiest to start with a "data stores stay running"
> >> implementation, and then if we see issues with that move towards tests
> >> that
> >> start/stop the data stores on each run. One thing I'd like to make sure
> is
> >> that we're not manually tweaking the configurations for data stores. One
> >> way we could do that is to destroy/recreate the data stores on a slower
> >> schedule - maybe once per week. That way if the script is changed or the
> >> data store instances are changed, we'd be able to detect it relatively
> >> soon
> >> while still removing the need for the tests to manage the data stores.
> >>
> > I agree. In addition to configuration manual tweaking, there might be
> > cases in which a data store re-partition data during a test or after some
> > tests while the dataset changes. The IO must be tolerant to that but the
> > asserts (number of bundles for example) in test must not fail in that
> case.
> > I would also prefer if possible that the tests do not manage data stores
> > (not setup them, not start them, not stop them)
> >
> >
> >> as a general note, I suspect many of the folks in the states will be on
> >> holiday until Jan 2nd/3rd.
> >>
> >> S
> >>
> >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot <ec...@gmail.com>
> >> wrote:
> >>
> >> Hi,
> >>>
> >>> Recently we had a discussion about integration tests of IOs. I'm
> >>> preparing a PR for integration tests of the elasticSearch IO
> >>> (
> >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> >>> LASTICSEARCH-IO
> >>> as a first shot) which are very important IMHO because they helped
> catch
> >>> some bugs that UT could not (volume, data store instance sharing, real
> >>> data store instance ...)
> >>>
> >>> I would like to have your thoughts/remarks about points bellow. Some of
> >>> these points are also discussed here
> >>>
> >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> >>> :
> >>>
> >>> - UT and IT have a similar architecture, but while UT focus on testing
> >>> the correct behavior of the code including corner cases and use
> embedded
> >>> in memory data store, IT assume that the behavior is correct (strong
> UT)
> >>> and focus on higher volume testing and testing against real data store
> >>> instance(s)
> >>>
> >>> - For now, IT are stored alongside with UT in src/test directory of the
> >>> IO but they might go to dedicated module, waiting for a consensus.
> Maven
> >>> is not configured to run them automatically because data store is not
> >>> available on jenkins server yet
> >>>
> >>> - For now, they only use DirectRunner, but they will  be run against
> >>> each runner.
> >>>
> >>> - IT do not setup data store instance (like stated in the above
> >>> document) they assume that one is already running (hardcoded
> >>> configuration in test for now, waiting for a common solution to pass
> >>> configuration to IT). A docker container script is provided in the
> >>> contrib directory as a starting point to whatever orchestration
> software
> >>> will be chosen.
> >>>
> >>> - IT load and clean test data before and after each test if needed. It
> >>> is simpler to do so because some tests need empty data store (write
> >>> test) and because, as discussed in the document, tests might not be the
> >>> only users of the data store. Also IMHO, it is better that tests
> >>> load/clean data than doing some assumptions about the running order of
> >>> the tests.
> >>>
> >>> If we generalize this pattern to all IT tests, this will tend to go to
> >>> the direction of long running data store instances rather than data
> >>> store instances started (and optionally loaded) before tests.
> >>>
> >>> Besides if we where to change our minds and load data from outside the
> >>> tests, a logstash script is provided.
> >>>
> >>> If you have any thoughts or remarks I'm all ears :)
> >>>
> >>> Regards,
> >>>
> >>> Etienne
> >>>
> >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> >>>
> >>>> Hi Stephen,
> >>>>
> >>>> the purpose of having in a specific module is to share resources and
> >>>> apply the same behavior from IT perspective and be able to have IT
> >>>> "cross" IO (for instance, reading from JMS and sending to Kafka, I
> >>>> think that's the key idea for integration tests).
> >>>>
> >>>> For instance, in Karaf, we have:
> >>>> - utest in each module
> >>>> - itest module containing itests for all modules all together
> >>>>
> >>>> Regards
> >>>> JB
> >>>>
> >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >>>>
> >>>>> Hi Etienne,
> >>>>>
> >>>>> thanks for following up and answering my questions.
> >>>>>
> >>>>> re: where to store integration tests - having them all in a separate
> >>>>> module
> >>>>> is an interesting idea. I couldn't find JB's comments about moving
> them
> >>>>> into a separate module in the PR - can you share the reasons for
> >>>>> doing so?
> >>>>> The IO integration/perf tests so it does seem like they'll need to be
> >>>>> treated in a special manner, but given that there is already an IO
> >>>>> specific
> >>>>> module, it may just be that we need to treat all the ITs in the IO
> >>>>> module
> >>>>> the same way. I don't have strong opinions either way right now.
> >>>>>
> >>>>> S
> >>>>>
> >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> echauchot@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Hi guys,
> >>>>>
> >>>>> @Stephen: I addressed all your comments directly in the PR, thanks!
> >>>>> I just wanted to comment here about the docker image I used: the only
> >>>>> official Elastic image contains only ElasticSearch. But for testing I
> >>>>> needed logstash (for ingestion) and kibana (not for integration
> tests,
> >>>>> but to easily test REST requests to ES using sense). This is why I
> use
> >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased
> >>>>> under
> >>>>> theapache 2 license.
> >>>>>
> >>>>>
> >>>>> Besides, there is also a point about where to store integration
> tests:
> >>>>> JB proposed in the PR to store integration tests to dedicated module
> >>>>> rather than directly in the IO module (like I did).
> >>>>>
> >>>>>
> >>>>>
> >>>>> Etienne
> >>>>>
> >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >>>>>
> >>>>>> hey!
> >>>>>>
> >>>>>> thanks for sending this. I'm very excited to see this change. I
> >>>>>> added some
> >>>>>> detail-oriented code review comments in addition to what I've
> >>>>>> discussed
> >>>>>> here.
> >>>>>>
> >>>>>> The general goal is to allow for re-usable instantiation of
> particular
> >>>>>>
> >>>>> data
> >>>>>
> >>>>>> store instances and this seems like a good start. Looks like you
> >>>>>> also have
> >>>>>> a script to generate test data for your tests - that's great.
> >>>>>>
> >>>>>> The next steps (definitely not blocking your work) will be to have
> >>>>>> ways to
> >>>>>> create instances from the docker images you have here, and use them
> >>>>>> in the
> >>>>>> tests. We'll need support in the test framework for that since it'll
> >>>>>> be
> >>>>>> different on developer machines and in the beam jenkins cluster, but
> >>>>>> your
> >>>>>> scripts here allow someone running these tests locally to not have
> to
> >>>>>>
> >>>>> worry
> >>>>>
> >>>>>> about getting the instance set up and can manually adjust, so this
> is
> >>>>>> a
> >>>>>> good incremental step.
> >>>>>>
> >>>>>> I have some thoughts now that I'm reviewing your scripts (that I
> >>>>>> didn't
> >>>>>> have previously, so we are learning this together):
> >>>>>> * It may be useful to try and document why we chose a particular
> >>>>>> docker
> >>>>>> image as the base (ie, "this is the official supported elastic
> search
> >>>>>> docker image" or "this image has several data stores together that
> >>>>>> can be
> >>>>>> used for a couple different tests")  - I'm curious as to whether the
> >>>>>> community thinks that is important
> >>>>>>
> >>>>>> One thing that I called out in the comment that's worth mentioning
> >>>>>> on the
> >>>>>> larger list - if you want to specify which specific runners a test
> >>>>>> uses,
> >>>>>> that can be controlled in the pom for the module. I updated the
> >>>>>> testing
> >>>>>>
> >>>>> doc
> >>>>>
> >>>>>> mentioned previously in this thread with a TODO to talk about this
> >>>>>> more. I
> >>>>>> think we should also make it so that IO modules have that
> >>>>>> automatically,
> >>>>>>
> >>>>> so
> >>>>>
> >>>>>> developers don't have to worry about it.
> >>>>>>
> >>>>>> S
> >>>>>>
> >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> echauchot@gmail.com>
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Stephen,
> >>>>>>
> >>>>>> As discussed, I added injection script, docker containers scripts
> and
> >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> >>>>>> <
> >>>>>>
> >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> >>> d824cefcb3ed0b9
> >>>
> >>>> directory in that PR:
> >>>>>> https://github.com/apache/incubator-beam/pull/1439.
> >>>>>>
> >>>>>> These work well but they are first shot. Do you have any comments
> >>>>>> about
> >>>>>> those?
> >>>>>>
> >>>>>> Besides I am not very sure that these files should be in the IO
> itself
> >>>>>> (even in contrib directory, out of maven source directories). Any
> >>>>>>
> >>>>> thoughts?
> >>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Etienne
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >>>>>>
> >>>>>>> It's great to hear more experiences.
> >>>>>>>
> >>>>>>> I'm also glad to hear that people see real value in the high
> >>>>>>> volume/performance benchmark tests. I tried to capture that in the
> >>>>>>>
> >>>>>> Testing
> >>>>>
> >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >>>>>>>
> >>>>>>> It does generally sound like we're in agreement here. Areas of
> >>>>>>> discussion
> >>>>>>>
> >>>>>> I
> >>>>>>
> >>>>>>> see:
> >>>>>>> 1.  People like the idea of bringing up fresh instances for each
> test
> >>>>>>> rather than keeping instances running all the time, since that
> >>>>>>> ensures no
> >>>>>>> contamination between tests. That seems reasonable to me. If we see
> >>>>>>> flakiness in the tests or we note that setting up/tearing down
> >>>>>>> instances
> >>>>>>>
> >>>>>> is
> >>>>>>
> >>>>>>> taking a lot of time,
> >>>>>>> 2. Deciding on cluster management software/orchestration software
> - I
> >>>>>>>
> >>>>>> want
> >>>>>
> >>>>>> to make sure we land on the right tool here since choosing the
> >>>>>>> wrong tool
> >>>>>>> could result in administration of the instances taking more work. I
> >>>>>>>
> >>>>>> suspect
> >>>>>>
> >>>>>>> that's a good place for a follow up discussion, so I'll start a
> >>>>>>> separate
> >>>>>>> thread on that. I'm happy with whatever tool we choose, but I want
> to
> >>>>>>>
> >>>>>> make
> >>>>>
> >>>>>> sure we take a moment to consider different options and have a
> >>>>>>> reason for
> >>>>>>> choosing one.
> >>>>>>>
> >>>>>>> Etienne - thanks for being willing to port your creation/other
> >>>>>>> scripts
> >>>>>>> over. You might be a good early tester of whether this system works
> >>>>>>> well
> >>>>>>> for everyone.
> >>>>>>>
> >>>>>>> Stephen
> >>>>>>>
> >>>>>>> [1]  Reasons for Beam Test Strategy -
> >>>>>>>
> >>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> >>>
> >>>>
> >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
> >>>>>>> <jb...@nanthrax.net>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> I second Etienne there.
> >>>>>>>>
> >>>>>>>> We worked together on the ElasticsearchIO and definitely, the high
> >>>>>>>> valuable test we did were integration tests with ES on docker and
> >>>>>>>> high
> >>>>>>>> volume.
> >>>>>>>>
> >>>>>>>> I think we have to distinguish the two kinds of tests:
> >>>>>>>> 1. utests are located in the IO itself and basically they should
> >>>>>>>> cover
> >>>>>>>> the core behaviors of the IO
> >>>>>>>> 2. itests are located as contrib in the IO (they could be part of
> >>>>>>>> the IO
> >>>>>>>> but executed by the integration-test plugin or a specific profile)
> >>>>>>>> that
> >>>>>>>> deals with "real" backend and high volumes. The resources required
> >>>>>>>> by
> >>>>>>>> the itest can be bootstrapped by Jenkins (for instance using
> >>>>>>>> Mesos/Marathon and docker images as already discussed, and it's
> >>>>>>>> what I'm
> >>>>>>>> doing on my own "server").
> >>>>>>>>
> >>>>>>>> It's basically what Stephen described.
> >>>>>>>>
> >>>>>>>> We have to not relay only on itest: utests are very important and
> >>>>>>>> they
> >>>>>>>> validate the core behavior.
> >>>>>>>>
> >>>>>>>> My $0.01 ;)
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>> JB
> >>>>>>>>
> >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> >>>>>>>>
> >>>>>>>>> Hi Stephen,
> >>>>>>>>>
> >>>>>>>>> I like your proposition very much and I also agree that docker +
> >>>>>>>>> some
> >>>>>>>>> orchestration software would be great !
> >>>>>>>>>
> >>>>>>>>> On the elasticsearchIO (PR to be created this week) there is
> docker
> >>>>>>>>> container creation scripts and logstash data ingestion script for
> >>>>>>>>> IT
> >>>>>>>>> environment available in contrib directory alongside with
> >>>>>>>>> integration
> >>>>>>>>> tests themselves. I'll be happy to make them compliant to new IT
> >>>>>>>>> environment.
> >>>>>>>>>
> >>>>>>>>> What you say bellow about the need for external IT environment is
> >>>>>>>>> particularly true. As an example with ES what came out in first
> >>>>>>>>> implementation was that there were problems starting at some high
> >>>>>>>>>
> >>>>>>>> volume
> >>>>>
> >>>>>> of data (timeouts, ES windowing overflow...) that could not have be
> >>>>>>>>>
> >>>>>>>> seen
> >>>>>
> >>>>>> on embedded ES version. Also there where some particularities to
> >>>>>>>>> external instance like secondary (replica) shards that where not
> >>>>>>>>>
> >>>>>>>> visible
> >>>>>
> >>>>>> on embedded instance.
> >>>>>>>>>
> >>>>>>>>> Besides, I also favor bringing up instances before test because
> it
> >>>>>>>>> allows (amongst other things) to be sure to start on a fresh
> >>>>>>>>> dataset
> >>>>>>>>>
> >>>>>>>> for
> >>>>>
> >>>>>> the test to be deterministic.
> >>>>>>>>>
> >>>>>>>>> Etienne
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I'm excited we're getting lots of discussion going. There are
> many
> >>>>>>>>>> threads
> >>>>>>>>>> of conversation here, we may choose to split some of them off
> >>>>>>>>>> into a
> >>>>>>>>>> different email thread. I'm also betting I missed some of the
> >>>>>>>>>> questions in
> >>>>>>>>>> this thread, so apologies ahead of time for that. Also apologies
> >>>>>>>>>> for
> >>>>>>>>>>
> >>>>>>>>> the
> >>>>>>
> >>>>>>> amount of text, I provided some quick summaries at the top of each
> >>>>>>>>>> section.
> >>>>>>>>>>
> >>>>>>>>>> Amit - thanks for your thoughts. I've responded in detail below.
> >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of work
> >>>>>>>>>> here to
> >>>>>>>>>>
> >>>>>>>>> go
> >>>>>
> >>>>>> around. I'll try and think about how we can divide up some next
> >>>>>>>>>> steps
> >>>>>>>>>> (probably in a separate thread.) The main next step I see is
> >>>>>>>>>> deciding
> >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm working on
> >>>>>>>>>> that,
> >>>>>>>>>>
> >>>>>>>>> but
> >>>>>>>>
> >>>>>>>>> having lots of different thoughts on what the
> >>>>>>>>>> advantages/disadvantages
> >>>>>>>>>>
> >>>>>>>>> of
> >>>>>>>>
> >>>>>>>>> those are would be helpful (I'm not entirely sure of the
> >>>>>>>>>> protocol for
> >>>>>>>>>> collaborating on sub-projects like this.)
> >>>>>>>>>>
> >>>>>>>>>> These issues are all related to what kind of tests we want to
> >>>>>>>>>> write. I
> >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all the use
> >>>>>>>>>> cases
> >>>>>>>>>> we've discussed here (and thus should not block moving forward
> >>>>>>>>>> with
> >>>>>>>>>> this),
> >>>>>>>>>> but understanding what we want to test will help us understand
> >>>>>>>>>> how the
> >>>>>>>>>> cluster will be used. I'm working on a proposed user guide for
> >>>>>>>>>> testing
> >>>>>>>>>>
> >>>>>>>>> IO
> >>>>>>>>
> >>>>>>>>> Transforms, and I'm going to send out a link to that + a short
> >>>>>>>>>> summary
> >>>>>>>>>>
> >>>>>>>>> to
> >>>>>>>>
> >>>>>>>>> the list shortly so folks can get a better sense of where I'm
> >>>>>>>>>> coming
> >>>>>>>>>> from.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Here's my thinking on the questions we've raised here -
> >>>>>>>>>>
> >>>>>>>>>> Embedded versions of data stores for testing
> >>>>>>>>>> --------------------
> >>>>>>>>>> Summary: yes! But we still need real data stores to test
> against.
> >>>>>>>>>>
> >>>>>>>>>> I am a gigantic fan of using embedded versions of the various
> data
> >>>>>>>>>> stores.
> >>>>>>>>>> I think we should test everything we possibly can using them,
> >>>>>>>>>> and do
> >>>>>>>>>>
> >>>>>>>>> the
> >>>>>>
> >>>>>>> majority of our correctness testing using embedded versions + the
> >>>>>>>>>>
> >>>>>>>>> direct
> >>>>>>
> >>>>>>> runner. However, it's also important to have at least one test that
> >>>>>>>>>> actually connects to an actual instance, so we can get coverage
> >>>>>>>>>> for
> >>>>>>>>>> things
> >>>>>>>>>> like credentials, real connection strings, etc...
> >>>>>>>>>>
> >>>>>>>>>> The key point is that embedded versions definitely can't cover
> the
> >>>>>>>>>> performance tests, so we need to host instances if we want to
> test
> >>>>>>>>>>
> >>>>>>>>> that.
> >>>>>>
> >>>>>>> I consider the integration tests/performance benchmarks to be
> >>>>>>>>>> costly
> >>>>>>>>>> things
> >>>>>>>>>> that we do only for the IO transforms with large amounts of
> >>>>>>>>>> community
> >>>>>>>>>> support/usage. A random IO transform used by a few users doesn't
> >>>>>>>>>> necessarily need integration & perf tests, but for heavily used
> IO
> >>>>>>>>>> transforms, there's a lot of community value in these tests. The
> >>>>>>>>>> maintenance proposal below scales with the amount of community
> >>>>>>>>>> support
> >>>>>>>>>> for
> >>>>>>>>>> a particular IO transform.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Reusing data stores ("use the data stores across executions.")
> >>>>>>>>>> ------------------
> >>>>>>>>>> Summary: I favor a hybrid approach: some frequently used, very
> >>>>>>>>>> small
> >>>>>>>>>> instances that we keep up all the time + larger multi-container
> >>>>>>>>>> data
> >>>>>>>>>> store
> >>>>>>>>>> instances that we spin up for perf tests.
> >>>>>>>>>>
> >>>>>>>>>> I don't think we need to have a strong answer to this question,
> >>>>>>>>>> but I
> >>>>>>>>>> think
> >>>>>>>>>> we do need to know what range of capabilities we need, and use
> >>>>>>>>>> that to
> >>>>>>>>>> inform our requirements on the hosting infrastructure. I think
> >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios I
> discuss
> >>>>>>>>>>
> >>>>>>>>> below.
> >>>>>>
> >>>>>>> I had been thinking of a hybrid approach - reuse some instances and
> >>>>>>>>>>
> >>>>>>>>> don't
> >>>>>>>>
> >>>>>>>>> reuse others. Some tests require isolation from other tests (eg.
> >>>>>>>>>> performance benchmarking), while others can easily re-use the
> same
> >>>>>>>>>> database/data store instance over time, provided they are
> >>>>>>>>>> written in
> >>>>>>>>>>
> >>>>>>>>> the
> >>>>>>
> >>>>>>> correct manner (eg. a simple read or write correctness integration
> >>>>>>>>>>
> >>>>>>>>> tests)
> >>>>>>>>
> >>>>>>>>> To me, the question of whether to use one instance over time for
> a
> >>>>>>>>>> test vs
> >>>>>>>>>> spin up an instance for each test comes down to a trade off
> >>>>>>>>>> between
> >>>>>>>>>>
> >>>>>>>>> these
> >>>>>>>>
> >>>>>>>>> factors:
> >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super flaky,
> >>>>>>>>>> we'll
> >>>>>>>>>> want to
> >>>>>>>>>> keep more instances up and running rather than bring them
> up/down.
> >>>>>>>>>>
> >>>>>>>>> (this
> >>>>>>
> >>>>>>> may also vary by the data store in question)
> >>>>>>>>>> 2. Frequency of testing - if we are running tests every 5
> >>>>>>>>>> minutes, it
> >>>>>>>>>>
> >>>>>>>>> may
> >>>>>>>>
> >>>>>>>>> be wasteful to bring machines up/down every time. If we run
> >>>>>>>>>> tests once
> >>>>>>>>>>
> >>>>>>>>> a
> >>>>>>
> >>>>>>> day or week, it seems wasteful to keep the machines up the whole
> >>>>>>>>>> time.
> >>>>>>>>>> 3. Isolation requirements - If tests must be isolated, it means
> we
> >>>>>>>>>>
> >>>>>>>>> either
> >>>>>>>>
> >>>>>>>>> have to bring up the instances for each test, or we have to have
> >>>>>>>>>> some
> >>>>>>>>>> sort
> >>>>>>>>>> of signaling mechanism to indicate that a given instance is in
> >>>>>>>>>> use. I
> >>>>>>>>>> strongly favor bringing up an instance per test.
> >>>>>>>>>> 4. Number/size of containers - if we need a large number of
> >>>>>>>>>> machines
> >>>>>>>>>> for a
> >>>>>>>>>> particular test, keeping them running all the time will use more
> >>>>>>>>>> resources.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The major unknown to me is how flaky it'll be to spin these up.
> >>>>>>>>>> I'm
> >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up, but I
> >>>>>>>>>> think the
> >>>>>>>>>> best
> >>>>>>>>>> way to test that is to start doing it.
> >>>>>>>>>>
> >>>>>>>>>> I suspect the sweet spot is the following: have a set of very
> >>>>>>>>>> small
> >>>>>>>>>>
> >>>>>>>>> data
> >>>>>>
> >>>>>>> store instances that stay up to support small-data-size post-commit
> >>>>>>>>>> end to
> >>>>>>>>>> end tests (post-commits run frequently and the data size means
> the
> >>>>>>>>>> instances would not use many resources), combined with the
> >>>>>>>>>> ability to
> >>>>>>>>>> spin
> >>>>>>>>>> up larger instances for once a day/week performance benchmarks
> >>>>>>>>>> (these
> >>>>>>>>>>
> >>>>>>>>> use
> >>>>>>>>
> >>>>>>>>> up more resources and are used less frequently.) That's the mix
> >>>>>>>>>> I'll
> >>>>>>>>>> propose in my docs on testing IO transforms.  If spinning up new
> >>>>>>>>>> instances
> >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of spinning up
> >>>>>>>>>> instances
> >>>>>>>>>> for
> >>>>>>>>>> each test.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Management ("what's the overhead of managing such a deployment")
> >>>>>>>>>> --------------------
> >>>>>>>>>> Summary: I propose that anyone can contribute scripts for
> >>>>>>>>>> setting up
> >>>>>>>>>>
> >>>>>>>>> data
> >>>>>>>>
> >>>>>>>>> store instances + integration/perf tests, but if the community
> >>>>>>>>>> doesn't
> >>>>>>>>>> maintain a particular data store's tests, we disable the tests
> and
> >>>>>>>>>> turn off
> >>>>>>>>>> the data store instances.
> >>>>>>>>>>
> >>>>>>>>>> Management of these instances is a crucial question. First,
> let's
> >>>>>>>>>>
> >>>>>>>>> break
> >>>>>
> >>>>>> down what tasks we'll need to do on a recurring basis:
> >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both instance &
> >>>>>>>>>> dependencies) - we don't want to have a lot of old versions
> >>>>>>>>>> vulnerable
> >>>>>>>>>>
> >>>>>>>>> to
> >>>>>>>>
> >>>>>>>>> attacks/buggy
> >>>>>>>>>> 2. Investigate breakages/regressions
> >>>>>>>>>> (I'm betting there will be more things we'll discover - let me
> >>>>>>>>>> know if
> >>>>>>>>>> you
> >>>>>>>>>> have suggestions)
> >>>>>>>>>>
> >>>>>>>>>> There's a couple goals I see:
> >>>>>>>>>> 1. We should only do sys admin work for things that give us a
> >>>>>>>>>> lot of
> >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up scripts for
> >>>>>>>>>> data
> >>>>>>>>>> stores
> >>>>>>>>>> without a large community)
> >>>>>>>>>> 2. We should do as much as possible of testing via
> >>>>>>>>>> in-memory/embedded
> >>>>>>>>>> testing (as you brought up).
> >>>>>>>>>> 3. Reduce the amount of manual administration overhead
> >>>>>>>>>>
> >>>>>>>>>> As I discussed above, I think that integration tests/performance
> >>>>>>>>>> benchmarks
> >>>>>>>>>> are costly things that we should do only for the IO transforms
> >>>>>>>>>> with
> >>>>>>>>>>
> >>>>>>>>> large
> >>>>>>>>
> >>>>>>>>> amounts of community support/usage. Thus, I propose that we
> >>>>>>>>>> limit the
> >>>>>>>>>>
> >>>>>>>>> IO
> >>>>>>
> >>>>>>> transforms that get integration tests & performance benchmarks to
> >>>>>>>>>>
> >>>>>>>>> those
> >>>>>
> >>>>>> that have community support for maintaining the data store
> >>>>>>>>>> instances.
> >>>>>>>>>>
> >>>>>>>>>> We can enforce this organically using some simple rules:
> >>>>>>>>>> 1. Investigating breakages/regressions: if a given
> >>>>>>>>>> integration/perf
> >>>>>>>>>>
> >>>>>>>>> test
> >>>>>>
> >>>>>>> starts failing and no one investigates it within a set period of
> >>>>>>>>>> time
> >>>>>>>>>>
> >>>>>>>>> (a
> >>>>>>
> >>>>>>> week?), we disable the tests and shut off the data store
> >>>>>>>>>> instances if
> >>>>>>>>>>
> >>>>>>>>> we
> >>>>>>
> >>>>>>> have instances running. When someone wants to step up and
> >>>>>>>>>> support it
> >>>>>>>>>> again,
> >>>>>>>>>> they can fix the test, check it in, and re-enable the test.
> >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira issue that
> >>>>>>>>>> is just
> >>>>>>>>>> "is
> >>>>>>>>>> the IO Transform X data store up to date?" - if the jira is not
> >>>>>>>>>> resolved in
> >>>>>>>>>> a set period of time (1 month?), the perf/integration tests are
> >>>>>>>>>>
> >>>>>>>>> disabled,
> >>>>>>>>
> >>>>>>>>> and the data store instances shut off.
> >>>>>>>>>>
> >>>>>>>>>> This is pretty flexible -
> >>>>>>>>>> * If a particular person or organization wants to support an IO
> >>>>>>>>>> transform,
> >>>>>>>>>> they can. If a group of people all organically organize to keep
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>> tests
> >>>>>>>>
> >>>>>>>>> running, they can.
> >>>>>>>>>> * It can be mostly automated - there's not a lot of central
> >>>>>>>>>> organizing
> >>>>>>>>>> work
> >>>>>>>>>> that needs to be done.
> >>>>>>>>>>
> >>>>>>>>>> Exposing the information about what IO transforms currently have
> >>>>>>>>>>
> >>>>>>>>> running
> >>>>>>
> >>>>>>> IT/perf benchmarks on the website will let users know what IO
> >>>>>>>>>>
> >>>>>>>>> transforms
> >>>>>>
> >>>>>>> are well supported.
> >>>>>>>>>>
> >>>>>>>>>> I like this solution, but I also recognize this is a tricky
> >>>>>>>>>> problem.
> >>>>>>>>>>
> >>>>>>>>> This
> >>>>>>>>
> >>>>>>>>> is something the community needs to be supportive of, so I'm
> >>>>>>>>>> open to
> >>>>>>>>>> other
> >>>>>>>>>> thoughts.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Simulating failures in real nodes ("programmatic tests to
> simulate
> >>>>>>>>>> failure")
> >>>>>>>>>> -----------------
> >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We should
> >>>>>>>>>> encourage a
> >>>>>>>>>> design pattern separating out network/retry logic from the main
> IO
> >>>>>>>>>> transform logic
> >>>>>>>>>>
> >>>>>>>>>> We *could* create instance failure in any container management
> >>>>>>>>>>
> >>>>>>>>> software
> >>>>>
> >>>>>> -
> >>>>>>>>
> >>>>>>>>> we can use their programmatic APIs to determine which containers
> >>>>>>>>>> are
> >>>>>>>>>> running the instances, and ask them to kill the container in
> >>>>>>>>>> question.
> >>>>>>>>>>
> >>>>>>>>> A
> >>>>>>
> >>>>>>> slow node would be trickier, but I'm sure we could figure it out
> >>>>>>>>>> - for
> >>>>>>>>>> example, add a network proxy that would delay responses.
> >>>>>>>>>>
> >>>>>>>>>> However, I would argue that this type of testing doesn't gain
> us a
> >>>>>>>>>> lot, and
> >>>>>>>>>> is complicated to set up. I think it will be easier to test
> >>>>>>>>>> network
> >>>>>>>>>> errors
> >>>>>>>>>> and retry behavior in unit tests for the IO transforms.
> >>>>>>>>>>
> >>>>>>>>>> Part of the way to handle this is to separate out the read code
> >>>>>>>>>> from
> >>>>>>>>>>
> >>>>>>>>> the
> >>>>>>
> >>>>>>> network code (eg. bigtable has BigtableService). If you put the
> >>>>>>>>>>
> >>>>>>>>> "handle
> >>>>>
> >>>>>> errors/retry logic" code in a separate MySourceService class,
> >>>>>>>>>> you can
> >>>>>>>>>> test
> >>>>>>>>>> MySourceService on the wide variety of networks errors/data
> store
> >>>>>>>>>> problems,
> >>>>>>>>>> and then your main IO transform tests focus on the read behavior
> >>>>>>>>>> and
> >>>>>>>>>> handling the small set of errors the MySourceService class will
> >>>>>>>>>>
> >>>>>>>>> return.
> >>>>>
> >>>>>> I also think we should focus on testing the IO Transform, not
> >>>>>>>>>> the data
> >>>>>>>>>> store - if we kill a node in a data store, it's that data
> store's
> >>>>>>>>>> problem,
> >>>>>>>>>> not beam's problem. As you were pointing out, there are a
> *large*
> >>>>>>>>>> number of
> >>>>>>>>>> possible ways that a particular data store can fail, and we
> >>>>>>>>>> would like
> >>>>>>>>>>
> >>>>>>>>> to
> >>>>>>>>
> >>>>>>>>> support many different data stores. Rather than try to test that
> >>>>>>>>>> each
> >>>>>>>>>> data
> >>>>>>>>>> store behaves well, we should ensure that we handle
> >>>>>>>>>> generic/expected
> >>>>>>>>>> errors
> >>>>>>>>>> in a graceful manner.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Ismaeal had a couple other quick comments/questions, I'll answer
> >>>>>>>>>> here
> >>>>>>>>>>
> >>>>>>>>> -
> >>>>>
> >>>>>> We can use this to test other runners running on multiple
> >>>>>>>>>> machines - I
> >>>>>>>>>> agree. This is also necessary for a good performance benchmark
> >>>>>>>>>> test.
> >>>>>>>>>>
> >>>>>>>>>> "providing the test machines to mount the cluster" - we can
> >>>>>>>>>> discuss
> >>>>>>>>>>
> >>>>>>>>> this
> >>>>>>
> >>>>>>> further, but one possible option is that google may be willing to
> >>>>>>>>>>
> >>>>>>>>> donate
> >>>>>>
> >>>>>>> something to support this.
> >>>>>>>>>>
> >>>>>>>>>> "IO Consistency" - let's follow up on those questions in another
> >>>>>>>>>>
> >>>>>>>>> thread.
> >>>>>>
> >>>>>>> That's as much about the public interface we provide to users as
> >>>>>>>>>>
> >>>>>>>>> anything
> >>>>>>>>
> >>>>>>>>> else. I agree with your sentiment that a user should be able to
> >>>>>>>>>> expect
> >>>>>>>>>> predictable behavior from the different IO transforms.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for everyone's questions/comments - I really am excited
> >>>>>>>>>> to see
> >>>>>>>>>> that
> >>>>>>>>>> people care about this :)
> >>>>>>>>>>
> >>>>>>>>>> Stephen
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <iemejia@gmail.com
> >
> >>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>
> >>>>>> Hello,
> >>>>>>>>>>>
> >>>>>>>>>>> @Stephen Thanks for your proposal, it is really interesting, I
> >>>>>>>>>>> would
> >>>>>>>>>>> really
> >>>>>>>>>>> like to help with this. I have never played with Kubernetes but
> >>>>>>>>>>> this
> >>>>>>>>>>> seems
> >>>>>>>>>>> a really nice chance to do something useful with it.
> >>>>>>>>>>>
> >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple
> container
> >>>>>>>>>>>
> >>>>>>>>>> images
> >>>>>>>>
> >>>>>>>>> and in some particular cases ‘clusters’ of containers using
> >>>>>>>>>>> docker-compose
> >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be really
> >>>>>>>>>>> nice to
> >>>>>>>>>>>
> >>>>>>>>>> have
> >>>>>>>>
> >>>>>>>>> this at the Beam level, in particular to try to test more complex
> >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is to
> achieve
> >>>>>>>>>>> this for
> >>>>>>>>>>> example:
> >>>>>>>>>>>
> >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka nodes, I
> >>>>>>>>>>> would
> >>>>>>>>>>> like to
> >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill a node),
> >>>>>>>>>>> or
> >>>>>>>>>>> simulate
> >>>>>>>>>>> a really slow node, to ensure that the IO behaves as expected
> >>>>>>>>>>> in the
> >>>>>>>>>>> Beam
> >>>>>>>>>>> pipeline for the given runner.
> >>>>>>>>>>>
> >>>>>>>>>>> Another related idea is to improve IO consistency: Today the
> >>>>>>>>>>> different IOs
> >>>>>>>>>>> have small differences in their failure behavior, I really
> >>>>>>>>>>> would like
> >>>>>>>>>>> to be
> >>>>>>>>>>> able to predict with more precision what will happen in case of
> >>>>>>>>>>>
> >>>>>>>>>> errors,
> >>>>>>
> >>>>>>> e.g. what is the correct behavior if I am writing to a Kafka
> >>>>>>>>>>> node and
> >>>>>>>>>>> there
> >>>>>>>>>>> is a network partition, does the Kafka sink retries or no ? and
> >>>>>>>>>>> what
> >>>>>>>>>>> if it
> >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
> >>>>>>>>>>> checkpointing?
> >>>>>>>>>>> Or do
> >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am not sure
> >>>>>>>>>>> about
> >>>>>>>>>>> what
> >>>>>>>>>>> happens (or if the expected behavior depends on the runner),
> >>>>>>>>>>> but well
> >>>>>>>>>>> maybe
> >>>>>>>>>>> it is just that I don’t know and we have tests to ensure this.
> >>>>>>>>>>>
> >>>>>>>>>>> Of course both are really hard problems, but I think with your
> >>>>>>>>>>> proposal we
> >>>>>>>>>>> can try to tackle them, as well as the performance ones. And
> >>>>>>>>>>> apart of
> >>>>>>>>>>> the
> >>>>>>>>>>> data stores, I think it will be also really nice to be able to
> >>>>>>>>>>> test
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>
> >>>>>>> runners in a distributed manner.
> >>>>>>>>>>>
> >>>>>>>>>>> So what is the next step? How do you imagine such integration
> >>>>>>>>>>> tests?
> >>>>>>>>>>> ? Who
> >>>>>>>>>>> can provide the test machines so we can mount the cluster?
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial setup, but
> >>>>>>>>>>> it
> >>>>>>>>>>> will be
> >>>>>>>>>>> really nice to start working on this.
> >>>>>>>>>>>
> >>>>>>>>>>> Ismael
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <
> >>>>>>>>>>> amitsela33@gmail.com
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Stephen,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I was wondering about how we plan to use the data stores
> across
> >>>>>>>>>>>>
> >>>>>>>>>>> executions.
> >>>>>>>>>>>
> >>>>>>>>>>>> Clearly, it's best to setup a new instance (container) for
> every
> >>>>>>>>>>>>
> >>>>>>>>>>> test,
> >>>>>>
> >>>>>>> running a "standalone" store (say HBase/Cassandra for
> >>>>>>>>>>>> example), and
> >>>>>>>>>>>> once
> >>>>>>>>>>>> the test is done, teardown the instance. It should also be
> >>>>>>>>>>>> agnostic
> >>>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>
> >>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
> >>>>>>>>>>>> I'm wondering though what's the overhead of managing such a
> >>>>>>>>>>>>
> >>>>>>>>>>> deployment
> >>>>>>
> >>>>>>> which could become heavy and complicated as more IOs are
> >>>>>>>>>>>> supported
> >>>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>
> >>>>>>> more
> >>>>>>>>>>>
> >>>>>>>>>>>> test cases introduced.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another way to go would be to have small clusters of different
> >>>>>>>>>>>> data
> >>>>>>>>>>>>
> >>>>>>>>>>> stores
> >>>>>>>>>>>
> >>>>>>>>>>>> and run against new "namespaces" (while lazily evicting old
> >>>>>>>>>>>> ones),
> >>>>>>>>>>>> but I
> >>>>>>>>>>>> think this is less likely as maintaining a distributed
> instance
> >>>>>>>>>>>>
> >>>>>>>>>>> (even
> >>>>>
> >>>>>> a
> >>>>>>>>
> >>>>>>>>> small one) for each data store sounds even more complex.
> >>>>>>>>>>>>
> >>>>>>>>>>>> A third approach would be to to simply have an "embedded"
> >>>>>>>>>>>> in-memory
> >>>>>>>>>>>> instance of a data store as part of a test that runs against
> it
> >>>>>>>>>>>> (such as
> >>>>>>>>>>>>
> >>>>>>>>>>> an
> >>>>>>>>>>>
> >>>>>>>>>>>> embedded Kafka, though not a data store).
> >>>>>>>>>>>> This is probably the simplest solution in terms of
> >>>>>>>>>>>> orchestration,
> >>>>>>>>>>>> but it
> >>>>>>>>>>>> depends on having a proper "embedded" implementation for an
> IO.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Does this make sense to you ? have you considered it ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Amit
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <
> >>>>>>>>>>>>
> >>>>>>>>>>> jb@nanthrax.net
> >>>>>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Stephen,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> as already discussed a bit together, it sounds great !
> >>>>>>>>>>>>> Especially I
> >>>>>>>>>>>>>
> >>>>>>>>>>>> like
> >>>>>>>>>>>
> >>>>>>>>>>>> it as a both integration test platform and good coverage for
> >>>>>>>>>>>>> IOs.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm very late on this but, as said, I will share with you my
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Marathon
> >>>>>>
> >>>>>>> JSON and Mesos docker images.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> By the way, I started to experiment a bit kubernetes and
> >>>>>>>>>>>>> swamp but
> >>>>>>>>>>>>> it's
> >>>>>>>>>>>>> not yet complete. I will share what I have on the same github
> >>>>>>>>>>>>> repo.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks !
> >>>>>>>>>>>>> Regards
> >>>>>>>>>>>>> JB
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi everyone!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Currently we have a good set of unit tests for our IO
> >>>>>>>>>>>>>> Transforms -
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> those
> >>>>>>>>>>>>
> >>>>>>>>>>>>> tend to run against in-memory versions of the data stores.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> However,
> >>>>>
> >>>>>> we'd
> >>>>>>>>>>>>
> >>>>>>>>>>>>> like to further increase our test coverage to include
> >>>>>>>>>>>>>> running them
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> against
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> real instances of the data stores that the IO Transforms
> work
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> against
> >>>>>>>>
> >>>>>>>>> (e.g.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> cassandra, mongodb, kafka, etc…), which means we'll need to
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> real
> >>>>>>>>
> >>>>>>>>> instances of various data stores.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Additionally, if we want to do performance regression
> >>>>>>>>>>>>>> detection,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> it's
> >>>>>>>>
> >>>>>>>>> important to have instances of the services that behave
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> realistically,
> >>>>>>>>>>>
> >>>>>>>>>>>> which isn't true of in-memory or dev versions of the services.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Proposed solution
> >>>>>>>>>>>>>> -------------------------
> >>>>>>>>>>>>>> If we accept this proposal, we would create an
> >>>>>>>>>>>>>> infrastructure for
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> running
> >>>>>>>>>>>>
> >>>>>>>>>>>>> real instances of data stores inside of containers, using
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> container
> >>>>>
> >>>>>> management software like mesos/marathon, kubernetes, docker
> >>>>>>>>>>>>>> swarm,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> etc…
> >>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> manage the instances.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This would enable us to build integration tests that run
> >>>>>>>>>>>>>> against
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> those
> >>>>>>>>>>>
> >>>>>>>>>>>> real
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> instances and performance tests that run against those real
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> instances
> >>>>>>>>
> >>>>>>>>> (like
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> those that Jason Kuster is proposing elsewhere.)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Why do we need one centralized set of instances vs just
> having
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> various
> >>>>>>>>>>>
> >>>>>>>>>>>> people host their own instances?
> >>>>>>>>>>>>>> -------------------------
> >>>>>>>>>>>>>> Reducing flakiness of tests is key. By not having
> dependencies
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> from
> >>>>>
> >>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>>> core project on external services/instances of data stores
> >>>>>>>>>>>>>> we have
> >>>>>>>>>>>>>> guaranteed access to the services and the group can fix
> issues
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> that
> >>>>>
> >>>>>> arise.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> An exception would be something that has an ops team
> >>>>>>>>>>>>>> supporting it
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> (eg,
> >>>>>>>>>>>
> >>>>>>>>>>>> AWS, Google Cloud or other professionally managed service) -
> >>>>>>>>>>>>>> those
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> we
> >>>>>>>>
> >>>>>>>>> trust
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> will be stable.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> There may be a lot of different data stores needed - how
> >>>>>>>>>>>>>> will we
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> maintain
> >>>>>>>>>>>>
> >>>>>>>>>>>>> them?
> >>>>>>>>>>>>>> -------------------------
> >>>>>>>>>>>>>> It will take work above and beyond that of a normal set of
> >>>>>>>>>>>>>> unit
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> tests
> >>>>>>>>
> >>>>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>>> build and maintain integration/performance tests & their data
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> store
> >>>>>
> >>>>>> instances.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Setup & maintenance of the data store containers and data
> >>>>>>>>>>>>>> store
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> instances
> >>>>>>>>>>>>
> >>>>>>>>>>>>> on it must be automated. It also has to be as simple of a
> >>>>>>>>>>>>>> setup as
> >>>>>>>>>>>>>> possible, and we should avoid hand tweaking the containers -
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> expecting
> >>>>>>>>>>>
> >>>>>>>>>>>> checked in scripts/dockerfiles is key.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Aligned with the community ownership approach of Apache, as
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> members
> >>>>>
> >>>>>> of
> >>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> community are excited to contribute & maintain those tests
> >>>>>>>>>>>>>> and the
> >>>>>>>>>>>>>> integration/performance tests, people will be able to step
> >>>>>>>>>>>>>> up and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> do
> >>>>>>
> >>>>>>> that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> If there is no longer support for maintaining a particular
> >>>>>>>>>>>>>> set of
> >>>>>>>>>>>>>> integration & performance tests and their data store
> >>>>>>>>>>>>>> instances,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> then
> >>>>>>
> >>>>>>> we
> >>>>>>>>>>>
> >>>>>>>>>>>> can
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> disable those tests. We may document on the website what IO
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Transforms
> >>>>>>>>>>>
> >>>>>>>>>>>> have
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> current integration/performance tests so users know what
> >>>>>>>>>>>>>> level of
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> testing
> >>>>>>>>>>>>
> >>>>>>>>>>>>> the various IO Transforms have.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What about requirements for the container management
> software
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> itself?
> >>>>>>>>
> >>>>>>>>> -------------------------
> >>>>>>>>>>>>>> * We should have the data store instances themselves in
> >>>>>>>>>>>>>> Docker.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Docker
> >>>>>>>>>>>
> >>>>>>>>>>>> allows new instances to be spun up in a quick, reproducible
> way
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> and
> >>>>>
> >>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>>> fairly platform independent. It has wide support from a
> >>>>>>>>>>>>>> variety of
> >>>>>>>>>>>>>> different container management services.
> >>>>>>>>>>>>>> * As little admin work required as possible. Crashing
> >>>>>>>>>>>>>> instances
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> should
> >>>>>>>>>>>
> >>>>>>>>>>>> be
> >>>>>>>>>>>>
> >>>>>>>>>>>>> restarted, setup should be simple, everything possible
> >>>>>>>>>>>>>> should be
> >>>>>>>>>>>>>> scripted/scriptable.
> >>>>>>>>>>>>>> * Logs and test output should be on a publicly available
> >>>>>>>>>>>>>> website,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> without
> >>>>>>>>>>>>
> >>>>>>>>>>>>> needing to log into test execution machine. Centralized
> >>>>>>>>>>>>>> capture of
> >>>>>>>>>>>>>> monitoring info/logs from instances running in the
> containers
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> would
> >>>>>
> >>>>>> support
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> this. Ideally, this would just be supported by the container
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> software
> >>>>>>>>
> >>>>>>>>> out
> >>>>>>>>>>>>
> >>>>>>>>>>>>> of the box.
> >>>>>>>>>>>>>> * It'd be useful to have good persistent volume in the
> >>>>>>>>>>>>>> container
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> management
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> software so that databases don't have to reload large data
> >>>>>>>>>>>>>> sets
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> every
> >>>>>>>>
> >>>>>>>>> time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> * The containers may be a place to execute runners
> >>>>>>>>>>>>>> themselves if
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> we
> >>>>>
> >>>>>> need
> >>>>>>>>>>>>
> >>>>>>>>>>>>> larger runner instances, so it should play well with Spark,
> >>>>>>>>>>>>>> Flink,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> etc…
> >>>>>>>>>>>
> >>>>>>>>>>>> As I discussed earlier on the mailing list, it looks like
> >>>>>>>>>>>>>> hosting
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> docker
> >>>>>>>>>>>>
> >>>>>>>>>>>>> containers on kubernetes, docker swarm or mesos+marathon
> >>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> a
> >>>>>
> >>>>>> good
> >>>>>>>>>>>>
> >>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Stephen Sisk
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>>>>>> jbonofre@apache.org
> >>>>>>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>> Jean-Baptiste Onofré
> >>>>>>>> jbonofre@apache.org
> >>>>>>>> http://blog.nanthrax.net
> >>>>>>>> Talend - http://www.talend.com
> >>>>>>>>
> >>>>>>>>
> >>>
> >
> >
>

Re: Hosting data stores for IO Transform testing

Posted by Stephen Sisk <si...@google.com.INVALID>.

hey folks! I wanted to gather any last thoughts that people might have. I'd
like to get started setting this up - anyone else have input?

S

On Thu, Jan 19, 2017 at 11:41 AM Stephen Sisk <si...@google.com> wrote:

> Glad to hear you support kubernetes (although to be clear, I'm rooting for
> the right solution for us in the long run - if anyone has a strong reason
> for dcos, I'm excited to hear it.)
>
> I agree with you that testing IO in failure scenarios seems like a
> fruitful area for future work, but that I don't want to tackle it just yet
> (and I'm not hearing that we think it affects our current decision - if
> someone does, I'd like to hear about it.) I am going to split off a thread
> for that discussion because I think the discussion informs how we write our
> unit tests currently, and want to clarify it.
>
> On Wed, Jan 18, 2017 at 1:42 PM Ismaël Mejía <ie...@gmail.com> wrote:
>
> Hello again,
>
> Stephen, I agree with you the real question is what is the scope of the
> tests, maybe the discussion so far has been more about testing a ‘real’
> data store and finding infra/performance issues (and future regressions),
> but having a modern cluster manager opens the door to create more
> interesting integration tests like the ones I mentioned, in particular my
> idea is more oriented towards the validation of the ‘correct’expected
> behavior of the IOs and runners. But this is quite ambitious for a first
> goal, maybe we should first get things working and let this for later (if
> there is still interest).
>
> I am not sure that unit tests are enough to test distribution issues
> because they are harder to simulate in particular if we add the fact that
> we can have too many moving pieces. For example, imagine that we run a Beam
> pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
> that reads from Kafka (with some slow partition) and writes to Cassandra
> (with a partition that goes down). You see, this is a quite complex
> combination of pieces (and possible issues), but it is not a totally
> artificial scenario, in fact this is a common architecture, and this can
> (at least in theory) be simulated with a cluster manager, but I don’t see
> how can I easily reproduce this with a unit test.
>
> Anyway, this scenario makes me think that the boundaries of what we want to
> test are really important. Complexity can be huge.
>
> About the Mesos package question, effectively I referred to Mesos Universe
> (the repo you linked), and what you said is sadly true, it is not easy to
> find multi-node instance packages that are the most interesting ones for
> our tests (in both k8s or mesos). I agree with your decision of using
> Kubernetes, I just wanted to mention that in some cases we will need to
> produce these multi-node packages to have interesting tests.
>
> Ismaël
>
>
> On Wed, Jan 18, 2017 at 10:09 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may find
> > single node config but not sure for multi-node setup. Anyway, I'm not
> sure
> > if we find a multi-node configuration, it would cover our needs.
> >
> > Regards
> > JB
> >
> > On 01/18/2017 12:52 PM, Stephen Sisk wrote:
> >
> >> ah! I looked around a bit more and found the dcos package repo -
> >> https://github.com/mesosphere/universe/tree/version-3.x/repo/packages
> >>
> >> poking around a bit, I can find a lot of packages for single node
> >> instances, but not many packages for multi-node instances. Single node
> >> instance packages are kind of useful, but I don't think it's *too*
> >> helpful.
> >> The multi-node instance packages that run the data store's high
> >> availability mode are where the real work is, and it seems like both
> >> kubernetes helm and dcos' package universe don't have a lot of those.
> >>
> >> S
> >>
> >> On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <si...@google.com> wrote:
> >>
> >> Hi Ishmael,
> >>>
> >>> these are good questions, thanks for raising them.
> >>>
> >>> Ability to modify network/compute resources to simulate failures
> >>> =================================================
> >>> I see two real questions here:
> >>> 1. Is this something we want to do?
> >>> 2. Is it possible with both/either?
> >>>
> >>> So far, the test strategy I've been advocating is that we test problems
> >>> like this in unit tests rather than do this in ITs/Perf tests.
> Otherwise,
> >>> it's hard to re-create the same conditions.
> >>>
> >>> I can investigate whether it's possible, but I want to clarify whether
> >>> this is something that we care about. I know both support killing
> >>> individual nodes. I haven't seen a lot of network control in either,
> but
> >>> haven't tried to look for it.
> >>>
> >>> Availability of ready to play packages
> >>> ============================
> >>> I did look at this, and as far as I could tell, mesos didn't have any
> >>> pre-built packages for multi-node clusters of data stores. If there's a
> >>> good repository of them that we trust, that would definitely save us
> >>> time.
> >>> Can you point me at the mesos repository?
> >>>
> >>> S
> >>>
> >>>
> >>>
> >>> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> >>> wrote:
> >>>
> >>> ⁣Hi Ismael
> >>>
> >>> Stephen will reply with details but I know he did a comparison and
> >>> evaluate different options.
> >>>
> >>> He tested with the jdbc Io itests.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <ie...@gmail.com>
> >>> wrote:
> >>>
> >>>> Thanks for your analysis Stephen, good arguments / references.
> >>>>
> >>>> One quick question. Have you checked the APIs of both
> >>>> (Mesos/Kubernetes) to
> >>>> see
> >>>> if we can do programmatically do more complex tests (I suppose so, but
> >>>> you
> >>>> don't mention how easy or if those are possible), for example to
> >>>> simulate a
> >>>> slow networking slave (to test stragglers), or to arbitrarily kill one
> >>>> slave (e.g. if I want to test the correct behavior of a runner/IO that
> >>>> is
> >>>> reading from it) ?
> >>>>
> >>>> Other missing point in the review is the availability of ready to play
> >>>> packages,
> >>>> I think in this area mesos/dcos seems more advanced no? I haven't
> >>>> looked
> >>>> recently but at least 6 months ago there were not many helm packages
> >>>> ready
> >>>> for
> >>>> example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
> >>>> etc). Has
> >>>> this been improved ? because preparing this also is a considerable
> >>>> amount of
> >>>> work on the other hand this could be also a chance to contribute to
> >>>> kubernetes.
> >>>>
> >>>> Regards,
> >>>> Ismaël
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <sisk@google.com.invalid
> >
> >>>> wrote:
> >>>>
> >>>> hi!
> >>>>>
> >>>>> I've been continuing this investigation, and have some more info to
> >>>>>
> >>>> report,
> >>>>
> >>>>> and hopefully we can start making some decisions.
> >>>>>
> >>>>> To support performance testing, I've been investigating
> >>>>>
> >>>> mesos+marathon and
> >>>>
> >>>>> kubernetes for running data stores in their high availability mode. I
> >>>>>
> >>>> have
> >>>>
> >>>>> been examining features that kubernetes/mesos+marathon use to support
> >>>>>
> >>>> this.
> >>>>
> >>>>>
> >>>>> Setting up a multi-node cluster in a high availability mode tends to
> >>>>>
> >>>> be
> >>>>
> >>>>> more expensive time-wise than the single node instances I've played
> >>>>>
> >>>> around
> >>>>
> >>>>> with in the past. Rather than do a full build out with both
> >>>>>
> >>>> kubernetes and
> >>>>
> >>>>> mesos, I'd like to pick one of the two options to build the prototype
> >>>>> cluster with. If the prototype doesn't go well, we could still go
> >>>>>
> >>>> back to
> >>>>
> >>>>> the other option, but I'd like to change us from a mode of "let's
> >>>>>
> >>>> look at
> >>>>
> >>>>> all the options" to one of "here's the favorite, let's prove that
> >>>>>
> >>>> works for
> >>>>
> >>>>> us".
> >>>>>
> >>>>> Below are the features that I've seen are important to multi-node
> >>>>>
> >>>> instances
> >>>>
> >>>>> of data stores. I'm sure other folks on the list have done this
> >>>>>
> >>>> before, so
> >>>>
> >>>>> feel free to pipe up if I'm missing a good solution to a problem.
> >>>>>
> >>>>> DNS/Discovery
> >>>>>
> >>>>> --------------------
> >>>>>
> >>>>> Necessary for talking between nodes (eg, cassandra nodes all need to
> >>>>>
> >>>> be
> >>>>
> >>>>> able to talk to a set of seed nodes.)
> >>>>>
> >>>>> * Kubernetes has built-in DNS/discovery between nodes.
> >>>>>
> >>>>> * Mesos has supports this via mesos-dns, which isn't a part of core
> >>>>>
> >>>> mesos,
> >>>>
> >>>>> but is in dcos, which is the mesos distribution I've been using and
> >>>>>
> >>>> that I
> >>>>
> >>>>> would expect us to use.
> >>>>>
> >>>>> Instances properly distributed across nodes
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>>
> >>>>> If multiple instances of a data source end up on the same underlying
> >>>>>
> >>>> VM, we
> >>>>
> >>>>> may not get good performance out of those instances since the
> >>>>>
> >>>> underlying VM
> >>>>
> >>>>> may be more taxed than other VMs.
> >>>>>
> >>>>> * Kubernetes has a beta feature StatefulSets[1] which allow for
> >>>>>
> >>>> containers
> >>>>
> >>>>> distributed so that there's one container per underlying machine (as
> >>>>>
> >>>> well
> >>>>
> >>>>> as a lot of other useful features like easy stable dns names.)
> >>>>>
> >>>>> * Mesos can support this via the built in UNIQUE constraint [2]
> >>>>>
> >>>>> Load balancing
> >>>>>
> >>>>> --------------------
> >>>>>
> >>>>> Incoming requests from users need to be distributed to the various
> >>>>>
> >>>> machines
> >>>>
> >>>>> - this is important for many data stores' high availability modes.
> >>>>>
> >>>>> * Kubernetes supports easily hooking up to an external load balancer
> >>>>>
> >>>> when
> >>>>
> >>>>> on a cloud (and can be configured to work with a built-in load
> >>>>>
> >>>> balancer if
> >>>>
> >>>>> not)
> >>>>>
> >>>>> * Mesos supports this via marathon-lb [3], which is an install-able
> >>>>>
> >>>> package
> >>>>
> >>>>> in DC/OS
> >>>>>
> >>>>> Persistent Volumes tied to specific instances
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>>
> >>>>> Databases often need persistent state (for example to store the data
> >>>>>
> >>>> :), so
> >>>>
> >>>>> it's an important part of running our service.
> >>>>>
> >>>>> * Kubernetes StatefulSets supports this
> >>>>>
> >>>>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
> >>>>>
> >>>>> As I mentioned above, I'd like to focus on either kubernetes or mesos
> >>>>>
> >>>> for
> >>>>
> >>>>> my investigation, and as I go further along, I'm seeing kubernetes as
> >>>>> better suited to our needs.
> >>>>>
> >>>>> (1) It supports more of the features we want out of the box and with
> >>>>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> >>>>> requires marathon-lb to be installed and mesos-dns to be configured.
> >>>>>
> >>>>> (2) I'm also finding that there seem to be more examples of using
> >>>>> kubernetes to solve the types of problems we're working on. This is
> >>>>> somewhat subjective, but in my experience as I've tried to learn both
> >>>>> kubernetes and mesos, I personally found it generally easier to get
> >>>>> kubernetes running than mesos due to the tutorials/examples available
> >>>>>
> >>>> for
> >>>>
> >>>>> kubernetes.
> >>>>>
> >>>>> (3) Lower cost of initial setup - as I discussed in a previous
> >>>>>
> >>>> mail[6],
> >>>>
> >>>>> kubernetes was far easier to get set up even when I knew the exact
> >>>>>
> >>>> steps.
> >>>>
> >>>>> Mesos took me around 27 steps [7], which involved a lot of config
> >>>>>
> >>>> that was
> >>>>
> >>>>> easy to get wrong (it took me about 5 tries to get all the steps
> >>>>>
> >>>> correct in
> >>>>
> >>>>> one go.) Kubernetes took me around 8 steps and very little config.
> >>>>>
> >>>>> Given that, I'd like to focus my investigation/prototyping on
> >>>>>
> >>>> Kubernetes.
> >>>>
> >>>>> To
> >>>>> be clear, it's fairly close and I think both Mesos and Kubernetes
> >>>>>
> >>>> could
> >>>>
> >>>>> support what we need, so if we run into issues with kubernetes, Mesos
> >>>>>
> >>>> still
> >>>>
> >>>>> seems like a viable option that we could fall back to.
> >>>>>
> >>>>> Thanks,
> >>>>> Stephen
> >>>>>
> >>>>>
> >>>>> [1] Kubernetes StatefulSets
> >>>>>
> >>>>>
> >>>> https://kubernetes.io/docs/concepts/abstractions/controllers
> >>> /statefulsets/
> >>>
> >>>>
> >>>>> [2] mesos unique constraint -
> >>>>> https://mesosphere.github.io/marathon/docs/constraints.html
> >>>>>
> >>>>> [3]
> >>>>> https://mesosphere.github.io/marathon/docs/service-
> >>>>> discovery-load-balancing.html
> >>>>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
> >>>>>
> >>>>> [4]
> >>>>>
> >>>> https://mesosphere.github.io/marathon/docs/persistent-volumes.html
> >>>>
> >>>>>
> >>>>> [5]
> >>>>>
> >>>> https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
> >>>>
> >>>>>
> >>>>> [6] Container Orchestration software for hosting data stores
> >>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> >>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
> >>>>>
> >>>>> [7]
> https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
> >>>>>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> Just a quick drive-by comment: how tests are laid out has
> >>>>>>
> >>>>> non-trivial
> >>>>
> >>>>> tradeoffs on how/where continuous integration runs, and how results
> >>>>>>
> >>>>> are
> >>>>
> >>>>> integrated into the tooling. The current state is certainly not
> >>>>>>
> >>>>> ideal
> >>>>
> >>>>> (e.g., due to multiple test executions some links in Jenkins point
> >>>>>>
> >>>>> where
> >>>>
> >>>>> they shouldn't), but most other alternatives had even bigger
> >>>>>>
> >>>>> drawbacks at
> >>>>
> >>>>> the time. If someone has great ideas that don't explode the number
> >>>>>>
> >>>>> of
> >>>>
> >>>>> modules, please share ;-)
> >>>>>>
> >>>>>> On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
> >>>>>>
> >>>>> <ec...@gmail.com>
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Stephen,
> >>>>>>>
> >>>>>>> Thanks for taking the time to comment.
> >>>>>>>
> >>>>>>> My comments are bellow in the email:
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> >>>>>>>
> >>>>>>> hey Etienne -
> >>>>>>>>
> >>>>>>>> thanks for your thoughts and thanks for sharing your
> >>>>>>>>
> >>>>>>> experiences. I
> >>>>
> >>>>> generally agree with what you're saying. Quick comments below:
> >>>>>>>>
> >>>>>>>> IT are stored alongside with UT in src/test directory of the IO
> >>>>>>>>
> >>>>>>> but
> >>>>
> >>>>> they
> >>>>>
> >>>>>>
> >>>>>>>>> might go to dedicated module, waiting for a consensus
> >>>>>>>> I don't have a strong opinion or feel that I've worked enough
> >>>>>>>>
> >>>>>>> with
> >>>>
> >>>>> maven
> >>>>>
> >>>>>> to
> >>>>>>>> understand all the consequences - I'd love for someone with more
> >>>>>>>>
> >>>>>>> maven
> >>>>
> >>>>> experience to weigh in. If this becomes blocking, I'd say check
> >>>>>>>>
> >>>>>>> it in,
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> we can refactor later if it proves problematic.
> >>>>>>>>
> >>>>>>>> Sure, not a blocking point, it could be refactored afterwards.
> >>>>>>>
> >>>>>> Just as
> >>>>
> >>>>> a
> >>>>>
> >>>>>> reminder, JB mentioned that storing IT in separate module allows
> >>>>>>>
> >>>>>> to
> >>>>
> >>>>> have
> >>>>>
> >>>>>> more coherence between all IT (same behavior) and to do cross IO
> >>>>>>> integration tests. JB, have you experienced some long term
> >>>>>>>
> >>>>>> drawbacks of
> >>>>
> >>>>> storing IT in a separate module, like, for example, more
> >>>>>>>
> >>>>>> difficult
> >>>>
> >>>>> maintenance due to "distance" with production code?
> >>>>>>>
> >>>>>>>
> >>>>>>>   Also IMHO, it is better that tests load/clean data than doing
> >>>>>>>>
> >>>>>>> some
> >>>>
> >>>>>
> >>>>>>>>> assumptions about the running order of the tests.
> >>>>>>>> I definitely agree that we don't want to make assumptions about
> >>>>>>>>
> >>>>>>> the
> >>>>
> >>>>> running
> >>>>>>>> order of the tests - that way lies pain. :) It will be
> >>>>>>>>
> >>>>>>> interesting to
> >>>>
> >>>>> see
> >>>>>>
> >>>>>>> how the performance tests work out since they will need more
> >>>>>>>>
> >>>>>>> data (and
> >>>>
> >>>>> thus
> >>>>>>>> loading data can take much longer.)
> >>>>>>>>
> >>>>>>>> Yes, performance testing might push in the direction of data
> >>>>>>>
> >>>>>> loading
> >>>>
> >>>>> from
> >>>>>
> >>>>>> outside the tests due to loading time.
> >>>>>>>
> >>>>>>>   This should also be an easier problem
> >>>>>>>> for read tests than for write tests - if we have long running
> >>>>>>>>
> >>>>>>> instances,
> >>>>>
> >>>>>> read tests don't really need cleanup. And if write tests only
> >>>>>>>>
> >>>>>>> write a
> >>>>
> >>>>> small
> >>>>>>>> amount of data, as long as we are sure we're writing to uniquely
> >>>>>>>> identifiable locations (ie, new table per test or something
> >>>>>>>>
> >>>>>>> similar),
> >>>>
> >>>>> we
> >>>>>
> >>>>>> can clean up the write test data on a slower schedule.
> >>>>>>>>
> >>>>>>>> I agree
> >>>>>>>
> >>>>>>>
> >>>>>>>> this will tend to go to the direction of long running data store
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> instances rather than data store instances started (and
> >>>>>>>>
> >>>>>>> optionally
> >>>>
> >>>>> loaded)
> >>>>>>
> >>>>>>> before tests.
> >>>>>>>> It may be easiest to start with a "data stores stay running"
> >>>>>>>> implementation, and then if we see issues with that move towards
> >>>>>>>>
> >>>>>>> tests
> >>>>
> >>>>> that
> >>>>>>>> start/stop the data stores on each run. One thing I'd like to
> >>>>>>>>
> >>>>>>> make
> >>>>
> >>>>> sure
> >>>>>
> >>>>>> is
> >>>>>>
> >>>>>>> that we're not manually tweaking the configurations for data
> >>>>>>>>
> >>>>>>> stores.
> >>>>
> >>>>> One
> >>>>>
> >>>>>> way we could do that is to destroy/recreate the data stores on a
> >>>>>>>>
> >>>>>>> slower
> >>>>>
> >>>>>> schedule - maybe once per week. That way if the script is
> >>>>>>>>
> >>>>>>> changed or
> >>>>
> >>>>> the
> >>>>>
> >>>>>> data store instances are changed, we'd be able to detect it
> >>>>>>>>
> >>>>>>> relatively
> >>>>
> >>>>> soon
> >>>>>>>> while still removing the need for the tests to manage the data
> >>>>>>>>
> >>>>>>> stores.
> >>>>
> >>>>>
> >>>>>>>> I agree. In addition to configuration manual tweaking, there
> >>>>>>>
> >>>>>> might be
> >>>>
> >>>>> cases in which a data store re-partition data during a test or
> >>>>>>>
> >>>>>> after
> >>>>
> >>>>> some
> >>>>>
> >>>>>> tests while the dataset changes. The IO must be tolerant to that
> >>>>>>>
> >>>>>> but
> >>>>
> >>>>> the
> >>>>>
> >>>>>> asserts (number of bundles for example) in test must not fail in
> >>>>>>>
> >>>>>> that
> >>>>
> >>>>> case.
> >>>>>>
> >>>>>>> I would also prefer if possible that the tests do not manage data
> >>>>>>>
> >>>>>> stores
> >>>>>
> >>>>>> (not setup them, not start them, not stop them)
> >>>>>>>
> >>>>>>>
> >>>>>>> as a general note, I suspect many of the folks in the states
> >>>>>>>>
> >>>>>>> will be
> >>>>
> >>>>> on
> >>>>>
> >>>>>> holiday until Jan 2nd/3rd.
> >>>>>>>>
> >>>>>>>> S
> >>>>>>>>
> >>>>>>>> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
> >>>>>>>>
> >>>>>>> <echauchot@gmail.com
> >>>>
> >>>>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Recently we had a discussion about integration tests of IOs.
> >>>>>>>>>
> >>>>>>>> I'm
> >>>>
> >>>>> preparing a PR for integration tests of the elasticSearch IO
> >>>>>>>>> (
> >>>>>>>>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> >>>>>>>>> LASTICSEARCH-IO
> >>>>>>>>> as a first shot) which are very important IMHO because they
> >>>>>>>>>
> >>>>>>>> helped
> >>>>
> >>>>> catch
> >>>>>>
> >>>>>>> some bugs that UT could not (volume, data store instance
> >>>>>>>>>
> >>>>>>>> sharing,
> >>>>
> >>>>> real
> >>>>>
> >>>>>> data store instance ...)
> >>>>>>>>>
> >>>>>>>>> I would like to have your thoughts/remarks about points bellow.
> >>>>>>>>>
> >>>>>>>> Some
> >>>>
> >>>>> of
> >>>>>
> >>>>>> these points are also discussed here
> >>>>>>>>>
> >>>>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >>>>>>>>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> >>>>>>>>> :
> >>>>>>>>>
> >>>>>>>>> - UT and IT have a similar architecture, but while UT focus on
> >>>>>>>>>
> >>>>>>>> testing
> >>>>>
> >>>>>> the correct behavior of the code including corner cases and use
> >>>>>>>>>
> >>>>>>>> embedded
> >>>>>>
> >>>>>>> in memory data store, IT assume that the behavior is correct
> >>>>>>>>>
> >>>>>>>> (strong
> >>>>
> >>>>> UT)
> >>>>>>
> >>>>>>> and focus on higher volume testing and testing against real
> >>>>>>>>>
> >>>>>>>> data
> >>>>
> >>>>> store
> >>>>>
> >>>>>> instance(s)
> >>>>>>>>>
> >>>>>>>>> - For now, IT are stored alongside with UT in src/test
> >>>>>>>>>
> >>>>>>>> directory of
> >>>>
> >>>>> the
> >>>>>
> >>>>>> IO but they might go to dedicated module, waiting for a
> >>>>>>>>>
> >>>>>>>> consensus.
> >>>>
> >>>>> Maven
> >>>>>>
> >>>>>>> is not configured to run them automatically because data store
> >>>>>>>>>
> >>>>>>>> is not
> >>>>
> >>>>> available on jenkins server yet
> >>>>>>>>>
> >>>>>>>>> - For now, they only use DirectRunner, but they will  be run
> >>>>>>>>>
> >>>>>>>> against
> >>>>
> >>>>> each runner.
> >>>>>>>>>
> >>>>>>>>> - IT do not setup data store instance (like stated in the above
> >>>>>>>>> document) they assume that one is already running (hardcoded
> >>>>>>>>> configuration in test for now, waiting for a common solution to
> >>>>>>>>>
> >>>>>>>> pass
> >>>>
> >>>>> configuration to IT). A docker container script is provided in
> >>>>>>>>>
> >>>>>>>> the
> >>>>
> >>>>> contrib directory as a starting point to whatever orchestration
> >>>>>>>>>
> >>>>>>>> software
> >>>>>>
> >>>>>>> will be chosen.
> >>>>>>>>>
> >>>>>>>>> - IT load and clean test data before and after each test if
> >>>>>>>>>
> >>>>>>>> needed.
> >>>>
> >>>>> It
> >>>>>
> >>>>>> is simpler to do so because some tests need empty data store
> >>>>>>>>>
> >>>>>>>> (write
> >>>>
> >>>>> test) and because, as discussed in the document, tests might
> >>>>>>>>>
> >>>>>>>> not be
> >>>>
> >>>>> the
> >>>>>
> >>>>>> only users of the data store. Also IMHO, it is better that
> >>>>>>>>>
> >>>>>>>> tests
> >>>>
> >>>>> load/clean data than doing some assumptions about the running
> >>>>>>>>>
> >>>>>>>> order
> >>>>
> >>>>> of
> >>>>>
> >>>>>> the tests.
> >>>>>>>>>
> >>>>>>>>> If we generalize this pattern to all IT tests, this will tend
> >>>>>>>>>
> >>>>>>>> to go
> >>>>
> >>>>> to
> >>>>>
> >>>>>> the direction of long running data store instances rather than
> >>>>>>>>>
> >>>>>>>> data
> >>>>
> >>>>> store instances started (and optionally loaded) before tests.
> >>>>>>>>>
> >>>>>>>>> Besides if we where to change our minds and load data from
> >>>>>>>>>
> >>>>>>>> outside
> >>>>
> >>>>> the
> >>>>>
> >>>>>> tests, a logstash script is provided.
> >>>>>>>>>
> >>>>>>>>> If you have any thoughts or remarks I'm all ears :)
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>>
> >>>>>>>>> Etienne
> >>>>>>>>>
> >>>>>>>>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> >>>>>>>>>
> >>>>>>>>> Hi Stephen,
> >>>>>>>>>>
> >>>>>>>>>> the purpose of having in a specific module is to share
> >>>>>>>>>>
> >>>>>>>>> resources and
> >>>>
> >>>>> apply the same behavior from IT perspective and be able to
> >>>>>>>>>>
> >>>>>>>>> have IT
> >>>>
> >>>>> "cross" IO (for instance, reading from JMS and sending to
> >>>>>>>>>>
> >>>>>>>>> Kafka, I
> >>>>
> >>>>> think that's the key idea for integration tests).
> >>>>>>>>>>
> >>>>>>>>>> For instance, in Karaf, we have:
> >>>>>>>>>> - utest in each module
> >>>>>>>>>> - itest module containing itests for all modules all together
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Etienne,
> >>>>>>>>>>>
> >>>>>>>>>>> thanks for following up and answering my questions.
> >>>>>>>>>>>
> >>>>>>>>>>> re: where to store integration tests - having them all in a
> >>>>>>>>>>>
> >>>>>>>>>> separate
> >>>>>
> >>>>>> module
> >>>>>>>>>>> is an interesting idea. I couldn't find JB's comments about
> >>>>>>>>>>>
> >>>>>>>>>> moving
> >>>>
> >>>>> them
> >>>>>>
> >>>>>>> into a separate module in the PR - can you share the reasons
> >>>>>>>>>>>
> >>>>>>>>>> for
> >>>>
> >>>>> doing so?
> >>>>>>>>>>> The IO integration/perf tests so it does seem like they'll
> >>>>>>>>>>>
> >>>>>>>>>> need to
> >>>>
> >>>>> be
> >>>>>
> >>>>>> treated in a special manner, but given that there is already
> >>>>>>>>>>>
> >>>>>>>>>> an IO
> >>>>
> >>>>> specific
> >>>>>>>>>>> module, it may just be that we need to treat all the ITs in
> >>>>>>>>>>>
> >>>>>>>>>> the IO
> >>>>
> >>>>> module
> >>>>>>>>>>> the same way. I don't have strong opinions either way right
> >>>>>>>>>>>
> >>>>>>>>>> now.
> >>>>
> >>>>>
> >>>>>>>>>>> S
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> >>>>>>>>>>>
> >>>>>>>>>> echauchot@gmail.com>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi guys,
> >>>>>>>>>>>
> >>>>>>>>>>> @Stephen: I addressed all your comments directly in the PR,
> >>>>>>>>>>>
> >>>>>>>>>> thanks!
> >>>>
> >>>>> I just wanted to comment here about the docker image I used:
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>
> >>>>> only
> >>>>>
> >>>>>> official Elastic image contains only ElasticSearch. But for
> >>>>>>>>>>>
> >>>>>>>>>> testing I
> >>>>>
> >>>>>> needed logstash (for ingestion) and kibana (not for
> >>>>>>>>>>>
> >>>>>>>>>> integration
> >>>>
> >>>>> tests,
> >>>>>>
> >>>>>>> but to easily test REST requests to ES using sense). This is
> >>>>>>>>>>>
> >>>>>>>>>> why I
> >>>>
> >>>>> use
> >>>>>>
> >>>>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
> >>>>>>>>>>>
> >>>>>>>>>> isreleased
> >>>>
> >>>>> under
> >>>>>>>>>>> theapache 2 license.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Besides, there is also a point about where to store
> >>>>>>>>>>>
> >>>>>>>>>> integration
> >>>>
> >>>>> tests:
> >>>>>>
> >>>>>>> JB proposed in the PR to store integration tests to dedicated
> >>>>>>>>>>>
> >>>>>>>>>> module
> >>>>>
> >>>>>> rather than directly in the IO module (like I did).
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Etienne
> >>>>>>>>>>>
> >>>>>>>>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >>>>>>>>>>>
> >>>>>>>>>>> hey!
> >>>>>>>>>>>>
> >>>>>>>>>>>> thanks for sending this. I'm very excited to see this
> >>>>>>>>>>>>
> >>>>>>>>>>> change. I
> >>>>
> >>>>> added some
> >>>>>>>>>>>> detail-oriented code review comments in addition to what
> >>>>>>>>>>>>
> >>>>>>>>>>> I've
> >>>>
> >>>>> discussed
> >>>>>>>>>>>> here.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The general goal is to allow for re-usable instantiation of
> >>>>>>>>>>>>
> >>>>>>>>>>> particular
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> data
> >>>>>>>>>>>
> >>>>>>>>>>> store instances and this seems like a good start. Looks like
> >>>>>>>>>>>>
> >>>>>>>>>>> you
> >>>>
> >>>>> also have
> >>>>>>>>>>>> a script to generate test data for your tests - that's
> >>>>>>>>>>>>
> >>>>>>>>>>> great.
> >>>>
> >>>>>
> >>>>>>>>>>>> The next steps (definitely not blocking your work) will be
> >>>>>>>>>>>>
> >>>>>>>>>>> to have
> >>>>
> >>>>> ways to
> >>>>>>>>>>>> create instances from the docker images you have here, and
> >>>>>>>>>>>>
> >>>>>>>>>>> use
> >>>>
> >>>>> them
> >>>>>
> >>>>>> in the
> >>>>>>>>>>>> tests. We'll need support in the test framework for that
> >>>>>>>>>>>>
> >>>>>>>>>>> since
> >>>>
> >>>>> it'll
> >>>>>
> >>>>>> be
> >>>>>>>>>>>> different on developer machines and in the beam jenkins
> >>>>>>>>>>>>
> >>>>>>>>>>> cluster,
> >>>>
> >>>>> but
> >>>>>
> >>>>>> your
> >>>>>>>>>>>> scripts here allow someone running these tests locally to
> >>>>>>>>>>>>
> >>>>>>>>>>> not have
> >>>>
> >>>>> to
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> worry
> >>>>>>>>>>>
> >>>>>>>>>>> about getting the instance set up and can manually adjust,
> >>>>>>>>>>>>
> >>>>>>>>>>> so this
> >>>>
> >>>>> is
> >>>>>>
> >>>>>>> a
> >>>>>>>>>>>> good incremental step.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have some thoughts now that I'm reviewing your scripts
> >>>>>>>>>>>>
> >>>>>>>>>>> (that I
> >>>>
> >>>>> didn't
> >>>>>>>>>>>> have previously, so we are learning this together):
> >>>>>>>>>>>> * It may be useful to try and document why we chose a
> >>>>>>>>>>>>
> >>>>>>>>>>> particular
> >>>>
> >>>>> docker
> >>>>>>>>>>>> image as the base (ie, "this is the official supported
> >>>>>>>>>>>>
> >>>>>>>>>>> elastic
> >>>>
> >>>>> search
> >>>>>>
> >>>>>>> docker image" or "this image has several data stores
> >>>>>>>>>>>>
> >>>>>>>>>>> together that
> >>>>
> >>>>> can be
> >>>>>>>>>>>> used for a couple different tests")  - I'm curious as to
> >>>>>>>>>>>>
> >>>>>>>>>>> whether
> >>>>
> >>>>> the
> >>>>>
> >>>>>> community thinks that is important
> >>>>>>>>>>>>
> >>>>>>>>>>>> One thing that I called out in the comment that's worth
> >>>>>>>>>>>>
> >>>>>>>>>>> mentioning
> >>>>
> >>>>> on the
> >>>>>>>>>>>> larger list - if you want to specify which specific runners
> >>>>>>>>>>>>
> >>>>>>>>>>> a test
> >>>>
> >>>>> uses,
> >>>>>>>>>>>> that can be controlled in the pom for the module. I updated
> >>>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>
> >>>>> testing
> >>>>>>>>>>>>
> >>>>>>>>>>>> doc
> >>>>>>>>>>>
> >>>>>>>>>>> mentioned previously in this thread with a TODO to talk
> >>>>>>>>>>>>
> >>>>>>>>>>> about this
> >>>>
> >>>>> more. I
> >>>>>>>>>>>> think we should also make it so that IO modules have that
> >>>>>>>>>>>> automatically,
> >>>>>>>>>>>>
> >>>>>>>>>>>> so
> >>>>>>>>>>>
> >>>>>>>>>>> developers don't have to worry about it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> S
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> >>>>>>>>>>>>
> >>>>>>>>>>> echauchot@gmail.com>
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Stephen,
> >>>>>>>>>>>>
> >>>>>>>>>>>> As discussed, I added injection script, docker containers
> >>>>>>>>>>>>
> >>>>>>>>>>> scripts
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> >>>>>>>>>>>>
> >>>>>>>>>>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> >>>>>>>>> d824cefcb3ed0b9
> >>>>>>>>>
> >>>>>>>>> directory in that PR:
> >>>>>>>>>>
> >>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439.
> >>>>>>>>>>>>
> >>>>>>>>>>>> These work well but they are first shot. Do you have any
> >>>>>>>>>>>>
> >>>>>>>>>>> comments
> >>>>
> >>>>> about
> >>>>>>>>>>>> those?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Besides I am not very sure that these files should be in the
> >>>>>>>>>>>>
> >>>>>>>>>>> IO
> >>>>
> >>>>> itself
> >>>>>>
> >>>>>>> (even in contrib directory, out of maven source
> >>>>>>>>>>>>
> >>>>>>>>>>> directories). Any
> >>>>
> >>>>>
> >>>>>>>>>>>> thoughts?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Etienne
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >>>>>>>>>>>>
> >>>>>>>>>>>> It's great to hear more experiences.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm also glad to hear that people see real value in the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> high
> >>>>
> >>>>> volume/performance benchmark tests. I tried to capture that
> >>>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>
> >>>>> the
> >>>>>
> >>>>>>
> >>>>>>>>>>>>> Testing
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It does generally sound like we're in agreement here. Areas
> >>>>>>>>>>>>>
> >>>>>>>>>>>> of
> >>>>
> >>>>> discussion
> >>>>>>>>>>>>
> >>>>>>>>>>>>
>
>

Re: Hosting data stores for IO Transform testing

Posted by Stephen Sisk <si...@google.com.INVALID>.

Glad to hear you support kubernetes (although to be clear, I'm rooting for
the right solution for us in the long run - if anyone has a strong reason
for dcos, I'm excited to hear it.)

I agree with you that testing IO in failure scenarios seems like a fruitful
area for future work, but that I don't want to tackle it just yet (and I'm
not hearing that we think it affects our current decision - if someone
does, I'd like to hear about it.) I am going to split off a thread for that
discussion because I think the discussion informs how we write our unit
tests currently, and want to clarify it.

On Wed, Jan 18, 2017 at 1:42 PM Ismaël Mejía <ie...@gmail.com> wrote:

> Hello again,
>
> Stephen, I agree with you the real question is what is the scope of the
> tests, maybe the discussion so far has been more about testing a ‘real’
> data store and finding infra/performance issues (and future regressions),
> but having a modern cluster manager opens the door to create more
> interesting integration tests like the ones I mentioned, in particular my
> idea is more oriented towards the validation of the ‘correct’expected
> behavior of the IOs and runners. But this is quite ambitious for a first
> goal, maybe we should first get things working and let this for later (if
> there is still interest).
>
> I am not sure that unit tests are enough to test distribution issues
> because they are harder to simulate in particular if we add the fact that
> we can have too many moving pieces. For example, imagine that we run a Beam
> pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
> that reads from Kafka (with some slow partition) and writes to Cassandra
> (with a partition that goes down). You see, this is a quite complex
> combination of pieces (and possible issues), but it is not a totally
> artificial scenario, in fact this is a common architecture, and this can
> (at least in theory) be simulated with a cluster manager, but I don’t see
> how can I easily reproduce this with a unit test.
>
> Anyway, this scenario makes me think that the boundaries of what we want to
> test are really important. Complexity can be huge.
>
> About the Mesos package question, effectively I referred to Mesos Universe
> (the repo you linked), and what you said is sadly true, it is not easy to
> find multi-node instance packages that are the most interesting ones for
> our tests (in both k8s or mesos). I agree with your decision of using
> Kubernetes, I just wanted to mention that in some cases we will need to
> produce these multi-node packages to have interesting tests.
>
> Ismaël
>
>
> On Wed, Jan 18, 2017 at 10:09 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may find
> > single node config but not sure for multi-node setup. Anyway, I'm not
> sure
> > if we find a multi-node configuration, it would cover our needs.
> >
> > Regards
> > JB
> >
> > On 01/18/2017 12:52 PM, Stephen Sisk wrote:
> >
> >> ah! I looked around a bit more and found the dcos package repo -
> >> https://github.com/mesosphere/universe/tree/version-3.x/repo/packages
> >>
> >> poking around a bit, I can find a lot of packages for single node
> >> instances, but not many packages for multi-node instances. Single node
> >> instance packages are kind of useful, but I don't think it's *too*
> >> helpful.
> >> The multi-node instance packages that run the data store's high
> >> availability mode are where the real work is, and it seems like both
> >> kubernetes helm and dcos' package universe don't have a lot of those.
> >>
> >> S
> >>
> >> On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <si...@google.com> wrote:
> >>
> >> Hi Ishmael,
> >>>
> >>> these are good questions, thanks for raising them.
> >>>
> >>> Ability to modify network/compute resources to simulate failures
> >>> =================================================
> >>> I see two real questions here:
> >>> 1. Is this something we want to do?
> >>> 2. Is it possible with both/either?
> >>>
> >>> So far, the test strategy I've been advocating is that we test problems
> >>> like this in unit tests rather than do this in ITs/Perf tests.
> Otherwise,
> >>> it's hard to re-create the same conditions.
> >>>
> >>> I can investigate whether it's possible, but I want to clarify whether
> >>> this is something that we care about. I know both support killing
> >>> individual nodes. I haven't seen a lot of network control in either,
> but
> >>> haven't tried to look for it.
> >>>
> >>> Availability of ready to play packages
> >>> ============================
> >>> I did look at this, and as far as I could tell, mesos didn't have any
> >>> pre-built packages for multi-node clusters of data stores. If there's a
> >>> good repository of them that we trust, that would definitely save us
> >>> time.
> >>> Can you point me at the mesos repository?
> >>>
> >>> S
> >>>
> >>>
> >>>
> >>> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> >>> wrote:
> >>>
> >>> ⁣Hi Ismael
> >>>
> >>> Stephen will reply with details but I know he did a comparison and
> >>> evaluate different options.
> >>>
> >>> He tested with the jdbc Io itests.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <ie...@gmail.com>
> >>> wrote:
> >>>
> >>>> Thanks for your analysis Stephen, good arguments / references.
> >>>>
> >>>> One quick question. Have you checked the APIs of both
> >>>> (Mesos/Kubernetes) to
> >>>> see
> >>>> if we can do programmatically do more complex tests (I suppose so, but
> >>>> you
> >>>> don't mention how easy or if those are possible), for example to
> >>>> simulate a
> >>>> slow networking slave (to test stragglers), or to arbitrarily kill one
> >>>> slave (e.g. if I want to test the correct behavior of a runner/IO that
> >>>> is
> >>>> reading from it) ?
> >>>>
> >>>> Other missing point in the review is the availability of ready to play
> >>>> packages,
> >>>> I think in this area mesos/dcos seems more advanced no? I haven't
> >>>> looked
> >>>> recently but at least 6 months ago there were not many helm packages
> >>>> ready
> >>>> for
> >>>> example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
> >>>> etc). Has
> >>>> this been improved ? because preparing this also is a considerable
> >>>> amount of
> >>>> work on the other hand this could be also a chance to contribute to
> >>>> kubernetes.
> >>>>
> >>>> Regards,
> >>>> Ismaël
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <sisk@google.com.invalid
> >
> >>>> wrote:
> >>>>
> >>>> hi!
> >>>>>
> >>>>> I've been continuing this investigation, and have some more info to
> >>>>>
> >>>> report,
> >>>>
> >>>>> and hopefully we can start making some decisions.
> >>>>>
> >>>>> To support performance testing, I've been investigating
> >>>>>
> >>>> mesos+marathon and
> >>>>
> >>>>> kubernetes for running data stores in their high availability mode. I
> >>>>>
> >>>> have
> >>>>
> >>>>> been examining features that kubernetes/mesos+marathon use to support
> >>>>>
> >>>> this.
> >>>>
> >>>>>
> >>>>> Setting up a multi-node cluster in a high availability mode tends to
> >>>>>
> >>>> be
> >>>>
> >>>>> more expensive time-wise than the single node instances I've played
> >>>>>
> >>>> around
> >>>>
> >>>>> with in the past. Rather than do a full build out with both
> >>>>>
> >>>> kubernetes and
> >>>>
> >>>>> mesos, I'd like to pick one of the two options to build the prototype
> >>>>> cluster with. If the prototype doesn't go well, we could still go
> >>>>>
> >>>> back to
> >>>>
> >>>>> the other option, but I'd like to change us from a mode of "let's
> >>>>>
> >>>> look at
> >>>>
> >>>>> all the options" to one of "here's the favorite, let's prove that
> >>>>>
> >>>> works for
> >>>>
> >>>>> us".
> >>>>>
> >>>>> Below are the features that I've seen are important to multi-node
> >>>>>
> >>>> instances
> >>>>
> >>>>> of data stores. I'm sure other folks on the list have done this
> >>>>>
> >>>> before, so
> >>>>
> >>>>> feel free to pipe up if I'm missing a good solution to a problem.
> >>>>>
> >>>>> DNS/Discovery
> >>>>>
> >>>>> --------------------
> >>>>>
> >>>>> Necessary for talking between nodes (eg, cassandra nodes all need to
> >>>>>
> >>>> be
> >>>>
> >>>>> able to talk to a set of seed nodes.)
> >>>>>
> >>>>> * Kubernetes has built-in DNS/discovery between nodes.
> >>>>>
> >>>>> * Mesos has supports this via mesos-dns, which isn't a part of core
> >>>>>
> >>>> mesos,
> >>>>
> >>>>> but is in dcos, which is the mesos distribution I've been using and
> >>>>>
> >>>> that I
> >>>>
> >>>>> would expect us to use.
> >>>>>
> >>>>> Instances properly distributed across nodes
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>>
> >>>>> If multiple instances of a data source end up on the same underlying
> >>>>>
> >>>> VM, we
> >>>>
> >>>>> may not get good performance out of those instances since the
> >>>>>
> >>>> underlying VM
> >>>>
> >>>>> may be more taxed than other VMs.
> >>>>>
> >>>>> * Kubernetes has a beta feature StatefulSets[1] which allow for
> >>>>>
> >>>> containers
> >>>>
> >>>>> distributed so that there's one container per underlying machine (as
> >>>>>
> >>>> well
> >>>>
> >>>>> as a lot of other useful features like easy stable dns names.)
> >>>>>
> >>>>> * Mesos can support this via the built in UNIQUE constraint [2]
> >>>>>
> >>>>> Load balancing
> >>>>>
> >>>>> --------------------
> >>>>>
> >>>>> Incoming requests from users need to be distributed to the various
> >>>>>
> >>>> machines
> >>>>
> >>>>> - this is important for many data stores' high availability modes.
> >>>>>
> >>>>> * Kubernetes supports easily hooking up to an external load balancer
> >>>>>
> >>>> when
> >>>>
> >>>>> on a cloud (and can be configured to work with a built-in load
> >>>>>
> >>>> balancer if
> >>>>
> >>>>> not)
> >>>>>
> >>>>> * Mesos supports this via marathon-lb [3], which is an install-able
> >>>>>
> >>>> package
> >>>>
> >>>>> in DC/OS
> >>>>>
> >>>>> Persistent Volumes tied to specific instances
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>>
> >>>>> Databases often need persistent state (for example to store the data
> >>>>>
> >>>> :), so
> >>>>
> >>>>> it's an important part of running our service.
> >>>>>
> >>>>> * Kubernetes StatefulSets supports this
> >>>>>
> >>>>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
> >>>>>
> >>>>> As I mentioned above, I'd like to focus on either kubernetes or mesos
> >>>>>
> >>>> for
> >>>>
> >>>>> my investigation, and as I go further along, I'm seeing kubernetes as
> >>>>> better suited to our needs.
> >>>>>
> >>>>> (1) It supports more of the features we want out of the box and with
> >>>>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> >>>>> requires marathon-lb to be installed and mesos-dns to be configured.
> >>>>>
> >>>>> (2) I'm also finding that there seem to be more examples of using
> >>>>> kubernetes to solve the types of problems we're working on. This is
> >>>>> somewhat subjective, but in my experience as I've tried to learn both
> >>>>> kubernetes and mesos, I personally found it generally easier to get
> >>>>> kubernetes running than mesos due to the tutorials/examples available
> >>>>>
> >>>> for
> >>>>
> >>>>> kubernetes.
> >>>>>
> >>>>> (3) Lower cost of initial setup - as I discussed in a previous
> >>>>>
> >>>> mail[6],
> >>>>
> >>>>> kubernetes was far easier to get set up even when I knew the exact
> >>>>>
> >>>> steps.
> >>>>
> >>>>> Mesos took me around 27 steps [7], which involved a lot of config
> >>>>>
> >>>> that was
> >>>>
> >>>>> easy to get wrong (it took me about 5 tries to get all the steps
> >>>>>
> >>>> correct in
> >>>>
> >>>>> one go.) Kubernetes took me around 8 steps and very little config.
> >>>>>
> >>>>> Given that, I'd like to focus my investigation/prototyping on
> >>>>>
> >>>> Kubernetes.
> >>>>
> >>>>> To
> >>>>> be clear, it's fairly close and I think both Mesos and Kubernetes
> >>>>>
> >>>> could
> >>>>
> >>>>> support what we need, so if we run into issues with kubernetes, Mesos
> >>>>>
> >>>> still
> >>>>
> >>>>> seems like a viable option that we could fall back to.
> >>>>>
> >>>>> Thanks,
> >>>>> Stephen
> >>>>>
> >>>>>
> >>>>> [1] Kubernetes StatefulSets
> >>>>>
> >>>>>
> >>>> https://kubernetes.io/docs/concepts/abstractions/controllers
> >>> /statefulsets/
> >>>
> >>>>
> >>>>> [2] mesos unique constraint -
> >>>>> https://mesosphere.github.io/marathon/docs/constraints.html
> >>>>>
> >>>>> [3]
> >>>>> https://mesosphere.github.io/marathon/docs/service-
> >>>>> discovery-load-balancing.html
> >>>>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
> >>>>>
> >>>>> [4]
> >>>>>
> >>>> https://mesosphere.github.io/marathon/docs/persistent-volumes.html
> >>>>
> >>>>>
> >>>>> [5]
> >>>>>
> >>>> https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
> >>>>
> >>>>>
> >>>>> [6] Container Orchestration software for hosting data stores
> >>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> >>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
> >>>>>
> >>>>> [7]
> https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
> >>>>>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> Just a quick drive-by comment: how tests are laid out has
> >>>>>>
> >>>>> non-trivial
> >>>>
> >>>>> tradeoffs on how/where continuous integration runs, and how results
> >>>>>>
> >>>>> are
> >>>>
> >>>>> integrated into the tooling. The current state is certainly not
> >>>>>>
> >>>>> ideal
> >>>>
> >>>>> (e.g., due to multiple test executions some links in Jenkins point
> >>>>>>
> >>>>> where
> >>>>
> >>>>> they shouldn't), but most other alternatives had even bigger
> >>>>>>
> >>>>> drawbacks at
> >>>>
> >>>>> the time. If someone has great ideas that don't explode the number
> >>>>>>
> >>>>> of
> >>>>
> >>>>> modules, please share ;-)
> >>>>>>
> >>>>>> On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
> >>>>>>
> >>>>> <ec...@gmail.com>
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Stephen,
> >>>>>>>
> >>>>>>> Thanks for taking the time to comment.
> >>>>>>>
> >>>>>>> My comments are bellow in the email:
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> >>>>>>>
> >>>>>>> hey Etienne -
> >>>>>>>>
> >>>>>>>> thanks for your thoughts and thanks for sharing your
> >>>>>>>>
> >>>>>>> experiences. I
> >>>>
> >>>>> generally agree with what you're saying. Quick comments below:
> >>>>>>>>
> >>>>>>>> IT are stored alongside with UT in src/test directory of the IO
> >>>>>>>>
> >>>>>>> but
> >>>>
> >>>>> they
> >>>>>
> >>>>>>
> >>>>>>>>> might go to dedicated module, waiting for a consensus
> >>>>>>>> I don't have a strong opinion or feel that I've worked enough
> >>>>>>>>
> >>>>>>> with
> >>>>
> >>>>> maven
> >>>>>
> >>>>>> to
> >>>>>>>> understand all the consequences - I'd love for someone with more
> >>>>>>>>
> >>>>>>> maven
> >>>>
> >>>>> experience to weigh in. If this becomes blocking, I'd say check
> >>>>>>>>
> >>>>>>> it in,
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> we can refactor later if it proves problematic.
> >>>>>>>>
> >>>>>>>> Sure, not a blocking point, it could be refactored afterwards.
> >>>>>>>
> >>>>>> Just as
> >>>>
> >>>>> a
> >>>>>
> >>>>>> reminder, JB mentioned that storing IT in separate module allows
> >>>>>>>
> >>>>>> to
> >>>>
> >>>>> have
> >>>>>
> >>>>>> more coherence between all IT (same behavior) and to do cross IO
> >>>>>>> integration tests. JB, have you experienced some long term
> >>>>>>>
> >>>>>> drawbacks of
> >>>>
> >>>>> storing IT in a separate module, like, for example, more
> >>>>>>>
> >>>>>> difficult
> >>>>
> >>>>> maintenance due to "distance" with production code?
> >>>>>>>
> >>>>>>>
> >>>>>>>   Also IMHO, it is better that tests load/clean data than doing
> >>>>>>>>
> >>>>>>> some
> >>>>
> >>>>>
> >>>>>>>>> assumptions about the running order of the tests.
> >>>>>>>> I definitely agree that we don't want to make assumptions about
> >>>>>>>>
> >>>>>>> the
> >>>>
> >>>>> running
> >>>>>>>> order of the tests - that way lies pain. :) It will be
> >>>>>>>>
> >>>>>>> interesting to
> >>>>
> >>>>> see
> >>>>>>
> >>>>>>> how the performance tests work out since they will need more
> >>>>>>>>
> >>>>>>> data (and
> >>>>
> >>>>> thus
> >>>>>>>> loading data can take much longer.)
> >>>>>>>>
> >>>>>>>> Yes, performance testing might push in the direction of data
> >>>>>>>
> >>>>>> loading
> >>>>
> >>>>> from
> >>>>>
> >>>>>> outside the tests due to loading time.
> >>>>>>>
> >>>>>>>   This should also be an easier problem
> >>>>>>>> for read tests than for write tests - if we have long running
> >>>>>>>>
> >>>>>>> instances,
> >>>>>
> >>>>>> read tests don't really need cleanup. And if write tests only
> >>>>>>>>
> >>>>>>> write a
> >>>>
> >>>>> small
> >>>>>>>> amount of data, as long as we are sure we're writing to uniquely
> >>>>>>>> identifiable locations (ie, new table per test or something
> >>>>>>>>
> >>>>>>> similar),
> >>>>
> >>>>> we
> >>>>>
> >>>>>> can clean up the write test data on a slower schedule.
> >>>>>>>>
> >>>>>>>> I agree
> >>>>>>>
> >>>>>>>
> >>>>>>>> this will tend to go to the direction of long running data store
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> instances rather than data store instances started (and
> >>>>>>>>
> >>>>>>> optionally
> >>>>
> >>>>> loaded)
> >>>>>>
> >>>>>>> before tests.
> >>>>>>>> It may be easiest to start with a "data stores stay running"
> >>>>>>>> implementation, and then if we see issues with that move towards
> >>>>>>>>
> >>>>>>> tests
> >>>>
> >>>>> that
> >>>>>>>> start/stop the data stores on each run. One thing I'd like to
> >>>>>>>>
> >>>>>>> make
> >>>>
> >>>>> sure
> >>>>>
> >>>>>> is
> >>>>>>
> >>>>>>> that we're not manually tweaking the configurations for data
> >>>>>>>>
> >>>>>>> stores.
> >>>>
> >>>>> One
> >>>>>
> >>>>>> way we could do that is to destroy/recreate the data stores on a
> >>>>>>>>
> >>>>>>> slower
> >>>>>
> >>>>>> schedule - maybe once per week. That way if the script is
> >>>>>>>>
> >>>>>>> changed or
> >>>>
> >>>>> the
> >>>>>
> >>>>>> data store instances are changed, we'd be able to detect it
> >>>>>>>>
> >>>>>>> relatively
> >>>>
> >>>>> soon
> >>>>>>>> while still removing the need for the tests to manage the data
> >>>>>>>>
> >>>>>>> stores.
> >>>>
> >>>>>
> >>>>>>>> I agree. In addition to configuration manual tweaking, there
> >>>>>>>
> >>>>>> might be
> >>>>
> >>>>> cases in which a data store re-partition data during a test or
> >>>>>>>
> >>>>>> after
> >>>>
> >>>>> some
> >>>>>
> >>>>>> tests while the dataset changes. The IO must be tolerant to that
> >>>>>>>
> >>>>>> but
> >>>>
> >>>>> the
> >>>>>
> >>>>>> asserts (number of bundles for example) in test must not fail in
> >>>>>>>
> >>>>>> that
> >>>>
> >>>>> case.
> >>>>>>
> >>>>>>> I would also prefer if possible that the tests do not manage data
> >>>>>>>
> >>>>>> stores
> >>>>>
> >>>>>> (not setup them, not start them, not stop them)
> >>>>>>>
> >>>>>>>
> >>>>>>> as a general note, I suspect many of the folks in the states
> >>>>>>>>
> >>>>>>> will be
> >>>>
> >>>>> on
> >>>>>
> >>>>>> holiday until Jan 2nd/3rd.
> >>>>>>>>
> >>>>>>>> S
> >>>>>>>>
> >>>>>>>> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
> >>>>>>>>
> >>>>>>> <echauchot@gmail.com
> >>>>
> >>>>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Recently we had a discussion about integration tests of IOs.
> >>>>>>>>>
> >>>>>>>> I'm
> >>>>
> >>>>> preparing a PR for integration tests of the elasticSearch IO
> >>>>>>>>> (
> >>>>>>>>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> >>>>>>>>> LASTICSEARCH-IO
> >>>>>>>>> as a first shot) which are very important IMHO because they
> >>>>>>>>>
> >>>>>>>> helped
> >>>>
> >>>>> catch
> >>>>>>
> >>>>>>> some bugs that UT could not (volume, data store instance
> >>>>>>>>>
> >>>>>>>> sharing,
> >>>>
> >>>>> real
> >>>>>
> >>>>>> data store instance ...)
> >>>>>>>>>
> >>>>>>>>> I would like to have your thoughts/remarks about points bellow.
> >>>>>>>>>
> >>>>>>>> Some
> >>>>
> >>>>> of
> >>>>>
> >>>>>> these points are also discussed here
> >>>>>>>>>
> >>>>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >>>>>>>>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> >>>>>>>>> :
> >>>>>>>>>
> >>>>>>>>> - UT and IT have a similar architecture, but while UT focus on
> >>>>>>>>>
> >>>>>>>> testing
> >>>>>
> >>>>>> the correct behavior of the code including corner cases and use
> >>>>>>>>>
> >>>>>>>> embedded
> >>>>>>
> >>>>>>> in memory data store, IT assume that the behavior is correct
> >>>>>>>>>
> >>>>>>>> (strong
> >>>>
> >>>>> UT)
> >>>>>>
> >>>>>>> and focus on higher volume testing and testing against real
> >>>>>>>>>
> >>>>>>>> data
> >>>>
> >>>>> store
> >>>>>
> >>>>>> instance(s)
> >>>>>>>>>
> >>>>>>>>> - For now, IT are stored alongside with UT in src/test
> >>>>>>>>>
> >>>>>>>> directory of
> >>>>
> >>>>> the
> >>>>>
> >>>>>> IO but they might go to dedicated module, waiting for a
> >>>>>>>>>
> >>>>>>>> consensus.
> >>>>
> >>>>> Maven
> >>>>>>
> >>>>>>> is not configured to run them automatically because data store
> >>>>>>>>>
> >>>>>>>> is not
> >>>>
> >>>>> available on jenkins server yet
> >>>>>>>>>
> >>>>>>>>> - For now, they only use DirectRunner, but they will  be run
> >>>>>>>>>
> >>>>>>>> against
> >>>>
> >>>>> each runner.
> >>>>>>>>>
> >>>>>>>>> - IT do not setup data store instance (like stated in the above
> >>>>>>>>> document) they assume that one is already running (hardcoded
> >>>>>>>>> configuration in test for now, waiting for a common solution to
> >>>>>>>>>
> >>>>>>>> pass
> >>>>
> >>>>> configuration to IT). A docker container script is provided in
> >>>>>>>>>
> >>>>>>>> the
> >>>>
> >>>>> contrib directory as a starting point to whatever orchestration
> >>>>>>>>>
> >>>>>>>> software
> >>>>>>
> >>>>>>> will be chosen.
> >>>>>>>>>
> >>>>>>>>> - IT load and clean test data before and after each test if
> >>>>>>>>>
> >>>>>>>> needed.
> >>>>
> >>>>> It
> >>>>>
> >>>>>> is simpler to do so because some tests need empty data store
> >>>>>>>>>
> >>>>>>>> (write
> >>>>
> >>>>> test) and because, as discussed in the document, tests might
> >>>>>>>>>
> >>>>>>>> not be
> >>>>
> >>>>> the
> >>>>>
> >>>>>> only users of the data store. Also IMHO, it is better that
> >>>>>>>>>
> >>>>>>>> tests
> >>>>
> >>>>> load/clean data than doing some assumptions about the running
> >>>>>>>>>
> >>>>>>>> order
> >>>>
> >>>>> of
> >>>>>
> >>>>>> the tests.
> >>>>>>>>>
> >>>>>>>>> If we generalize this pattern to all IT tests, this will tend
> >>>>>>>>>
> >>>>>>>> to go
> >>>>
> >>>>> to
> >>>>>
> >>>>>> the direction of long running data store instances rather than
> >>>>>>>>>
> >>>>>>>> data
> >>>>
> >>>>> store instances started (and optionally loaded) before tests.
> >>>>>>>>>
> >>>>>>>>> Besides if we where to change our minds and load data from
> >>>>>>>>>
> >>>>>>>> outside
> >>>>
> >>>>> the
> >>>>>
> >>>>>> tests, a logstash script is provided.
> >>>>>>>>>
> >>>>>>>>> If you have any thoughts or remarks I'm all ears :)
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>>
> >>>>>>>>> Etienne
> >>>>>>>>>
> >>>>>>>>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> >>>>>>>>>
> >>>>>>>>> Hi Stephen,
> >>>>>>>>>>
> >>>>>>>>>> the purpose of having in a specific module is to share
> >>>>>>>>>>
> >>>>>>>>> resources and
> >>>>
> >>>>> apply the same behavior from IT perspective and be able to
> >>>>>>>>>>
> >>>>>>>>> have IT
> >>>>
> >>>>> "cross" IO (for instance, reading from JMS and sending to
> >>>>>>>>>>
> >>>>>>>>> Kafka, I
> >>>>
> >>>>> think that's the key idea for integration tests).
> >>>>>>>>>>
> >>>>>>>>>> For instance, in Karaf, we have:
> >>>>>>>>>> - utest in each module
> >>>>>>>>>> - itest module containing itests for all modules all together
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Etienne,
> >>>>>>>>>>>
> >>>>>>>>>>> thanks for following up and answering my questions.
> >>>>>>>>>>>
> >>>>>>>>>>> re: where to store integration tests - having them all in a
> >>>>>>>>>>>
> >>>>>>>>>> separate
> >>>>>
> >>>>>> module
> >>>>>>>>>>> is an interesting idea. I couldn't find JB's comments about
> >>>>>>>>>>>
> >>>>>>>>>> moving
> >>>>
> >>>>> them
> >>>>>>
> >>>>>>> into a separate module in the PR - can you share the reasons
> >>>>>>>>>>>
> >>>>>>>>>> for
> >>>>
> >>>>> doing so?
> >>>>>>>>>>> The IO integration/perf tests so it does seem like they'll
> >>>>>>>>>>>
> >>>>>>>>>> need to
> >>>>
> >>>>> be
> >>>>>
> >>>>>> treated in a special manner, but given that there is already
> >>>>>>>>>>>
> >>>>>>>>>> an IO
> >>>>
> >>>>> specific
> >>>>>>>>>>> module, it may just be that we need to treat all the ITs in
> >>>>>>>>>>>
> >>>>>>>>>> the IO
> >>>>
> >>>>> module
> >>>>>>>>>>> the same way. I don't have strong opinions either way right
> >>>>>>>>>>>
> >>>>>>>>>> now.
> >>>>
> >>>>>
> >>>>>>>>>>> S
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> >>>>>>>>>>>
> >>>>>>>>>> echauchot@gmail.com>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi guys,
> >>>>>>>>>>>
> >>>>>>>>>>> @Stephen: I addressed all your comments directly in the PR,
> >>>>>>>>>>>
> >>>>>>>>>> thanks!
> >>>>
> >>>>> I just wanted to comment here about the docker image I used:
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>
> >>>>> only
> >>>>>
> >>>>>> official Elastic image contains only ElasticSearch. But for
> >>>>>>>>>>>
> >>>>>>>>>> testing I
> >>>>>
> >>>>>> needed logstash (for ingestion) and kibana (not for
> >>>>>>>>>>>
> >>>>>>>>>> integration
> >>>>
> >>>>> tests,
> >>>>>>
> >>>>>>> but to easily test REST requests to ES using sense). This is
> >>>>>>>>>>>
> >>>>>>>>>> why I
> >>>>
> >>>>> use
> >>>>>>
> >>>>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
> >>>>>>>>>>>
> >>>>>>>>>> isreleased
> >>>>
> >>>>> under
> >>>>>>>>>>> theapache 2 license.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Besides, there is also a point about where to store
> >>>>>>>>>>>
> >>>>>>>>>> integration
> >>>>
> >>>>> tests:
> >>>>>>
> >>>>>>> JB proposed in the PR to store integration tests to dedicated
> >>>>>>>>>>>
> >>>>>>>>>> module
> >>>>>
> >>>>>> rather than directly in the IO module (like I did).
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Etienne
> >>>>>>>>>>>
> >>>>>>>>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >>>>>>>>>>>
> >>>>>>>>>>> hey!
> >>>>>>>>>>>>
> >>>>>>>>>>>> thanks for sending this. I'm very excited to see this
> >>>>>>>>>>>>
> >>>>>>>>>>> change. I
> >>>>
> >>>>> added some
> >>>>>>>>>>>> detail-oriented code review comments in addition to what
> >>>>>>>>>>>>
> >>>>>>>>>>> I've
> >>>>
> >>>>> discussed
> >>>>>>>>>>>> here.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The general goal is to allow for re-usable instantiation of
> >>>>>>>>>>>>
> >>>>>>>>>>> particular
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> data
> >>>>>>>>>>>
> >>>>>>>>>>> store instances and this seems like a good start. Looks like
> >>>>>>>>>>>>
> >>>>>>>>>>> you
> >>>>
> >>>>> also have
> >>>>>>>>>>>> a script to generate test data for your tests - that's
> >>>>>>>>>>>>
> >>>>>>>>>>> great.
> >>>>
> >>>>>
> >>>>>>>>>>>> The next steps (definitely not blocking your work) will be
> >>>>>>>>>>>>
> >>>>>>>>>>> to have
> >>>>
> >>>>> ways to
> >>>>>>>>>>>> create instances from the docker images you have here, and
> >>>>>>>>>>>>
> >>>>>>>>>>> use
> >>>>
> >>>>> them
> >>>>>
> >>>>>> in the
> >>>>>>>>>>>> tests. We'll need support in the test framework for that
> >>>>>>>>>>>>
> >>>>>>>>>>> since
> >>>>
> >>>>> it'll
> >>>>>
> >>>>>> be
> >>>>>>>>>>>> different on developer machines and in the beam jenkins
> >>>>>>>>>>>>
> >>>>>>>>>>> cluster,
> >>>>
> >>>>> but
> >>>>>
> >>>>>> your
> >>>>>>>>>>>> scripts here allow someone running these tests locally to
> >>>>>>>>>>>>
> >>>>>>>>>>> not have
> >>>>
> >>>>> to
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> worry
> >>>>>>>>>>>
> >>>>>>>>>>> about getting the instance set up and can manually adjust,
> >>>>>>>>>>>>
> >>>>>>>>>>> so this
> >>>>
> >>>>> is
> >>>>>>
> >>>>>>> a
> >>>>>>>>>>>> good incremental step.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have some thoughts now that I'm reviewing your scripts
> >>>>>>>>>>>>
> >>>>>>>>>>> (that I
> >>>>
> >>>>> didn't
> >>>>>>>>>>>> have previously, so we are learning this together):
> >>>>>>>>>>>> * It may be useful to try and document why we chose a
> >>>>>>>>>>>>
> >>>>>>>>>>> particular
> >>>>
> >>>>> docker
> >>>>>>>>>>>> image as the base (ie, "this is the official supported
> >>>>>>>>>>>>
> >>>>>>>>>>> elastic
> >>>>
> >>>>> search
> >>>>>>
> >>>>>>> docker image" or "this image has several data stores
> >>>>>>>>>>>>
> >>>>>>>>>>> together that
> >>>>
> >>>>> can be
> >>>>>>>>>>>> used for a couple different tests")  - I'm curious as to
> >>>>>>>>>>>>
> >>>>>>>>>>> whether
> >>>>
> >>>>> the
> >>>>>
> >>>>>> community thinks that is important
> >>>>>>>>>>>>
> >>>>>>>>>>>> One thing that I called out in the comment that's worth
> >>>>>>>>>>>>
> >>>>>>>>>>> mentioning
> >>>>
> >>>>> on the
> >>>>>>>>>>>> larger list - if you want to specify which specific runners
> >>>>>>>>>>>>
> >>>>>>>>>>> a test
> >>>>
> >>>>> uses,
> >>>>>>>>>>>> that can be controlled in the pom for the module. I updated
> >>>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>
> >>>>> testing
> >>>>>>>>>>>>
> >>>>>>>>>>>> doc
> >>>>>>>>>>>
> >>>>>>>>>>> mentioned previously in this thread with a TODO to talk
> >>>>>>>>>>>>
> >>>>>>>>>>> about this
> >>>>
> >>>>> more. I
> >>>>>>>>>>>> think we should also make it so that IO modules have that
> >>>>>>>>>>>> automatically,
> >>>>>>>>>>>>
> >>>>>>>>>>>> so
> >>>>>>>>>>>
> >>>>>>>>>>> developers don't have to worry about it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> S
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> >>>>>>>>>>>>
> >>>>>>>>>>> echauchot@gmail.com>
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Stephen,
> >>>>>>>>>>>>
> >>>>>>>>>>>> As discussed, I added injection script, docker containers
> >>>>>>>>>>>>
> >>>>>>>>>>> scripts
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> >>>>>>>>>>>>
> >>>>>>>>>>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> >>>>>>>>> d824cefcb3ed0b9
> >>>>>>>>>
> >>>>>>>>> directory in that PR:
> >>>>>>>>>>
> >>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439.
> >>>>>>>>>>>>
> >>>>>>>>>>>> These work well but they are first shot. Do you have any
> >>>>>>>>>>>>
> >>>>>>>>>>> comments
> >>>>
> >>>>> about
> >>>>>>>>>>>> those?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Besides I am not very sure that these files should be in the
> >>>>>>>>>>>>
> >>>>>>>>>>> IO
> >>>>
> >>>>> itself
> >>>>>>
> >>>>>>> (even in contrib directory, out of maven source
> >>>>>>>>>>>>
> >>>>>>>>>>> directories). Any
> >>>>
> >>>>>
> >>>>>>>>>>>> thoughts?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Etienne
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >>>>>>>>>>>>
> >>>>>>>>>>>> It's great to hear more experiences.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm also glad to hear that people see real value in the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> high
> >>>>
> >>>>> volume/performance benchmark tests. I tried to capture that
> >>>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>
> >>>>> the
> >>>>>
> >>>>>>
> >>>>>>>>>>>>> Testing
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It does generally sound like we're in agreement here. Areas
> >>>>>>>>>>>>>
> >>>>>>>>>>>> of
> >>>>
> >>>>> discussion
> >>>>>>>>>>>>
> >>>>>>>>>>>>
>

Re: Hosting data stores for IO Transform testing

Posted by Ismaël Mejía <ie...@gmail.com>.

Hello again,

Stephen, I agree with you the real question is what is the scope of the
tests, maybe the discussion so far has been more about testing a ‘real’
data store and finding infra/performance issues (and future regressions),
but having a modern cluster manager opens the door to create more
interesting integration tests like the ones I mentioned, in particular my
idea is more oriented towards the validation of the ‘correct’expected
behavior of the IOs and runners. But this is quite ambitious for a first
goal, maybe we should first get things working and let this for later (if
there is still interest).

I am not sure that unit tests are enough to test distribution issues
because they are harder to simulate in particular if we add the fact that
we can have too many moving pieces. For example, imagine that we run a Beam
pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
that reads from Kafka (with some slow partition) and writes to Cassandra
(with a partition that goes down). You see, this is a quite complex
combination of pieces (and possible issues), but it is not a totally
artificial scenario, in fact this is a common architecture, and this can
(at least in theory) be simulated with a cluster manager, but I don’t see
how can I easily reproduce this with a unit test.

Anyway, this scenario makes me think that the boundaries of what we want to
test are really important. Complexity can be huge.

About the Mesos package question, effectively I referred to Mesos Universe
(the repo you linked), and what you said is sadly true, it is not easy to
find multi-node instance packages that are the most interesting ones for
our tests (in both k8s or mesos). I agree with your decision of using
Kubernetes, I just wanted to mention that in some cases we will need to
produce these multi-node packages to have interesting tests.

Ismaël


On Wed, Jan 18, 2017 at 10:09 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may find
> single node config but not sure for multi-node setup. Anyway, I'm not sure
> if we find a multi-node configuration, it would cover our needs.
>
> Regards
> JB
>
> On 01/18/2017 12:52 PM, Stephen Sisk wrote:
>
>> ah! I looked around a bit more and found the dcos package repo -
>> https://github.com/mesosphere/universe/tree/version-3.x/repo/packages
>>
>> poking around a bit, I can find a lot of packages for single node
>> instances, but not many packages for multi-node instances. Single node
>> instance packages are kind of useful, but I don't think it's *too*
>> helpful.
>> The multi-node instance packages that run the data store's high
>> availability mode are where the real work is, and it seems like both
>> kubernetes helm and dcos' package universe don't have a lot of those.
>>
>> S
>>
>> On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <si...@google.com> wrote:
>>
>> Hi Ishmael,
>>>
>>> these are good questions, thanks for raising them.
>>>
>>> Ability to modify network/compute resources to simulate failures
>>> =================================================
>>> I see two real questions here:
>>> 1. Is this something we want to do?
>>> 2. Is it possible with both/either?
>>>
>>> So far, the test strategy I've been advocating is that we test problems
>>> like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
>>> it's hard to re-create the same conditions.
>>>
>>> I can investigate whether it's possible, but I want to clarify whether
>>> this is something that we care about. I know both support killing
>>> individual nodes. I haven't seen a lot of network control in either, but
>>> haven't tried to look for it.
>>>
>>> Availability of ready to play packages
>>> ============================
>>> I did look at this, and as far as I could tell, mesos didn't have any
>>> pre-built packages for multi-node clusters of data stores. If there's a
>>> good repository of them that we trust, that would definitely save us
>>> time.
>>> Can you point me at the mesos repository?
>>>
>>> S
>>>
>>>
>>>
>>> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
>>> wrote:
>>>
>>> ⁣Hi Ismael
>>>
>>> Stephen will reply with details but I know he did a comparison and
>>> evaluate different options.
>>>
>>> He tested with the jdbc Io itests.
>>>
>>> Regards
>>> JB
>>>
>>> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <ie...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for your analysis Stephen, good arguments / references.
>>>>
>>>> One quick question. Have you checked the APIs of both
>>>> (Mesos/Kubernetes) to
>>>> see
>>>> if we can do programmatically do more complex tests (I suppose so, but
>>>> you
>>>> don't mention how easy or if those are possible), for example to
>>>> simulate a
>>>> slow networking slave (to test stragglers), or to arbitrarily kill one
>>>> slave (e.g. if I want to test the correct behavior of a runner/IO that
>>>> is
>>>> reading from it) ?
>>>>
>>>> Other missing point in the review is the availability of ready to play
>>>> packages,
>>>> I think in this area mesos/dcos seems more advanced no? I haven't
>>>> looked
>>>> recently but at least 6 months ago there were not many helm packages
>>>> ready
>>>> for
>>>> example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
>>>> etc). Has
>>>> this been improved ? because preparing this also is a considerable
>>>> amount of
>>>> work on the other hand this could be also a chance to contribute to
>>>> kubernetes.
>>>>
>>>> Regards,
>>>> Ismaël
>>>>
>>>>
>>>>
>>>> On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <si...@google.com.invalid>
>>>> wrote:
>>>>
>>>> hi!
>>>>>
>>>>> I've been continuing this investigation, and have some more info to
>>>>>
>>>> report,
>>>>
>>>>> and hopefully we can start making some decisions.
>>>>>
>>>>> To support performance testing, I've been investigating
>>>>>
>>>> mesos+marathon and
>>>>
>>>>> kubernetes for running data stores in their high availability mode. I
>>>>>
>>>> have
>>>>
>>>>> been examining features that kubernetes/mesos+marathon use to support
>>>>>
>>>> this.
>>>>
>>>>>
>>>>> Setting up a multi-node cluster in a high availability mode tends to
>>>>>
>>>> be
>>>>
>>>>> more expensive time-wise than the single node instances I've played
>>>>>
>>>> around
>>>>
>>>>> with in the past. Rather than do a full build out with both
>>>>>
>>>> kubernetes and
>>>>
>>>>> mesos, I'd like to pick one of the two options to build the prototype
>>>>> cluster with. If the prototype doesn't go well, we could still go
>>>>>
>>>> back to
>>>>
>>>>> the other option, but I'd like to change us from a mode of "let's
>>>>>
>>>> look at
>>>>
>>>>> all the options" to one of "here's the favorite, let's prove that
>>>>>
>>>> works for
>>>>
>>>>> us".
>>>>>
>>>>> Below are the features that I've seen are important to multi-node
>>>>>
>>>> instances
>>>>
>>>>> of data stores. I'm sure other folks on the list have done this
>>>>>
>>>> before, so
>>>>
>>>>> feel free to pipe up if I'm missing a good solution to a problem.
>>>>>
>>>>> DNS/Discovery
>>>>>
>>>>> --------------------
>>>>>
>>>>> Necessary for talking between nodes (eg, cassandra nodes all need to
>>>>>
>>>> be
>>>>
>>>>> able to talk to a set of seed nodes.)
>>>>>
>>>>> * Kubernetes has built-in DNS/discovery between nodes.
>>>>>
>>>>> * Mesos has supports this via mesos-dns, which isn't a part of core
>>>>>
>>>> mesos,
>>>>
>>>>> but is in dcos, which is the mesos distribution I've been using and
>>>>>
>>>> that I
>>>>
>>>>> would expect us to use.
>>>>>
>>>>> Instances properly distributed across nodes
>>>>>
>>>>> ------------------------------------------------------------
>>>>>
>>>>> If multiple instances of a data source end up on the same underlying
>>>>>
>>>> VM, we
>>>>
>>>>> may not get good performance out of those instances since the
>>>>>
>>>> underlying VM
>>>>
>>>>> may be more taxed than other VMs.
>>>>>
>>>>> * Kubernetes has a beta feature StatefulSets[1] which allow for
>>>>>
>>>> containers
>>>>
>>>>> distributed so that there's one container per underlying machine (as
>>>>>
>>>> well
>>>>
>>>>> as a lot of other useful features like easy stable dns names.)
>>>>>
>>>>> * Mesos can support this via the built in UNIQUE constraint [2]
>>>>>
>>>>> Load balancing
>>>>>
>>>>> --------------------
>>>>>
>>>>> Incoming requests from users need to be distributed to the various
>>>>>
>>>> machines
>>>>
>>>>> - this is important for many data stores' high availability modes.
>>>>>
>>>>> * Kubernetes supports easily hooking up to an external load balancer
>>>>>
>>>> when
>>>>
>>>>> on a cloud (and can be configured to work with a built-in load
>>>>>
>>>> balancer if
>>>>
>>>>> not)
>>>>>
>>>>> * Mesos supports this via marathon-lb [3], which is an install-able
>>>>>
>>>> package
>>>>
>>>>> in DC/OS
>>>>>
>>>>> Persistent Volumes tied to specific instances
>>>>>
>>>>> ------------------------------------------------------------
>>>>>
>>>>> Databases often need persistent state (for example to store the data
>>>>>
>>>> :), so
>>>>
>>>>> it's an important part of running our service.
>>>>>
>>>>> * Kubernetes StatefulSets supports this
>>>>>
>>>>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
>>>>>
>>>>> As I mentioned above, I'd like to focus on either kubernetes or mesos
>>>>>
>>>> for
>>>>
>>>>> my investigation, and as I go further along, I'm seeing kubernetes as
>>>>> better suited to our needs.
>>>>>
>>>>> (1) It supports more of the features we want out of the box and with
>>>>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
>>>>> requires marathon-lb to be installed and mesos-dns to be configured.
>>>>>
>>>>> (2) I'm also finding that there seem to be more examples of using
>>>>> kubernetes to solve the types of problems we're working on. This is
>>>>> somewhat subjective, but in my experience as I've tried to learn both
>>>>> kubernetes and mesos, I personally found it generally easier to get
>>>>> kubernetes running than mesos due to the tutorials/examples available
>>>>>
>>>> for
>>>>
>>>>> kubernetes.
>>>>>
>>>>> (3) Lower cost of initial setup - as I discussed in a previous
>>>>>
>>>> mail[6],
>>>>
>>>>> kubernetes was far easier to get set up even when I knew the exact
>>>>>
>>>> steps.
>>>>
>>>>> Mesos took me around 27 steps [7], which involved a lot of config
>>>>>
>>>> that was
>>>>
>>>>> easy to get wrong (it took me about 5 tries to get all the steps
>>>>>
>>>> correct in
>>>>
>>>>> one go.) Kubernetes took me around 8 steps and very little config.
>>>>>
>>>>> Given that, I'd like to focus my investigation/prototyping on
>>>>>
>>>> Kubernetes.
>>>>
>>>>> To
>>>>> be clear, it's fairly close and I think both Mesos and Kubernetes
>>>>>
>>>> could
>>>>
>>>>> support what we need, so if we run into issues with kubernetes, Mesos
>>>>>
>>>> still
>>>>
>>>>> seems like a viable option that we could fall back to.
>>>>>
>>>>> Thanks,
>>>>> Stephen
>>>>>
>>>>>
>>>>> [1] Kubernetes StatefulSets
>>>>>
>>>>>
>>>> https://kubernetes.io/docs/concepts/abstractions/controllers
>>> /statefulsets/
>>>
>>>>
>>>>> [2] mesos unique constraint -
>>>>> https://mesosphere.github.io/marathon/docs/constraints.html
>>>>>
>>>>> [3]
>>>>> https://mesosphere.github.io/marathon/docs/service-
>>>>> discovery-load-balancing.html
>>>>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
>>>>>
>>>>> [4]
>>>>>
>>>> https://mesosphere.github.io/marathon/docs/persistent-volumes.html
>>>>
>>>>>
>>>>> [5]
>>>>>
>>>> https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
>>>>
>>>>>
>>>>> [6] Container Orchestration software for hosting data stores
>>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>
>>>>> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
>>>>>
>>>>>
>>>>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>> Just a quick drive-by comment: how tests are laid out has
>>>>>>
>>>>> non-trivial
>>>>
>>>>> tradeoffs on how/where continuous integration runs, and how results
>>>>>>
>>>>> are
>>>>
>>>>> integrated into the tooling. The current state is certainly not
>>>>>>
>>>>> ideal
>>>>
>>>>> (e.g., due to multiple test executions some links in Jenkins point
>>>>>>
>>>>> where
>>>>
>>>>> they shouldn't), but most other alternatives had even bigger
>>>>>>
>>>>> drawbacks at
>>>>
>>>>> the time. If someone has great ideas that don't explode the number
>>>>>>
>>>>> of
>>>>
>>>>> modules, please share ;-)
>>>>>>
>>>>>> On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
>>>>>>
>>>>> <ec...@gmail.com>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hi Stephen,
>>>>>>>
>>>>>>> Thanks for taking the time to comment.
>>>>>>>
>>>>>>> My comments are bellow in the email:
>>>>>>>
>>>>>>>
>>>>>>> Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
>>>>>>>
>>>>>>> hey Etienne -
>>>>>>>>
>>>>>>>> thanks for your thoughts and thanks for sharing your
>>>>>>>>
>>>>>>> experiences. I
>>>>
>>>>> generally agree with what you're saying. Quick comments below:
>>>>>>>>
>>>>>>>> IT are stored alongside with UT in src/test directory of the IO
>>>>>>>>
>>>>>>> but
>>>>
>>>>> they
>>>>>
>>>>>>
>>>>>>>>> might go to dedicated module, waiting for a consensus
>>>>>>>> I don't have a strong opinion or feel that I've worked enough
>>>>>>>>
>>>>>>> with
>>>>
>>>>> maven
>>>>>
>>>>>> to
>>>>>>>> understand all the consequences - I'd love for someone with more
>>>>>>>>
>>>>>>> maven
>>>>
>>>>> experience to weigh in. If this becomes blocking, I'd say check
>>>>>>>>
>>>>>>> it in,
>>>>
>>>>> and
>>>>>>
>>>>>>> we can refactor later if it proves problematic.
>>>>>>>>
>>>>>>>> Sure, not a blocking point, it could be refactored afterwards.
>>>>>>>
>>>>>> Just as
>>>>
>>>>> a
>>>>>
>>>>>> reminder, JB mentioned that storing IT in separate module allows
>>>>>>>
>>>>>> to
>>>>
>>>>> have
>>>>>
>>>>>> more coherence between all IT (same behavior) and to do cross IO
>>>>>>> integration tests. JB, have you experienced some long term
>>>>>>>
>>>>>> drawbacks of
>>>>
>>>>> storing IT in a separate module, like, for example, more
>>>>>>>
>>>>>> difficult
>>>>
>>>>> maintenance due to "distance" with production code?
>>>>>>>
>>>>>>>
>>>>>>>   Also IMHO, it is better that tests load/clean data than doing
>>>>>>>>
>>>>>>> some
>>>>
>>>>>
>>>>>>>>> assumptions about the running order of the tests.
>>>>>>>> I definitely agree that we don't want to make assumptions about
>>>>>>>>
>>>>>>> the
>>>>
>>>>> running
>>>>>>>> order of the tests - that way lies pain. :) It will be
>>>>>>>>
>>>>>>> interesting to
>>>>
>>>>> see
>>>>>>
>>>>>>> how the performance tests work out since they will need more
>>>>>>>>
>>>>>>> data (and
>>>>
>>>>> thus
>>>>>>>> loading data can take much longer.)
>>>>>>>>
>>>>>>>> Yes, performance testing might push in the direction of data
>>>>>>>
>>>>>> loading
>>>>
>>>>> from
>>>>>
>>>>>> outside the tests due to loading time.
>>>>>>>
>>>>>>>   This should also be an easier problem
>>>>>>>> for read tests than for write tests - if we have long running
>>>>>>>>
>>>>>>> instances,
>>>>>
>>>>>> read tests don't really need cleanup. And if write tests only
>>>>>>>>
>>>>>>> write a
>>>>
>>>>> small
>>>>>>>> amount of data, as long as we are sure we're writing to uniquely
>>>>>>>> identifiable locations (ie, new table per test or something
>>>>>>>>
>>>>>>> similar),
>>>>
>>>>> we
>>>>>
>>>>>> can clean up the write test data on a slower schedule.
>>>>>>>>
>>>>>>>> I agree
>>>>>>>
>>>>>>>
>>>>>>>> this will tend to go to the direction of long running data store
>>>>>>>>
>>>>>>>>>
>>>>>>>>> instances rather than data store instances started (and
>>>>>>>>
>>>>>>> optionally
>>>>
>>>>> loaded)
>>>>>>
>>>>>>> before tests.
>>>>>>>> It may be easiest to start with a "data stores stay running"
>>>>>>>> implementation, and then if we see issues with that move towards
>>>>>>>>
>>>>>>> tests
>>>>
>>>>> that
>>>>>>>> start/stop the data stores on each run. One thing I'd like to
>>>>>>>>
>>>>>>> make
>>>>
>>>>> sure
>>>>>
>>>>>> is
>>>>>>
>>>>>>> that we're not manually tweaking the configurations for data
>>>>>>>>
>>>>>>> stores.
>>>>
>>>>> One
>>>>>
>>>>>> way we could do that is to destroy/recreate the data stores on a
>>>>>>>>
>>>>>>> slower
>>>>>
>>>>>> schedule - maybe once per week. That way if the script is
>>>>>>>>
>>>>>>> changed or
>>>>
>>>>> the
>>>>>
>>>>>> data store instances are changed, we'd be able to detect it
>>>>>>>>
>>>>>>> relatively
>>>>
>>>>> soon
>>>>>>>> while still removing the need for the tests to manage the data
>>>>>>>>
>>>>>>> stores.
>>>>
>>>>>
>>>>>>>> I agree. In addition to configuration manual tweaking, there
>>>>>>>
>>>>>> might be
>>>>
>>>>> cases in which a data store re-partition data during a test or
>>>>>>>
>>>>>> after
>>>>
>>>>> some
>>>>>
>>>>>> tests while the dataset changes. The IO must be tolerant to that
>>>>>>>
>>>>>> but
>>>>
>>>>> the
>>>>>
>>>>>> asserts (number of bundles for example) in test must not fail in
>>>>>>>
>>>>>> that
>>>>
>>>>> case.
>>>>>>
>>>>>>> I would also prefer if possible that the tests do not manage data
>>>>>>>
>>>>>> stores
>>>>>
>>>>>> (not setup them, not start them, not stop them)
>>>>>>>
>>>>>>>
>>>>>>> as a general note, I suspect many of the folks in the states
>>>>>>>>
>>>>>>> will be
>>>>
>>>>> on
>>>>>
>>>>>> holiday until Jan 2nd/3rd.
>>>>>>>>
>>>>>>>> S
>>>>>>>>
>>>>>>>> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
>>>>>>>>
>>>>>>> <echauchot@gmail.com
>>>>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Recently we had a discussion about integration tests of IOs.
>>>>>>>>>
>>>>>>>> I'm
>>>>
>>>>> preparing a PR for integration tests of the elasticSearch IO
>>>>>>>>> (
>>>>>>>>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
>>>>>>>>> LASTICSEARCH-IO
>>>>>>>>> as a first shot) which are very important IMHO because they
>>>>>>>>>
>>>>>>>> helped
>>>>
>>>>> catch
>>>>>>
>>>>>>> some bugs that UT could not (volume, data store instance
>>>>>>>>>
>>>>>>>> sharing,
>>>>
>>>>> real
>>>>>
>>>>>> data store instance ...)
>>>>>>>>>
>>>>>>>>> I would like to have your thoughts/remarks about points bellow.
>>>>>>>>>
>>>>>>>> Some
>>>>
>>>>> of
>>>>>
>>>>>> these points are also discussed here
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>>>>>>>>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>> - UT and IT have a similar architecture, but while UT focus on
>>>>>>>>>
>>>>>>>> testing
>>>>>
>>>>>> the correct behavior of the code including corner cases and use
>>>>>>>>>
>>>>>>>> embedded
>>>>>>
>>>>>>> in memory data store, IT assume that the behavior is correct
>>>>>>>>>
>>>>>>>> (strong
>>>>
>>>>> UT)
>>>>>>
>>>>>>> and focus on higher volume testing and testing against real
>>>>>>>>>
>>>>>>>> data
>>>>
>>>>> store
>>>>>
>>>>>> instance(s)
>>>>>>>>>
>>>>>>>>> - For now, IT are stored alongside with UT in src/test
>>>>>>>>>
>>>>>>>> directory of
>>>>
>>>>> the
>>>>>
>>>>>> IO but they might go to dedicated module, waiting for a
>>>>>>>>>
>>>>>>>> consensus.
>>>>
>>>>> Maven
>>>>>>
>>>>>>> is not configured to run them automatically because data store
>>>>>>>>>
>>>>>>>> is not
>>>>
>>>>> available on jenkins server yet
>>>>>>>>>
>>>>>>>>> - For now, they only use DirectRunner, but they will  be run
>>>>>>>>>
>>>>>>>> against
>>>>
>>>>> each runner.
>>>>>>>>>
>>>>>>>>> - IT do not setup data store instance (like stated in the above
>>>>>>>>> document) they assume that one is already running (hardcoded
>>>>>>>>> configuration in test for now, waiting for a common solution to
>>>>>>>>>
>>>>>>>> pass
>>>>
>>>>> configuration to IT). A docker container script is provided in
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> contrib directory as a starting point to whatever orchestration
>>>>>>>>>
>>>>>>>> software
>>>>>>
>>>>>>> will be chosen.
>>>>>>>>>
>>>>>>>>> - IT load and clean test data before and after each test if
>>>>>>>>>
>>>>>>>> needed.
>>>>
>>>>> It
>>>>>
>>>>>> is simpler to do so because some tests need empty data store
>>>>>>>>>
>>>>>>>> (write
>>>>
>>>>> test) and because, as discussed in the document, tests might
>>>>>>>>>
>>>>>>>> not be
>>>>
>>>>> the
>>>>>
>>>>>> only users of the data store. Also IMHO, it is better that
>>>>>>>>>
>>>>>>>> tests
>>>>
>>>>> load/clean data than doing some assumptions about the running
>>>>>>>>>
>>>>>>>> order
>>>>
>>>>> of
>>>>>
>>>>>> the tests.
>>>>>>>>>
>>>>>>>>> If we generalize this pattern to all IT tests, this will tend
>>>>>>>>>
>>>>>>>> to go
>>>>
>>>>> to
>>>>>
>>>>>> the direction of long running data store instances rather than
>>>>>>>>>
>>>>>>>> data
>>>>
>>>>> store instances started (and optionally loaded) before tests.
>>>>>>>>>
>>>>>>>>> Besides if we where to change our minds and load data from
>>>>>>>>>
>>>>>>>> outside
>>>>
>>>>> the
>>>>>
>>>>>> tests, a logstash script is provided.
>>>>>>>>>
>>>>>>>>> If you have any thoughts or remarks I'm all ears :)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Etienne
>>>>>>>>>
>>>>>>>>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
>>>>>>>>>
>>>>>>>>> Hi Stephen,
>>>>>>>>>>
>>>>>>>>>> the purpose of having in a specific module is to share
>>>>>>>>>>
>>>>>>>>> resources and
>>>>
>>>>> apply the same behavior from IT perspective and be able to
>>>>>>>>>>
>>>>>>>>> have IT
>>>>
>>>>> "cross" IO (for instance, reading from JMS and sending to
>>>>>>>>>>
>>>>>>>>> Kafka, I
>>>>
>>>>> think that's the key idea for integration tests).
>>>>>>>>>>
>>>>>>>>>> For instance, in Karaf, we have:
>>>>>>>>>> - utest in each module
>>>>>>>>>> - itest module containing itests for all modules all together
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>>
>>>>>>>>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Etienne,
>>>>>>>>>>>
>>>>>>>>>>> thanks for following up and answering my questions.
>>>>>>>>>>>
>>>>>>>>>>> re: where to store integration tests - having them all in a
>>>>>>>>>>>
>>>>>>>>>> separate
>>>>>
>>>>>> module
>>>>>>>>>>> is an interesting idea. I couldn't find JB's comments about
>>>>>>>>>>>
>>>>>>>>>> moving
>>>>
>>>>> them
>>>>>>
>>>>>>> into a separate module in the PR - can you share the reasons
>>>>>>>>>>>
>>>>>>>>>> for
>>>>
>>>>> doing so?
>>>>>>>>>>> The IO integration/perf tests so it does seem like they'll
>>>>>>>>>>>
>>>>>>>>>> need to
>>>>
>>>>> be
>>>>>
>>>>>> treated in a special manner, but given that there is already
>>>>>>>>>>>
>>>>>>>>>> an IO
>>>>
>>>>> specific
>>>>>>>>>>> module, it may just be that we need to treat all the ITs in
>>>>>>>>>>>
>>>>>>>>>> the IO
>>>>
>>>>> module
>>>>>>>>>>> the same way. I don't have strong opinions either way right
>>>>>>>>>>>
>>>>>>>>>> now.
>>>>
>>>>>
>>>>>>>>>>> S
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
>>>>>>>>>>>
>>>>>>>>>> echauchot@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> @Stephen: I addressed all your comments directly in the PR,
>>>>>>>>>>>
>>>>>>>>>> thanks!
>>>>
>>>>> I just wanted to comment here about the docker image I used:
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>> only
>>>>>
>>>>>> official Elastic image contains only ElasticSearch. But for
>>>>>>>>>>>
>>>>>>>>>> testing I
>>>>>
>>>>>> needed logstash (for ingestion) and kibana (not for
>>>>>>>>>>>
>>>>>>>>>> integration
>>>>
>>>>> tests,
>>>>>>
>>>>>>> but to easily test REST requests to ES using sense). This is
>>>>>>>>>>>
>>>>>>>>>> why I
>>>>
>>>>> use
>>>>>>
>>>>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
>>>>>>>>>>>
>>>>>>>>>> isreleased
>>>>
>>>>> under
>>>>>>>>>>> theapache 2 license.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Besides, there is also a point about where to store
>>>>>>>>>>>
>>>>>>>>>> integration
>>>>
>>>>> tests:
>>>>>>
>>>>>>> JB proposed in the PR to store integration tests to dedicated
>>>>>>>>>>>
>>>>>>>>>> module
>>>>>
>>>>>> rather than directly in the IO module (like I did).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Etienne
>>>>>>>>>>>
>>>>>>>>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
>>>>>>>>>>>
>>>>>>>>>>> hey!
>>>>>>>>>>>>
>>>>>>>>>>>> thanks for sending this. I'm very excited to see this
>>>>>>>>>>>>
>>>>>>>>>>> change. I
>>>>
>>>>> added some
>>>>>>>>>>>> detail-oriented code review comments in addition to what
>>>>>>>>>>>>
>>>>>>>>>>> I've
>>>>
>>>>> discussed
>>>>>>>>>>>> here.
>>>>>>>>>>>>
>>>>>>>>>>>> The general goal is to allow for re-usable instantiation of
>>>>>>>>>>>>
>>>>>>>>>>> particular
>>>>>>
>>>>>>>
>>>>>>>>>>>> data
>>>>>>>>>>>
>>>>>>>>>>> store instances and this seems like a good start. Looks like
>>>>>>>>>>>>
>>>>>>>>>>> you
>>>>
>>>>> also have
>>>>>>>>>>>> a script to generate test data for your tests - that's
>>>>>>>>>>>>
>>>>>>>>>>> great.
>>>>
>>>>>
>>>>>>>>>>>> The next steps (definitely not blocking your work) will be
>>>>>>>>>>>>
>>>>>>>>>>> to have
>>>>
>>>>> ways to
>>>>>>>>>>>> create instances from the docker images you have here, and
>>>>>>>>>>>>
>>>>>>>>>>> use
>>>>
>>>>> them
>>>>>
>>>>>> in the
>>>>>>>>>>>> tests. We'll need support in the test framework for that
>>>>>>>>>>>>
>>>>>>>>>>> since
>>>>
>>>>> it'll
>>>>>
>>>>>> be
>>>>>>>>>>>> different on developer machines and in the beam jenkins
>>>>>>>>>>>>
>>>>>>>>>>> cluster,
>>>>
>>>>> but
>>>>>
>>>>>> your
>>>>>>>>>>>> scripts here allow someone running these tests locally to
>>>>>>>>>>>>
>>>>>>>>>>> not have
>>>>
>>>>> to
>>>>>>
>>>>>>>
>>>>>>>>>>>> worry
>>>>>>>>>>>
>>>>>>>>>>> about getting the instance set up and can manually adjust,
>>>>>>>>>>>>
>>>>>>>>>>> so this
>>>>
>>>>> is
>>>>>>
>>>>>>> a
>>>>>>>>>>>> good incremental step.
>>>>>>>>>>>>
>>>>>>>>>>>> I have some thoughts now that I'm reviewing your scripts
>>>>>>>>>>>>
>>>>>>>>>>> (that I
>>>>
>>>>> didn't
>>>>>>>>>>>> have previously, so we are learning this together):
>>>>>>>>>>>> * It may be useful to try and document why we chose a
>>>>>>>>>>>>
>>>>>>>>>>> particular
>>>>
>>>>> docker
>>>>>>>>>>>> image as the base (ie, "this is the official supported
>>>>>>>>>>>>
>>>>>>>>>>> elastic
>>>>
>>>>> search
>>>>>>
>>>>>>> docker image" or "this image has several data stores
>>>>>>>>>>>>
>>>>>>>>>>> together that
>>>>
>>>>> can be
>>>>>>>>>>>> used for a couple different tests")  - I'm curious as to
>>>>>>>>>>>>
>>>>>>>>>>> whether
>>>>
>>>>> the
>>>>>
>>>>>> community thinks that is important
>>>>>>>>>>>>
>>>>>>>>>>>> One thing that I called out in the comment that's worth
>>>>>>>>>>>>
>>>>>>>>>>> mentioning
>>>>
>>>>> on the
>>>>>>>>>>>> larger list - if you want to specify which specific runners
>>>>>>>>>>>>
>>>>>>>>>>> a test
>>>>
>>>>> uses,
>>>>>>>>>>>> that can be controlled in the pom for the module. I updated
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>
>>>>> testing
>>>>>>>>>>>>
>>>>>>>>>>>> doc
>>>>>>>>>>>
>>>>>>>>>>> mentioned previously in this thread with a TODO to talk
>>>>>>>>>>>>
>>>>>>>>>>> about this
>>>>
>>>>> more. I
>>>>>>>>>>>> think we should also make it so that IO modules have that
>>>>>>>>>>>> automatically,
>>>>>>>>>>>>
>>>>>>>>>>>> so
>>>>>>>>>>>
>>>>>>>>>>> developers don't have to worry about it.
>>>>>>>>>>>>
>>>>>>>>>>>> S
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
>>>>>>>>>>>>
>>>>>>>>>>> echauchot@gmail.com>
>>>>>>
>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Stephen,
>>>>>>>>>>>>
>>>>>>>>>>>> As discussed, I added injection script, docker containers
>>>>>>>>>>>>
>>>>>>>>>>> scripts
>>>>
>>>>> and
>>>>>>
>>>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
>>>>>>>>>>>>
>>>>>>>>>>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
>>>>>>>>> d824cefcb3ed0b9
>>>>>>>>>
>>>>>>>>> directory in that PR:
>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439.
>>>>>>>>>>>>
>>>>>>>>>>>> These work well but they are first shot. Do you have any
>>>>>>>>>>>>
>>>>>>>>>>> comments
>>>>
>>>>> about
>>>>>>>>>>>> those?
>>>>>>>>>>>>
>>>>>>>>>>>> Besides I am not very sure that these files should be in the
>>>>>>>>>>>>
>>>>>>>>>>> IO
>>>>
>>>>> itself
>>>>>>
>>>>>>> (even in contrib directory, out of maven source
>>>>>>>>>>>>
>>>>>>>>>>> directories). Any
>>>>
>>>>>
>>>>>>>>>>>> thoughts?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Etienne
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
>>>>>>>>>>>>
>>>>>>>>>>>> It's great to hear more experiences.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm also glad to hear that people see real value in the
>>>>>>>>>>>>>
>>>>>>>>>>>> high
>>>>
>>>>> volume/performance benchmark tests. I tried to capture that
>>>>>>>>>>>>>
>>>>>>>>>>>> in
>>>>
>>>>> the
>>>>>
>>>>>>
>>>>>>>>>>>>> Testing
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It does generally sound like we're in agreement here. Areas
>>>>>>>>>>>>>
>>>>>>>>>>>> of
>>>>
>>>>> discussion
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Hosting data stores for IO Transform testing

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may find 
single node config but not sure for multi-node setup. Anyway, I'm not 
sure if we find a multi-node configuration, it would cover our needs.

Regards
JB

On 01/18/2017 12:52 PM, Stephen Sisk wrote:
> ah! I looked around a bit more and found the dcos package repo -
> https://github.com/mesosphere/universe/tree/version-3.x/repo/packages
>
> poking around a bit, I can find a lot of packages for single node
> instances, but not many packages for multi-node instances. Single node
> instance packages are kind of useful, but I don't think it's *too* helpful.
> The multi-node instance packages that run the data store's high
> availability mode are where the real work is, and it seems like both
> kubernetes helm and dcos' package universe don't have a lot of those.
>
> S
>
> On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <si...@google.com> wrote:
>
>> Hi Ishmael,
>>
>> these are good questions, thanks for raising them.
>>
>> Ability to modify network/compute resources to simulate failures
>> =================================================
>> I see two real questions here:
>> 1. Is this something we want to do?
>> 2. Is it possible with both/either?
>>
>> So far, the test strategy I've been advocating is that we test problems
>> like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
>> it's hard to re-create the same conditions.
>>
>> I can investigate whether it's possible, but I want to clarify whether
>> this is something that we care about. I know both support killing
>> individual nodes. I haven't seen a lot of network control in either, but
>> haven't tried to look for it.
>>
>> Availability of ready to play packages
>> ============================
>> I did look at this, and as far as I could tell, mesos didn't have any
>> pre-built packages for multi-node clusters of data stores. If there's a
>> good repository of them that we trust, that would definitely save us time.
>> Can you point me at the mesos repository?
>>
>> S
>>
>>
>>
>> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofr� <jb...@nanthrax.net>
>> wrote:
>>
>> \u2063\u200bHi Ismael
>>
>> Stephen will reply with details but I know he did a comparison and
>> evaluate different options.
>>
>> He tested with the jdbc Io itests.
>>
>> Regards
>> JB
>>
>> On Jan 18, 2017, 08:26, at 08:26, "Isma�l Mej�a" <ie...@gmail.com>
>> wrote:
>>> Thanks for your analysis Stephen, good arguments / references.
>>>
>>> One quick question. Have you checked the APIs of both
>>> (Mesos/Kubernetes) to
>>> see
>>> if we can do programmatically do more complex tests (I suppose so, but
>>> you
>>> don't mention how easy or if those are possible), for example to
>>> simulate a
>>> slow networking slave (to test stragglers), or to arbitrarily kill one
>>> slave (e.g. if I want to test the correct behavior of a runner/IO that
>>> is
>>> reading from it) ?
>>>
>>> Other missing point in the review is the availability of ready to play
>>> packages,
>>> I think in this area mesos/dcos seems more advanced no? I haven't
>>> looked
>>> recently but at least 6 months ago there were not many helm packages
>>> ready
>>> for
>>> example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
>>> etc). Has
>>> this been improved ? because preparing this also is a considerable
>>> amount of
>>> work on the other hand this could be also a chance to contribute to
>>> kubernetes.
>>>
>>> Regards,
>>> Isma�l
>>>
>>>
>>>
>>> On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <si...@google.com.invalid>
>>> wrote:
>>>
>>>> hi!
>>>>
>>>> I've been continuing this investigation, and have some more info to
>>> report,
>>>> and hopefully we can start making some decisions.
>>>>
>>>> To support performance testing, I've been investigating
>>> mesos+marathon and
>>>> kubernetes for running data stores in their high availability mode. I
>>> have
>>>> been examining features that kubernetes/mesos+marathon use to support
>>> this.
>>>>
>>>> Setting up a multi-node cluster in a high availability mode tends to
>>> be
>>>> more expensive time-wise than the single node instances I've played
>>> around
>>>> with in the past. Rather than do a full build out with both
>>> kubernetes and
>>>> mesos, I'd like to pick one of the two options to build the prototype
>>>> cluster with. If the prototype doesn't go well, we could still go
>>> back to
>>>> the other option, but I'd like to change us from a mode of "let's
>>> look at
>>>> all the options" to one of "here's the favorite, let's prove that
>>> works for
>>>> us".
>>>>
>>>> Below are the features that I've seen are important to multi-node
>>> instances
>>>> of data stores. I'm sure other folks on the list have done this
>>> before, so
>>>> feel free to pipe up if I'm missing a good solution to a problem.
>>>>
>>>> DNS/Discovery
>>>>
>>>> --------------------
>>>>
>>>> Necessary for talking between nodes (eg, cassandra nodes all need to
>>> be
>>>> able to talk to a set of seed nodes.)
>>>>
>>>> * Kubernetes has built-in DNS/discovery between nodes.
>>>>
>>>> * Mesos has supports this via mesos-dns, which isn't a part of core
>>> mesos,
>>>> but is in dcos, which is the mesos distribution I've been using and
>>> that I
>>>> would expect us to use.
>>>>
>>>> Instances properly distributed across nodes
>>>>
>>>> ------------------------------------------------------------
>>>>
>>>> If multiple instances of a data source end up on the same underlying
>>> VM, we
>>>> may not get good performance out of those instances since the
>>> underlying VM
>>>> may be more taxed than other VMs.
>>>>
>>>> * Kubernetes has a beta feature StatefulSets[1] which allow for
>>> containers
>>>> distributed so that there's one container per underlying machine (as
>>> well
>>>> as a lot of other useful features like easy stable dns names.)
>>>>
>>>> * Mesos can support this via the built in UNIQUE constraint [2]
>>>>
>>>> Load balancing
>>>>
>>>> --------------------
>>>>
>>>> Incoming requests from users need to be distributed to the various
>>> machines
>>>> - this is important for many data stores' high availability modes.
>>>>
>>>> * Kubernetes supports easily hooking up to an external load balancer
>>> when
>>>> on a cloud (and can be configured to work with a built-in load
>>> balancer if
>>>> not)
>>>>
>>>> * Mesos supports this via marathon-lb [3], which is an install-able
>>> package
>>>> in DC/OS
>>>>
>>>> Persistent Volumes tied to specific instances
>>>>
>>>> ------------------------------------------------------------
>>>>
>>>> Databases often need persistent state (for example to store the data
>>> :), so
>>>> it's an important part of running our service.
>>>>
>>>> * Kubernetes StatefulSets supports this
>>>>
>>>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
>>>>
>>>> As I mentioned above, I'd like to focus on either kubernetes or mesos
>>> for
>>>> my investigation, and as I go further along, I'm seeing kubernetes as
>>>> better suited to our needs.
>>>>
>>>> (1) It supports more of the features we want out of the box and with
>>>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
>>>> requires marathon-lb to be installed and mesos-dns to be configured.
>>>>
>>>> (2) I'm also finding that there seem to be more examples of using
>>>> kubernetes to solve the types of problems we're working on. This is
>>>> somewhat subjective, but in my experience as I've tried to learn both
>>>> kubernetes and mesos, I personally found it generally easier to get
>>>> kubernetes running than mesos due to the tutorials/examples available
>>> for
>>>> kubernetes.
>>>>
>>>> (3) Lower cost of initial setup - as I discussed in a previous
>>> mail[6],
>>>> kubernetes was far easier to get set up even when I knew the exact
>>> steps.
>>>> Mesos took me around 27 steps [7], which involved a lot of config
>>> that was
>>>> easy to get wrong (it took me about 5 tries to get all the steps
>>> correct in
>>>> one go.) Kubernetes took me around 8 steps and very little config.
>>>>
>>>> Given that, I'd like to focus my investigation/prototyping on
>>> Kubernetes.
>>>> To
>>>> be clear, it's fairly close and I think both Mesos and Kubernetes
>>> could
>>>> support what we need, so if we run into issues with kubernetes, Mesos
>>> still
>>>> seems like a viable option that we could fall back to.
>>>>
>>>> Thanks,
>>>> Stephen
>>>>
>>>>
>>>> [1] Kubernetes StatefulSets
>>>>
>>>
>> https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
>>>>
>>>> [2] mesos unique constraint -
>>>> https://mesosphere.github.io/marathon/docs/constraints.html
>>>>
>>>> [3]
>>>> https://mesosphere.github.io/marathon/docs/service-
>>>> discovery-load-balancing.html
>>>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
>>>>
>>>> [4]
>>> https://mesosphere.github.io/marathon/docs/persistent-volumes.html
>>>>
>>>> [5]
>>> https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
>>>>
>>>> [6] Container Orchestration software for hosting data stores
>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>
>>>> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
>>>>
>>>>
>>>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
>>> wrote:
>>>>
>>>>> Just a quick drive-by comment: how tests are laid out has
>>> non-trivial
>>>>> tradeoffs on how/where continuous integration runs, and how results
>>> are
>>>>> integrated into the tooling. The current state is certainly not
>>> ideal
>>>>> (e.g., due to multiple test executions some links in Jenkins point
>>> where
>>>>> they shouldn't), but most other alternatives had even bigger
>>> drawbacks at
>>>>> the time. If someone has great ideas that don't explode the number
>>> of
>>>>> modules, please share ;-)
>>>>>
>>>>> On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
>>> <ec...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Stephen,
>>>>>>
>>>>>> Thanks for taking the time to comment.
>>>>>>
>>>>>> My comments are bellow in the email:
>>>>>>
>>>>>>
>>>>>> Le 24/12/2016 � 00:07, Stephen Sisk a �crit :
>>>>>>
>>>>>>> hey Etienne -
>>>>>>>
>>>>>>> thanks for your thoughts and thanks for sharing your
>>> experiences. I
>>>>>>> generally agree with what you're saying. Quick comments below:
>>>>>>>
>>>>>>> IT are stored alongside with UT in src/test directory of the IO
>>> but
>>>> they
>>>>>>>>
>>>>>>> might go to dedicated module, waiting for a consensus
>>>>>>> I don't have a strong opinion or feel that I've worked enough
>>> with
>>>> maven
>>>>>>> to
>>>>>>> understand all the consequences - I'd love for someone with more
>>> maven
>>>>>>> experience to weigh in. If this becomes blocking, I'd say check
>>> it in,
>>>>> and
>>>>>>> we can refactor later if it proves problematic.
>>>>>>>
>>>>>> Sure, not a blocking point, it could be refactored afterwards.
>>> Just as
>>>> a
>>>>>> reminder, JB mentioned that storing IT in separate module allows
>>> to
>>>> have
>>>>>> more coherence between all IT (same behavior) and to do cross IO
>>>>>> integration tests. JB, have you experienced some long term
>>> drawbacks of
>>>>>> storing IT in a separate module, like, for example, more
>>> difficult
>>>>>> maintenance due to "distance" with production code?
>>>>>>
>>>>>>
>>>>>>>   Also IMHO, it is better that tests load/clean data than doing
>>> some
>>>>>>>>
>>>>>>> assumptions about the running order of the tests.
>>>>>>> I definitely agree that we don't want to make assumptions about
>>> the
>>>>>>> running
>>>>>>> order of the tests - that way lies pain. :) It will be
>>> interesting to
>>>>> see
>>>>>>> how the performance tests work out since they will need more
>>> data (and
>>>>>>> thus
>>>>>>> loading data can take much longer.)
>>>>>>>
>>>>>> Yes, performance testing might push in the direction of data
>>> loading
>>>> from
>>>>>> outside the tests due to loading time.
>>>>>>
>>>>>>>   This should also be an easier problem
>>>>>>> for read tests than for write tests - if we have long running
>>>> instances,
>>>>>>> read tests don't really need cleanup. And if write tests only
>>> write a
>>>>>>> small
>>>>>>> amount of data, as long as we are sure we're writing to uniquely
>>>>>>> identifiable locations (ie, new table per test or something
>>> similar),
>>>> we
>>>>>>> can clean up the write test data on a slower schedule.
>>>>>>>
>>>>>> I agree
>>>>>>
>>>>>>>
>>>>>>> this will tend to go to the direction of long running data store
>>>>>>>>
>>>>>>> instances rather than data store instances started (and
>>> optionally
>>>>> loaded)
>>>>>>> before tests.
>>>>>>> It may be easiest to start with a "data stores stay running"
>>>>>>> implementation, and then if we see issues with that move towards
>>> tests
>>>>>>> that
>>>>>>> start/stop the data stores on each run. One thing I'd like to
>>> make
>>>> sure
>>>>> is
>>>>>>> that we're not manually tweaking the configurations for data
>>> stores.
>>>> One
>>>>>>> way we could do that is to destroy/recreate the data stores on a
>>>> slower
>>>>>>> schedule - maybe once per week. That way if the script is
>>> changed or
>>>> the
>>>>>>> data store instances are changed, we'd be able to detect it
>>> relatively
>>>>>>> soon
>>>>>>> while still removing the need for the tests to manage the data
>>> stores.
>>>>>>>
>>>>>> I agree. In addition to configuration manual tweaking, there
>>> might be
>>>>>> cases in which a data store re-partition data during a test or
>>> after
>>>> some
>>>>>> tests while the dataset changes. The IO must be tolerant to that
>>> but
>>>> the
>>>>>> asserts (number of bundles for example) in test must not fail in
>>> that
>>>>> case.
>>>>>> I would also prefer if possible that the tests do not manage data
>>>> stores
>>>>>> (not setup them, not start them, not stop them)
>>>>>>
>>>>>>
>>>>>>> as a general note, I suspect many of the folks in the states
>>> will be
>>>> on
>>>>>>> holiday until Jan 2nd/3rd.
>>>>>>>
>>>>>>> S
>>>>>>>
>>>>>>> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
>>> <echauchot@gmail.com
>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Recently we had a discussion about integration tests of IOs.
>>> I'm
>>>>>>>> preparing a PR for integration tests of the elasticSearch IO
>>>>>>>> (
>>>>>>>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
>>>>>>>> LASTICSEARCH-IO
>>>>>>>> as a first shot) which are very important IMHO because they
>>> helped
>>>>> catch
>>>>>>>> some bugs that UT could not (volume, data store instance
>>> sharing,
>>>> real
>>>>>>>> data store instance ...)
>>>>>>>>
>>>>>>>> I would like to have your thoughts/remarks about points bellow.
>>> Some
>>>> of
>>>>>>>> these points are also discussed here
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>>>>>>>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
>>>>>>>> :
>>>>>>>>
>>>>>>>> - UT and IT have a similar architecture, but while UT focus on
>>>> testing
>>>>>>>> the correct behavior of the code including corner cases and use
>>>>> embedded
>>>>>>>> in memory data store, IT assume that the behavior is correct
>>> (strong
>>>>> UT)
>>>>>>>> and focus on higher volume testing and testing against real
>>> data
>>>> store
>>>>>>>> instance(s)
>>>>>>>>
>>>>>>>> - For now, IT are stored alongside with UT in src/test
>>> directory of
>>>> the
>>>>>>>> IO but they might go to dedicated module, waiting for a
>>> consensus.
>>>>> Maven
>>>>>>>> is not configured to run them automatically because data store
>>> is not
>>>>>>>> available on jenkins server yet
>>>>>>>>
>>>>>>>> - For now, they only use DirectRunner, but they will  be run
>>> against
>>>>>>>> each runner.
>>>>>>>>
>>>>>>>> - IT do not setup data store instance (like stated in the above
>>>>>>>> document) they assume that one is already running (hardcoded
>>>>>>>> configuration in test for now, waiting for a common solution to
>>> pass
>>>>>>>> configuration to IT). A docker container script is provided in
>>> the
>>>>>>>> contrib directory as a starting point to whatever orchestration
>>>>> software
>>>>>>>> will be chosen.
>>>>>>>>
>>>>>>>> - IT load and clean test data before and after each test if
>>> needed.
>>>> It
>>>>>>>> is simpler to do so because some tests need empty data store
>>> (write
>>>>>>>> test) and because, as discussed in the document, tests might
>>> not be
>>>> the
>>>>>>>> only users of the data store. Also IMHO, it is better that
>>> tests
>>>>>>>> load/clean data than doing some assumptions about the running
>>> order
>>>> of
>>>>>>>> the tests.
>>>>>>>>
>>>>>>>> If we generalize this pattern to all IT tests, this will tend
>>> to go
>>>> to
>>>>>>>> the direction of long running data store instances rather than
>>> data
>>>>>>>> store instances started (and optionally loaded) before tests.
>>>>>>>>
>>>>>>>> Besides if we where to change our minds and load data from
>>> outside
>>>> the
>>>>>>>> tests, a logstash script is provided.
>>>>>>>>
>>>>>>>> If you have any thoughts or remarks I'm all ears :)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Etienne
>>>>>>>>
>>>>>>>> Le 14/12/2016 � 17:07, Jean-Baptiste Onofr� a �crit :
>>>>>>>>
>>>>>>>>> Hi Stephen,
>>>>>>>>>
>>>>>>>>> the purpose of having in a specific module is to share
>>> resources and
>>>>>>>>> apply the same behavior from IT perspective and be able to
>>> have IT
>>>>>>>>> "cross" IO (for instance, reading from JMS and sending to
>>> Kafka, I
>>>>>>>>> think that's the key idea for integration tests).
>>>>>>>>>
>>>>>>>>> For instance, in Karaf, we have:
>>>>>>>>> - utest in each module
>>>>>>>>> - itest module containing itests for all modules all together
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> JB
>>>>>>>>>
>>>>>>>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
>>>>>>>>>
>>>>>>>>>> Hi Etienne,
>>>>>>>>>>
>>>>>>>>>> thanks for following up and answering my questions.
>>>>>>>>>>
>>>>>>>>>> re: where to store integration tests - having them all in a
>>>> separate
>>>>>>>>>> module
>>>>>>>>>> is an interesting idea. I couldn't find JB's comments about
>>> moving
>>>>> them
>>>>>>>>>> into a separate module in the PR - can you share the reasons
>>> for
>>>>>>>>>> doing so?
>>>>>>>>>> The IO integration/perf tests so it does seem like they'll
>>> need to
>>>> be
>>>>>>>>>> treated in a special manner, but given that there is already
>>> an IO
>>>>>>>>>> specific
>>>>>>>>>> module, it may just be that we need to treat all the ITs in
>>> the IO
>>>>>>>>>> module
>>>>>>>>>> the same way. I don't have strong opinions either way right
>>> now.
>>>>>>>>>>
>>>>>>>>>> S
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
>>>>> echauchot@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> @Stephen: I addressed all your comments directly in the PR,
>>> thanks!
>>>>>>>>>> I just wanted to comment here about the docker image I used:
>>> the
>>>> only
>>>>>>>>>> official Elastic image contains only ElasticSearch. But for
>>>> testing I
>>>>>>>>>> needed logstash (for ingestion) and kibana (not for
>>> integration
>>>>> tests,
>>>>>>>>>> but to easily test REST requests to ES using sense). This is
>>> why I
>>>>> use
>>>>>>>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
>>> isreleased
>>>>>>>>>> under
>>>>>>>>>> theapache 2 license.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Besides, there is also a point about where to store
>>> integration
>>>>> tests:
>>>>>>>>>> JB proposed in the PR to store integration tests to dedicated
>>>> module
>>>>>>>>>> rather than directly in the IO module (like I did).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Etienne
>>>>>>>>>>
>>>>>>>>>> Le 01/12/2016 � 20:14, Stephen Sisk a �crit :
>>>>>>>>>>
>>>>>>>>>>> hey!
>>>>>>>>>>>
>>>>>>>>>>> thanks for sending this. I'm very excited to see this
>>> change. I
>>>>>>>>>>> added some
>>>>>>>>>>> detail-oriented code review comments in addition to what
>>> I've
>>>>>>>>>>> discussed
>>>>>>>>>>> here.
>>>>>>>>>>>
>>>>>>>>>>> The general goal is to allow for re-usable instantiation of
>>>>> particular
>>>>>>>>>>>
>>>>>>>>>> data
>>>>>>>>>>
>>>>>>>>>>> store instances and this seems like a good start. Looks like
>>> you
>>>>>>>>>>> also have
>>>>>>>>>>> a script to generate test data for your tests - that's
>>> great.
>>>>>>>>>>>
>>>>>>>>>>> The next steps (definitely not blocking your work) will be
>>> to have
>>>>>>>>>>> ways to
>>>>>>>>>>> create instances from the docker images you have here, and
>>> use
>>>> them
>>>>>>>>>>> in the
>>>>>>>>>>> tests. We'll need support in the test framework for that
>>> since
>>>> it'll
>>>>>>>>>>> be
>>>>>>>>>>> different on developer machines and in the beam jenkins
>>> cluster,
>>>> but
>>>>>>>>>>> your
>>>>>>>>>>> scripts here allow someone running these tests locally to
>>> not have
>>>>> to
>>>>>>>>>>>
>>>>>>>>>> worry
>>>>>>>>>>
>>>>>>>>>>> about getting the instance set up and can manually adjust,
>>> so this
>>>>> is
>>>>>>>>>>> a
>>>>>>>>>>> good incremental step.
>>>>>>>>>>>
>>>>>>>>>>> I have some thoughts now that I'm reviewing your scripts
>>> (that I
>>>>>>>>>>> didn't
>>>>>>>>>>> have previously, so we are learning this together):
>>>>>>>>>>> * It may be useful to try and document why we chose a
>>> particular
>>>>>>>>>>> docker
>>>>>>>>>>> image as the base (ie, "this is the official supported
>>> elastic
>>>>> search
>>>>>>>>>>> docker image" or "this image has several data stores
>>> together that
>>>>>>>>>>> can be
>>>>>>>>>>> used for a couple different tests")  - I'm curious as to
>>> whether
>>>> the
>>>>>>>>>>> community thinks that is important
>>>>>>>>>>>
>>>>>>>>>>> One thing that I called out in the comment that's worth
>>> mentioning
>>>>>>>>>>> on the
>>>>>>>>>>> larger list - if you want to specify which specific runners
>>> a test
>>>>>>>>>>> uses,
>>>>>>>>>>> that can be controlled in the pom for the module. I updated
>>> the
>>>>>>>>>>> testing
>>>>>>>>>>>
>>>>>>>>>> doc
>>>>>>>>>>
>>>>>>>>>>> mentioned previously in this thread with a TODO to talk
>>> about this
>>>>>>>>>>> more. I
>>>>>>>>>>> think we should also make it so that IO modules have that
>>>>>>>>>>> automatically,
>>>>>>>>>>>
>>>>>>>>>> so
>>>>>>>>>>
>>>>>>>>>>> developers don't have to worry about it.
>>>>>>>>>>>
>>>>>>>>>>> S
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
>>>>> echauchot@gmail.com>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Stephen,
>>>>>>>>>>>
>>>>>>>>>>> As discussed, I added injection script, docker containers
>>> scripts
>>>>> and
>>>>>>>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
>>>>>>>>>>> <
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
>>>>>>>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
>>>>>>>> d824cefcb3ed0b9
>>>>>>>>
>>>>>>>>> directory in that PR:
>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439.
>>>>>>>>>>>
>>>>>>>>>>> These work well but they are first shot. Do you have any
>>> comments
>>>>>>>>>>> about
>>>>>>>>>>> those?
>>>>>>>>>>>
>>>>>>>>>>> Besides I am not very sure that these files should be in the
>>> IO
>>>>> itself
>>>>>>>>>>> (even in contrib directory, out of maven source
>>> directories). Any
>>>>>>>>>>>
>>>>>>>>>> thoughts?
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Etienne
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Le 23/11/2016 � 19:03, Stephen Sisk a �crit :
>>>>>>>>>>>
>>>>>>>>>>>> It's great to hear more experiences.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm also glad to hear that people see real value in the
>>> high
>>>>>>>>>>>> volume/performance benchmark tests. I tried to capture that
>>> in
>>>> the
>>>>>>>>>>>>
>>>>>>>>>>> Testing
>>>>>>>>>>
>>>>>>>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
>>>>>>>>>>>>
>>>>>>>>>>>> It does generally sound like we're in agreement here. Areas
>>> of
>>>>>>>>>>>> discussion
>>>>>>>>>>>>
>>>>>>>>>>> I
>>>>>>>>>>>
>>>>>>>>>>>> see:
>>>>>>>>>>>> 1.  People like the idea of bringing up fresh instances for
>>> each
>>>>> test
>>>>>>>>>>>> rather than keeping instances running all the time, since
>>> that
>>>>>>>>>>>> ensures no
>>>>>>>>>>>> contamination between tests. That seems reasonable to me.
>>> If we
>>>> see
>>>>>>>>>>>> flakiness in the tests or we note that setting up/tearing
>>> down
>>>>>>>>>>>> instances
>>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> taking a lot of time,
>>>>>>>>>>>> 2. Deciding on cluster management software/orchestration
>>> software
>>>>> - I
>>>>>>>>>>>>
>>>>>>>>>>> want
>>>>>>>>>>
>>>>>>>>>>> to make sure we land on the right tool here since choosing
>>> the
>>>>>>>>>>>> wrong tool
>>>>>>>>>>>> could result in administration of the instances taking more
>>>> work. I
>>>>>>>>>>>>
>>>>>>>>>>> suspect
>>>>>>>>>>>
>>>>>>>>>>>> that's a good place for a follow up discussion, so I'll
>>> start a
>>>>>>>>>>>> separate
>>>>>>>>>>>> thread on that. I'm happy with whatever tool we choose, but
>>> I
>>>> want
>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>> make
>>>>>>>>>>
>>>>>>>>>>> sure we take a moment to consider different options and have
>>> a
>>>>>>>>>>>> reason for
>>>>>>>>>>>> choosing one.
>>>>>>>>>>>>
>>>>>>>>>>>> Etienne - thanks for being willing to port your
>>> creation/other
>>>>>>>>>>>> scripts
>>>>>>>>>>>> over. You might be a good early tester of whether this
>>> system
>>>> works
>>>>>>>>>>>> well
>>>>>>>>>>>> for everyone.
>>>>>>>>>>>>
>>>>>>>>>>>> Stephen
>>>>>>>>>>>>
>>>>>>>>>>>> [1]  Reasons for Beam Test Strategy -
>>>>>>>>>>>>
>>>>>>>>>>>>
>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>>>>>>>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofr�
>>>>>>>>>>>> <jb...@nanthrax.net>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I second Etienne there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We worked together on the ElasticsearchIO and definitely,
>>> the
>>>> high
>>>>>>>>>>>>> valuable test we did were integration tests with ES on
>>> docker
>>>> and
>>>>>>>>>>>>> high
>>>>>>>>>>>>> volume.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think we have to distinguish the two kinds of tests:
>>>>>>>>>>>>> 1. utests are located in the IO itself and basically they
>>> should
>>>>>>>>>>>>> cover
>>>>>>>>>>>>> the core behaviors of the IO
>>>>>>>>>>>>> 2. itests are located as contrib in the IO (they could be
>>> part
>>>> of
>>>>>>>>>>>>> the IO
>>>>>>>>>>>>> but executed by the integration-test plugin or a specific
>>>> profile)
>>>>>>>>>>>>> that
>>>>>>>>>>>>> deals with "real" backend and high volumes. The resources
>>>> required
>>>>>>>>>>>>> by
>>>>>>>>>>>>> the itest can be bootstrapped by Jenkins (for instance
>>> using
>>>>>>>>>>>>> Mesos/Marathon and docker images as already discussed, and
>>> it's
>>>>>>>>>>>>> what I'm
>>>>>>>>>>>>> doing on my own "server").
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's basically what Stephen described.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have to not relay only on itest: utests are very
>>> important
>>>> and
>>>>>>>>>>>>> they
>>>>>>>>>>>>> validate the core behavior.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My $0.01 ;)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> JB
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Stephen,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like your proposition very much and I also agree that
>>> docker
>>>> +
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> orchestration software would be great !
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On the elasticsearchIO (PR to be created this week) there
>>> is
>>>>> docker
>>>>>>>>>>>>>> container creation scripts and logstash data ingestion
>>> script
>>>> for
>>>>>>>>>>>>>> IT
>>>>>>>>>>>>>> environment available in contrib directory alongside with
>>>>>>>>>>>>>> integration
>>>>>>>>>>>>>> tests themselves. I'll be happy to make them compliant to
>>> new
>>>> IT
>>>>>>>>>>>>>> environment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What you say bellow about the need for external IT
>>> environment
>>>> is
>>>>>>>>>>>>>> particularly true. As an example with ES what came out in
>>> first
>>>>>>>>>>>>>> implementation was that there were problems starting at
>>> some
>>>> high
>>>>>>>>>>>>>>
>>>>>>>>>>>>> volume
>>>>>>>>>>
>>>>>>>>>>> of data (timeouts, ES windowing overflow...) that could not
>>> have
>>>> be
>>>>>>>>>>>>>>
>>>>>>>>>>>>> seen
>>>>>>>>>>
>>>>>>>>>>> on embedded ES version. Also there where some
>>> particularities to
>>>>>>>>>>>>>> external instance like secondary (replica) shards that
>>> where
>>>> not
>>>>>>>>>>>>>>
>>>>>>>>>>>>> visible
>>>>>>>>>>
>>>>>>>>>>> on embedded instance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Besides, I also favor bringing up instances before test
>>> because
>>>>> it
>>>>>>>>>>>>>> allows (amongst other things) to be sure to start on a
>>> fresh
>>>>>>>>>>>>>> dataset
>>>>>>>>>>>>>>
>>>>>>>>>>>>> for
>>>>>>>>>>
>>>>>>>>>>> the test to be deterministic.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Etienne
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Le 23/11/2016 � 02:00, Stephen Sisk a �crit :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm excited we're getting lots of discussion going.
>>> There are
>>>>> many
>>>>>>>>>>>>>>> threads
>>>>>>>>>>>>>>> of conversation here, we may choose to split some of
>>> them off
>>>>>>>>>>>>>>> into a
>>>>>>>>>>>>>>> different email thread. I'm also betting I missed some
>>> of the
>>>>>>>>>>>>>>> questions in
>>>>>>>>>>>>>>> this thread, so apologies ahead of time for that. Also
>>>> apologies
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> amount of text, I provided some quick summaries at the top
>>> of
>>>> each
>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Amit - thanks for your thoughts. I've responded in
>>> detail
>>>> below.
>>>>>>>>>>>>>>> Ismael - thanks for offering to help. There's plenty of
>>> work
>>>>>>>>>>>>>>> here to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> go
>>>>>>>>>>
>>>>>>>>>>> around. I'll try and think about how we can divide up some
>>> next
>>>>>>>>>>>>>>> steps
>>>>>>>>>>>>>>> (probably in a separate thread.) The main next step I
>>> see is
>>>>>>>>>>>>>>> deciding
>>>>>>>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm
>>> working
>>>> on
>>>>>>>>>>>>>>> that,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>
>>>>>>>>>>>>>> having lots of different thoughts on what the
>>>>>>>>>>>>>>> advantages/disadvantages
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>
>>>>>>>>>>>>>> those are would be helpful (I'm not entirely sure of the
>>>>>>>>>>>>>>> protocol for
>>>>>>>>>>>>>>> collaborating on sub-projects like this.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These issues are all related to what kind of tests we
>>> want to
>>>>>>>>>>>>>>> write. I
>>>>>>>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all
>>> the
>>>> use
>>>>>>>>>>>>>>> cases
>>>>>>>>>>>>>>> we've discussed here (and thus should not block moving
>>> forward
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> this),
>>>>>>>>>>>>>>> but understanding what we want to test will help us
>>> understand
>>>>>>>>>>>>>>> how the
>>>>>>>>>>>>>>> cluster will be used. I'm working on a proposed user
>>> guide for
>>>>>>>>>>>>>>> testing
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IO
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Transforms, and I'm going to send out a link to that + a
>>> short
>>>>>>>>>>>>>>> summary
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>> the list shortly so folks can get a better sense of where
>>> I'm
>>>>>>>>>>>>>>> coming
>>>>>>>>>>>>>>> from.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here's my thinking on the questions we've raised here -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Embedded versions of data stores for testing
>>>>>>>>>>>>>>> --------------------
>>>>>>>>>>>>>>> Summary: yes! But we still need real data stores to test
>>>>> against.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am a gigantic fan of using embedded versions of the
>>> various
>>>>> data
>>>>>>>>>>>>>>> stores.
>>>>>>>>>>>>>>> I think we should test everything we possibly can using
>>> them,
>>>>>>>>>>>>>>> and do
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> majority of our correctness testing using embedded versions
>>> + the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> direct
>>>>>>>>>>>
>>>>>>>>>>>> runner. However, it's also important to have at least one
>>> test
>>>> that
>>>>>>>>>>>>>>> actually connects to an actual instance, so we can get
>>>> coverage
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>> like credentials, real connection strings, etc...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The key point is that embedded versions definitely can't
>>> cover
>>>>> the
>>>>>>>>>>>>>>> performance tests, so we need to host instances if we
>>> want to
>>>>> test
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> that.
>>>>>>>>>>>
>>>>>>>>>>>> I consider the integration tests/performance benchmarks to
>>> be
>>>>>>>>>>>>>>> costly
>>>>>>>>>>>>>>> things
>>>>>>>>>>>>>>> that we do only for the IO transforms with large amounts
>>> of
>>>>>>>>>>>>>>> community
>>>>>>>>>>>>>>> support/usage. A random IO transform used by a few users
>>>> doesn't
>>>>>>>>>>>>>>> necessarily need integration & perf tests, but for
>>> heavily
>>>> used
>>>>> IO
>>>>>>>>>>>>>>> transforms, there's a lot of community value in these
>>> tests.
>>>> The
>>>>>>>>>>>>>>> maintenance proposal below scales with the amount of
>>> community
>>>>>>>>>>>>>>> support
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> a particular IO transform.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reusing data stores ("use the data stores across
>>> executions.")
>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>> Summary: I favor a hybrid approach: some frequently
>>> used, very
>>>>>>>>>>>>>>> small
>>>>>>>>>>>>>>> instances that we keep up all the time + larger
>>>> multi-container
>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>> instances that we spin up for perf tests.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't think we need to have a strong answer to this
>>>> question,
>>>>>>>>>>>>>>> but I
>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>> we do need to know what range of capabilities we need,
>>> and use
>>>>>>>>>>>>>>> that to
>>>>>>>>>>>>>>> inform our requirements on the hosting infrastructure. I
>>> think
>>>>>>>>>>>>>>> kubernetes/mesos + docker can support all the scenarios
>>> I
>>>>> discuss
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> below.
>>>>>>>>>>>
>>>>>>>>>>>> I had been thinking of a hybrid approach - reuse some
>>> instances
>>>> and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>
>>>>>>>>>>>>>> reuse others. Some tests require isolation from other
>>> tests
>>>> (eg.
>>>>>>>>>>>>>>> performance benchmarking), while others can easily
>>> re-use the
>>>>> same
>>>>>>>>>>>>>>> database/data store instance over time, provided they
>>> are
>>>>>>>>>>>>>>> written in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> correct manner (eg. a simple read or write correctness
>>>> integration
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> tests)
>>>>>>>>>>>>>
>>>>>>>>>>>>>> To me, the question of whether to use one instance over
>>> time
>>>> for
>>>>> a
>>>>>>>>>>>>>>> test vs
>>>>>>>>>>>>>>> spin up an instance for each test comes down to a trade
>>> off
>>>>>>>>>>>>>>> between
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> these
>>>>>>>>>>>>>
>>>>>>>>>>>>>> factors:
>>>>>>>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super
>>> flaky,
>>>>>>>>>>>>>>> we'll
>>>>>>>>>>>>>>> want to
>>>>>>>>>>>>>>> keep more instances up and running rather than bring
>>> them
>>>>> up/down.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (this
>>>>>>>>>>>
>>>>>>>>>>>> may also vary by the data store in question)
>>>>>>>>>>>>>>> 2. Frequency of testing - if we are running tests every
>>> 5
>>>>>>>>>>>>>>> minutes, it
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>
>>>>>>>>>>>>>> be wasteful to bring machines up/down every time. If we
>>> run
>>>>>>>>>>>>>>> tests once
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> a
>>>>>>>>>>>
>>>>>>>>>>>> day or week, it seems wasteful to keep the machines up the
>>> whole
>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>> 3. Isolation requirements - If tests must be isolated,
>>> it
>>>> means
>>>>> we
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> either
>>>>>>>>>>>>>
>>>>>>>>>>>>>> have to bring up the instances for each test, or we have
>>> to
>>>> have
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>> sort
>>>>>>>>>>>>>>> of signaling mechanism to indicate that a given instance
>>> is in
>>>>>>>>>>>>>>> use. I
>>>>>>>>>>>>>>> strongly favor bringing up an instance per test.
>>>>>>>>>>>>>>> 4. Number/size of containers - if we need a large number
>>> of
>>>>>>>>>>>>>>> machines
>>>>>>>>>>>>>>> for a
>>>>>>>>>>>>>>> particular test, keeping them running all the time will
>>> use
>>>> more
>>>>>>>>>>>>>>> resources.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The major unknown to me is how flaky it'll be to spin
>>> these
>>>> up.
>>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up,
>>> but I
>>>>>>>>>>>>>>> think the
>>>>>>>>>>>>>>> best
>>>>>>>>>>>>>>> way to test that is to start doing it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I suspect the sweet spot is the following: have a set of
>>> very
>>>>>>>>>>>>>>> small
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> data
>>>>>>>>>>>
>>>>>>>>>>>> store instances that stay up to support small-data-size
>>>> post-commit
>>>>>>>>>>>>>>> end to
>>>>>>>>>>>>>>> end tests (post-commits run frequently and the data size
>>> means
>>>>> the
>>>>>>>>>>>>>>> instances would not use many resources), combined with
>>> the
>>>>>>>>>>>>>>> ability to
>>>>>>>>>>>>>>> spin
>>>>>>>>>>>>>>> up larger instances for once a day/week performance
>>> benchmarks
>>>>>>>>>>>>>>> (these
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> use
>>>>>>>>>>>>>
>>>>>>>>>>>>>> up more resources and are used less frequently.) That's
>>> the mix
>>>>>>>>>>>>>>> I'll
>>>>>>>>>>>>>>> propose in my docs on testing IO transforms.  If
>>> spinning up
>>>> new
>>>>>>>>>>>>>>> instances
>>>>>>>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of
>>> spinning up
>>>>>>>>>>>>>>> instances
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> each test.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Management ("what's the overhead of managing such a
>>>> deployment")
>>>>>>>>>>>>>>> --------------------
>>>>>>>>>>>>>>> Summary: I propose that anyone can contribute scripts
>>> for
>>>>>>>>>>>>>>> setting up
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> data
>>>>>>>>>>>>>
>>>>>>>>>>>>>> store instances + integration/perf tests, but if the
>>> community
>>>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>> maintain a particular data store's tests, we disable the
>>> tests
>>>>> and
>>>>>>>>>>>>>>> turn off
>>>>>>>>>>>>>>> the data store instances.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Management of these instances is a crucial question.
>>> First,
>>>>> let's
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> break
>>>>>>>>>>
>>>>>>>>>>> down what tasks we'll need to do on a recurring basis:
>>>>>>>>>>>>>>> 1. Ongoing maintenance (update to new versions, both
>>> instance
>>>> &
>>>>>>>>>>>>>>> dependencies) - we don't want to have a lot of old
>>> versions
>>>>>>>>>>>>>>> vulnerable
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>>> attacks/buggy
>>>>>>>>>>>>>>> 2. Investigate breakages/regressions
>>>>>>>>>>>>>>> (I'm betting there will be more things we'll discover -
>>> let me
>>>>>>>>>>>>>>> know if
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> have suggestions)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There's a couple goals I see:
>>>>>>>>>>>>>>> 1. We should only do sys admin work for things that give
>>> us a
>>>>>>>>>>>>>>> lot of
>>>>>>>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up
>>> scripts
>>>> for
>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>> stores
>>>>>>>>>>>>>>> without a large community)
>>>>>>>>>>>>>>> 2. We should do as much as possible of testing via
>>>>>>>>>>>>>>> in-memory/embedded
>>>>>>>>>>>>>>> testing (as you brought up).
>>>>>>>>>>>>>>> 3. Reduce the amount of manual administration overhead
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As I discussed above, I think that integration
>>>> tests/performance
>>>>>>>>>>>>>>> benchmarks
>>>>>>>>>>>>>>> are costly things that we should do only for the IO
>>> transforms
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> large
>>>>>>>>>>>>>
>>>>>>>>>>>>>> amounts of community support/usage. Thus, I propose that
>>> we
>>>>>>>>>>>>>>> limit the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IO
>>>>>>>>>>>
>>>>>>>>>>>> transforms that get integration tests & performance
>>> benchmarks to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> those
>>>>>>>>>>
>>>>>>>>>>> that have community support for maintaining the data store
>>>>>>>>>>>>>>> instances.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We can enforce this organically using some simple rules:
>>>>>>>>>>>>>>> 1. Investigating breakages/regressions: if a given
>>>>>>>>>>>>>>> integration/perf
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> test
>>>>>>>>>>>
>>>>>>>>>>>> starts failing and no one investigates it within a set
>>> period of
>>>>>>>>>>>>>>> time
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (a
>>>>>>>>>>>
>>>>>>>>>>>> week?), we disable the tests and shut off the data store
>>>>>>>>>>>>>>> instances if
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> we
>>>>>>>>>>>
>>>>>>>>>>>> have instances running. When someone wants to step up and
>>>>>>>>>>>>>>> support it
>>>>>>>>>>>>>>> again,
>>>>>>>>>>>>>>> they can fix the test, check it in, and re-enable the
>>> test.
>>>>>>>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira
>>> issue that
>>>>>>>>>>>>>>> is just
>>>>>>>>>>>>>>> "is
>>>>>>>>>>>>>>> the IO Transform X data store up to date?" - if the jira
>>> is
>>>> not
>>>>>>>>>>>>>>> resolved in
>>>>>>>>>>>>>>> a set period of time (1 month?), the perf/integration
>>> tests
>>>> are
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> disabled,
>>>>>>>>>>>>>
>>>>>>>>>>>>>> and the data store instances shut off.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is pretty flexible -
>>>>>>>>>>>>>>> * If a particular person or organization wants to
>>> support an
>>>> IO
>>>>>>>>>>>>>>> transform,
>>>>>>>>>>>>>>> they can. If a group of people all organically organize
>>> to
>>>> keep
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> tests
>>>>>>>>>>>>>
>>>>>>>>>>>>>> running, they can.
>>>>>>>>>>>>>>> * It can be mostly automated - there's not a lot of
>>> central
>>>>>>>>>>>>>>> organizing
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>> that needs to be done.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Exposing the information about what IO transforms
>>> currently
>>>> have
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> running
>>>>>>>>>>>
>>>>>>>>>>>> IT/perf benchmarks on the website will let users know what
>>> IO
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> transforms
>>>>>>>>>>>
>>>>>>>>>>>> are well supported.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I like this solution, but I also recognize this is a
>>> tricky
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>
>>>>>>>>>>>>>> is something the community needs to be supportive of, so
>>> I'm
>>>>>>>>>>>>>>> open to
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>> thoughts.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Simulating failures in real nodes ("programmatic tests
>>> to
>>>>> simulate
>>>>>>>>>>>>>>> failure")
>>>>>>>>>>>>>>> -----------------
>>>>>>>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We
>>> should
>>>>>>>>>>>>>>> encourage a
>>>>>>>>>>>>>>> design pattern separating out network/retry logic from
>>> the
>>>> main
>>>>> IO
>>>>>>>>>>>>>>> transform logic
>>>>>>>>
>>
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Hosting data stores for IO Transform testing

Posted by Stephen Sisk <si...@google.com.INVALID>.

ah! I looked around a bit more and found the dcos package repo -
https://github.com/mesosphere/universe/tree/version-3.x/repo/packages

poking around a bit, I can find a lot of packages for single node
instances, but not many packages for multi-node instances. Single node
instance packages are kind of useful, but I don't think it's *too* helpful.
The multi-node instance packages that run the data store's high
availability mode are where the real work is, and it seems like both
kubernetes helm and dcos' package universe don't have a lot of those.

S

On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <si...@google.com> wrote:

> Hi Ishmael,
>
> these are good questions, thanks for raising them.
>
> Ability to modify network/compute resources to simulate failures
> =================================================
> I see two real questions here:
> 1. Is this something we want to do?
> 2. Is it possible with both/either?
>
> So far, the test strategy I've been advocating is that we test problems
> like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
> it's hard to re-create the same conditions.
>
> I can investigate whether it's possible, but I want to clarify whether
> this is something that we care about. I know both support killing
> individual nodes. I haven't seen a lot of network control in either, but
> haven't tried to look for it.
>
> Availability of ready to play packages
> ============================
> I did look at this, and as far as I could tell, mesos didn't have any
> pre-built packages for multi-node clusters of data stores. If there's a
> good repository of them that we trust, that would definitely save us time.
> Can you point me at the mesos repository?
>
> S
>
>
>
> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> ⁣Hi Ismael
>
> Stephen will reply with details but I know he did a comparison and
> evaluate different options.
>
> He tested with the jdbc Io itests.
>
> Regards
> JB
>
> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <ie...@gmail.com>
> wrote:
> >Thanks for your analysis Stephen, good arguments / references.
> >
> >One quick question. Have you checked the APIs of both
> >(Mesos/Kubernetes) to
> >see
> >if we can do programmatically do more complex tests (I suppose so, but
> >you
> >don't mention how easy or if those are possible), for example to
> >simulate a
> >slow networking slave (to test stragglers), or to arbitrarily kill one
> >slave (e.g. if I want to test the correct behavior of a runner/IO that
> >is
> >reading from it) ?
> >
> >Other missing point in the review is the availability of ready to play
> >packages,
> >I think in this area mesos/dcos seems more advanced no? I haven't
> >looked
> >recently but at least 6 months ago there were not many helm packages
> >ready
> >for
> >example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
> >etc). Has
> >this been improved ? because preparing this also is a considerable
> >amount of
> >work on the other hand this could be also a chance to contribute to
> >kubernetes.
> >
> >Regards,
> >Ismaël
> >
> >
> >
> >On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <si...@google.com.invalid>
> >wrote:
> >
> >> hi!
> >>
> >> I've been continuing this investigation, and have some more info to
> >report,
> >> and hopefully we can start making some decisions.
> >>
> >> To support performance testing, I've been investigating
> >mesos+marathon and
> >> kubernetes for running data stores in their high availability mode. I
> >have
> >> been examining features that kubernetes/mesos+marathon use to support
> >this.
> >>
> >> Setting up a multi-node cluster in a high availability mode tends to
> >be
> >> more expensive time-wise than the single node instances I've played
> >around
> >> with in the past. Rather than do a full build out with both
> >kubernetes and
> >> mesos, I'd like to pick one of the two options to build the prototype
> >> cluster with. If the prototype doesn't go well, we could still go
> >back to
> >> the other option, but I'd like to change us from a mode of "let's
> >look at
> >> all the options" to one of "here's the favorite, let's prove that
> >works for
> >> us".
> >>
> >> Below are the features that I've seen are important to multi-node
> >instances
> >> of data stores. I'm sure other folks on the list have done this
> >before, so
> >> feel free to pipe up if I'm missing a good solution to a problem.
> >>
> >> DNS/Discovery
> >>
> >> --------------------
> >>
> >> Necessary for talking between nodes (eg, cassandra nodes all need to
> >be
> >> able to talk to a set of seed nodes.)
> >>
> >> * Kubernetes has built-in DNS/discovery between nodes.
> >>
> >> * Mesos has supports this via mesos-dns, which isn't a part of core
> >mesos,
> >> but is in dcos, which is the mesos distribution I've been using and
> >that I
> >> would expect us to use.
> >>
> >> Instances properly distributed across nodes
> >>
> >> ------------------------------------------------------------
> >>
> >> If multiple instances of a data source end up on the same underlying
> >VM, we
> >> may not get good performance out of those instances since the
> >underlying VM
> >> may be more taxed than other VMs.
> >>
> >> * Kubernetes has a beta feature StatefulSets[1] which allow for
> >containers
> >> distributed so that there's one container per underlying machine (as
> >well
> >> as a lot of other useful features like easy stable dns names.)
> >>
> >> * Mesos can support this via the built in UNIQUE constraint [2]
> >>
> >> Load balancing
> >>
> >> --------------------
> >>
> >> Incoming requests from users need to be distributed to the various
> >machines
> >> - this is important for many data stores' high availability modes.
> >>
> >> * Kubernetes supports easily hooking up to an external load balancer
> >when
> >> on a cloud (and can be configured to work with a built-in load
> >balancer if
> >> not)
> >>
> >> * Mesos supports this via marathon-lb [3], which is an install-able
> >package
> >> in DC/OS
> >>
> >> Persistent Volumes tied to specific instances
> >>
> >> ------------------------------------------------------------
> >>
> >> Databases often need persistent state (for example to store the data
> >:), so
> >> it's an important part of running our service.
> >>
> >> * Kubernetes StatefulSets supports this
> >>
> >> * Mesos+marathon apps with persistent volumes supports this [4] [5]
> >>
> >> As I mentioned above, I'd like to focus on either kubernetes or mesos
> >for
> >> my investigation, and as I go further along, I'm seeing kubernetes as
> >> better suited to our needs.
> >>
> >> (1) It supports more of the features we want out of the box and with
> >> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> >> requires marathon-lb to be installed and mesos-dns to be configured.
> >>
> >> (2) I'm also finding that there seem to be more examples of using
> >> kubernetes to solve the types of problems we're working on. This is
> >> somewhat subjective, but in my experience as I've tried to learn both
> >> kubernetes and mesos, I personally found it generally easier to get
> >> kubernetes running than mesos due to the tutorials/examples available
> >for
> >> kubernetes.
> >>
> >> (3) Lower cost of initial setup - as I discussed in a previous
> >mail[6],
> >> kubernetes was far easier to get set up even when I knew the exact
> >steps.
> >> Mesos took me around 27 steps [7], which involved a lot of config
> >that was
> >> easy to get wrong (it took me about 5 tries to get all the steps
> >correct in
> >> one go.) Kubernetes took me around 8 steps and very little config.
> >>
> >> Given that, I'd like to focus my investigation/prototyping on
> >Kubernetes.
> >> To
> >> be clear, it's fairly close and I think both Mesos and Kubernetes
> >could
> >> support what we need, so if we run into issues with kubernetes, Mesos
> >still
> >> seems like a viable option that we could fall back to.
> >>
> >> Thanks,
> >> Stephen
> >>
> >>
> >> [1] Kubernetes StatefulSets
> >>
> >
> https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
> >>
> >> [2] mesos unique constraint -
> >> https://mesosphere.github.io/marathon/docs/constraints.html
> >>
> >> [3]
> >> https://mesosphere.github.io/marathon/docs/service-
> >> discovery-load-balancing.html
> >>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
> >>
> >> [4]
> >https://mesosphere.github.io/marathon/docs/persistent-volumes.html
> >>
> >> [5]
> >https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
> >>
> >> [6] Container Orchestration software for hosting data stores
> >> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> >> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
> >>
> >> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
> >>
> >>
> >> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
> >wrote:
> >>
> >> > Just a quick drive-by comment: how tests are laid out has
> >non-trivial
> >> > tradeoffs on how/where continuous integration runs, and how results
> >are
> >> > integrated into the tooling. The current state is certainly not
> >ideal
> >> > (e.g., due to multiple test executions some links in Jenkins point
> >where
> >> > they shouldn't), but most other alternatives had even bigger
> >drawbacks at
> >> > the time. If someone has great ideas that don't explode the number
> >of
> >> > modules, please share ;-)
> >> >
> >> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
> ><ec...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Stephen,
> >> > >
> >> > > Thanks for taking the time to comment.
> >> > >
> >> > > My comments are bellow in the email:
> >> > >
> >> > >
> >> > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> >> > >
> >> > >> hey Etienne -
> >> > >>
> >> > >> thanks for your thoughts and thanks for sharing your
> >experiences. I
> >> > >> generally agree with what you're saying. Quick comments below:
> >> > >>
> >> > >> IT are stored alongside with UT in src/test directory of the IO
> >but
> >> they
> >> > >>>
> >> > >> might go to dedicated module, waiting for a consensus
> >> > >> I don't have a strong opinion or feel that I've worked enough
> >with
> >> maven
> >> > >> to
> >> > >> understand all the consequences - I'd love for someone with more
> >maven
> >> > >> experience to weigh in. If this becomes blocking, I'd say check
> >it in,
> >> > and
> >> > >> we can refactor later if it proves problematic.
> >> > >>
> >> > > Sure, not a blocking point, it could be refactored afterwards.
> >Just as
> >> a
> >> > > reminder, JB mentioned that storing IT in separate module allows
> >to
> >> have
> >> > > more coherence between all IT (same behavior) and to do cross IO
> >> > > integration tests. JB, have you experienced some long term
> >drawbacks of
> >> > > storing IT in a separate module, like, for example, more
> >difficult
> >> > > maintenance due to "distance" with production code?
> >> > >
> >> > >
> >> > >>   Also IMHO, it is better that tests load/clean data than doing
> >some
> >> > >>>
> >> > >> assumptions about the running order of the tests.
> >> > >> I definitely agree that we don't want to make assumptions about
> >the
> >> > >> running
> >> > >> order of the tests - that way lies pain. :) It will be
> >interesting to
> >> > see
> >> > >> how the performance tests work out since they will need more
> >data (and
> >> > >> thus
> >> > >> loading data can take much longer.)
> >> > >>
> >> > > Yes, performance testing might push in the direction of data
> >loading
> >> from
> >> > > outside the tests due to loading time.
> >> > >
> >> > >>   This should also be an easier problem
> >> > >> for read tests than for write tests - if we have long running
> >> instances,
> >> > >> read tests don't really need cleanup. And if write tests only
> >write a
> >> > >> small
> >> > >> amount of data, as long as we are sure we're writing to uniquely
> >> > >> identifiable locations (ie, new table per test or something
> >similar),
> >> we
> >> > >> can clean up the write test data on a slower schedule.
> >> > >>
> >> > > I agree
> >> > >
> >> > >>
> >> > >> this will tend to go to the direction of long running data store
> >> > >>>
> >> > >> instances rather than data store instances started (and
> >optionally
> >> > loaded)
> >> > >> before tests.
> >> > >> It may be easiest to start with a "data stores stay running"
> >> > >> implementation, and then if we see issues with that move towards
> >tests
> >> > >> that
> >> > >> start/stop the data stores on each run. One thing I'd like to
> >make
> >> sure
> >> > is
> >> > >> that we're not manually tweaking the configurations for data
> >stores.
> >> One
> >> > >> way we could do that is to destroy/recreate the data stores on a
> >> slower
> >> > >> schedule - maybe once per week. That way if the script is
> >changed or
> >> the
> >> > >> data store instances are changed, we'd be able to detect it
> >relatively
> >> > >> soon
> >> > >> while still removing the need for the tests to manage the data
> >stores.
> >> > >>
> >> > > I agree. In addition to configuration manual tweaking, there
> >might be
> >> > > cases in which a data store re-partition data during a test or
> >after
> >> some
> >> > > tests while the dataset changes. The IO must be tolerant to that
> >but
> >> the
> >> > > asserts (number of bundles for example) in test must not fail in
> >that
> >> > case.
> >> > > I would also prefer if possible that the tests do not manage data
> >> stores
> >> > > (not setup them, not start them, not stop them)
> >> > >
> >> > >
> >> > >> as a general note, I suspect many of the folks in the states
> >will be
> >> on
> >> > >> holiday until Jan 2nd/3rd.
> >> > >>
> >> > >> S
> >> > >>
> >> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
> ><echauchot@gmail.com
> >> >
> >> > >> wrote:
> >> > >>
> >> > >> Hi,
> >> > >>>
> >> > >>> Recently we had a discussion about integration tests of IOs.
> >I'm
> >> > >>> preparing a PR for integration tests of the elasticSearch IO
> >> > >>> (
> >> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> >> > >>> LASTICSEARCH-IO
> >> > >>> as a first shot) which are very important IMHO because they
> >helped
> >> > catch
> >> > >>> some bugs that UT could not (volume, data store instance
> >sharing,
> >> real
> >> > >>> data store instance ...)
> >> > >>>
> >> > >>> I would like to have your thoughts/remarks about points bellow.
> >Some
> >> of
> >> > >>> these points are also discussed here
> >> > >>>
> >> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> >> > >>> :
> >> > >>>
> >> > >>> - UT and IT have a similar architecture, but while UT focus on
> >> testing
> >> > >>> the correct behavior of the code including corner cases and use
> >> > embedded
> >> > >>> in memory data store, IT assume that the behavior is correct
> >(strong
> >> > UT)
> >> > >>> and focus on higher volume testing and testing against real
> >data
> >> store
> >> > >>> instance(s)
> >> > >>>
> >> > >>> - For now, IT are stored alongside with UT in src/test
> >directory of
> >> the
> >> > >>> IO but they might go to dedicated module, waiting for a
> >consensus.
> >> > Maven
> >> > >>> is not configured to run them automatically because data store
> >is not
> >> > >>> available on jenkins server yet
> >> > >>>
> >> > >>> - For now, they only use DirectRunner, but they will  be run
> >against
> >> > >>> each runner.
> >> > >>>
> >> > >>> - IT do not setup data store instance (like stated in the above
> >> > >>> document) they assume that one is already running (hardcoded
> >> > >>> configuration in test for now, waiting for a common solution to
> >pass
> >> > >>> configuration to IT). A docker container script is provided in
> >the
> >> > >>> contrib directory as a starting point to whatever orchestration
> >> > software
> >> > >>> will be chosen.
> >> > >>>
> >> > >>> - IT load and clean test data before and after each test if
> >needed.
> >> It
> >> > >>> is simpler to do so because some tests need empty data store
> >(write
> >> > >>> test) and because, as discussed in the document, tests might
> >not be
> >> the
> >> > >>> only users of the data store. Also IMHO, it is better that
> >tests
> >> > >>> load/clean data than doing some assumptions about the running
> >order
> >> of
> >> > >>> the tests.
> >> > >>>
> >> > >>> If we generalize this pattern to all IT tests, this will tend
> >to go
> >> to
> >> > >>> the direction of long running data store instances rather than
> >data
> >> > >>> store instances started (and optionally loaded) before tests.
> >> > >>>
> >> > >>> Besides if we where to change our minds and load data from
> >outside
> >> the
> >> > >>> tests, a logstash script is provided.
> >> > >>>
> >> > >>> If you have any thoughts or remarks I'm all ears :)
> >> > >>>
> >> > >>> Regards,
> >> > >>>
> >> > >>> Etienne
> >> > >>>
> >> > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> >> > >>>
> >> > >>>> Hi Stephen,
> >> > >>>>
> >> > >>>> the purpose of having in a specific module is to share
> >resources and
> >> > >>>> apply the same behavior from IT perspective and be able to
> >have IT
> >> > >>>> "cross" IO (for instance, reading from JMS and sending to
> >Kafka, I
> >> > >>>> think that's the key idea for integration tests).
> >> > >>>>
> >> > >>>> For instance, in Karaf, we have:
> >> > >>>> - utest in each module
> >> > >>>> - itest module containing itests for all modules all together
> >> > >>>>
> >> > >>>> Regards
> >> > >>>> JB
> >> > >>>>
> >> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >> > >>>>
> >> > >>>>> Hi Etienne,
> >> > >>>>>
> >> > >>>>> thanks for following up and answering my questions.
> >> > >>>>>
> >> > >>>>> re: where to store integration tests - having them all in a
> >> separate
> >> > >>>>> module
> >> > >>>>> is an interesting idea. I couldn't find JB's comments about
> >moving
> >> > them
> >> > >>>>> into a separate module in the PR - can you share the reasons
> >for
> >> > >>>>> doing so?
> >> > >>>>> The IO integration/perf tests so it does seem like they'll
> >need to
> >> be
> >> > >>>>> treated in a special manner, but given that there is already
> >an IO
> >> > >>>>> specific
> >> > >>>>> module, it may just be that we need to treat all the ITs in
> >the IO
> >> > >>>>> module
> >> > >>>>> the same way. I don't have strong opinions either way right
> >now.
> >> > >>>>>
> >> > >>>>> S
> >> > >>>>>
> >> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> >> > echauchot@gmail.com>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>> Hi guys,
> >> > >>>>>
> >> > >>>>> @Stephen: I addressed all your comments directly in the PR,
> >thanks!
> >> > >>>>> I just wanted to comment here about the docker image I used:
> >the
> >> only
> >> > >>>>> official Elastic image contains only ElasticSearch. But for
> >> testing I
> >> > >>>>> needed logstash (for ingestion) and kibana (not for
> >integration
> >> > tests,
> >> > >>>>> but to easily test REST requests to ES using sense). This is
> >why I
> >> > use
> >> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
> >isreleased
> >> > >>>>> under
> >> > >>>>> theapache 2 license.
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Besides, there is also a point about where to store
> >integration
> >> > tests:
> >> > >>>>> JB proposed in the PR to store integration tests to dedicated
> >> module
> >> > >>>>> rather than directly in the IO module (like I did).
> >> > >>>>>
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Etienne
> >> > >>>>>
> >> > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >> > >>>>>
> >> > >>>>>> hey!
> >> > >>>>>>
> >> > >>>>>> thanks for sending this. I'm very excited to see this
> >change. I
> >> > >>>>>> added some
> >> > >>>>>> detail-oriented code review comments in addition to what
> >I've
> >> > >>>>>> discussed
> >> > >>>>>> here.
> >> > >>>>>>
> >> > >>>>>> The general goal is to allow for re-usable instantiation of
> >> > particular
> >> > >>>>>>
> >> > >>>>> data
> >> > >>>>>
> >> > >>>>>> store instances and this seems like a good start. Looks like
> >you
> >> > >>>>>> also have
> >> > >>>>>> a script to generate test data for your tests - that's
> >great.
> >> > >>>>>>
> >> > >>>>>> The next steps (definitely not blocking your work) will be
> >to have
> >> > >>>>>> ways to
> >> > >>>>>> create instances from the docker images you have here, and
> >use
> >> them
> >> > >>>>>> in the
> >> > >>>>>> tests. We'll need support in the test framework for that
> >since
> >> it'll
> >> > >>>>>> be
> >> > >>>>>> different on developer machines and in the beam jenkins
> >cluster,
> >> but
> >> > >>>>>> your
> >> > >>>>>> scripts here allow someone running these tests locally to
> >not have
> >> > to
> >> > >>>>>>
> >> > >>>>> worry
> >> > >>>>>
> >> > >>>>>> about getting the instance set up and can manually adjust,
> >so this
> >> > is
> >> > >>>>>> a
> >> > >>>>>> good incremental step.
> >> > >>>>>>
> >> > >>>>>> I have some thoughts now that I'm reviewing your scripts
> >(that I
> >> > >>>>>> didn't
> >> > >>>>>> have previously, so we are learning this together):
> >> > >>>>>> * It may be useful to try and document why we chose a
> >particular
> >> > >>>>>> docker
> >> > >>>>>> image as the base (ie, "this is the official supported
> >elastic
> >> > search
> >> > >>>>>> docker image" or "this image has several data stores
> >together that
> >> > >>>>>> can be
> >> > >>>>>> used for a couple different tests")  - I'm curious as to
> >whether
> >> the
> >> > >>>>>> community thinks that is important
> >> > >>>>>>
> >> > >>>>>> One thing that I called out in the comment that's worth
> >mentioning
> >> > >>>>>> on the
> >> > >>>>>> larger list - if you want to specify which specific runners
> >a test
> >> > >>>>>> uses,
> >> > >>>>>> that can be controlled in the pom for the module. I updated
> >the
> >> > >>>>>> testing
> >> > >>>>>>
> >> > >>>>> doc
> >> > >>>>>
> >> > >>>>>> mentioned previously in this thread with a TODO to talk
> >about this
> >> > >>>>>> more. I
> >> > >>>>>> think we should also make it so that IO modules have that
> >> > >>>>>> automatically,
> >> > >>>>>>
> >> > >>>>> so
> >> > >>>>>
> >> > >>>>>> developers don't have to worry about it.
> >> > >>>>>>
> >> > >>>>>> S
> >> > >>>>>>
> >> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> >> > echauchot@gmail.com>
> >> > >>>>>>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> Stephen,
> >> > >>>>>>
> >> > >>>>>> As discussed, I added injection script, docker containers
> >scripts
> >> > and
> >> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> >> > >>>>>> <
> >> > >>>>>>
> >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> >> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> >> > >>> d824cefcb3ed0b9
> >> > >>>
> >> > >>>> directory in that PR:
> >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439.
> >> > >>>>>>
> >> > >>>>>> These work well but they are first shot. Do you have any
> >comments
> >> > >>>>>> about
> >> > >>>>>> those?
> >> > >>>>>>
> >> > >>>>>> Besides I am not very sure that these files should be in the
> >IO
> >> > itself
> >> > >>>>>> (even in contrib directory, out of maven source
> >directories). Any
> >> > >>>>>>
> >> > >>>>> thoughts?
> >> > >>>>>
> >> > >>>>>> Thanks,
> >> > >>>>>>
> >> > >>>>>> Etienne
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >> > >>>>>>
> >> > >>>>>>> It's great to hear more experiences.
> >> > >>>>>>>
> >> > >>>>>>> I'm also glad to hear that people see real value in the
> >high
> >> > >>>>>>> volume/performance benchmark tests. I tried to capture that
> >in
> >> the
> >> > >>>>>>>
> >> > >>>>>> Testing
> >> > >>>>>
> >> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >> > >>>>>>>
> >> > >>>>>>> It does generally sound like we're in agreement here. Areas
> >of
> >> > >>>>>>> discussion
> >> > >>>>>>>
> >> > >>>>>> I
> >> > >>>>>>
> >> > >>>>>>> see:
> >> > >>>>>>> 1.  People like the idea of bringing up fresh instances for
> >each
> >> > test
> >> > >>>>>>> rather than keeping instances running all the time, since
> >that
> >> > >>>>>>> ensures no
> >> > >>>>>>> contamination between tests. That seems reasonable to me.
> >If we
> >> see
> >> > >>>>>>> flakiness in the tests or we note that setting up/tearing
> >down
> >> > >>>>>>> instances
> >> > >>>>>>>
> >> > >>>>>> is
> >> > >>>>>>
> >> > >>>>>>> taking a lot of time,
> >> > >>>>>>> 2. Deciding on cluster management software/orchestration
> >software
> >> > - I
> >> > >>>>>>>
> >> > >>>>>> want
> >> > >>>>>
> >> > >>>>>> to make sure we land on the right tool here since choosing
> >the
> >> > >>>>>>> wrong tool
> >> > >>>>>>> could result in administration of the instances taking more
> >> work. I
> >> > >>>>>>>
> >> > >>>>>> suspect
> >> > >>>>>>
> >> > >>>>>>> that's a good place for a follow up discussion, so I'll
> >start a
> >> > >>>>>>> separate
> >> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but
> >I
> >> want
> >> > to
> >> > >>>>>>>
> >> > >>>>>> make
> >> > >>>>>
> >> > >>>>>> sure we take a moment to consider different options and have
> >a
> >> > >>>>>>> reason for
> >> > >>>>>>> choosing one.
> >> > >>>>>>>
> >> > >>>>>>> Etienne - thanks for being willing to port your
> >creation/other
> >> > >>>>>>> scripts
> >> > >>>>>>> over. You might be a good early tester of whether this
> >system
> >> works
> >> > >>>>>>> well
> >> > >>>>>>> for everyone.
> >> > >>>>>>>
> >> > >>>>>>> Stephen
> >> > >>>>>>>
> >> > >>>>>>> [1]  Reasons for Beam Test Strategy -
> >> > >>>>>>>
> >> > >>>>>>>
> >https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> >> > >>>
> >> > >>>>
> >> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
> >> > >>>>>>> <jb...@nanthrax.net>
> >> > >>>>>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>> I second Etienne there.
> >> > >>>>>>>>
> >> > >>>>>>>> We worked together on the ElasticsearchIO and definitely,
> >the
> >> high
> >> > >>>>>>>> valuable test we did were integration tests with ES on
> >docker
> >> and
> >> > >>>>>>>> high
> >> > >>>>>>>> volume.
> >> > >>>>>>>>
> >> > >>>>>>>> I think we have to distinguish the two kinds of tests:
> >> > >>>>>>>> 1. utests are located in the IO itself and basically they
> >should
> >> > >>>>>>>> cover
> >> > >>>>>>>> the core behaviors of the IO
> >> > >>>>>>>> 2. itests are located as contrib in the IO (they could be
> >part
> >> of
> >> > >>>>>>>> the IO
> >> > >>>>>>>> but executed by the integration-test plugin or a specific
> >> profile)
> >> > >>>>>>>> that
> >> > >>>>>>>> deals with "real" backend and high volumes. The resources
> >> required
> >> > >>>>>>>> by
> >> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance
> >using
> >> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and
> >it's
> >> > >>>>>>>> what I'm
> >> > >>>>>>>> doing on my own "server").
> >> > >>>>>>>>
> >> > >>>>>>>> It's basically what Stephen described.
> >> > >>>>>>>>
> >> > >>>>>>>> We have to not relay only on itest: utests are very
> >important
> >> and
> >> > >>>>>>>> they
> >> > >>>>>>>> validate the core behavior.
> >> > >>>>>>>>
> >> > >>>>>>>> My $0.01 ;)
> >> > >>>>>>>>
> >> > >>>>>>>> Regards
> >> > >>>>>>>> JB
> >> > >>>>>>>>
> >> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Hi Stephen,
> >> > >>>>>>>>>
> >> > >>>>>>>>> I like your proposition very much and I also agree that
> >docker
> >> +
> >> > >>>>>>>>> some
> >> > >>>>>>>>> orchestration software would be great !
> >> > >>>>>>>>>
> >> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there
> >is
> >> > docker
> >> > >>>>>>>>> container creation scripts and logstash data ingestion
> >script
> >> for
> >> > >>>>>>>>> IT
> >> > >>>>>>>>> environment available in contrib directory alongside with
> >> > >>>>>>>>> integration
> >> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to
> >new
> >> IT
> >> > >>>>>>>>> environment.
> >> > >>>>>>>>>
> >> > >>>>>>>>> What you say bellow about the need for external IT
> >environment
> >> is
> >> > >>>>>>>>> particularly true. As an example with ES what came out in
> >first
> >> > >>>>>>>>> implementation was that there were problems starting at
> >some
> >> high
> >> > >>>>>>>>>
> >> > >>>>>>>> volume
> >> > >>>>>
> >> > >>>>>> of data (timeouts, ES windowing overflow...) that could not
> >have
> >> be
> >> > >>>>>>>>>
> >> > >>>>>>>> seen
> >> > >>>>>
> >> > >>>>>> on embedded ES version. Also there where some
> >particularities to
> >> > >>>>>>>>> external instance like secondary (replica) shards that
> >where
> >> not
> >> > >>>>>>>>>
> >> > >>>>>>>> visible
> >> > >>>>>
> >> > >>>>>> on embedded instance.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Besides, I also favor bringing up instances before test
> >because
> >> > it
> >> > >>>>>>>>> allows (amongst other things) to be sure to start on a
> >fresh
> >> > >>>>>>>>> dataset
> >> > >>>>>>>>>
> >> > >>>>>>>> for
> >> > >>>>>
> >> > >>>>>> the test to be deterministic.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Etienne
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> >> > >>>>>>>>>
> >> > >>>>>>>>>> Hi,
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I'm excited we're getting lots of discussion going.
> >There are
> >> > many
> >> > >>>>>>>>>> threads
> >> > >>>>>>>>>> of conversation here, we may choose to split some of
> >them off
> >> > >>>>>>>>>> into a
> >> > >>>>>>>>>> different email thread. I'm also betting I missed some
> >of the
> >> > >>>>>>>>>> questions in
> >> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also
> >> apologies
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> amount of text, I provided some quick summaries at the top
> >of
> >> each
> >> > >>>>>>>>>> section.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in
> >detail
> >> below.
> >> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of
> >work
> >> > >>>>>>>>>> here to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> go
> >> > >>>>>
> >> > >>>>>> around. I'll try and think about how we can divide up some
> >next
> >> > >>>>>>>>>> steps
> >> > >>>>>>>>>> (probably in a separate thread.) The main next step I
> >see is
> >> > >>>>>>>>>> deciding
> >> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm
> >working
> >> on
> >> > >>>>>>>>>> that,
> >> > >>>>>>>>>>
> >> > >>>>>>>>> but
> >> > >>>>>>>>
> >> > >>>>>>>>> having lots of different thoughts on what the
> >> > >>>>>>>>>> advantages/disadvantages
> >> > >>>>>>>>>>
> >> > >>>>>>>>> of
> >> > >>>>>>>>
> >> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the
> >> > >>>>>>>>>> protocol for
> >> > >>>>>>>>>> collaborating on sub-projects like this.)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> These issues are all related to what kind of tests we
> >want to
> >> > >>>>>>>>>> write. I
> >> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all
> >the
> >> use
> >> > >>>>>>>>>> cases
> >> > >>>>>>>>>> we've discussed here (and thus should not block moving
> >forward
> >> > >>>>>>>>>> with
> >> > >>>>>>>>>> this),
> >> > >>>>>>>>>> but understanding what we want to test will help us
> >understand
> >> > >>>>>>>>>> how the
> >> > >>>>>>>>>> cluster will be used. I'm working on a proposed user
> >guide for
> >> > >>>>>>>>>> testing
> >> > >>>>>>>>>>
> >> > >>>>>>>>> IO
> >> > >>>>>>>>
> >> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a
> >short
> >> > >>>>>>>>>> summary
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> the list shortly so folks can get a better sense of where
> >I'm
> >> > >>>>>>>>>> coming
> >> > >>>>>>>>>> from.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Here's my thinking on the questions we've raised here -
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Embedded versions of data stores for testing
> >> > >>>>>>>>>> --------------------
> >> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
> >> > against.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the
> >various
> >> > data
> >> > >>>>>>>>>> stores.
> >> > >>>>>>>>>> I think we should test everything we possibly can using
> >them,
> >> > >>>>>>>>>> and do
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> majority of our correctness testing using embedded versions
> >+ the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> direct
> >> > >>>>>>
> >> > >>>>>>> runner. However, it's also important to have at least one
> >test
> >> that
> >> > >>>>>>>>>> actually connects to an actual instance, so we can get
> >> coverage
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> things
> >> > >>>>>>>>>> like credentials, real connection strings, etc...
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> The key point is that embedded versions definitely can't
> >cover
> >> > the
> >> > >>>>>>>>>> performance tests, so we need to host instances if we
> >want to
> >> > test
> >> > >>>>>>>>>>
> >> > >>>>>>>>> that.
> >> > >>>>>>
> >> > >>>>>>> I consider the integration tests/performance benchmarks to
> >be
> >> > >>>>>>>>>> costly
> >> > >>>>>>>>>> things
> >> > >>>>>>>>>> that we do only for the IO transforms with large amounts
> >of
> >> > >>>>>>>>>> community
> >> > >>>>>>>>>> support/usage. A random IO transform used by a few users
> >> doesn't
> >> > >>>>>>>>>> necessarily need integration & perf tests, but for
> >heavily
> >> used
> >> > IO
> >> > >>>>>>>>>> transforms, there's a lot of community value in these
> >tests.
> >> The
> >> > >>>>>>>>>> maintenance proposal below scales with the amount of
> >community
> >> > >>>>>>>>>> support
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> a particular IO transform.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Reusing data stores ("use the data stores across
> >executions.")
> >> > >>>>>>>>>> ------------------
> >> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently
> >used, very
> >> > >>>>>>>>>> small
> >> > >>>>>>>>>> instances that we keep up all the time + larger
> >> multi-container
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> store
> >> > >>>>>>>>>> instances that we spin up for perf tests.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I don't think we need to have a strong answer to this
> >> question,
> >> > >>>>>>>>>> but I
> >> > >>>>>>>>>> think
> >> > >>>>>>>>>> we do need to know what range of capabilities we need,
> >and use
> >> > >>>>>>>>>> that to
> >> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I
> >think
> >> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios
> >I
> >> > discuss
> >> > >>>>>>>>>>
> >> > >>>>>>>>> below.
> >> > >>>>>>
> >> > >>>>>>> I had been thinking of a hybrid approach - reuse some
> >instances
> >> and
> >> > >>>>>>>>>>
> >> > >>>>>>>>> don't
> >> > >>>>>>>>
> >> > >>>>>>>>> reuse others. Some tests require isolation from other
> >tests
> >> (eg.
> >> > >>>>>>>>>> performance benchmarking), while others can easily
> >re-use the
> >> > same
> >> > >>>>>>>>>> database/data store instance over time, provided they
> >are
> >> > >>>>>>>>>> written in
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> correct manner (eg. a simple read or write correctness
> >> integration
> >> > >>>>>>>>>>
> >> > >>>>>>>>> tests)
> >> > >>>>>>>>
> >> > >>>>>>>>> To me, the question of whether to use one instance over
> >time
> >> for
> >> > a
> >> > >>>>>>>>>> test vs
> >> > >>>>>>>>>> spin up an instance for each test comes down to a trade
> >off
> >> > >>>>>>>>>> between
> >> > >>>>>>>>>>
> >> > >>>>>>>>> these
> >> > >>>>>>>>
> >> > >>>>>>>>> factors:
> >> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super
> >flaky,
> >> > >>>>>>>>>> we'll
> >> > >>>>>>>>>> want to
> >> > >>>>>>>>>> keep more instances up and running rather than bring
> >them
> >> > up/down.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> (this
> >> > >>>>>>
> >> > >>>>>>> may also vary by the data store in question)
> >> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every
> >5
> >> > >>>>>>>>>> minutes, it
> >> > >>>>>>>>>>
> >> > >>>>>>>>> may
> >> > >>>>>>>>
> >> > >>>>>>>>> be wasteful to bring machines up/down every time. If we
> >run
> >> > >>>>>>>>>> tests once
> >> > >>>>>>>>>>
> >> > >>>>>>>>> a
> >> > >>>>>>
> >> > >>>>>>> day or week, it seems wasteful to keep the machines up the
> >whole
> >> > >>>>>>>>>> time.
> >> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated,
> >it
> >> means
> >> > we
> >> > >>>>>>>>>>
> >> > >>>>>>>>> either
> >> > >>>>>>>>
> >> > >>>>>>>>> have to bring up the instances for each test, or we have
> >to
> >> have
> >> > >>>>>>>>>> some
> >> > >>>>>>>>>> sort
> >> > >>>>>>>>>> of signaling mechanism to indicate that a given instance
> >is in
> >> > >>>>>>>>>> use. I
> >> > >>>>>>>>>> strongly favor bringing up an instance per test.
> >> > >>>>>>>>>> 4. Number/size of containers - if we need a large number
> >of
> >> > >>>>>>>>>> machines
> >> > >>>>>>>>>> for a
> >> > >>>>>>>>>> particular test, keeping them running all the time will
> >use
> >> more
> >> > >>>>>>>>>> resources.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin
> >these
> >> up.
> >> > >>>>>>>>>> I'm
> >> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up,
> >but I
> >> > >>>>>>>>>> think the
> >> > >>>>>>>>>> best
> >> > >>>>>>>>>> way to test that is to start doing it.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of
> >very
> >> > >>>>>>>>>> small
> >> > >>>>>>>>>>
> >> > >>>>>>>>> data
> >> > >>>>>>
> >> > >>>>>>> store instances that stay up to support small-data-size
> >> post-commit
> >> > >>>>>>>>>> end to
> >> > >>>>>>>>>> end tests (post-commits run frequently and the data size
> >means
> >> > the
> >> > >>>>>>>>>> instances would not use many resources), combined with
> >the
> >> > >>>>>>>>>> ability to
> >> > >>>>>>>>>> spin
> >> > >>>>>>>>>> up larger instances for once a day/week performance
> >benchmarks
> >> > >>>>>>>>>> (these
> >> > >>>>>>>>>>
> >> > >>>>>>>>> use
> >> > >>>>>>>>
> >> > >>>>>>>>> up more resources and are used less frequently.) That's
> >the mix
> >> > >>>>>>>>>> I'll
> >> > >>>>>>>>>> propose in my docs on testing IO transforms.  If
> >spinning up
> >> new
> >> > >>>>>>>>>> instances
> >> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of
> >spinning up
> >> > >>>>>>>>>> instances
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> each test.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Management ("what's the overhead of managing such a
> >> deployment")
> >> > >>>>>>>>>> --------------------
> >> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts
> >for
> >> > >>>>>>>>>> setting up
> >> > >>>>>>>>>>
> >> > >>>>>>>>> data
> >> > >>>>>>>>
> >> > >>>>>>>>> store instances + integration/perf tests, but if the
> >community
> >> > >>>>>>>>>> doesn't
> >> > >>>>>>>>>> maintain a particular data store's tests, we disable the
> >tests
> >> > and
> >> > >>>>>>>>>> turn off
> >> > >>>>>>>>>> the data store instances.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Management of these instances is a crucial question.
> >First,
> >> > let's
> >> > >>>>>>>>>>
> >> > >>>>>>>>> break
> >> > >>>>>
> >> > >>>>>> down what tasks we'll need to do on a recurring basis:
> >> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both
> >instance
> >> &
> >> > >>>>>>>>>> dependencies) - we don't want to have a lot of old
> >versions
> >> > >>>>>>>>>> vulnerable
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> attacks/buggy
> >> > >>>>>>>>>> 2. Investigate breakages/regressions
> >> > >>>>>>>>>> (I'm betting there will be more things we'll discover -
> >let me
> >> > >>>>>>>>>> know if
> >> > >>>>>>>>>> you
> >> > >>>>>>>>>> have suggestions)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> There's a couple goals I see:
> >> > >>>>>>>>>> 1. We should only do sys admin work for things that give
> >us a
> >> > >>>>>>>>>> lot of
> >> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up
> >scripts
> >> for
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> stores
> >> > >>>>>>>>>> without a large community)
> >> > >>>>>>>>>> 2. We should do as much as possible of testing via
> >> > >>>>>>>>>> in-memory/embedded
> >> > >>>>>>>>>> testing (as you brought up).
> >> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> As I discussed above, I think that integration
> >> tests/performance
> >> > >>>>>>>>>> benchmarks
> >> > >>>>>>>>>> are costly things that we should do only for the IO
> >transforms
> >> > >>>>>>>>>> with
> >> > >>>>>>>>>>
> >> > >>>>>>>>> large
> >> > >>>>>>>>
> >> > >>>>>>>>> amounts of community support/usage. Thus, I propose that
> >we
> >> > >>>>>>>>>> limit the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> IO
> >> > >>>>>>
> >> > >>>>>>> transforms that get integration tests & performance
> >benchmarks to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> those
> >> > >>>>>
> >> > >>>>>> that have community support for maintaining the data store
> >> > >>>>>>>>>> instances.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> We can enforce this organically using some simple rules:
> >> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given
> >> > >>>>>>>>>> integration/perf
> >> > >>>>>>>>>>
> >> > >>>>>>>>> test
> >> > >>>>>>
> >> > >>>>>>> starts failing and no one investigates it within a set
> >period of
> >> > >>>>>>>>>> time
> >> > >>>>>>>>>>
> >> > >>>>>>>>> (a
> >> > >>>>>>
> >> > >>>>>>> week?), we disable the tests and shut off the data store
> >> > >>>>>>>>>> instances if
> >> > >>>>>>>>>>
> >> > >>>>>>>>> we
> >> > >>>>>>
> >> > >>>>>>> have instances running. When someone wants to step up and
> >> > >>>>>>>>>> support it
> >> > >>>>>>>>>> again,
> >> > >>>>>>>>>> they can fix the test, check it in, and re-enable the
> >test.
> >> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira
> >issue that
> >> > >>>>>>>>>> is just
> >> > >>>>>>>>>> "is
> >> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira
> >is
> >> not
> >> > >>>>>>>>>> resolved in
> >> > >>>>>>>>>> a set period of time (1 month?), the perf/integration
> >tests
> >> are
> >> > >>>>>>>>>>
> >> > >>>>>>>>> disabled,
> >> > >>>>>>>>
> >> > >>>>>>>>> and the data store instances shut off.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> This is pretty flexible -
> >> > >>>>>>>>>> * If a particular person or organization wants to
> >support an
> >> IO
> >> > >>>>>>>>>> transform,
> >> > >>>>>>>>>> they can. If a group of people all organically organize
> >to
> >> keep
> >> > >>>>>>>>>> the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> tests
> >> > >>>>>>>>
> >> > >>>>>>>>> running, they can.
> >> > >>>>>>>>>> * It can be mostly automated - there's not a lot of
> >central
> >> > >>>>>>>>>> organizing
> >> > >>>>>>>>>> work
> >> > >>>>>>>>>> that needs to be done.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Exposing the information about what IO transforms
> >currently
> >> have
> >> > >>>>>>>>>>
> >> > >>>>>>>>> running
> >> > >>>>>>
> >> > >>>>>>> IT/perf benchmarks on the website will let users know what
> >IO
> >> > >>>>>>>>>>
> >> > >>>>>>>>> transforms
> >> > >>>>>>
> >> > >>>>>>> are well supported.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I like this solution, but I also recognize this is a
> >tricky
> >> > >>>>>>>>>> problem.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> This
> >> > >>>>>>>>
> >> > >>>>>>>>> is something the community needs to be supportive of, so
> >I'm
> >> > >>>>>>>>>> open to
> >> > >>>>>>>>>> other
> >> > >>>>>>>>>> thoughts.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests
> >to
> >> > simulate
> >> > >>>>>>>>>> failure")
> >> > >>>>>>>>>> -----------------
> >> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We
> >should
> >> > >>>>>>>>>> encourage a
> >> > >>>>>>>>>> design pattern separating out network/retry logic from
> >the
> >> main
> >> > IO
> >> > >>>>>>>>>> transform logic
> >> > >>>
>
>

Re: Hosting data stores for IO Transform testing

Posted by Stephen Sisk <si...@google.com.INVALID>.

Hi Ishmael,

these are good questions, thanks for raising them.

Ability to modify network/compute resources to simulate failures
=================================================
I see two real questions here:
1. Is this something we want to do?
2. Is it possible with both/either?

So far, the test strategy I've been advocating is that we test problems
like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
it's hard to re-create the same conditions.

I can investigate whether it's possible, but I want to clarify whether this
is something that we care about. I know both support killing individual
nodes. I haven't seen a lot of network control in either, but haven't tried
to look for it.

Availability of ready to play packages
============================
I did look at this, and as far as I could tell, mesos didn't have any
pre-built packages for multi-node clusters of data stores. If there's a
good repository of them that we trust, that would definitely save us time.
Can you point me at the mesos repository?

S



On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> ⁣Hi Ismael
>
> Stephen will reply with details but I know he did a comparison and
> evaluate different options.
>
> He tested with the jdbc Io itests.
>
> Regards
> JB
>
> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <ie...@gmail.com>
> wrote:
> >Thanks for your analysis Stephen, good arguments / references.
> >
> >One quick question. Have you checked the APIs of both
> >(Mesos/Kubernetes) to
> >see
> >if we can do programmatically do more complex tests (I suppose so, but
> >you
> >don't mention how easy or if those are possible), for example to
> >simulate a
> >slow networking slave (to test stragglers), or to arbitrarily kill one
> >slave (e.g. if I want to test the correct behavior of a runner/IO that
> >is
> >reading from it) ?
> >
> >Other missing point in the review is the availability of ready to play
> >packages,
> >I think in this area mesos/dcos seems more advanced no? I haven't
> >looked
> >recently but at least 6 months ago there were not many helm packages
> >ready
> >for
> >example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
> >etc). Has
> >this been improved ? because preparing this also is a considerable
> >amount of
> >work on the other hand this could be also a chance to contribute to
> >kubernetes.
> >
> >Regards,
> >Ismaël
> >
> >
> >
> >On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <si...@google.com.invalid>
> >wrote:
> >
> >> hi!
> >>
> >> I've been continuing this investigation, and have some more info to
> >report,
> >> and hopefully we can start making some decisions.
> >>
> >> To support performance testing, I've been investigating
> >mesos+marathon and
> >> kubernetes for running data stores in their high availability mode. I
> >have
> >> been examining features that kubernetes/mesos+marathon use to support
> >this.
> >>
> >> Setting up a multi-node cluster in a high availability mode tends to
> >be
> >> more expensive time-wise than the single node instances I've played
> >around
> >> with in the past. Rather than do a full build out with both
> >kubernetes and
> >> mesos, I'd like to pick one of the two options to build the prototype
> >> cluster with. If the prototype doesn't go well, we could still go
> >back to
> >> the other option, but I'd like to change us from a mode of "let's
> >look at
> >> all the options" to one of "here's the favorite, let's prove that
> >works for
> >> us".
> >>
> >> Below are the features that I've seen are important to multi-node
> >instances
> >> of data stores. I'm sure other folks on the list have done this
> >before, so
> >> feel free to pipe up if I'm missing a good solution to a problem.
> >>
> >> DNS/Discovery
> >>
> >> --------------------
> >>
> >> Necessary for talking between nodes (eg, cassandra nodes all need to
> >be
> >> able to talk to a set of seed nodes.)
> >>
> >> * Kubernetes has built-in DNS/discovery between nodes.
> >>
> >> * Mesos has supports this via mesos-dns, which isn't a part of core
> >mesos,
> >> but is in dcos, which is the mesos distribution I've been using and
> >that I
> >> would expect us to use.
> >>
> >> Instances properly distributed across nodes
> >>
> >> ------------------------------------------------------------
> >>
> >> If multiple instances of a data source end up on the same underlying
> >VM, we
> >> may not get good performance out of those instances since the
> >underlying VM
> >> may be more taxed than other VMs.
> >>
> >> * Kubernetes has a beta feature StatefulSets[1] which allow for
> >containers
> >> distributed so that there's one container per underlying machine (as
> >well
> >> as a lot of other useful features like easy stable dns names.)
> >>
> >> * Mesos can support this via the built in UNIQUE constraint [2]
> >>
> >> Load balancing
> >>
> >> --------------------
> >>
> >> Incoming requests from users need to be distributed to the various
> >machines
> >> - this is important for many data stores' high availability modes.
> >>
> >> * Kubernetes supports easily hooking up to an external load balancer
> >when
> >> on a cloud (and can be configured to work with a built-in load
> >balancer if
> >> not)
> >>
> >> * Mesos supports this via marathon-lb [3], which is an install-able
> >package
> >> in DC/OS
> >>
> >> Persistent Volumes tied to specific instances
> >>
> >> ------------------------------------------------------------
> >>
> >> Databases often need persistent state (for example to store the data
> >:), so
> >> it's an important part of running our service.
> >>
> >> * Kubernetes StatefulSets supports this
> >>
> >> * Mesos+marathon apps with persistent volumes supports this [4] [5]
> >>
> >> As I mentioned above, I'd like to focus on either kubernetes or mesos
> >for
> >> my investigation, and as I go further along, I'm seeing kubernetes as
> >> better suited to our needs.
> >>
> >> (1) It supports more of the features we want out of the box and with
> >> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> >> requires marathon-lb to be installed and mesos-dns to be configured.
> >>
> >> (2) I'm also finding that there seem to be more examples of using
> >> kubernetes to solve the types of problems we're working on. This is
> >> somewhat subjective, but in my experience as I've tried to learn both
> >> kubernetes and mesos, I personally found it generally easier to get
> >> kubernetes running than mesos due to the tutorials/examples available
> >for
> >> kubernetes.
> >>
> >> (3) Lower cost of initial setup - as I discussed in a previous
> >mail[6],
> >> kubernetes was far easier to get set up even when I knew the exact
> >steps.
> >> Mesos took me around 27 steps [7], which involved a lot of config
> >that was
> >> easy to get wrong (it took me about 5 tries to get all the steps
> >correct in
> >> one go.) Kubernetes took me around 8 steps and very little config.
> >>
> >> Given that, I'd like to focus my investigation/prototyping on
> >Kubernetes.
> >> To
> >> be clear, it's fairly close and I think both Mesos and Kubernetes
> >could
> >> support what we need, so if we run into issues with kubernetes, Mesos
> >still
> >> seems like a viable option that we could fall back to.
> >>
> >> Thanks,
> >> Stephen
> >>
> >>
> >> [1] Kubernetes StatefulSets
> >>
> >
> https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
> >>
> >> [2] mesos unique constraint -
> >> https://mesosphere.github.io/marathon/docs/constraints.html
> >>
> >> [3]
> >> https://mesosphere.github.io/marathon/docs/service-
> >> discovery-load-balancing.html
> >>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
> >>
> >> [4]
> >https://mesosphere.github.io/marathon/docs/persistent-volumes.html
> >>
> >> [5]
> >https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
> >>
> >> [6] Container Orchestration software for hosting data stores
> >> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> >> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
> >>
> >> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
> >>
> >>
> >> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
> >wrote:
> >>
> >> > Just a quick drive-by comment: how tests are laid out has
> >non-trivial
> >> > tradeoffs on how/where continuous integration runs, and how results
> >are
> >> > integrated into the tooling. The current state is certainly not
> >ideal
> >> > (e.g., due to multiple test executions some links in Jenkins point
> >where
> >> > they shouldn't), but most other alternatives had even bigger
> >drawbacks at
> >> > the time. If someone has great ideas that don't explode the number
> >of
> >> > modules, please share ;-)
> >> >
> >> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
> ><ec...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Stephen,
> >> > >
> >> > > Thanks for taking the time to comment.
> >> > >
> >> > > My comments are bellow in the email:
> >> > >
> >> > >
> >> > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> >> > >
> >> > >> hey Etienne -
> >> > >>
> >> > >> thanks for your thoughts and thanks for sharing your
> >experiences. I
> >> > >> generally agree with what you're saying. Quick comments below:
> >> > >>
> >> > >> IT are stored alongside with UT in src/test directory of the IO
> >but
> >> they
> >> > >>>
> >> > >> might go to dedicated module, waiting for a consensus
> >> > >> I don't have a strong opinion or feel that I've worked enough
> >with
> >> maven
> >> > >> to
> >> > >> understand all the consequences - I'd love for someone with more
> >maven
> >> > >> experience to weigh in. If this becomes blocking, I'd say check
> >it in,
> >> > and
> >> > >> we can refactor later if it proves problematic.
> >> > >>
> >> > > Sure, not a blocking point, it could be refactored afterwards.
> >Just as
> >> a
> >> > > reminder, JB mentioned that storing IT in separate module allows
> >to
> >> have
> >> > > more coherence between all IT (same behavior) and to do cross IO
> >> > > integration tests. JB, have you experienced some long term
> >drawbacks of
> >> > > storing IT in a separate module, like, for example, more
> >difficult
> >> > > maintenance due to "distance" with production code?
> >> > >
> >> > >
> >> > >>   Also IMHO, it is better that tests load/clean data than doing
> >some
> >> > >>>
> >> > >> assumptions about the running order of the tests.
> >> > >> I definitely agree that we don't want to make assumptions about
> >the
> >> > >> running
> >> > >> order of the tests - that way lies pain. :) It will be
> >interesting to
> >> > see
> >> > >> how the performance tests work out since they will need more
> >data (and
> >> > >> thus
> >> > >> loading data can take much longer.)
> >> > >>
> >> > > Yes, performance testing might push in the direction of data
> >loading
> >> from
> >> > > outside the tests due to loading time.
> >> > >
> >> > >>   This should also be an easier problem
> >> > >> for read tests than for write tests - if we have long running
> >> instances,
> >> > >> read tests don't really need cleanup. And if write tests only
> >write a
> >> > >> small
> >> > >> amount of data, as long as we are sure we're writing to uniquely
> >> > >> identifiable locations (ie, new table per test or something
> >similar),
> >> we
> >> > >> can clean up the write test data on a slower schedule.
> >> > >>
> >> > > I agree
> >> > >
> >> > >>
> >> > >> this will tend to go to the direction of long running data store
> >> > >>>
> >> > >> instances rather than data store instances started (and
> >optionally
> >> > loaded)
> >> > >> before tests.
> >> > >> It may be easiest to start with a "data stores stay running"
> >> > >> implementation, and then if we see issues with that move towards
> >tests
> >> > >> that
> >> > >> start/stop the data stores on each run. One thing I'd like to
> >make
> >> sure
> >> > is
> >> > >> that we're not manually tweaking the configurations for data
> >stores.
> >> One
> >> > >> way we could do that is to destroy/recreate the data stores on a
> >> slower
> >> > >> schedule - maybe once per week. That way if the script is
> >changed or
> >> the
> >> > >> data store instances are changed, we'd be able to detect it
> >relatively
> >> > >> soon
> >> > >> while still removing the need for the tests to manage the data
> >stores.
> >> > >>
> >> > > I agree. In addition to configuration manual tweaking, there
> >might be
> >> > > cases in which a data store re-partition data during a test or
> >after
> >> some
> >> > > tests while the dataset changes. The IO must be tolerant to that
> >but
> >> the
> >> > > asserts (number of bundles for example) in test must not fail in
> >that
> >> > case.
> >> > > I would also prefer if possible that the tests do not manage data
> >> stores
> >> > > (not setup them, not start them, not stop them)
> >> > >
> >> > >
> >> > >> as a general note, I suspect many of the folks in the states
> >will be
> >> on
> >> > >> holiday until Jan 2nd/3rd.
> >> > >>
> >> > >> S
> >> > >>
> >> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
> ><echauchot@gmail.com
> >> >
> >> > >> wrote:
> >> > >>
> >> > >> Hi,
> >> > >>>
> >> > >>> Recently we had a discussion about integration tests of IOs.
> >I'm
> >> > >>> preparing a PR for integration tests of the elasticSearch IO
> >> > >>> (
> >> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> >> > >>> LASTICSEARCH-IO
> >> > >>> as a first shot) which are very important IMHO because they
> >helped
> >> > catch
> >> > >>> some bugs that UT could not (volume, data store instance
> >sharing,
> >> real
> >> > >>> data store instance ...)
> >> > >>>
> >> > >>> I would like to have your thoughts/remarks about points bellow.
> >Some
> >> of
> >> > >>> these points are also discussed here
> >> > >>>
> >> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> >> > >>> :
> >> > >>>
> >> > >>> - UT and IT have a similar architecture, but while UT focus on
> >> testing
> >> > >>> the correct behavior of the code including corner cases and use
> >> > embedded
> >> > >>> in memory data store, IT assume that the behavior is correct
> >(strong
> >> > UT)
> >> > >>> and focus on higher volume testing and testing against real
> >data
> >> store
> >> > >>> instance(s)
> >> > >>>
> >> > >>> - For now, IT are stored alongside with UT in src/test
> >directory of
> >> the
> >> > >>> IO but they might go to dedicated module, waiting for a
> >consensus.
> >> > Maven
> >> > >>> is not configured to run them automatically because data store
> >is not
> >> > >>> available on jenkins server yet
> >> > >>>
> >> > >>> - For now, they only use DirectRunner, but they will  be run
> >against
> >> > >>> each runner.
> >> > >>>
> >> > >>> - IT do not setup data store instance (like stated in the above
> >> > >>> document) they assume that one is already running (hardcoded
> >> > >>> configuration in test for now, waiting for a common solution to
> >pass
> >> > >>> configuration to IT). A docker container script is provided in
> >the
> >> > >>> contrib directory as a starting point to whatever orchestration
> >> > software
> >> > >>> will be chosen.
> >> > >>>
> >> > >>> - IT load and clean test data before and after each test if
> >needed.
> >> It
> >> > >>> is simpler to do so because some tests need empty data store
> >(write
> >> > >>> test) and because, as discussed in the document, tests might
> >not be
> >> the
> >> > >>> only users of the data store. Also IMHO, it is better that
> >tests
> >> > >>> load/clean data than doing some assumptions about the running
> >order
> >> of
> >> > >>> the tests.
> >> > >>>
> >> > >>> If we generalize this pattern to all IT tests, this will tend
> >to go
> >> to
> >> > >>> the direction of long running data store instances rather than
> >data
> >> > >>> store instances started (and optionally loaded) before tests.
> >> > >>>
> >> > >>> Besides if we where to change our minds and load data from
> >outside
> >> the
> >> > >>> tests, a logstash script is provided.
> >> > >>>
> >> > >>> If you have any thoughts or remarks I'm all ears :)
> >> > >>>
> >> > >>> Regards,
> >> > >>>
> >> > >>> Etienne
> >> > >>>
> >> > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> >> > >>>
> >> > >>>> Hi Stephen,
> >> > >>>>
> >> > >>>> the purpose of having in a specific module is to share
> >resources and
> >> > >>>> apply the same behavior from IT perspective and be able to
> >have IT
> >> > >>>> "cross" IO (for instance, reading from JMS and sending to
> >Kafka, I
> >> > >>>> think that's the key idea for integration tests).
> >> > >>>>
> >> > >>>> For instance, in Karaf, we have:
> >> > >>>> - utest in each module
> >> > >>>> - itest module containing itests for all modules all together
> >> > >>>>
> >> > >>>> Regards
> >> > >>>> JB
> >> > >>>>
> >> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >> > >>>>
> >> > >>>>> Hi Etienne,
> >> > >>>>>
> >> > >>>>> thanks for following up and answering my questions.
> >> > >>>>>
> >> > >>>>> re: where to store integration tests - having them all in a
> >> separate
> >> > >>>>> module
> >> > >>>>> is an interesting idea. I couldn't find JB's comments about
> >moving
> >> > them
> >> > >>>>> into a separate module in the PR - can you share the reasons
> >for
> >> > >>>>> doing so?
> >> > >>>>> The IO integration/perf tests so it does seem like they'll
> >need to
> >> be
> >> > >>>>> treated in a special manner, but given that there is already
> >an IO
> >> > >>>>> specific
> >> > >>>>> module, it may just be that we need to treat all the ITs in
> >the IO
> >> > >>>>> module
> >> > >>>>> the same way. I don't have strong opinions either way right
> >now.
> >> > >>>>>
> >> > >>>>> S
> >> > >>>>>
> >> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> >> > echauchot@gmail.com>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>> Hi guys,
> >> > >>>>>
> >> > >>>>> @Stephen: I addressed all your comments directly in the PR,
> >thanks!
> >> > >>>>> I just wanted to comment here about the docker image I used:
> >the
> >> only
> >> > >>>>> official Elastic image contains only ElasticSearch. But for
> >> testing I
> >> > >>>>> needed logstash (for ingestion) and kibana (not for
> >integration
> >> > tests,
> >> > >>>>> but to easily test REST requests to ES using sense). This is
> >why I
> >> > use
> >> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
> >isreleased
> >> > >>>>> under
> >> > >>>>> theapache 2 license.
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Besides, there is also a point about where to store
> >integration
> >> > tests:
> >> > >>>>> JB proposed in the PR to store integration tests to dedicated
> >> module
> >> > >>>>> rather than directly in the IO module (like I did).
> >> > >>>>>
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Etienne
> >> > >>>>>
> >> > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >> > >>>>>
> >> > >>>>>> hey!
> >> > >>>>>>
> >> > >>>>>> thanks for sending this. I'm very excited to see this
> >change. I
> >> > >>>>>> added some
> >> > >>>>>> detail-oriented code review comments in addition to what
> >I've
> >> > >>>>>> discussed
> >> > >>>>>> here.
> >> > >>>>>>
> >> > >>>>>> The general goal is to allow for re-usable instantiation of
> >> > particular
> >> > >>>>>>
> >> > >>>>> data
> >> > >>>>>
> >> > >>>>>> store instances and this seems like a good start. Looks like
> >you
> >> > >>>>>> also have
> >> > >>>>>> a script to generate test data for your tests - that's
> >great.
> >> > >>>>>>
> >> > >>>>>> The next steps (definitely not blocking your work) will be
> >to have
> >> > >>>>>> ways to
> >> > >>>>>> create instances from the docker images you have here, and
> >use
> >> them
> >> > >>>>>> in the
> >> > >>>>>> tests. We'll need support in the test framework for that
> >since
> >> it'll
> >> > >>>>>> be
> >> > >>>>>> different on developer machines and in the beam jenkins
> >cluster,
> >> but
> >> > >>>>>> your
> >> > >>>>>> scripts here allow someone running these tests locally to
> >not have
> >> > to
> >> > >>>>>>
> >> > >>>>> worry
> >> > >>>>>
> >> > >>>>>> about getting the instance set up and can manually adjust,
> >so this
> >> > is
> >> > >>>>>> a
> >> > >>>>>> good incremental step.
> >> > >>>>>>
> >> > >>>>>> I have some thoughts now that I'm reviewing your scripts
> >(that I
> >> > >>>>>> didn't
> >> > >>>>>> have previously, so we are learning this together):
> >> > >>>>>> * It may be useful to try and document why we chose a
> >particular
> >> > >>>>>> docker
> >> > >>>>>> image as the base (ie, "this is the official supported
> >elastic
> >> > search
> >> > >>>>>> docker image" or "this image has several data stores
> >together that
> >> > >>>>>> can be
> >> > >>>>>> used for a couple different tests")  - I'm curious as to
> >whether
> >> the
> >> > >>>>>> community thinks that is important
> >> > >>>>>>
> >> > >>>>>> One thing that I called out in the comment that's worth
> >mentioning
> >> > >>>>>> on the
> >> > >>>>>> larger list - if you want to specify which specific runners
> >a test
> >> > >>>>>> uses,
> >> > >>>>>> that can be controlled in the pom for the module. I updated
> >the
> >> > >>>>>> testing
> >> > >>>>>>
> >> > >>>>> doc
> >> > >>>>>
> >> > >>>>>> mentioned previously in this thread with a TODO to talk
> >about this
> >> > >>>>>> more. I
> >> > >>>>>> think we should also make it so that IO modules have that
> >> > >>>>>> automatically,
> >> > >>>>>>
> >> > >>>>> so
> >> > >>>>>
> >> > >>>>>> developers don't have to worry about it.
> >> > >>>>>>
> >> > >>>>>> S
> >> > >>>>>>
> >> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> >> > echauchot@gmail.com>
> >> > >>>>>>
> >> > >>>>> wrote:
> >> > >>>>>
> >> > >>>>>> Stephen,
> >> > >>>>>>
> >> > >>>>>> As discussed, I added injection script, docker containers
> >scripts
> >> > and
> >> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> >> > >>>>>> <
> >> > >>>>>>
> >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> >> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> >> > >>> d824cefcb3ed0b9
> >> > >>>
> >> > >>>> directory in that PR:
> >> > >>>>>> https://github.com/apache/incubator-beam/pull/1439.
> >> > >>>>>>
> >> > >>>>>> These work well but they are first shot. Do you have any
> >comments
> >> > >>>>>> about
> >> > >>>>>> those?
> >> > >>>>>>
> >> > >>>>>> Besides I am not very sure that these files should be in the
> >IO
> >> > itself
> >> > >>>>>> (even in contrib directory, out of maven source
> >directories). Any
> >> > >>>>>>
> >> > >>>>> thoughts?
> >> > >>>>>
> >> > >>>>>> Thanks,
> >> > >>>>>>
> >> > >>>>>> Etienne
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >> > >>>>>>
> >> > >>>>>>> It's great to hear more experiences.
> >> > >>>>>>>
> >> > >>>>>>> I'm also glad to hear that people see real value in the
> >high
> >> > >>>>>>> volume/performance benchmark tests. I tried to capture that
> >in
> >> the
> >> > >>>>>>>
> >> > >>>>>> Testing
> >> > >>>>>
> >> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >> > >>>>>>>
> >> > >>>>>>> It does generally sound like we're in agreement here. Areas
> >of
> >> > >>>>>>> discussion
> >> > >>>>>>>
> >> > >>>>>> I
> >> > >>>>>>
> >> > >>>>>>> see:
> >> > >>>>>>> 1.  People like the idea of bringing up fresh instances for
> >each
> >> > test
> >> > >>>>>>> rather than keeping instances running all the time, since
> >that
> >> > >>>>>>> ensures no
> >> > >>>>>>> contamination between tests. That seems reasonable to me.
> >If we
> >> see
> >> > >>>>>>> flakiness in the tests or we note that setting up/tearing
> >down
> >> > >>>>>>> instances
> >> > >>>>>>>
> >> > >>>>>> is
> >> > >>>>>>
> >> > >>>>>>> taking a lot of time,
> >> > >>>>>>> 2. Deciding on cluster management software/orchestration
> >software
> >> > - I
> >> > >>>>>>>
> >> > >>>>>> want
> >> > >>>>>
> >> > >>>>>> to make sure we land on the right tool here since choosing
> >the
> >> > >>>>>>> wrong tool
> >> > >>>>>>> could result in administration of the instances taking more
> >> work. I
> >> > >>>>>>>
> >> > >>>>>> suspect
> >> > >>>>>>
> >> > >>>>>>> that's a good place for a follow up discussion, so I'll
> >start a
> >> > >>>>>>> separate
> >> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but
> >I
> >> want
> >> > to
> >> > >>>>>>>
> >> > >>>>>> make
> >> > >>>>>
> >> > >>>>>> sure we take a moment to consider different options and have
> >a
> >> > >>>>>>> reason for
> >> > >>>>>>> choosing one.
> >> > >>>>>>>
> >> > >>>>>>> Etienne - thanks for being willing to port your
> >creation/other
> >> > >>>>>>> scripts
> >> > >>>>>>> over. You might be a good early tester of whether this
> >system
> >> works
> >> > >>>>>>> well
> >> > >>>>>>> for everyone.
> >> > >>>>>>>
> >> > >>>>>>> Stephen
> >> > >>>>>>>
> >> > >>>>>>> [1]  Reasons for Beam Test Strategy -
> >> > >>>>>>>
> >> > >>>>>>>
> >https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> >> > >>>
> >> > >>>>
> >> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
> >> > >>>>>>> <jb...@nanthrax.net>
> >> > >>>>>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>> I second Etienne there.
> >> > >>>>>>>>
> >> > >>>>>>>> We worked together on the ElasticsearchIO and definitely,
> >the
> >> high
> >> > >>>>>>>> valuable test we did were integration tests with ES on
> >docker
> >> and
> >> > >>>>>>>> high
> >> > >>>>>>>> volume.
> >> > >>>>>>>>
> >> > >>>>>>>> I think we have to distinguish the two kinds of tests:
> >> > >>>>>>>> 1. utests are located in the IO itself and basically they
> >should
> >> > >>>>>>>> cover
> >> > >>>>>>>> the core behaviors of the IO
> >> > >>>>>>>> 2. itests are located as contrib in the IO (they could be
> >part
> >> of
> >> > >>>>>>>> the IO
> >> > >>>>>>>> but executed by the integration-test plugin or a specific
> >> profile)
> >> > >>>>>>>> that
> >> > >>>>>>>> deals with "real" backend and high volumes. The resources
> >> required
> >> > >>>>>>>> by
> >> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance
> >using
> >> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and
> >it's
> >> > >>>>>>>> what I'm
> >> > >>>>>>>> doing on my own "server").
> >> > >>>>>>>>
> >> > >>>>>>>> It's basically what Stephen described.
> >> > >>>>>>>>
> >> > >>>>>>>> We have to not relay only on itest: utests are very
> >important
> >> and
> >> > >>>>>>>> they
> >> > >>>>>>>> validate the core behavior.
> >> > >>>>>>>>
> >> > >>>>>>>> My $0.01 ;)
> >> > >>>>>>>>
> >> > >>>>>>>> Regards
> >> > >>>>>>>> JB
> >> > >>>>>>>>
> >> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> Hi Stephen,
> >> > >>>>>>>>>
> >> > >>>>>>>>> I like your proposition very much and I also agree that
> >docker
> >> +
> >> > >>>>>>>>> some
> >> > >>>>>>>>> orchestration software would be great !
> >> > >>>>>>>>>
> >> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there
> >is
> >> > docker
> >> > >>>>>>>>> container creation scripts and logstash data ingestion
> >script
> >> for
> >> > >>>>>>>>> IT
> >> > >>>>>>>>> environment available in contrib directory alongside with
> >> > >>>>>>>>> integration
> >> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to
> >new
> >> IT
> >> > >>>>>>>>> environment.
> >> > >>>>>>>>>
> >> > >>>>>>>>> What you say bellow about the need for external IT
> >environment
> >> is
> >> > >>>>>>>>> particularly true. As an example with ES what came out in
> >first
> >> > >>>>>>>>> implementation was that there were problems starting at
> >some
> >> high
> >> > >>>>>>>>>
> >> > >>>>>>>> volume
> >> > >>>>>
> >> > >>>>>> of data (timeouts, ES windowing overflow...) that could not
> >have
> >> be
> >> > >>>>>>>>>
> >> > >>>>>>>> seen
> >> > >>>>>
> >> > >>>>>> on embedded ES version. Also there where some
> >particularities to
> >> > >>>>>>>>> external instance like secondary (replica) shards that
> >where
> >> not
> >> > >>>>>>>>>
> >> > >>>>>>>> visible
> >> > >>>>>
> >> > >>>>>> on embedded instance.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Besides, I also favor bringing up instances before test
> >because
> >> > it
> >> > >>>>>>>>> allows (amongst other things) to be sure to start on a
> >fresh
> >> > >>>>>>>>> dataset
> >> > >>>>>>>>>
> >> > >>>>>>>> for
> >> > >>>>>
> >> > >>>>>> the test to be deterministic.
> >> > >>>>>>>>>
> >> > >>>>>>>>> Etienne
> >> > >>>>>>>>>
> >> > >>>>>>>>>
> >> > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> >> > >>>>>>>>>
> >> > >>>>>>>>>> Hi,
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I'm excited we're getting lots of discussion going.
> >There are
> >> > many
> >> > >>>>>>>>>> threads
> >> > >>>>>>>>>> of conversation here, we may choose to split some of
> >them off
> >> > >>>>>>>>>> into a
> >> > >>>>>>>>>> different email thread. I'm also betting I missed some
> >of the
> >> > >>>>>>>>>> questions in
> >> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also
> >> apologies
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> amount of text, I provided some quick summaries at the top
> >of
> >> each
> >> > >>>>>>>>>> section.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in
> >detail
> >> below.
> >> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of
> >work
> >> > >>>>>>>>>> here to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> go
> >> > >>>>>
> >> > >>>>>> around. I'll try and think about how we can divide up some
> >next
> >> > >>>>>>>>>> steps
> >> > >>>>>>>>>> (probably in a separate thread.) The main next step I
> >see is
> >> > >>>>>>>>>> deciding
> >> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm
> >working
> >> on
> >> > >>>>>>>>>> that,
> >> > >>>>>>>>>>
> >> > >>>>>>>>> but
> >> > >>>>>>>>
> >> > >>>>>>>>> having lots of different thoughts on what the
> >> > >>>>>>>>>> advantages/disadvantages
> >> > >>>>>>>>>>
> >> > >>>>>>>>> of
> >> > >>>>>>>>
> >> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the
> >> > >>>>>>>>>> protocol for
> >> > >>>>>>>>>> collaborating on sub-projects like this.)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> These issues are all related to what kind of tests we
> >want to
> >> > >>>>>>>>>> write. I
> >> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all
> >the
> >> use
> >> > >>>>>>>>>> cases
> >> > >>>>>>>>>> we've discussed here (and thus should not block moving
> >forward
> >> > >>>>>>>>>> with
> >> > >>>>>>>>>> this),
> >> > >>>>>>>>>> but understanding what we want to test will help us
> >understand
> >> > >>>>>>>>>> how the
> >> > >>>>>>>>>> cluster will be used. I'm working on a proposed user
> >guide for
> >> > >>>>>>>>>> testing
> >> > >>>>>>>>>>
> >> > >>>>>>>>> IO
> >> > >>>>>>>>
> >> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a
> >short
> >> > >>>>>>>>>> summary
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> the list shortly so folks can get a better sense of where
> >I'm
> >> > >>>>>>>>>> coming
> >> > >>>>>>>>>> from.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Here's my thinking on the questions we've raised here -
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Embedded versions of data stores for testing
> >> > >>>>>>>>>> --------------------
> >> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
> >> > against.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the
> >various
> >> > data
> >> > >>>>>>>>>> stores.
> >> > >>>>>>>>>> I think we should test everything we possibly can using
> >them,
> >> > >>>>>>>>>> and do
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> majority of our correctness testing using embedded versions
> >+ the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> direct
> >> > >>>>>>
> >> > >>>>>>> runner. However, it's also important to have at least one
> >test
> >> that
> >> > >>>>>>>>>> actually connects to an actual instance, so we can get
> >> coverage
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> things
> >> > >>>>>>>>>> like credentials, real connection strings, etc...
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> The key point is that embedded versions definitely can't
> >cover
> >> > the
> >> > >>>>>>>>>> performance tests, so we need to host instances if we
> >want to
> >> > test
> >> > >>>>>>>>>>
> >> > >>>>>>>>> that.
> >> > >>>>>>
> >> > >>>>>>> I consider the integration tests/performance benchmarks to
> >be
> >> > >>>>>>>>>> costly
> >> > >>>>>>>>>> things
> >> > >>>>>>>>>> that we do only for the IO transforms with large amounts
> >of
> >> > >>>>>>>>>> community
> >> > >>>>>>>>>> support/usage. A random IO transform used by a few users
> >> doesn't
> >> > >>>>>>>>>> necessarily need integration & perf tests, but for
> >heavily
> >> used
> >> > IO
> >> > >>>>>>>>>> transforms, there's a lot of community value in these
> >tests.
> >> The
> >> > >>>>>>>>>> maintenance proposal below scales with the amount of
> >community
> >> > >>>>>>>>>> support
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> a particular IO transform.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Reusing data stores ("use the data stores across
> >executions.")
> >> > >>>>>>>>>> ------------------
> >> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently
> >used, very
> >> > >>>>>>>>>> small
> >> > >>>>>>>>>> instances that we keep up all the time + larger
> >> multi-container
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> store
> >> > >>>>>>>>>> instances that we spin up for perf tests.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I don't think we need to have a strong answer to this
> >> question,
> >> > >>>>>>>>>> but I
> >> > >>>>>>>>>> think
> >> > >>>>>>>>>> we do need to know what range of capabilities we need,
> >and use
> >> > >>>>>>>>>> that to
> >> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I
> >think
> >> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios
> >I
> >> > discuss
> >> > >>>>>>>>>>
> >> > >>>>>>>>> below.
> >> > >>>>>>
> >> > >>>>>>> I had been thinking of a hybrid approach - reuse some
> >instances
> >> and
> >> > >>>>>>>>>>
> >> > >>>>>>>>> don't
> >> > >>>>>>>>
> >> > >>>>>>>>> reuse others. Some tests require isolation from other
> >tests
> >> (eg.
> >> > >>>>>>>>>> performance benchmarking), while others can easily
> >re-use the
> >> > same
> >> > >>>>>>>>>> database/data store instance over time, provided they
> >are
> >> > >>>>>>>>>> written in
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> correct manner (eg. a simple read or write correctness
> >> integration
> >> > >>>>>>>>>>
> >> > >>>>>>>>> tests)
> >> > >>>>>>>>
> >> > >>>>>>>>> To me, the question of whether to use one instance over
> >time
> >> for
> >> > a
> >> > >>>>>>>>>> test vs
> >> > >>>>>>>>>> spin up an instance for each test comes down to a trade
> >off
> >> > >>>>>>>>>> between
> >> > >>>>>>>>>>
> >> > >>>>>>>>> these
> >> > >>>>>>>>
> >> > >>>>>>>>> factors:
> >> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super
> >flaky,
> >> > >>>>>>>>>> we'll
> >> > >>>>>>>>>> want to
> >> > >>>>>>>>>> keep more instances up and running rather than bring
> >them
> >> > up/down.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> (this
> >> > >>>>>>
> >> > >>>>>>> may also vary by the data store in question)
> >> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every
> >5
> >> > >>>>>>>>>> minutes, it
> >> > >>>>>>>>>>
> >> > >>>>>>>>> may
> >> > >>>>>>>>
> >> > >>>>>>>>> be wasteful to bring machines up/down every time. If we
> >run
> >> > >>>>>>>>>> tests once
> >> > >>>>>>>>>>
> >> > >>>>>>>>> a
> >> > >>>>>>
> >> > >>>>>>> day or week, it seems wasteful to keep the machines up the
> >whole
> >> > >>>>>>>>>> time.
> >> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated,
> >it
> >> means
> >> > we
> >> > >>>>>>>>>>
> >> > >>>>>>>>> either
> >> > >>>>>>>>
> >> > >>>>>>>>> have to bring up the instances for each test, or we have
> >to
> >> have
> >> > >>>>>>>>>> some
> >> > >>>>>>>>>> sort
> >> > >>>>>>>>>> of signaling mechanism to indicate that a given instance
> >is in
> >> > >>>>>>>>>> use. I
> >> > >>>>>>>>>> strongly favor bringing up an instance per test.
> >> > >>>>>>>>>> 4. Number/size of containers - if we need a large number
> >of
> >> > >>>>>>>>>> machines
> >> > >>>>>>>>>> for a
> >> > >>>>>>>>>> particular test, keeping them running all the time will
> >use
> >> more
> >> > >>>>>>>>>> resources.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin
> >these
> >> up.
> >> > >>>>>>>>>> I'm
> >> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up,
> >but I
> >> > >>>>>>>>>> think the
> >> > >>>>>>>>>> best
> >> > >>>>>>>>>> way to test that is to start doing it.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of
> >very
> >> > >>>>>>>>>> small
> >> > >>>>>>>>>>
> >> > >>>>>>>>> data
> >> > >>>>>>
> >> > >>>>>>> store instances that stay up to support small-data-size
> >> post-commit
> >> > >>>>>>>>>> end to
> >> > >>>>>>>>>> end tests (post-commits run frequently and the data size
> >means
> >> > the
> >> > >>>>>>>>>> instances would not use many resources), combined with
> >the
> >> > >>>>>>>>>> ability to
> >> > >>>>>>>>>> spin
> >> > >>>>>>>>>> up larger instances for once a day/week performance
> >benchmarks
> >> > >>>>>>>>>> (these
> >> > >>>>>>>>>>
> >> > >>>>>>>>> use
> >> > >>>>>>>>
> >> > >>>>>>>>> up more resources and are used less frequently.) That's
> >the mix
> >> > >>>>>>>>>> I'll
> >> > >>>>>>>>>> propose in my docs on testing IO transforms.  If
> >spinning up
> >> new
> >> > >>>>>>>>>> instances
> >> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of
> >spinning up
> >> > >>>>>>>>>> instances
> >> > >>>>>>>>>> for
> >> > >>>>>>>>>> each test.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Management ("what's the overhead of managing such a
> >> deployment")
> >> > >>>>>>>>>> --------------------
> >> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts
> >for
> >> > >>>>>>>>>> setting up
> >> > >>>>>>>>>>
> >> > >>>>>>>>> data
> >> > >>>>>>>>
> >> > >>>>>>>>> store instances + integration/perf tests, but if the
> >community
> >> > >>>>>>>>>> doesn't
> >> > >>>>>>>>>> maintain a particular data store's tests, we disable the
> >tests
> >> > and
> >> > >>>>>>>>>> turn off
> >> > >>>>>>>>>> the data store instances.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Management of these instances is a crucial question.
> >First,
> >> > let's
> >> > >>>>>>>>>>
> >> > >>>>>>>>> break
> >> > >>>>>
> >> > >>>>>> down what tasks we'll need to do on a recurring basis:
> >> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both
> >instance
> >> &
> >> > >>>>>>>>>> dependencies) - we don't want to have a lot of old
> >versions
> >> > >>>>>>>>>> vulnerable
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> attacks/buggy
> >> > >>>>>>>>>> 2. Investigate breakages/regressions
> >> > >>>>>>>>>> (I'm betting there will be more things we'll discover -
> >let me
> >> > >>>>>>>>>> know if
> >> > >>>>>>>>>> you
> >> > >>>>>>>>>> have suggestions)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> There's a couple goals I see:
> >> > >>>>>>>>>> 1. We should only do sys admin work for things that give
> >us a
> >> > >>>>>>>>>> lot of
> >> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up
> >scripts
> >> for
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> stores
> >> > >>>>>>>>>> without a large community)
> >> > >>>>>>>>>> 2. We should do as much as possible of testing via
> >> > >>>>>>>>>> in-memory/embedded
> >> > >>>>>>>>>> testing (as you brought up).
> >> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> As I discussed above, I think that integration
> >> tests/performance
> >> > >>>>>>>>>> benchmarks
> >> > >>>>>>>>>> are costly things that we should do only for the IO
> >transforms
> >> > >>>>>>>>>> with
> >> > >>>>>>>>>>
> >> > >>>>>>>>> large
> >> > >>>>>>>>
> >> > >>>>>>>>> amounts of community support/usage. Thus, I propose that
> >we
> >> > >>>>>>>>>> limit the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> IO
> >> > >>>>>>
> >> > >>>>>>> transforms that get integration tests & performance
> >benchmarks to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> those
> >> > >>>>>
> >> > >>>>>> that have community support for maintaining the data store
> >> > >>>>>>>>>> instances.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> We can enforce this organically using some simple rules:
> >> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given
> >> > >>>>>>>>>> integration/perf
> >> > >>>>>>>>>>
> >> > >>>>>>>>> test
> >> > >>>>>>
> >> > >>>>>>> starts failing and no one investigates it within a set
> >period of
> >> > >>>>>>>>>> time
> >> > >>>>>>>>>>
> >> > >>>>>>>>> (a
> >> > >>>>>>
> >> > >>>>>>> week?), we disable the tests and shut off the data store
> >> > >>>>>>>>>> instances if
> >> > >>>>>>>>>>
> >> > >>>>>>>>> we
> >> > >>>>>>
> >> > >>>>>>> have instances running. When someone wants to step up and
> >> > >>>>>>>>>> support it
> >> > >>>>>>>>>> again,
> >> > >>>>>>>>>> they can fix the test, check it in, and re-enable the
> >test.
> >> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira
> >issue that
> >> > >>>>>>>>>> is just
> >> > >>>>>>>>>> "is
> >> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira
> >is
> >> not
> >> > >>>>>>>>>> resolved in
> >> > >>>>>>>>>> a set period of time (1 month?), the perf/integration
> >tests
> >> are
> >> > >>>>>>>>>>
> >> > >>>>>>>>> disabled,
> >> > >>>>>>>>
> >> > >>>>>>>>> and the data store instances shut off.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> This is pretty flexible -
> >> > >>>>>>>>>> * If a particular person or organization wants to
> >support an
> >> IO
> >> > >>>>>>>>>> transform,
> >> > >>>>>>>>>> they can. If a group of people all organically organize
> >to
> >> keep
> >> > >>>>>>>>>> the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> tests
> >> > >>>>>>>>
> >> > >>>>>>>>> running, they can.
> >> > >>>>>>>>>> * It can be mostly automated - there's not a lot of
> >central
> >> > >>>>>>>>>> organizing
> >> > >>>>>>>>>> work
> >> > >>>>>>>>>> that needs to be done.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Exposing the information about what IO transforms
> >currently
> >> have
> >> > >>>>>>>>>>
> >> > >>>>>>>>> running
> >> > >>>>>>
> >> > >>>>>>> IT/perf benchmarks on the website will let users know what
> >IO
> >> > >>>>>>>>>>
> >> > >>>>>>>>> transforms
> >> > >>>>>>
> >> > >>>>>>> are well supported.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> I like this solution, but I also recognize this is a
> >tricky
> >> > >>>>>>>>>> problem.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> This
> >> > >>>>>>>>
> >> > >>>>>>>>> is something the community needs to be supportive of, so
> >I'm
> >> > >>>>>>>>>> open to
> >> > >>>>>>>>>> other
> >> > >>>>>>>>>> thoughts.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests
> >to
> >> > simulate
> >> > >>>>>>>>>> failure")
> >> > >>>>>>>>>> -----------------
> >> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We
> >should
> >> > >>>>>>>>>> encourage a
> >> > >>>>>>>>>> design pattern separating out network/retry logic from
> >the
> >> main
> >> > IO
> >> > >>>>>>>>>> transform logic
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> We *could* create instance failure in any container
> >management
> >> > >>>>>>>>>>
> >> > >>>>>>>>> software
> >> > >>>>>
> >> > >>>>>> -
> >> > >>>>>>>>
> >> > >>>>>>>>> we can use their programmatic APIs to determine which
> >> containers
> >> > >>>>>>>>>> are
> >> > >>>>>>>>>> running the instances, and ask them to kill the
> >container in
> >> > >>>>>>>>>> question.
> >> > >>>>>>>>>>
> >> > >>>>>>>>> A
> >> > >>>>>>
> >> > >>>>>>> slow node would be trickier, but I'm sure we could figure
> >it out
> >> > >>>>>>>>>> - for
> >> > >>>>>>>>>> example, add a network proxy that would delay responses.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> However, I would argue that this type of testing doesn't
> >gain
> >> > us a
> >> > >>>>>>>>>> lot, and
> >> > >>>>>>>>>> is complicated to set up. I think it will be easier to
> >test
> >> > >>>>>>>>>> network
> >> > >>>>>>>>>> errors
> >> > >>>>>>>>>> and retry behavior in unit tests for the IO transforms.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Part of the way to handle this is to separate out the
> >read
> >> code
> >> > >>>>>>>>>> from
> >> > >>>>>>>>>>
> >> > >>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> network code (eg. bigtable has BigtableService). If you put
> >the
> >> > >>>>>>>>>>
> >> > >>>>>>>>> "handle
> >> > >>>>>
> >> > >>>>>> errors/retry logic" code in a separate MySourceService
> >class,
> >> > >>>>>>>>>> you can
> >> > >>>>>>>>>> test
> >> > >>>>>>>>>> MySourceService on the wide variety of networks
> >errors/data
> >> > store
> >> > >>>>>>>>>> problems,
> >> > >>>>>>>>>> and then your main IO transform tests focus on the read
> >> behavior
> >> > >>>>>>>>>> and
> >> > >>>>>>>>>> handling the small set of errors the MySourceService
> >class
> >> will
> >> > >>>>>>>>>>
> >> > >>>>>>>>> return.
> >> > >>>>>
> >> > >>>>>> I also think we should focus on testing the IO Transform,
> >not
> >> > >>>>>>>>>> the data
> >> > >>>>>>>>>> store - if we kill a node in a data store, it's that
> >data
> >> > store's
> >> > >>>>>>>>>> problem,
> >> > >>>>>>>>>> not beam's problem. As you were pointing out, there are
> >a
> >> > *large*
> >> > >>>>>>>>>> number of
> >> > >>>>>>>>>> possible ways that a particular data store can fail, and
> >we
> >> > >>>>>>>>>> would like
> >> > >>>>>>>>>>
> >> > >>>>>>>>> to
> >> > >>>>>>>>
> >> > >>>>>>>>> support many different data stores. Rather than try to
> >test
> >> that
> >> > >>>>>>>>>> each
> >> > >>>>>>>>>> data
> >> > >>>>>>>>>> store behaves well, we should ensure that we handle
> >> > >>>>>>>>>> generic/expected
> >> > >>>>>>>>>> errors
> >> > >>>>>>>>>> in a graceful manner.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Ismaeal had a couple other quick comments/questions,
> >I'll
> >> answer
> >> > >>>>>>>>>> here
> >> > >>>>>>>>>>
> >> > >>>>>>>>> -
> >> > >>>>>
> >> > >>>>>> We can use this to test other runners running on multiple
> >> > >>>>>>>>>> machines - I
> >> > >>>>>>>>>> agree. This is also necessary for a good performance
> >benchmark
> >> > >>>>>>>>>> test.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> "providing the test machines to mount the cluster" - we
> >can
> >> > >>>>>>>>>> discuss
> >> > >>>>>>>>>>
> >> > >>>>>>>>> this
> >> > >>>>>>
> >> > >>>>>>> further, but one possible option is that google may be
> >willing to
> >> > >>>>>>>>>>
> >> > >>>>>>>>> donate
> >> > >>>>>>
> >> > >>>>>>> something to support this.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in
> >> another
> >> > >>>>>>>>>>
> >> > >>>>>>>>> thread.
> >> > >>>>>>
> >> > >>>>>>> That's as much about the public interface we provide to
> >users as
> >> > >>>>>>>>>>
> >> > >>>>>>>>> anything
> >> > >>>>>>>>
> >> > >>>>>>>>> else. I agree with your sentiment that a user should be
> >able to
> >> > >>>>>>>>>> expect
> >> > >>>>>>>>>> predictable behavior from the different IO transforms.
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Thanks for everyone's questions/comments - I really am
> >excited
> >> > >>>>>>>>>> to see
> >> > >>>>>>>>>> that
> >> > >>>>>>>>>> people care about this :)
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> Stephen
> >> > >>>>>>>>>>
> >> > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <
> >> iemejia@gmail.com
> >> > >
> >> > >>>>>>>>>>
> >> > >>>>>>>>> wrote:
> >> > >>>>>
> >> > >>>>>> Hello,
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really
> >interesting,
> >> I
> >> > >>>>>>>>>>> would
> >> > >>>>>>>>>>> really
> >> > >>>>>>>>>>> like to help with this. I have never played with
> >Kubernetes
> >> but
> >> > >>>>>>>>>>> this
> >> > >>>>>>>>>>> seems
> >> > >>>>>>>>>>> a really nice chance to do something useful with it.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple
> >> > container
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> images
> >> > >>>>>>>>
> >> > >>>>>>>>> and in some particular cases ‘clusters’ of containers
> >using
> >> > >>>>>>>>>>> docker-compose
> >> > >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be
> >really
> >> > >>>>>>>>>>> nice to
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> have
> >> > >>>>>>>>
> >> > >>>>>>>>> this at the Beam level, in particular to try to test more
> >> complex
> >> > >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is
> >to
> >> > achieve
> >> > >>>>>>>>>>> this for
> >> > >>>>>>>>>>> example:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka
> >nodes, I
> >> > >>>>>>>>>>> would
> >> > >>>>>>>>>>> like to
> >> > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill
> >a
> >> node),
> >> > >>>>>>>>>>> or
> >> > >>>>>>>>>>> simulate
> >> > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as
> >expected
> >> > >>>>>>>>>>> in the
> >> > >>>>>>>>>>> Beam
> >> > >>>>>>>>>>> pipeline for the given runner.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Another related idea is to improve IO consistency:
> >Today the
> >> > >>>>>>>>>>> different IOs
> >> > >>>>>>>>>>> have small differences in their failure behavior, I
> >really
> >> > >>>>>>>>>>> would like
> >> > >>>>>>>>>>> to be
> >> > >>>>>>>>>>> able to predict with more precision what will happen in
> >case
> >> of
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> errors,
> >> > >>>>>>
> >> > >>>>>>> e.g. what is the correct behavior if I am writing to a
> >Kafka
> >> > >>>>>>>>>>> node and
> >> > >>>>>>>>>>> there
> >> > >>>>>>>>>>> is a network partition, does the Kafka sink retries or
> >no ?
> >> and
> >> > >>>>>>>>>>> what
> >> > >>>>>>>>>>> if it
> >> > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
> >> > >>>>>>>>>>> checkpointing?
> >> > >>>>>>>>>>> Or do
> >> > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am
> >not
> >> sure
> >> > >>>>>>>>>>> about
> >> > >>>>>>>>>>> what
> >> > >>>>>>>>>>> happens (or if the expected behavior depends on the
> >runner),
> >> > >>>>>>>>>>> but well
> >> > >>>>>>>>>>> maybe
> >> > >>>>>>>>>>> it is just that I don’t know and we have tests to
> >ensure
> >> this.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Of course both are really hard problems, but I think
> >with
> >> your
> >> > >>>>>>>>>>> proposal we
> >> > >>>>>>>>>>> can try to tackle them, as well as the performance
> >ones. And
> >> > >>>>>>>>>>> apart of
> >> > >>>>>>>>>>> the
> >> > >>>>>>>>>>> data stores, I think it will be also really nice to be
> >able
> >> to
> >> > >>>>>>>>>>> test
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>> the
> >> > >>>>>>
> >> > >>>>>>> runners in a distributed manner.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> So what is the next step? How do you imagine such
> >integration
> >> > >>>>>>>>>>> tests?
> >> > >>>>>>>>>>> ? Who
> >> > >>>>>>>>>>> can provide the test machines so we can mount the
> >cluster?
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial
> >setup,
> >> but
> >> > >>>>>>>>>>> it
> >> > >>>>>>>>>>> will be
> >> > >>>>>>>>>>> really nice to start working on this.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Ismael
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <
> >> > >>>>>>>>>>> amitsela33@gmail.com
> >> > >>>>>>>>>>> wrote:
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>> Hi Stephen,
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> I was wondering about how we plan to use the data
> >stores
> >> > across
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> executions.
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container)
> >for
> >> > every
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> test,
> >> > >>>>>>
> >> > >>>>>>> running a "standalone" store (say HBase/Cassandra for
> >> > >>>>>>>>>>>> example), and
> >> > >>>>>>>>>>>> once
> >> > >>>>>>>>>>>> the test is done, teardown the instance. It should
> >also be
> >> > >>>>>>>>>>>> agnostic
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> to
> >> > >>>>>>
> >> > >>>>>>> the
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
> >> > >>>>>>>>>>>> I'm wondering though what's the overhead of managing
> >such a
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> deployment
> >> > >>>>>>
> >> > >>>>>>> which could become heavy and complicated as more IOs are
> >> > >>>>>>>>>>>> supported
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> and
> >> > >>>>>>
> >> > >>>>>>> more
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> test cases introduced.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Another way to go would be to have small clusters of
> >> different
> >> > >>>>>>>>>>>> data
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> stores
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> and run against new "namespaces" (while lazily
> >evicting old
> >> > >>>>>>>>>>>> ones),
> >> > >>>>>>>>>>>> but I
> >> > >>>>>>>>>>>> think this is less likely as maintaining a distributed
> >> > instance
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> (even
> >> > >>>>>
> >> > >>>>>> a
> >> > >>>>>>>>
> >> > >>>>>>>>> small one) for each data store sounds even more complex.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> A third approach would be to to simply have an
> >"embedded"
> >> > >>>>>>>>>>>> in-memory
> >> > >>>>>>>>>>>> instance of a data store as part of a test that runs
> >against
> >> > it
> >> > >>>>>>>>>>>> (such as
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> an
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> embedded Kafka, though not a data store).
> >> > >>>>>>>>>>>> This is probably the simplest solution in terms of
> >> > >>>>>>>>>>>> orchestration,
> >> > >>>>>>>>>>>> but it
> >> > >>>>>>>>>>>> depends on having a proper "embedded" implementation
> >for an
> >> > IO.
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Does this make sense to you ? have you considered it ?
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Thanks,
> >> > >>>>>>>>>>>> Amit
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>> jb@nanthrax.net
> >> > >>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>>>>>>>
> >> > >>>>>>>>>>>> Hi Stephen,
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>>> as already discussed a bit together, it sounds great
> >!
> >> > >>>>>>>>>>>>> Especially I
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>>>>> like
> >> > >>>>>>>>>>>
> >> > >>>>>>>>>>>> it as a both integration test platform and good
> >coverage for
> >> > >>>>>>>>>>>>> IOs.
> >> > >>>>>>>>>>>>>
> >> > >>>>>>>>

Re: Hosting data stores for IO Transform testing

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

\u2063\u200bHi Ismael

Stephen will reply with details but I know he did a comparison and evaluate different options.

He tested with the jdbc Io itests.

Regards
JB

On Jan 18, 2017, 08:26, at 08:26, "Isma�l Mej�a" <ie...@gmail.com> wrote:
>Thanks for your analysis Stephen, good arguments / references.
>
>One quick question. Have you checked the APIs of both
>(Mesos/Kubernetes) to
>see
>if we can do programmatically do more complex tests (I suppose so, but
>you
>don't mention how easy or if those are possible), for example to
>simulate a
>slow networking slave (to test stragglers), or to arbitrarily kill one
>slave (e.g. if I want to test the correct behavior of a runner/IO that
>is
>reading from it) ?
>
>Other missing point in the review is the availability of ready to play
>packages,
>I think in this area mesos/dcos seems more advanced no? I haven't
>looked
>recently but at least 6 months ago there were not many helm packages
>ready
>for
>example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
>etc). Has
>this been improved ? because preparing this also is a considerable
>amount of
>work on the other hand this could be also a chance to contribute to
>kubernetes.
>
>Regards,
>Isma�l
>
>
>
>On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <si...@google.com.invalid>
>wrote:
>
>> hi!
>>
>> I've been continuing this investigation, and have some more info to
>report,
>> and hopefully we can start making some decisions.
>>
>> To support performance testing, I've been investigating
>mesos+marathon and
>> kubernetes for running data stores in their high availability mode. I
>have
>> been examining features that kubernetes/mesos+marathon use to support
>this.
>>
>> Setting up a multi-node cluster in a high availability mode tends to
>be
>> more expensive time-wise than the single node instances I've played
>around
>> with in the past. Rather than do a full build out with both
>kubernetes and
>> mesos, I'd like to pick one of the two options to build the prototype
>> cluster with. If the prototype doesn't go well, we could still go
>back to
>> the other option, but I'd like to change us from a mode of "let's
>look at
>> all the options" to one of "here's the favorite, let's prove that
>works for
>> us".
>>
>> Below are the features that I've seen are important to multi-node
>instances
>> of data stores. I'm sure other folks on the list have done this
>before, so
>> feel free to pipe up if I'm missing a good solution to a problem.
>>
>> DNS/Discovery
>>
>> --------------------
>>
>> Necessary for talking between nodes (eg, cassandra nodes all need to
>be
>> able to talk to a set of seed nodes.)
>>
>> * Kubernetes has built-in DNS/discovery between nodes.
>>
>> * Mesos has supports this via mesos-dns, which isn't a part of core
>mesos,
>> but is in dcos, which is the mesos distribution I've been using and
>that I
>> would expect us to use.
>>
>> Instances properly distributed across nodes
>>
>> ------------------------------------------------------------
>>
>> If multiple instances of a data source end up on the same underlying
>VM, we
>> may not get good performance out of those instances since the
>underlying VM
>> may be more taxed than other VMs.
>>
>> * Kubernetes has a beta feature StatefulSets[1] which allow for
>containers
>> distributed so that there's one container per underlying machine (as
>well
>> as a lot of other useful features like easy stable dns names.)
>>
>> * Mesos can support this via the built in UNIQUE constraint [2]
>>
>> Load balancing
>>
>> --------------------
>>
>> Incoming requests from users need to be distributed to the various
>machines
>> - this is important for many data stores' high availability modes.
>>
>> * Kubernetes supports easily hooking up to an external load balancer
>when
>> on a cloud (and can be configured to work with a built-in load
>balancer if
>> not)
>>
>> * Mesos supports this via marathon-lb [3], which is an install-able
>package
>> in DC/OS
>>
>> Persistent Volumes tied to specific instances
>>
>> ------------------------------------------------------------
>>
>> Databases often need persistent state (for example to store the data
>:), so
>> it's an important part of running our service.
>>
>> * Kubernetes StatefulSets supports this
>>
>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
>>
>> As I mentioned above, I'd like to focus on either kubernetes or mesos
>for
>> my investigation, and as I go further along, I'm seeing kubernetes as
>> better suited to our needs.
>>
>> (1) It supports more of the features we want out of the box and with
>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
>> requires marathon-lb to be installed and mesos-dns to be configured.
>>
>> (2) I'm also finding that there seem to be more examples of using
>> kubernetes to solve the types of problems we're working on. This is
>> somewhat subjective, but in my experience as I've tried to learn both
>> kubernetes and mesos, I personally found it generally easier to get
>> kubernetes running than mesos due to the tutorials/examples available
>for
>> kubernetes.
>>
>> (3) Lower cost of initial setup - as I discussed in a previous
>mail[6],
>> kubernetes was far easier to get set up even when I knew the exact
>steps.
>> Mesos took me around 27 steps [7], which involved a lot of config
>that was
>> easy to get wrong (it took me about 5 tries to get all the steps
>correct in
>> one go.) Kubernetes took me around 8 steps and very little config.
>>
>> Given that, I'd like to focus my investigation/prototyping on
>Kubernetes.
>> To
>> be clear, it's fairly close and I think both Mesos and Kubernetes
>could
>> support what we need, so if we run into issues with kubernetes, Mesos
>still
>> seems like a viable option that we could fall back to.
>>
>> Thanks,
>> Stephen
>>
>>
>> [1] Kubernetes StatefulSets
>>
>https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
>>
>> [2] mesos unique constraint -
>> https://mesosphere.github.io/marathon/docs/constraints.html
>>
>> [3]
>> https://mesosphere.github.io/marathon/docs/service-
>> discovery-load-balancing.html
>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
>>
>> [4]
>https://mesosphere.github.io/marathon/docs/persistent-volumes.html
>>
>> [5]
>https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
>>
>> [6] Container Orchestration software for hosting data stores
>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>
>> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
>>
>>
>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
>wrote:
>>
>> > Just a quick drive-by comment: how tests are laid out has
>non-trivial
>> > tradeoffs on how/where continuous integration runs, and how results
>are
>> > integrated into the tooling. The current state is certainly not
>ideal
>> > (e.g., due to multiple test executions some links in Jenkins point
>where
>> > they shouldn't), but most other alternatives had even bigger
>drawbacks at
>> > the time. If someone has great ideas that don't explode the number
>of
>> > modules, please share ;-)
>> >
>> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
><ec...@gmail.com>
>> > wrote:
>> >
>> > > Hi Stephen,
>> > >
>> > > Thanks for taking the time to comment.
>> > >
>> > > My comments are bellow in the email:
>> > >
>> > >
>> > > Le 24/12/2016 � 00:07, Stephen Sisk a �crit :
>> > >
>> > >> hey Etienne -
>> > >>
>> > >> thanks for your thoughts and thanks for sharing your
>experiences. I
>> > >> generally agree with what you're saying. Quick comments below:
>> > >>
>> > >> IT are stored alongside with UT in src/test directory of the IO
>but
>> they
>> > >>>
>> > >> might go to dedicated module, waiting for a consensus
>> > >> I don't have a strong opinion or feel that I've worked enough
>with
>> maven
>> > >> to
>> > >> understand all the consequences - I'd love for someone with more
>maven
>> > >> experience to weigh in. If this becomes blocking, I'd say check
>it in,
>> > and
>> > >> we can refactor later if it proves problematic.
>> > >>
>> > > Sure, not a blocking point, it could be refactored afterwards.
>Just as
>> a
>> > > reminder, JB mentioned that storing IT in separate module allows
>to
>> have
>> > > more coherence between all IT (same behavior) and to do cross IO
>> > > integration tests. JB, have you experienced some long term
>drawbacks of
>> > > storing IT in a separate module, like, for example, more
>difficult
>> > > maintenance due to "distance" with production code?
>> > >
>> > >
>> > >>   Also IMHO, it is better that tests load/clean data than doing
>some
>> > >>>
>> > >> assumptions about the running order of the tests.
>> > >> I definitely agree that we don't want to make assumptions about
>the
>> > >> running
>> > >> order of the tests - that way lies pain. :) It will be
>interesting to
>> > see
>> > >> how the performance tests work out since they will need more
>data (and
>> > >> thus
>> > >> loading data can take much longer.)
>> > >>
>> > > Yes, performance testing might push in the direction of data
>loading
>> from
>> > > outside the tests due to loading time.
>> > >
>> > >>   This should also be an easier problem
>> > >> for read tests than for write tests - if we have long running
>> instances,
>> > >> read tests don't really need cleanup. And if write tests only
>write a
>> > >> small
>> > >> amount of data, as long as we are sure we're writing to uniquely
>> > >> identifiable locations (ie, new table per test or something
>similar),
>> we
>> > >> can clean up the write test data on a slower schedule.
>> > >>
>> > > I agree
>> > >
>> > >>
>> > >> this will tend to go to the direction of long running data store
>> > >>>
>> > >> instances rather than data store instances started (and
>optionally
>> > loaded)
>> > >> before tests.
>> > >> It may be easiest to start with a "data stores stay running"
>> > >> implementation, and then if we see issues with that move towards
>tests
>> > >> that
>> > >> start/stop the data stores on each run. One thing I'd like to
>make
>> sure
>> > is
>> > >> that we're not manually tweaking the configurations for data
>stores.
>> One
>> > >> way we could do that is to destroy/recreate the data stores on a
>> slower
>> > >> schedule - maybe once per week. That way if the script is
>changed or
>> the
>> > >> data store instances are changed, we'd be able to detect it
>relatively
>> > >> soon
>> > >> while still removing the need for the tests to manage the data
>stores.
>> > >>
>> > > I agree. In addition to configuration manual tweaking, there
>might be
>> > > cases in which a data store re-partition data during a test or
>after
>> some
>> > > tests while the dataset changes. The IO must be tolerant to that
>but
>> the
>> > > asserts (number of bundles for example) in test must not fail in
>that
>> > case.
>> > > I would also prefer if possible that the tests do not manage data
>> stores
>> > > (not setup them, not start them, not stop them)
>> > >
>> > >
>> > >> as a general note, I suspect many of the folks in the states
>will be
>> on
>> > >> holiday until Jan 2nd/3rd.
>> > >>
>> > >> S
>> > >>
>> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
><echauchot@gmail.com
>> >
>> > >> wrote:
>> > >>
>> > >> Hi,
>> > >>>
>> > >>> Recently we had a discussion about integration tests of IOs.
>I'm
>> > >>> preparing a PR for integration tests of the elasticSearch IO
>> > >>> (
>> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
>> > >>> LASTICSEARCH-IO
>> > >>> as a first shot) which are very important IMHO because they
>helped
>> > catch
>> > >>> some bugs that UT could not (volume, data store instance
>sharing,
>> real
>> > >>> data store instance ...)
>> > >>>
>> > >>> I would like to have your thoughts/remarks about points bellow.
>Some
>> of
>> > >>> these points are also discussed here
>> > >>>
>> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
>> > >>> :
>> > >>>
>> > >>> - UT and IT have a similar architecture, but while UT focus on
>> testing
>> > >>> the correct behavior of the code including corner cases and use
>> > embedded
>> > >>> in memory data store, IT assume that the behavior is correct
>(strong
>> > UT)
>> > >>> and focus on higher volume testing and testing against real
>data
>> store
>> > >>> instance(s)
>> > >>>
>> > >>> - For now, IT are stored alongside with UT in src/test
>directory of
>> the
>> > >>> IO but they might go to dedicated module, waiting for a
>consensus.
>> > Maven
>> > >>> is not configured to run them automatically because data store
>is not
>> > >>> available on jenkins server yet
>> > >>>
>> > >>> - For now, they only use DirectRunner, but they will  be run
>against
>> > >>> each runner.
>> > >>>
>> > >>> - IT do not setup data store instance (like stated in the above
>> > >>> document) they assume that one is already running (hardcoded
>> > >>> configuration in test for now, waiting for a common solution to
>pass
>> > >>> configuration to IT). A docker container script is provided in
>the
>> > >>> contrib directory as a starting point to whatever orchestration
>> > software
>> > >>> will be chosen.
>> > >>>
>> > >>> - IT load and clean test data before and after each test if
>needed.
>> It
>> > >>> is simpler to do so because some tests need empty data store
>(write
>> > >>> test) and because, as discussed in the document, tests might
>not be
>> the
>> > >>> only users of the data store. Also IMHO, it is better that
>tests
>> > >>> load/clean data than doing some assumptions about the running
>order
>> of
>> > >>> the tests.
>> > >>>
>> > >>> If we generalize this pattern to all IT tests, this will tend
>to go
>> to
>> > >>> the direction of long running data store instances rather than
>data
>> > >>> store instances started (and optionally loaded) before tests.
>> > >>>
>> > >>> Besides if we where to change our minds and load data from
>outside
>> the
>> > >>> tests, a logstash script is provided.
>> > >>>
>> > >>> If you have any thoughts or remarks I'm all ears :)
>> > >>>
>> > >>> Regards,
>> > >>>
>> > >>> Etienne
>> > >>>
>> > >>> Le 14/12/2016 � 17:07, Jean-Baptiste Onofr� a �crit :
>> > >>>
>> > >>>> Hi Stephen,
>> > >>>>
>> > >>>> the purpose of having in a specific module is to share
>resources and
>> > >>>> apply the same behavior from IT perspective and be able to
>have IT
>> > >>>> "cross" IO (for instance, reading from JMS and sending to
>Kafka, I
>> > >>>> think that's the key idea for integration tests).
>> > >>>>
>> > >>>> For instance, in Karaf, we have:
>> > >>>> - utest in each module
>> > >>>> - itest module containing itests for all modules all together
>> > >>>>
>> > >>>> Regards
>> > >>>> JB
>> > >>>>
>> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
>> > >>>>
>> > >>>>> Hi Etienne,
>> > >>>>>
>> > >>>>> thanks for following up and answering my questions.
>> > >>>>>
>> > >>>>> re: where to store integration tests - having them all in a
>> separate
>> > >>>>> module
>> > >>>>> is an interesting idea. I couldn't find JB's comments about
>moving
>> > them
>> > >>>>> into a separate module in the PR - can you share the reasons
>for
>> > >>>>> doing so?
>> > >>>>> The IO integration/perf tests so it does seem like they'll
>need to
>> be
>> > >>>>> treated in a special manner, but given that there is already
>an IO
>> > >>>>> specific
>> > >>>>> module, it may just be that we need to treat all the ITs in
>the IO
>> > >>>>> module
>> > >>>>> the same way. I don't have strong opinions either way right
>now.
>> > >>>>>
>> > >>>>> S
>> > >>>>>
>> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
>> > echauchot@gmail.com>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>> Hi guys,
>> > >>>>>
>> > >>>>> @Stephen: I addressed all your comments directly in the PR,
>thanks!
>> > >>>>> I just wanted to comment here about the docker image I used:
>the
>> only
>> > >>>>> official Elastic image contains only ElasticSearch. But for
>> testing I
>> > >>>>> needed logstash (for ingestion) and kibana (not for
>integration
>> > tests,
>> > >>>>> but to easily test REST requests to ES using sense). This is
>why I
>> > use
>> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
>isreleased
>> > >>>>> under
>> > >>>>> theapache 2 license.
>> > >>>>>
>> > >>>>>
>> > >>>>> Besides, there is also a point about where to store
>integration
>> > tests:
>> > >>>>> JB proposed in the PR to store integration tests to dedicated
>> module
>> > >>>>> rather than directly in the IO module (like I did).
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> Etienne
>> > >>>>>
>> > >>>>> Le 01/12/2016 � 20:14, Stephen Sisk a �crit :
>> > >>>>>
>> > >>>>>> hey!
>> > >>>>>>
>> > >>>>>> thanks for sending this. I'm very excited to see this
>change. I
>> > >>>>>> added some
>> > >>>>>> detail-oriented code review comments in addition to what
>I've
>> > >>>>>> discussed
>> > >>>>>> here.
>> > >>>>>>
>> > >>>>>> The general goal is to allow for re-usable instantiation of
>> > particular
>> > >>>>>>
>> > >>>>> data
>> > >>>>>
>> > >>>>>> store instances and this seems like a good start. Looks like
>you
>> > >>>>>> also have
>> > >>>>>> a script to generate test data for your tests - that's
>great.
>> > >>>>>>
>> > >>>>>> The next steps (definitely not blocking your work) will be
>to have
>> > >>>>>> ways to
>> > >>>>>> create instances from the docker images you have here, and
>use
>> them
>> > >>>>>> in the
>> > >>>>>> tests. We'll need support in the test framework for that
>since
>> it'll
>> > >>>>>> be
>> > >>>>>> different on developer machines and in the beam jenkins
>cluster,
>> but
>> > >>>>>> your
>> > >>>>>> scripts here allow someone running these tests locally to
>not have
>> > to
>> > >>>>>>
>> > >>>>> worry
>> > >>>>>
>> > >>>>>> about getting the instance set up and can manually adjust,
>so this
>> > is
>> > >>>>>> a
>> > >>>>>> good incremental step.
>> > >>>>>>
>> > >>>>>> I have some thoughts now that I'm reviewing your scripts
>(that I
>> > >>>>>> didn't
>> > >>>>>> have previously, so we are learning this together):
>> > >>>>>> * It may be useful to try and document why we chose a
>particular
>> > >>>>>> docker
>> > >>>>>> image as the base (ie, "this is the official supported
>elastic
>> > search
>> > >>>>>> docker image" or "this image has several data stores
>together that
>> > >>>>>> can be
>> > >>>>>> used for a couple different tests")  - I'm curious as to
>whether
>> the
>> > >>>>>> community thinks that is important
>> > >>>>>>
>> > >>>>>> One thing that I called out in the comment that's worth
>mentioning
>> > >>>>>> on the
>> > >>>>>> larger list - if you want to specify which specific runners
>a test
>> > >>>>>> uses,
>> > >>>>>> that can be controlled in the pom for the module. I updated
>the
>> > >>>>>> testing
>> > >>>>>>
>> > >>>>> doc
>> > >>>>>
>> > >>>>>> mentioned previously in this thread with a TODO to talk
>about this
>> > >>>>>> more. I
>> > >>>>>> think we should also make it so that IO modules have that
>> > >>>>>> automatically,
>> > >>>>>>
>> > >>>>> so
>> > >>>>>
>> > >>>>>> developers don't have to worry about it.
>> > >>>>>>
>> > >>>>>> S
>> > >>>>>>
>> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
>> > echauchot@gmail.com>
>> > >>>>>>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> Stephen,
>> > >>>>>>
>> > >>>>>> As discussed, I added injection script, docker containers
>scripts
>> > and
>> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
>> > >>>>>> <
>> > >>>>>>
>> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
>> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
>> > >>> d824cefcb3ed0b9
>> > >>>
>> > >>>> directory in that PR:
>> > >>>>>> https://github.com/apache/incubator-beam/pull/1439.
>> > >>>>>>
>> > >>>>>> These work well but they are first shot. Do you have any
>comments
>> > >>>>>> about
>> > >>>>>> those?
>> > >>>>>>
>> > >>>>>> Besides I am not very sure that these files should be in the
>IO
>> > itself
>> > >>>>>> (even in contrib directory, out of maven source
>directories). Any
>> > >>>>>>
>> > >>>>> thoughts?
>> > >>>>>
>> > >>>>>> Thanks,
>> > >>>>>>
>> > >>>>>> Etienne
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> Le 23/11/2016 � 19:03, Stephen Sisk a �crit :
>> > >>>>>>
>> > >>>>>>> It's great to hear more experiences.
>> > >>>>>>>
>> > >>>>>>> I'm also glad to hear that people see real value in the
>high
>> > >>>>>>> volume/performance benchmark tests. I tried to capture that
>in
>> the
>> > >>>>>>>
>> > >>>>>> Testing
>> > >>>>>
>> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
>> > >>>>>>>
>> > >>>>>>> It does generally sound like we're in agreement here. Areas
>of
>> > >>>>>>> discussion
>> > >>>>>>>
>> > >>>>>> I
>> > >>>>>>
>> > >>>>>>> see:
>> > >>>>>>> 1.  People like the idea of bringing up fresh instances for
>each
>> > test
>> > >>>>>>> rather than keeping instances running all the time, since
>that
>> > >>>>>>> ensures no
>> > >>>>>>> contamination between tests. That seems reasonable to me.
>If we
>> see
>> > >>>>>>> flakiness in the tests or we note that setting up/tearing
>down
>> > >>>>>>> instances
>> > >>>>>>>
>> > >>>>>> is
>> > >>>>>>
>> > >>>>>>> taking a lot of time,
>> > >>>>>>> 2. Deciding on cluster management software/orchestration
>software
>> > - I
>> > >>>>>>>
>> > >>>>>> want
>> > >>>>>
>> > >>>>>> to make sure we land on the right tool here since choosing
>the
>> > >>>>>>> wrong tool
>> > >>>>>>> could result in administration of the instances taking more
>> work. I
>> > >>>>>>>
>> > >>>>>> suspect
>> > >>>>>>
>> > >>>>>>> that's a good place for a follow up discussion, so I'll
>start a
>> > >>>>>>> separate
>> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but
>I
>> want
>> > to
>> > >>>>>>>
>> > >>>>>> make
>> > >>>>>
>> > >>>>>> sure we take a moment to consider different options and have
>a
>> > >>>>>>> reason for
>> > >>>>>>> choosing one.
>> > >>>>>>>
>> > >>>>>>> Etienne - thanks for being willing to port your
>creation/other
>> > >>>>>>> scripts
>> > >>>>>>> over. You might be a good early tester of whether this
>system
>> works
>> > >>>>>>> well
>> > >>>>>>> for everyone.
>> > >>>>>>>
>> > >>>>>>> Stephen
>> > >>>>>>>
>> > >>>>>>> [1]  Reasons for Beam Test Strategy -
>> > >>>>>>>
>> > >>>>>>>
>https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
>> > >>>
>> > >>>>
>> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofr�
>> > >>>>>>> <jb...@nanthrax.net>
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>> I second Etienne there.
>> > >>>>>>>>
>> > >>>>>>>> We worked together on the ElasticsearchIO and definitely,
>the
>> high
>> > >>>>>>>> valuable test we did were integration tests with ES on
>docker
>> and
>> > >>>>>>>> high
>> > >>>>>>>> volume.
>> > >>>>>>>>
>> > >>>>>>>> I think we have to distinguish the two kinds of tests:
>> > >>>>>>>> 1. utests are located in the IO itself and basically they
>should
>> > >>>>>>>> cover
>> > >>>>>>>> the core behaviors of the IO
>> > >>>>>>>> 2. itests are located as contrib in the IO (they could be
>part
>> of
>> > >>>>>>>> the IO
>> > >>>>>>>> but executed by the integration-test plugin or a specific
>> profile)
>> > >>>>>>>> that
>> > >>>>>>>> deals with "real" backend and high volumes. The resources
>> required
>> > >>>>>>>> by
>> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance
>using
>> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and
>it's
>> > >>>>>>>> what I'm
>> > >>>>>>>> doing on my own "server").
>> > >>>>>>>>
>> > >>>>>>>> It's basically what Stephen described.
>> > >>>>>>>>
>> > >>>>>>>> We have to not relay only on itest: utests are very
>important
>> and
>> > >>>>>>>> they
>> > >>>>>>>> validate the core behavior.
>> > >>>>>>>>
>> > >>>>>>>> My $0.01 ;)
>> > >>>>>>>>
>> > >>>>>>>> Regards
>> > >>>>>>>> JB
>> > >>>>>>>>
>> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
>> > >>>>>>>>
>> > >>>>>>>>> Hi Stephen,
>> > >>>>>>>>>
>> > >>>>>>>>> I like your proposition very much and I also agree that
>docker
>> +
>> > >>>>>>>>> some
>> > >>>>>>>>> orchestration software would be great !
>> > >>>>>>>>>
>> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there
>is
>> > docker
>> > >>>>>>>>> container creation scripts and logstash data ingestion
>script
>> for
>> > >>>>>>>>> IT
>> > >>>>>>>>> environment available in contrib directory alongside with
>> > >>>>>>>>> integration
>> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to
>new
>> IT
>> > >>>>>>>>> environment.
>> > >>>>>>>>>
>> > >>>>>>>>> What you say bellow about the need for external IT
>environment
>> is
>> > >>>>>>>>> particularly true. As an example with ES what came out in
>first
>> > >>>>>>>>> implementation was that there were problems starting at
>some
>> high
>> > >>>>>>>>>
>> > >>>>>>>> volume
>> > >>>>>
>> > >>>>>> of data (timeouts, ES windowing overflow...) that could not
>have
>> be
>> > >>>>>>>>>
>> > >>>>>>>> seen
>> > >>>>>
>> > >>>>>> on embedded ES version. Also there where some
>particularities to
>> > >>>>>>>>> external instance like secondary (replica) shards that
>where
>> not
>> > >>>>>>>>>
>> > >>>>>>>> visible
>> > >>>>>
>> > >>>>>> on embedded instance.
>> > >>>>>>>>>
>> > >>>>>>>>> Besides, I also favor bringing up instances before test
>because
>> > it
>> > >>>>>>>>> allows (amongst other things) to be sure to start on a
>fresh
>> > >>>>>>>>> dataset
>> > >>>>>>>>>
>> > >>>>>>>> for
>> > >>>>>
>> > >>>>>> the test to be deterministic.
>> > >>>>>>>>>
>> > >>>>>>>>> Etienne
>> > >>>>>>>>>
>> > >>>>>>>>>
>> > >>>>>>>>> Le 23/11/2016 � 02:00, Stephen Sisk a �crit :
>> > >>>>>>>>>
>> > >>>>>>>>>> Hi,
>> > >>>>>>>>>>
>> > >>>>>>>>>> I'm excited we're getting lots of discussion going.
>There are
>> > many
>> > >>>>>>>>>> threads
>> > >>>>>>>>>> of conversation here, we may choose to split some of
>them off
>> > >>>>>>>>>> into a
>> > >>>>>>>>>> different email thread. I'm also betting I missed some
>of the
>> > >>>>>>>>>> questions in
>> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also
>> apologies
>> > >>>>>>>>>> for
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> amount of text, I provided some quick summaries at the top
>of
>> each
>> > >>>>>>>>>> section.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in
>detail
>> below.
>> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of
>work
>> > >>>>>>>>>> here to
>> > >>>>>>>>>>
>> > >>>>>>>>> go
>> > >>>>>
>> > >>>>>> around. I'll try and think about how we can divide up some
>next
>> > >>>>>>>>>> steps
>> > >>>>>>>>>> (probably in a separate thread.) The main next step I
>see is
>> > >>>>>>>>>> deciding
>> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm
>working
>> on
>> > >>>>>>>>>> that,
>> > >>>>>>>>>>
>> > >>>>>>>>> but
>> > >>>>>>>>
>> > >>>>>>>>> having lots of different thoughts on what the
>> > >>>>>>>>>> advantages/disadvantages
>> > >>>>>>>>>>
>> > >>>>>>>>> of
>> > >>>>>>>>
>> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the
>> > >>>>>>>>>> protocol for
>> > >>>>>>>>>> collaborating on sub-projects like this.)
>> > >>>>>>>>>>
>> > >>>>>>>>>> These issues are all related to what kind of tests we
>want to
>> > >>>>>>>>>> write. I
>> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all
>the
>> use
>> > >>>>>>>>>> cases
>> > >>>>>>>>>> we've discussed here (and thus should not block moving
>forward
>> > >>>>>>>>>> with
>> > >>>>>>>>>> this),
>> > >>>>>>>>>> but understanding what we want to test will help us
>understand
>> > >>>>>>>>>> how the
>> > >>>>>>>>>> cluster will be used. I'm working on a proposed user
>guide for
>> > >>>>>>>>>> testing
>> > >>>>>>>>>>
>> > >>>>>>>>> IO
>> > >>>>>>>>
>> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a
>short
>> > >>>>>>>>>> summary
>> > >>>>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>
>> > >>>>>>>>> the list shortly so folks can get a better sense of where
>I'm
>> > >>>>>>>>>> coming
>> > >>>>>>>>>> from.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Here's my thinking on the questions we've raised here -
>> > >>>>>>>>>>
>> > >>>>>>>>>> Embedded versions of data stores for testing
>> > >>>>>>>>>> --------------------
>> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
>> > against.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the
>various
>> > data
>> > >>>>>>>>>> stores.
>> > >>>>>>>>>> I think we should test everything we possibly can using
>them,
>> > >>>>>>>>>> and do
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> majority of our correctness testing using embedded versions
>+ the
>> > >>>>>>>>>>
>> > >>>>>>>>> direct
>> > >>>>>>
>> > >>>>>>> runner. However, it's also important to have at least one
>test
>> that
>> > >>>>>>>>>> actually connects to an actual instance, so we can get
>> coverage
>> > >>>>>>>>>> for
>> > >>>>>>>>>> things
>> > >>>>>>>>>> like credentials, real connection strings, etc...
>> > >>>>>>>>>>
>> > >>>>>>>>>> The key point is that embedded versions definitely can't
>cover
>> > the
>> > >>>>>>>>>> performance tests, so we need to host instances if we
>want to
>> > test
>> > >>>>>>>>>>
>> > >>>>>>>>> that.
>> > >>>>>>
>> > >>>>>>> I consider the integration tests/performance benchmarks to
>be
>> > >>>>>>>>>> costly
>> > >>>>>>>>>> things
>> > >>>>>>>>>> that we do only for the IO transforms with large amounts
>of
>> > >>>>>>>>>> community
>> > >>>>>>>>>> support/usage. A random IO transform used by a few users
>> doesn't
>> > >>>>>>>>>> necessarily need integration & perf tests, but for
>heavily
>> used
>> > IO
>> > >>>>>>>>>> transforms, there's a lot of community value in these
>tests.
>> The
>> > >>>>>>>>>> maintenance proposal below scales with the amount of
>community
>> > >>>>>>>>>> support
>> > >>>>>>>>>> for
>> > >>>>>>>>>> a particular IO transform.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Reusing data stores ("use the data stores across
>executions.")
>> > >>>>>>>>>> ------------------
>> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently
>used, very
>> > >>>>>>>>>> small
>> > >>>>>>>>>> instances that we keep up all the time + larger
>> multi-container
>> > >>>>>>>>>> data
>> > >>>>>>>>>> store
>> > >>>>>>>>>> instances that we spin up for perf tests.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I don't think we need to have a strong answer to this
>> question,
>> > >>>>>>>>>> but I
>> > >>>>>>>>>> think
>> > >>>>>>>>>> we do need to know what range of capabilities we need,
>and use
>> > >>>>>>>>>> that to
>> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I
>think
>> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios
>I
>> > discuss
>> > >>>>>>>>>>
>> > >>>>>>>>> below.
>> > >>>>>>
>> > >>>>>>> I had been thinking of a hybrid approach - reuse some
>instances
>> and
>> > >>>>>>>>>>
>> > >>>>>>>>> don't
>> > >>>>>>>>
>> > >>>>>>>>> reuse others. Some tests require isolation from other
>tests
>> (eg.
>> > >>>>>>>>>> performance benchmarking), while others can easily
>re-use the
>> > same
>> > >>>>>>>>>> database/data store instance over time, provided they
>are
>> > >>>>>>>>>> written in
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> correct manner (eg. a simple read or write correctness
>> integration
>> > >>>>>>>>>>
>> > >>>>>>>>> tests)
>> > >>>>>>>>
>> > >>>>>>>>> To me, the question of whether to use one instance over
>time
>> for
>> > a
>> > >>>>>>>>>> test vs
>> > >>>>>>>>>> spin up an instance for each test comes down to a trade
>off
>> > >>>>>>>>>> between
>> > >>>>>>>>>>
>> > >>>>>>>>> these
>> > >>>>>>>>
>> > >>>>>>>>> factors:
>> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super
>flaky,
>> > >>>>>>>>>> we'll
>> > >>>>>>>>>> want to
>> > >>>>>>>>>> keep more instances up and running rather than bring
>them
>> > up/down.
>> > >>>>>>>>>>
>> > >>>>>>>>> (this
>> > >>>>>>
>> > >>>>>>> may also vary by the data store in question)
>> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every
>5
>> > >>>>>>>>>> minutes, it
>> > >>>>>>>>>>
>> > >>>>>>>>> may
>> > >>>>>>>>
>> > >>>>>>>>> be wasteful to bring machines up/down every time. If we
>run
>> > >>>>>>>>>> tests once
>> > >>>>>>>>>>
>> > >>>>>>>>> a
>> > >>>>>>
>> > >>>>>>> day or week, it seems wasteful to keep the machines up the
>whole
>> > >>>>>>>>>> time.
>> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated,
>it
>> means
>> > we
>> > >>>>>>>>>>
>> > >>>>>>>>> either
>> > >>>>>>>>
>> > >>>>>>>>> have to bring up the instances for each test, or we have
>to
>> have
>> > >>>>>>>>>> some
>> > >>>>>>>>>> sort
>> > >>>>>>>>>> of signaling mechanism to indicate that a given instance
>is in
>> > >>>>>>>>>> use. I
>> > >>>>>>>>>> strongly favor bringing up an instance per test.
>> > >>>>>>>>>> 4. Number/size of containers - if we need a large number
>of
>> > >>>>>>>>>> machines
>> > >>>>>>>>>> for a
>> > >>>>>>>>>> particular test, keeping them running all the time will
>use
>> more
>> > >>>>>>>>>> resources.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin
>these
>> up.
>> > >>>>>>>>>> I'm
>> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up,
>but I
>> > >>>>>>>>>> think the
>> > >>>>>>>>>> best
>> > >>>>>>>>>> way to test that is to start doing it.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of
>very
>> > >>>>>>>>>> small
>> > >>>>>>>>>>
>> > >>>>>>>>> data
>> > >>>>>>
>> > >>>>>>> store instances that stay up to support small-data-size
>> post-commit
>> > >>>>>>>>>> end to
>> > >>>>>>>>>> end tests (post-commits run frequently and the data size
>means
>> > the
>> > >>>>>>>>>> instances would not use many resources), combined with
>the
>> > >>>>>>>>>> ability to
>> > >>>>>>>>>> spin
>> > >>>>>>>>>> up larger instances for once a day/week performance
>benchmarks
>> > >>>>>>>>>> (these
>> > >>>>>>>>>>
>> > >>>>>>>>> use
>> > >>>>>>>>
>> > >>>>>>>>> up more resources and are used less frequently.) That's
>the mix
>> > >>>>>>>>>> I'll
>> > >>>>>>>>>> propose in my docs on testing IO transforms.  If
>spinning up
>> new
>> > >>>>>>>>>> instances
>> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of
>spinning up
>> > >>>>>>>>>> instances
>> > >>>>>>>>>> for
>> > >>>>>>>>>> each test.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Management ("what's the overhead of managing such a
>> deployment")
>> > >>>>>>>>>> --------------------
>> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts
>for
>> > >>>>>>>>>> setting up
>> > >>>>>>>>>>
>> > >>>>>>>>> data
>> > >>>>>>>>
>> > >>>>>>>>> store instances + integration/perf tests, but if the
>community
>> > >>>>>>>>>> doesn't
>> > >>>>>>>>>> maintain a particular data store's tests, we disable the
>tests
>> > and
>> > >>>>>>>>>> turn off
>> > >>>>>>>>>> the data store instances.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Management of these instances is a crucial question.
>First,
>> > let's
>> > >>>>>>>>>>
>> > >>>>>>>>> break
>> > >>>>>
>> > >>>>>> down what tasks we'll need to do on a recurring basis:
>> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both
>instance
>> &
>> > >>>>>>>>>> dependencies) - we don't want to have a lot of old
>versions
>> > >>>>>>>>>> vulnerable
>> > >>>>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>
>> > >>>>>>>>> attacks/buggy
>> > >>>>>>>>>> 2. Investigate breakages/regressions
>> > >>>>>>>>>> (I'm betting there will be more things we'll discover -
>let me
>> > >>>>>>>>>> know if
>> > >>>>>>>>>> you
>> > >>>>>>>>>> have suggestions)
>> > >>>>>>>>>>
>> > >>>>>>>>>> There's a couple goals I see:
>> > >>>>>>>>>> 1. We should only do sys admin work for things that give
>us a
>> > >>>>>>>>>> lot of
>> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up
>scripts
>> for
>> > >>>>>>>>>> data
>> > >>>>>>>>>> stores
>> > >>>>>>>>>> without a large community)
>> > >>>>>>>>>> 2. We should do as much as possible of testing via
>> > >>>>>>>>>> in-memory/embedded
>> > >>>>>>>>>> testing (as you brought up).
>> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead
>> > >>>>>>>>>>
>> > >>>>>>>>>> As I discussed above, I think that integration
>> tests/performance
>> > >>>>>>>>>> benchmarks
>> > >>>>>>>>>> are costly things that we should do only for the IO
>transforms
>> > >>>>>>>>>> with
>> > >>>>>>>>>>
>> > >>>>>>>>> large
>> > >>>>>>>>
>> > >>>>>>>>> amounts of community support/usage. Thus, I propose that
>we
>> > >>>>>>>>>> limit the
>> > >>>>>>>>>>
>> > >>>>>>>>> IO
>> > >>>>>>
>> > >>>>>>> transforms that get integration tests & performance
>benchmarks to
>> > >>>>>>>>>>
>> > >>>>>>>>> those
>> > >>>>>
>> > >>>>>> that have community support for maintaining the data store
>> > >>>>>>>>>> instances.
>> > >>>>>>>>>>
>> > >>>>>>>>>> We can enforce this organically using some simple rules:
>> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given
>> > >>>>>>>>>> integration/perf
>> > >>>>>>>>>>
>> > >>>>>>>>> test
>> > >>>>>>
>> > >>>>>>> starts failing and no one investigates it within a set
>period of
>> > >>>>>>>>>> time
>> > >>>>>>>>>>
>> > >>>>>>>>> (a
>> > >>>>>>
>> > >>>>>>> week?), we disable the tests and shut off the data store
>> > >>>>>>>>>> instances if
>> > >>>>>>>>>>
>> > >>>>>>>>> we
>> > >>>>>>
>> > >>>>>>> have instances running. When someone wants to step up and
>> > >>>>>>>>>> support it
>> > >>>>>>>>>> again,
>> > >>>>>>>>>> they can fix the test, check it in, and re-enable the
>test.
>> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira
>issue that
>> > >>>>>>>>>> is just
>> > >>>>>>>>>> "is
>> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira
>is
>> not
>> > >>>>>>>>>> resolved in
>> > >>>>>>>>>> a set period of time (1 month?), the perf/integration
>tests
>> are
>> > >>>>>>>>>>
>> > >>>>>>>>> disabled,
>> > >>>>>>>>
>> > >>>>>>>>> and the data store instances shut off.
>> > >>>>>>>>>>
>> > >>>>>>>>>> This is pretty flexible -
>> > >>>>>>>>>> * If a particular person or organization wants to
>support an
>> IO
>> > >>>>>>>>>> transform,
>> > >>>>>>>>>> they can. If a group of people all organically organize
>to
>> keep
>> > >>>>>>>>>> the
>> > >>>>>>>>>>
>> > >>>>>>>>> tests
>> > >>>>>>>>
>> > >>>>>>>>> running, they can.
>> > >>>>>>>>>> * It can be mostly automated - there's not a lot of
>central
>> > >>>>>>>>>> organizing
>> > >>>>>>>>>> work
>> > >>>>>>>>>> that needs to be done.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Exposing the information about what IO transforms
>currently
>> have
>> > >>>>>>>>>>
>> > >>>>>>>>> running
>> > >>>>>>
>> > >>>>>>> IT/perf benchmarks on the website will let users know what
>IO
>> > >>>>>>>>>>
>> > >>>>>>>>> transforms
>> > >>>>>>
>> > >>>>>>> are well supported.
>> > >>>>>>>>>>
>> > >>>>>>>>>> I like this solution, but I also recognize this is a
>tricky
>> > >>>>>>>>>> problem.
>> > >>>>>>>>>>
>> > >>>>>>>>> This
>> > >>>>>>>>
>> > >>>>>>>>> is something the community needs to be supportive of, so
>I'm
>> > >>>>>>>>>> open to
>> > >>>>>>>>>> other
>> > >>>>>>>>>> thoughts.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests
>to
>> > simulate
>> > >>>>>>>>>> failure")
>> > >>>>>>>>>> -----------------
>> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We
>should
>> > >>>>>>>>>> encourage a
>> > >>>>>>>>>> design pattern separating out network/retry logic from
>the
>> main
>> > IO
>> > >>>>>>>>>> transform logic
>> > >>>>>>>>>>
>> > >>>>>>>>>> We *could* create instance failure in any container
>management
>> > >>>>>>>>>>
>> > >>>>>>>>> software
>> > >>>>>
>> > >>>>>> -
>> > >>>>>>>>
>> > >>>>>>>>> we can use their programmatic APIs to determine which
>> containers
>> > >>>>>>>>>> are
>> > >>>>>>>>>> running the instances, and ask them to kill the
>container in
>> > >>>>>>>>>> question.
>> > >>>>>>>>>>
>> > >>>>>>>>> A
>> > >>>>>>
>> > >>>>>>> slow node would be trickier, but I'm sure we could figure
>it out
>> > >>>>>>>>>> - for
>> > >>>>>>>>>> example, add a network proxy that would delay responses.
>> > >>>>>>>>>>
>> > >>>>>>>>>> However, I would argue that this type of testing doesn't
>gain
>> > us a
>> > >>>>>>>>>> lot, and
>> > >>>>>>>>>> is complicated to set up. I think it will be easier to
>test
>> > >>>>>>>>>> network
>> > >>>>>>>>>> errors
>> > >>>>>>>>>> and retry behavior in unit tests for the IO transforms.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Part of the way to handle this is to separate out the
>read
>> code
>> > >>>>>>>>>> from
>> > >>>>>>>>>>
>> > >>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> network code (eg. bigtable has BigtableService). If you put
>the
>> > >>>>>>>>>>
>> > >>>>>>>>> "handle
>> > >>>>>
>> > >>>>>> errors/retry logic" code in a separate MySourceService
>class,
>> > >>>>>>>>>> you can
>> > >>>>>>>>>> test
>> > >>>>>>>>>> MySourceService on the wide variety of networks
>errors/data
>> > store
>> > >>>>>>>>>> problems,
>> > >>>>>>>>>> and then your main IO transform tests focus on the read
>> behavior
>> > >>>>>>>>>> and
>> > >>>>>>>>>> handling the small set of errors the MySourceService
>class
>> will
>> > >>>>>>>>>>
>> > >>>>>>>>> return.
>> > >>>>>
>> > >>>>>> I also think we should focus on testing the IO Transform,
>not
>> > >>>>>>>>>> the data
>> > >>>>>>>>>> store - if we kill a node in a data store, it's that
>data
>> > store's
>> > >>>>>>>>>> problem,
>> > >>>>>>>>>> not beam's problem. As you were pointing out, there are
>a
>> > *large*
>> > >>>>>>>>>> number of
>> > >>>>>>>>>> possible ways that a particular data store can fail, and
>we
>> > >>>>>>>>>> would like
>> > >>>>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>
>> > >>>>>>>>> support many different data stores. Rather than try to
>test
>> that
>> > >>>>>>>>>> each
>> > >>>>>>>>>> data
>> > >>>>>>>>>> store behaves well, we should ensure that we handle
>> > >>>>>>>>>> generic/expected
>> > >>>>>>>>>> errors
>> > >>>>>>>>>> in a graceful manner.
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>> Ismaeal had a couple other quick comments/questions,
>I'll
>> answer
>> > >>>>>>>>>> here
>> > >>>>>>>>>>
>> > >>>>>>>>> -
>> > >>>>>
>> > >>>>>> We can use this to test other runners running on multiple
>> > >>>>>>>>>> machines - I
>> > >>>>>>>>>> agree. This is also necessary for a good performance
>benchmark
>> > >>>>>>>>>> test.
>> > >>>>>>>>>>
>> > >>>>>>>>>> "providing the test machines to mount the cluster" - we
>can
>> > >>>>>>>>>> discuss
>> > >>>>>>>>>>
>> > >>>>>>>>> this
>> > >>>>>>
>> > >>>>>>> further, but one possible option is that google may be
>willing to
>> > >>>>>>>>>>
>> > >>>>>>>>> donate
>> > >>>>>>
>> > >>>>>>> something to support this.
>> > >>>>>>>>>>
>> > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in
>> another
>> > >>>>>>>>>>
>> > >>>>>>>>> thread.
>> > >>>>>>
>> > >>>>>>> That's as much about the public interface we provide to
>users as
>> > >>>>>>>>>>
>> > >>>>>>>>> anything
>> > >>>>>>>>
>> > >>>>>>>>> else. I agree with your sentiment that a user should be
>able to
>> > >>>>>>>>>> expect
>> > >>>>>>>>>> predictable behavior from the different IO transforms.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Thanks for everyone's questions/comments - I really am
>excited
>> > >>>>>>>>>> to see
>> > >>>>>>>>>> that
>> > >>>>>>>>>> people care about this :)
>> > >>>>>>>>>>
>> > >>>>>>>>>> Stephen
>> > >>>>>>>>>>
>> > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Isma�l Mej�a <
>> iemejia@gmail.com
>> > >
>> > >>>>>>>>>>
>> > >>>>>>>>> wrote:
>> > >>>>>
>> > >>>>>> \u200bHello,
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really
>interesting,
>> I
>> > >>>>>>>>>>> would
>> > >>>>>>>>>>> really
>> > >>>>>>>>>>> like to help with this. I have never played with
>Kubernetes
>> but
>> > >>>>>>>>>>> this
>> > >>>>>>>>>>> seems
>> > >>>>>>>>>>> a really nice chance to do something useful with it.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple
>> > container
>> > >>>>>>>>>>>
>> > >>>>>>>>>> images
>> > >>>>>>>>
>> > >>>>>>>>> and in some particular cases \u2018clusters\u2019 of containers
>using
>> > >>>>>>>>>>> docker-compose
>> > >>>>>>>>>>> (a little bit like Amit\u2019s (2) proposal). It would be
>really
>> > >>>>>>>>>>> nice to
>> > >>>>>>>>>>>
>> > >>>>>>>>>> have
>> > >>>>>>>>
>> > >>>>>>>>> this at the Beam level, in particular to try to test more
>> complex
>> > >>>>>>>>>>> semantics, I don\u2019t know how programmable kubernetes is
>to
>> > achieve
>> > >>>>>>>>>>> this for
>> > >>>>>>>>>>> example:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Let\u2019s think we have a cluster of Cassandra or Kafka
>nodes, I
>> > >>>>>>>>>>> would
>> > >>>>>>>>>>> like to
>> > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill
>a
>> node),
>> > >>>>>>>>>>> or
>> > >>>>>>>>>>> simulate
>> > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as
>expected
>> > >>>>>>>>>>> in the
>> > >>>>>>>>>>> Beam
>> > >>>>>>>>>>> pipeline for the given runner.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Another related idea is to improve IO consistency:
>Today the
>> > >>>>>>>>>>> different IOs
>> > >>>>>>>>>>> have small differences in their failure behavior, I
>really
>> > >>>>>>>>>>> would like
>> > >>>>>>>>>>> to be
>> > >>>>>>>>>>> able to predict with more precision what will happen in
>case
>> of
>> > >>>>>>>>>>>
>> > >>>>>>>>>> errors,
>> > >>>>>>
>> > >>>>>>> e.g. what is the correct behavior if I am writing to a
>Kafka
>> > >>>>>>>>>>> node and
>> > >>>>>>>>>>> there
>> > >>>>>>>>>>> is a network partition, does the Kafka sink retries or
>no ?
>> and
>> > >>>>>>>>>>> what
>> > >>>>>>>>>>> if it
>> > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
>> > >>>>>>>>>>> checkpointing?
>> > >>>>>>>>>>> Or do
>> > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am
>not
>> sure
>> > >>>>>>>>>>> about
>> > >>>>>>>>>>> what
>> > >>>>>>>>>>> happens (or if the expected behavior depends on the
>runner),
>> > >>>>>>>>>>> but well
>> > >>>>>>>>>>> maybe
>> > >>>>>>>>>>> it is just that I don\u2019t know and we have tests to
>ensure
>> this.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Of course both are really hard problems, but I think
>with
>> your
>> > >>>>>>>>>>> proposal we
>> > >>>>>>>>>>> can try to tackle them, as well as the performance
>ones. And
>> > >>>>>>>>>>> apart of
>> > >>>>>>>>>>> the
>> > >>>>>>>>>>> data stores, I think it will be also really nice to be
>able
>> to
>> > >>>>>>>>>>> test
>> > >>>>>>>>>>>
>> > >>>>>>>>>> the
>> > >>>>>>
>> > >>>>>>> runners in a distributed manner.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> So what is the next step? How do you imagine such
>integration
>> > >>>>>>>>>>> tests?
>> > >>>>>>>>>>> ? Who
>> > >>>>>>>>>>> can provide the test machines so we can mount the
>cluster?
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial
>setup,
>> but
>> > >>>>>>>>>>> it
>> > >>>>>>>>>>> will be
>> > >>>>>>>>>>> really nice to start working on this.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Ismael\u200b
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <
>> > >>>>>>>>>>> amitsela33@gmail.com
>> > >>>>>>>>>>> wrote:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Hi Stephen,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> I was wondering about how we plan to use the data
>stores
>> > across
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> executions.
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container)
>for
>> > every
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> test,
>> > >>>>>>
>> > >>>>>>> running a "standalone" store (say HBase/Cassandra for
>> > >>>>>>>>>>>> example), and
>> > >>>>>>>>>>>> once
>> > >>>>>>>>>>>> the test is done, teardown the instance. It should
>also be
>> > >>>>>>>>>>>> agnostic
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> to
>> > >>>>>>
>> > >>>>>>> the
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
>> > >>>>>>>>>>>> I'm wondering though what's the overhead of managing
>such a
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> deployment
>> > >>>>>>
>> > >>>>>>> which could become heavy and complicated as more IOs are
>> > >>>>>>>>>>>> supported
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> and
>> > >>>>>>
>> > >>>>>>> more
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> test cases introduced.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Another way to go would be to have small clusters of
>> different
>> > >>>>>>>>>>>> data
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> stores
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> and run against new "namespaces" (while lazily
>evicting old
>> > >>>>>>>>>>>> ones),
>> > >>>>>>>>>>>> but I
>> > >>>>>>>>>>>> think this is less likely as maintaining a distributed
>> > instance
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> (even
>> > >>>>>
>> > >>>>>> a
>> > >>>>>>>>
>> > >>>>>>>>> small one) for each data store sounds even more complex.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> A third approach would be to to simply have an
>"embedded"
>> > >>>>>>>>>>>> in-memory
>> > >>>>>>>>>>>> instance of a data store as part of a test that runs
>against
>> > it
>> > >>>>>>>>>>>> (such as
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> an
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> embedded Kafka, though not a data store).
>> > >>>>>>>>>>>> This is probably the simplest solution in terms of
>> > >>>>>>>>>>>> orchestration,
>> > >>>>>>>>>>>> but it
>> > >>>>>>>>>>>> depends on having a proper "embedded" implementation
>for an
>> > IO.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Does this make sense to you ? have you considered it ?
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Thanks,
>> > >>>>>>>>>>>> Amit
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofr� <
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>> jb@nanthrax.net
>> > >>>>>
>> > >>>>>> wrote:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> Hi Stephen,
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> as already discussed a bit together, it sounds great
>!
>> > >>>>>>>>>>>>> Especially I
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>> like
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> it as a both integration test platform and good
>coverage for
>> > >>>>>>>>>>>>> IOs.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> I'm very late on this but, as said, I will share with
>you
>> my
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>> Marathon
>> > >>>>>>
>> > >>>>>>> JSON and Mesos docker images.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> By the way, I started to experiment a bit kubernetes
>and
>> > >>>>>>>>>>>>> swamp but
>> > >>>>>>>>>>>>> it's
>> > >>>>>>>>>>>>> not yet complete. I will share what I have on the
>same
>> github
>> > >>>>>>>>>>>>> repo.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Thanks !
>> > >>>>>>>>>>>>> Regards
>> > >>>>>>>>>>>>> JB
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Hi everyone!
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Currently we have a good set of unit tests for our
>IO
>> > >>>>>>>>>>>>>> Transforms -
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> those
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> tend to run against in-memory versions of the data
>stores.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> However,
>> > >>>>>
>> > >>>>>> we'd
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> like to further increase our test coverage to include
>> > >>>>>>>>>>>>>> running them
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> against
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> real instances of the data stores that the IO
>Transforms
>> > work
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> against
>> > >>>>>>>>
>> > >>>>>>>>> (e.g.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> cassandra, mongodb, kafka, etc\u2026), which means we'll
>need
>> to
>> > >>>>>>>>>>>>>> have
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> real
>> > >>>>>>>>
>> > >>>>>>>>> instances of various data stores.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Additionally, if we want to do performance
>regression
>> > >>>>>>>>>>>>>> detection,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> it's
>> > >>>>>>>>
>> > >>>>>>>>> important to have instances of the services that behave
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> realistically,
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> which isn't true of in-memory or dev versions of the
>> services.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Proposed solution
>> > >>>>>>>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> If we accept this proposal, we would create an
>> > >>>>>>>>>>>>>> infrastructure for
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> running
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> real instances of data stores inside of containers,
>using
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> container
>> > >>>>>
>> > >>>>>> management software like mesos/marathon, kubernetes, docker
>> > >>>>>>>>>>>>>> swarm,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> etc\u2026
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> to
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> manage the instances.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> This would enable us to build integration tests that
>run
>> > >>>>>>>>>>>>>> against
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> those
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> real
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> instances and performance tests that run against
>those
>> real
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> instances
>> > >>>>>>>>
>> > >>>>>>>>> (like
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> those that Jason Kuster is proposing elsewhere.)
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Why do we need one centralized set of instances vs
>just
>> > having
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> various
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> people host their own instances?
>> > >>>>>>>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> Reducing flakiness of tests is key. By not having
>> > dependencies
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> from
>> > >>>>>
>> > >>>>>> the
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> core project on external services/instances of data
>stores
>> > >>>>>>>>>>>>>> we have
>> > >>>>>>>>>>>>>> guaranteed access to the services and the group can
>fix
>> > issues
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> that
>> > >>>>>
>> > >>>>>> arise.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> An exception would be something that has an ops team
>> > >>>>>>>>>>>>>> supporting it
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> (eg,
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> AWS, Google Cloud or other professionally managed
>service) -
>> > >>>>>>>>>>>>>> those
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> we
>> > >>>>>>>>
>> > >>>>>>>>> trust
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> will be stable.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> There may be a lot of different data stores needed -
>how
>> > >>>>>>>>>>>>>> will we
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> maintain
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> them?
>> > >>>>>>>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> It will take work above and beyond that of a normal
>set of
>> > >>>>>>>>>>>>>> unit
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> tests
>> > >>>>>>>>
>> > >>>>>>>>> to
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> build and maintain integration/performance tests &
>their
>> data
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> store
>> > >>>>>
>> > >>>>>> instances.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Setup & maintenance of the data store containers and
>data
>> > >>>>>>>>>>>>>> store
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> instances
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> on it must be automated. It also has to be as simple
>of a
>> > >>>>>>>>>>>>>> setup as
>> > >>>>>>>>>>>>>> possible, and we should avoid hand tweaking the
>> containers -
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> expecting
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> checked in scripts/dockerfiles is key.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Aligned with the community ownership approach of
>Apache,
>> as
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> members
>> > >>>>>
>> > >>>>>> of
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> the
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> community are excited to contribute & maintain those
>tests
>> > >>>>>>>>>>>>>> and the
>> > >>>>>>>>>>>>>> integration/performance tests, people will be able
>to step
>> > >>>>>>>>>>>>>> up and
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> do
>> > >>>>>>
>> > >>>>>>> that.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> If there is no longer support for maintaining a
>particular
>> > >>>>>>>>>>>>>> set of
>> > >>>>>>>>>>>>>> integration & performance tests and their data store
>> > >>>>>>>>>>>>>> instances,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> then
>> > >>>>>>
>> > >>>>>>> we
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> can
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> disable those tests. We may document on the website
>what
>> IO
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Transforms
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> have
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> current integration/performance tests so users know
>what
>> > >>>>>>>>>>>>>> level of
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> testing
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> the various IO Transforms have.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> What about requirements for the container management
>> > software
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> itself?
>> > >>>>>>>>
>> > >>>>>>>>> -------------------------
>> > >>>>>>>>>>>>>> * We should have the data store instances themselves
>in
>> > >>>>>>>>>>>>>> Docker.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> Docker
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> allows new instances to be spun up in a quick,
>reproducible
>> > way
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> and
>> > >>>>>
>> > >>>>>> is
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> fairly platform independent. It has wide support from
>a
>> > >>>>>>>>>>>>>> variety of
>> > >>>>>>>>>>>>>> different container management services.
>> > >>>>>>>>>>>>>> * As little admin work required as possible.
>Crashing
>> > >>>>>>>>>>>>>> instances
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> should
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> be
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> restarted, setup should be simple, everything
>possible
>> > >>>>>>>>>>>>>> should be
>> > >>>>>>>>>>>>>> scripted/scriptable.
>> > >>>>>>>>>>>>>> * Logs and test output should be on a publicly
>available
>> > >>>>>>>>>>>>>> website,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> without
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> needing to log into test execution machine.
>Centralized
>> > >>>>>>>>>>>>>> capture of
>> > >>>>>>>>>>>>>> monitoring info/logs from instances running in the
>> > containers
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> would
>> > >>>>>
>> > >>>>>> support
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> this. Ideally, this would just be supported by the
>> container
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> software
>> > >>>>>>>>
>> > >>>>>>>>> out
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> of the box.
>> > >>>>>>>>>>>>>> * It'd be useful to have good persistent volume in
>the
>> > >>>>>>>>>>>>>> container
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> management
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> software so that databases don't have to reload
>large data
>> > >>>>>>>>>>>>>> sets
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> every
>> > >>>>>>>>
>> > >>>>>>>>> time.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> * The containers may be a place to execute runners
>> > >>>>>>>>>>>>>> themselves if
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> we
>> > >>>>>
>> > >>>>>> need
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> larger runner instances, so it should play well with
>Spark,
>> > >>>>>>>>>>>>>> Flink,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> etc\u2026
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> As I discussed earlier on the mailing list, it looks
>like
>> > >>>>>>>>>>>>>> hosting
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> docker
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> containers on kubernetes, docker swarm or
>mesos+marathon
>> > >>>>>>>>>>>>>> would be
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>> a
>> > >>>>>
>> > >>>>>> good
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> solution.
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Thanks,
>> > >>>>>>>>>>>>>> Stephen Sisk
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> --
>> > >>>>>>>>>>>>> Jean-Baptiste Onofr�
>> > >>>>>>>>>>>>> jbonofre@apache.org
>> > >>>>>>>>>>>>> http://blog.nanthrax.net
>> > >>>>>>>>>>>>> Talend - http://www.talend.com
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> --
>> > >>>>>>>> Jean-Baptiste Onofr�
>> > >>>>>>>> jbonofre@apache.org
>> > >>>>>>>> http://blog.nanthrax.net
>> > >>>>>>>> Talend - http://www.talend.com
>> > >>>>>>>>
>> > >>>>>>>>
>> > >>>
>> > >
>> > >
>> >
>>

Re: Hosting data stores for IO Transform testing

Posted by Ismaël Mejía <ie...@gmail.com>.

Thanks for your analysis Stephen, good arguments / references.

One quick question. Have you checked the APIs of both (Mesos/Kubernetes) to
see
if we can do programmatically do more complex tests (I suppose so, but you
don't mention how easy or if those are possible), for example to simulate a
slow networking slave (to test stragglers), or to arbitrarily kill one
slave (e.g. if I want to test the correct behavior of a runner/IO that is
reading from it) ?

Other missing point in the review is the availability of ready to play
packages,
I think in this area mesos/dcos seems more advanced no? I haven't looked
recently but at least 6 months ago there were not many helm packages ready
for
example to test kafka or the hadoop echosystem stuff (hdfs, hbase, etc). Has
this been improved ? because preparing this also is a considerable amount of
work on the other hand this could be also a chance to contribute to
kubernetes.

Regards,
Ismaël



On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <si...@google.com.invalid>
wrote:

> hi!
>
> I've been continuing this investigation, and have some more info to report,
> and hopefully we can start making some decisions.
>
> To support performance testing, I've been investigating mesos+marathon and
> kubernetes for running data stores in their high availability mode. I have
> been examining features that kubernetes/mesos+marathon use to support this.
>
> Setting up a multi-node cluster in a high availability mode tends to be
> more expensive time-wise than the single node instances I've played around
> with in the past. Rather than do a full build out with both kubernetes and
> mesos, I'd like to pick one of the two options to build the prototype
> cluster with. If the prototype doesn't go well, we could still go back to
> the other option, but I'd like to change us from a mode of "let's look at
> all the options" to one of "here's the favorite, let's prove that works for
> us".
>
> Below are the features that I've seen are important to multi-node instances
> of data stores. I'm sure other folks on the list have done this before, so
> feel free to pipe up if I'm missing a good solution to a problem.
>
> DNS/Discovery
>
> --------------------
>
> Necessary for talking between nodes (eg, cassandra nodes all need to be
> able to talk to a set of seed nodes.)
>
> * Kubernetes has built-in DNS/discovery between nodes.
>
> * Mesos has supports this via mesos-dns, which isn't a part of core mesos,
> but is in dcos, which is the mesos distribution I've been using and that I
> would expect us to use.
>
> Instances properly distributed across nodes
>
> ------------------------------------------------------------
>
> If multiple instances of a data source end up on the same underlying VM, we
> may not get good performance out of those instances since the underlying VM
> may be more taxed than other VMs.
>
> * Kubernetes has a beta feature StatefulSets[1] which allow for containers
> distributed so that there's one container per underlying machine (as well
> as a lot of other useful features like easy stable dns names.)
>
> * Mesos can support this via the built in UNIQUE constraint [2]
>
> Load balancing
>
> --------------------
>
> Incoming requests from users need to be distributed to the various machines
> - this is important for many data stores' high availability modes.
>
> * Kubernetes supports easily hooking up to an external load balancer when
> on a cloud (and can be configured to work with a built-in load balancer if
> not)
>
> * Mesos supports this via marathon-lb [3], which is an install-able package
> in DC/OS
>
> Persistent Volumes tied to specific instances
>
> ------------------------------------------------------------
>
> Databases often need persistent state (for example to store the data :), so
> it's an important part of running our service.
>
> * Kubernetes StatefulSets supports this
>
> * Mesos+marathon apps with persistent volumes supports this [4] [5]
>
> As I mentioned above, I'd like to focus on either kubernetes or mesos for
> my investigation, and as I go further along, I'm seeing kubernetes as
> better suited to our needs.
>
> (1) It supports more of the features we want out of the box and with
> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> requires marathon-lb to be installed and mesos-dns to be configured.
>
> (2) I'm also finding that there seem to be more examples of using
> kubernetes to solve the types of problems we're working on. This is
> somewhat subjective, but in my experience as I've tried to learn both
> kubernetes and mesos, I personally found it generally easier to get
> kubernetes running than mesos due to the tutorials/examples available for
> kubernetes.
>
> (3) Lower cost of initial setup - as I discussed in a previous mail[6],
> kubernetes was far easier to get set up even when I knew the exact steps.
> Mesos took me around 27 steps [7], which involved a lot of config that was
> easy to get wrong (it took me about 5 tries to get all the steps correct in
> one go.) Kubernetes took me around 8 steps and very little config.
>
> Given that, I'd like to focus my investigation/prototyping on Kubernetes.
> To
> be clear, it's fairly close and I think both Mesos and Kubernetes could
> support what we need, so if we run into issues with kubernetes, Mesos still
> seems like a viable option that we could fall back to.
>
> Thanks,
> Stephen
>
>
> [1] Kubernetes StatefulSets
> https://kubernetes.io/docs/concepts/abstractions/controllers/statefulsets/
>
> [2] mesos unique constraint -
> https://mesosphere.github.io/marathon/docs/constraints.html
>
> [3]
> https://mesosphere.github.io/marathon/docs/service-
> discovery-load-balancing.html
>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
>
> [4] https://mesosphere.github.io/marathon/docs/persistent-volumes.html
>
> [5] https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
>
> [6] Container Orchestration software for hosting data stores
> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>
> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
>
>
> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org> wrote:
>
> > Just a quick drive-by comment: how tests are laid out has non-trivial
> > tradeoffs on how/where continuous integration runs, and how results are
> > integrated into the tooling. The current state is certainly not ideal
> > (e.g., due to multiple test executions some links in Jenkins point where
> > they shouldn't), but most other alternatives had even bigger drawbacks at
> > the time. If someone has great ideas that don't explode the number of
> > modules, please share ;-)
> >
> > On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot <ec...@gmail.com>
> > wrote:
> >
> > > Hi Stephen,
> > >
> > > Thanks for taking the time to comment.
> > >
> > > My comments are bellow in the email:
> > >
> > >
> > > Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> > >
> > >> hey Etienne -
> > >>
> > >> thanks for your thoughts and thanks for sharing your experiences. I
> > >> generally agree with what you're saying. Quick comments below:
> > >>
> > >> IT are stored alongside with UT in src/test directory of the IO but
> they
> > >>>
> > >> might go to dedicated module, waiting for a consensus
> > >> I don't have a strong opinion or feel that I've worked enough with
> maven
> > >> to
> > >> understand all the consequences - I'd love for someone with more maven
> > >> experience to weigh in. If this becomes blocking, I'd say check it in,
> > and
> > >> we can refactor later if it proves problematic.
> > >>
> > > Sure, not a blocking point, it could be refactored afterwards. Just as
> a
> > > reminder, JB mentioned that storing IT in separate module allows to
> have
> > > more coherence between all IT (same behavior) and to do cross IO
> > > integration tests. JB, have you experienced some long term drawbacks of
> > > storing IT in a separate module, like, for example, more difficult
> > > maintenance due to "distance" with production code?
> > >
> > >
> > >>   Also IMHO, it is better that tests load/clean data than doing some
> > >>>
> > >> assumptions about the running order of the tests.
> > >> I definitely agree that we don't want to make assumptions about the
> > >> running
> > >> order of the tests - that way lies pain. :) It will be interesting to
> > see
> > >> how the performance tests work out since they will need more data (and
> > >> thus
> > >> loading data can take much longer.)
> > >>
> > > Yes, performance testing might push in the direction of data loading
> from
> > > outside the tests due to loading time.
> > >
> > >>   This should also be an easier problem
> > >> for read tests than for write tests - if we have long running
> instances,
> > >> read tests don't really need cleanup. And if write tests only write a
> > >> small
> > >> amount of data, as long as we are sure we're writing to uniquely
> > >> identifiable locations (ie, new table per test or something similar),
> we
> > >> can clean up the write test data on a slower schedule.
> > >>
> > > I agree
> > >
> > >>
> > >> this will tend to go to the direction of long running data store
> > >>>
> > >> instances rather than data store instances started (and optionally
> > loaded)
> > >> before tests.
> > >> It may be easiest to start with a "data stores stay running"
> > >> implementation, and then if we see issues with that move towards tests
> > >> that
> > >> start/stop the data stores on each run. One thing I'd like to make
> sure
> > is
> > >> that we're not manually tweaking the configurations for data stores.
> One
> > >> way we could do that is to destroy/recreate the data stores on a
> slower
> > >> schedule - maybe once per week. That way if the script is changed or
> the
> > >> data store instances are changed, we'd be able to detect it relatively
> > >> soon
> > >> while still removing the need for the tests to manage the data stores.
> > >>
> > > I agree. In addition to configuration manual tweaking, there might be
> > > cases in which a data store re-partition data during a test or after
> some
> > > tests while the dataset changes. The IO must be tolerant to that but
> the
> > > asserts (number of bundles for example) in test must not fail in that
> > case.
> > > I would also prefer if possible that the tests do not manage data
> stores
> > > (not setup them, not start them, not stop them)
> > >
> > >
> > >> as a general note, I suspect many of the folks in the states will be
> on
> > >> holiday until Jan 2nd/3rd.
> > >>
> > >> S
> > >>
> > >> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot <echauchot@gmail.com
> >
> > >> wrote:
> > >>
> > >> Hi,
> > >>>
> > >>> Recently we had a discussion about integration tests of IOs. I'm
> > >>> preparing a PR for integration tests of the elasticSearch IO
> > >>> (
> > >>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> > >>> LASTICSEARCH-IO
> > >>> as a first shot) which are very important IMHO because they helped
> > catch
> > >>> some bugs that UT could not (volume, data store instance sharing,
> real
> > >>> data store instance ...)
> > >>>
> > >>> I would like to have your thoughts/remarks about points bellow. Some
> of
> > >>> these points are also discussed here
> > >>>
> > >>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> > >>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> > >>> :
> > >>>
> > >>> - UT and IT have a similar architecture, but while UT focus on
> testing
> > >>> the correct behavior of the code including corner cases and use
> > embedded
> > >>> in memory data store, IT assume that the behavior is correct (strong
> > UT)
> > >>> and focus on higher volume testing and testing against real data
> store
> > >>> instance(s)
> > >>>
> > >>> - For now, IT are stored alongside with UT in src/test directory of
> the
> > >>> IO but they might go to dedicated module, waiting for a consensus.
> > Maven
> > >>> is not configured to run them automatically because data store is not
> > >>> available on jenkins server yet
> > >>>
> > >>> - For now, they only use DirectRunner, but they will  be run against
> > >>> each runner.
> > >>>
> > >>> - IT do not setup data store instance (like stated in the above
> > >>> document) they assume that one is already running (hardcoded
> > >>> configuration in test for now, waiting for a common solution to pass
> > >>> configuration to IT). A docker container script is provided in the
> > >>> contrib directory as a starting point to whatever orchestration
> > software
> > >>> will be chosen.
> > >>>
> > >>> - IT load and clean test data before and after each test if needed.
> It
> > >>> is simpler to do so because some tests need empty data store (write
> > >>> test) and because, as discussed in the document, tests might not be
> the
> > >>> only users of the data store. Also IMHO, it is better that tests
> > >>> load/clean data than doing some assumptions about the running order
> of
> > >>> the tests.
> > >>>
> > >>> If we generalize this pattern to all IT tests, this will tend to go
> to
> > >>> the direction of long running data store instances rather than data
> > >>> store instances started (and optionally loaded) before tests.
> > >>>
> > >>> Besides if we where to change our minds and load data from outside
> the
> > >>> tests, a logstash script is provided.
> > >>>
> > >>> If you have any thoughts or remarks I'm all ears :)
> > >>>
> > >>> Regards,
> > >>>
> > >>> Etienne
> > >>>
> > >>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> > >>>
> > >>>> Hi Stephen,
> > >>>>
> > >>>> the purpose of having in a specific module is to share resources and
> > >>>> apply the same behavior from IT perspective and be able to have IT
> > >>>> "cross" IO (for instance, reading from JMS and sending to Kafka, I
> > >>>> think that's the key idea for integration tests).
> > >>>>
> > >>>> For instance, in Karaf, we have:
> > >>>> - utest in each module
> > >>>> - itest module containing itests for all modules all together
> > >>>>
> > >>>> Regards
> > >>>> JB
> > >>>>
> > >>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> > >>>>
> > >>>>> Hi Etienne,
> > >>>>>
> > >>>>> thanks for following up and answering my questions.
> > >>>>>
> > >>>>> re: where to store integration tests - having them all in a
> separate
> > >>>>> module
> > >>>>> is an interesting idea. I couldn't find JB's comments about moving
> > them
> > >>>>> into a separate module in the PR - can you share the reasons for
> > >>>>> doing so?
> > >>>>> The IO integration/perf tests so it does seem like they'll need to
> be
> > >>>>> treated in a special manner, but given that there is already an IO
> > >>>>> specific
> > >>>>> module, it may just be that we need to treat all the ITs in the IO
> > >>>>> module
> > >>>>> the same way. I don't have strong opinions either way right now.
> > >>>>>
> > >>>>> S
> > >>>>>
> > >>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> > echauchot@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>> Hi guys,
> > >>>>>
> > >>>>> @Stephen: I addressed all your comments directly in the PR, thanks!
> > >>>>> I just wanted to comment here about the docker image I used: the
> only
> > >>>>> official Elastic image contains only ElasticSearch. But for
> testing I
> > >>>>> needed logstash (for ingestion) and kibana (not for integration
> > tests,
> > >>>>> but to easily test REST requests to ES using sense). This is why I
> > use
> > >>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased
> > >>>>> under
> > >>>>> theapache 2 license.
> > >>>>>
> > >>>>>
> > >>>>> Besides, there is also a point about where to store integration
> > tests:
> > >>>>> JB proposed in the PR to store integration tests to dedicated
> module
> > >>>>> rather than directly in the IO module (like I did).
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Etienne
> > >>>>>
> > >>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> > >>>>>
> > >>>>>> hey!
> > >>>>>>
> > >>>>>> thanks for sending this. I'm very excited to see this change. I
> > >>>>>> added some
> > >>>>>> detail-oriented code review comments in addition to what I've
> > >>>>>> discussed
> > >>>>>> here.
> > >>>>>>
> > >>>>>> The general goal is to allow for re-usable instantiation of
> > particular
> > >>>>>>
> > >>>>> data
> > >>>>>
> > >>>>>> store instances and this seems like a good start. Looks like you
> > >>>>>> also have
> > >>>>>> a script to generate test data for your tests - that's great.
> > >>>>>>
> > >>>>>> The next steps (definitely not blocking your work) will be to have
> > >>>>>> ways to
> > >>>>>> create instances from the docker images you have here, and use
> them
> > >>>>>> in the
> > >>>>>> tests. We'll need support in the test framework for that since
> it'll
> > >>>>>> be
> > >>>>>> different on developer machines and in the beam jenkins cluster,
> but
> > >>>>>> your
> > >>>>>> scripts here allow someone running these tests locally to not have
> > to
> > >>>>>>
> > >>>>> worry
> > >>>>>
> > >>>>>> about getting the instance set up and can manually adjust, so this
> > is
> > >>>>>> a
> > >>>>>> good incremental step.
> > >>>>>>
> > >>>>>> I have some thoughts now that I'm reviewing your scripts (that I
> > >>>>>> didn't
> > >>>>>> have previously, so we are learning this together):
> > >>>>>> * It may be useful to try and document why we chose a particular
> > >>>>>> docker
> > >>>>>> image as the base (ie, "this is the official supported elastic
> > search
> > >>>>>> docker image" or "this image has several data stores together that
> > >>>>>> can be
> > >>>>>> used for a couple different tests")  - I'm curious as to whether
> the
> > >>>>>> community thinks that is important
> > >>>>>>
> > >>>>>> One thing that I called out in the comment that's worth mentioning
> > >>>>>> on the
> > >>>>>> larger list - if you want to specify which specific runners a test
> > >>>>>> uses,
> > >>>>>> that can be controlled in the pom for the module. I updated the
> > >>>>>> testing
> > >>>>>>
> > >>>>> doc
> > >>>>>
> > >>>>>> mentioned previously in this thread with a TODO to talk about this
> > >>>>>> more. I
> > >>>>>> think we should also make it so that IO modules have that
> > >>>>>> automatically,
> > >>>>>>
> > >>>>> so
> > >>>>>
> > >>>>>> developers don't have to worry about it.
> > >>>>>>
> > >>>>>> S
> > >>>>>>
> > >>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> > echauchot@gmail.com>
> > >>>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Stephen,
> > >>>>>>
> > >>>>>> As discussed, I added injection script, docker containers scripts
> > and
> > >>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> > >>>>>> <
> > >>>>>>
> > >>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> > >>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> > >>> d824cefcb3ed0b9
> > >>>
> > >>>> directory in that PR:
> > >>>>>> https://github.com/apache/incubator-beam/pull/1439.
> > >>>>>>
> > >>>>>> These work well but they are first shot. Do you have any comments
> > >>>>>> about
> > >>>>>> those?
> > >>>>>>
> > >>>>>> Besides I am not very sure that these files should be in the IO
> > itself
> > >>>>>> (even in contrib directory, out of maven source directories). Any
> > >>>>>>
> > >>>>> thoughts?
> > >>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> Etienne
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> > >>>>>>
> > >>>>>>> It's great to hear more experiences.
> > >>>>>>>
> > >>>>>>> I'm also glad to hear that people see real value in the high
> > >>>>>>> volume/performance benchmark tests. I tried to capture that in
> the
> > >>>>>>>
> > >>>>>> Testing
> > >>>>>
> > >>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> > >>>>>>>
> > >>>>>>> It does generally sound like we're in agreement here. Areas of
> > >>>>>>> discussion
> > >>>>>>>
> > >>>>>> I
> > >>>>>>
> > >>>>>>> see:
> > >>>>>>> 1.  People like the idea of bringing up fresh instances for each
> > test
> > >>>>>>> rather than keeping instances running all the time, since that
> > >>>>>>> ensures no
> > >>>>>>> contamination between tests. That seems reasonable to me. If we
> see
> > >>>>>>> flakiness in the tests or we note that setting up/tearing down
> > >>>>>>> instances
> > >>>>>>>
> > >>>>>> is
> > >>>>>>
> > >>>>>>> taking a lot of time,
> > >>>>>>> 2. Deciding on cluster management software/orchestration software
> > - I
> > >>>>>>>
> > >>>>>> want
> > >>>>>
> > >>>>>> to make sure we land on the right tool here since choosing the
> > >>>>>>> wrong tool
> > >>>>>>> could result in administration of the instances taking more
> work. I
> > >>>>>>>
> > >>>>>> suspect
> > >>>>>>
> > >>>>>>> that's a good place for a follow up discussion, so I'll start a
> > >>>>>>> separate
> > >>>>>>> thread on that. I'm happy with whatever tool we choose, but I
> want
> > to
> > >>>>>>>
> > >>>>>> make
> > >>>>>
> > >>>>>> sure we take a moment to consider different options and have a
> > >>>>>>> reason for
> > >>>>>>> choosing one.
> > >>>>>>>
> > >>>>>>> Etienne - thanks for being willing to port your creation/other
> > >>>>>>> scripts
> > >>>>>>> over. You might be a good early tester of whether this system
> works
> > >>>>>>> well
> > >>>>>>> for everyone.
> > >>>>>>>
> > >>>>>>> Stephen
> > >>>>>>>
> > >>>>>>> [1]  Reasons for Beam Test Strategy -
> > >>>>>>>
> > >>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> > >>> rQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> > >>>
> > >>>>
> > >>>>>>> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
> > >>>>>>> <jb...@nanthrax.net>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I second Etienne there.
> > >>>>>>>>
> > >>>>>>>> We worked together on the ElasticsearchIO and definitely, the
> high
> > >>>>>>>> valuable test we did were integration tests with ES on docker
> and
> > >>>>>>>> high
> > >>>>>>>> volume.
> > >>>>>>>>
> > >>>>>>>> I think we have to distinguish the two kinds of tests:
> > >>>>>>>> 1. utests are located in the IO itself and basically they should
> > >>>>>>>> cover
> > >>>>>>>> the core behaviors of the IO
> > >>>>>>>> 2. itests are located as contrib in the IO (they could be part
> of
> > >>>>>>>> the IO
> > >>>>>>>> but executed by the integration-test plugin or a specific
> profile)
> > >>>>>>>> that
> > >>>>>>>> deals with "real" backend and high volumes. The resources
> required
> > >>>>>>>> by
> > >>>>>>>> the itest can be bootstrapped by Jenkins (for instance using
> > >>>>>>>> Mesos/Marathon and docker images as already discussed, and it's
> > >>>>>>>> what I'm
> > >>>>>>>> doing on my own "server").
> > >>>>>>>>
> > >>>>>>>> It's basically what Stephen described.
> > >>>>>>>>
> > >>>>>>>> We have to not relay only on itest: utests are very important
> and
> > >>>>>>>> they
> > >>>>>>>> validate the core behavior.
> > >>>>>>>>
> > >>>>>>>> My $0.01 ;)
> > >>>>>>>>
> > >>>>>>>> Regards
> > >>>>>>>> JB
> > >>>>>>>>
> > >>>>>>>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Stephen,
> > >>>>>>>>>
> > >>>>>>>>> I like your proposition very much and I also agree that docker
> +
> > >>>>>>>>> some
> > >>>>>>>>> orchestration software would be great !
> > >>>>>>>>>
> > >>>>>>>>> On the elasticsearchIO (PR to be created this week) there is
> > docker
> > >>>>>>>>> container creation scripts and logstash data ingestion script
> for
> > >>>>>>>>> IT
> > >>>>>>>>> environment available in contrib directory alongside with
> > >>>>>>>>> integration
> > >>>>>>>>> tests themselves. I'll be happy to make them compliant to new
> IT
> > >>>>>>>>> environment.
> > >>>>>>>>>
> > >>>>>>>>> What you say bellow about the need for external IT environment
> is
> > >>>>>>>>> particularly true. As an example with ES what came out in first
> > >>>>>>>>> implementation was that there were problems starting at some
> high
> > >>>>>>>>>
> > >>>>>>>> volume
> > >>>>>
> > >>>>>> of data (timeouts, ES windowing overflow...) that could not have
> be
> > >>>>>>>>>
> > >>>>>>>> seen
> > >>>>>
> > >>>>>> on embedded ES version. Also there where some particularities to
> > >>>>>>>>> external instance like secondary (replica) shards that where
> not
> > >>>>>>>>>
> > >>>>>>>> visible
> > >>>>>
> > >>>>>> on embedded instance.
> > >>>>>>>>>
> > >>>>>>>>> Besides, I also favor bringing up instances before test because
> > it
> > >>>>>>>>> allows (amongst other things) to be sure to start on a fresh
> > >>>>>>>>> dataset
> > >>>>>>>>>
> > >>>>>>>> for
> > >>>>>
> > >>>>>> the test to be deterministic.
> > >>>>>>>>>
> > >>>>>>>>> Etienne
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> > >>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> I'm excited we're getting lots of discussion going. There are
> > many
> > >>>>>>>>>> threads
> > >>>>>>>>>> of conversation here, we may choose to split some of them off
> > >>>>>>>>>> into a
> > >>>>>>>>>> different email thread. I'm also betting I missed some of the
> > >>>>>>>>>> questions in
> > >>>>>>>>>> this thread, so apologies ahead of time for that. Also
> apologies
> > >>>>>>>>>> for
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> amount of text, I provided some quick summaries at the top of
> each
> > >>>>>>>>>> section.
> > >>>>>>>>>>
> > >>>>>>>>>> Amit - thanks for your thoughts. I've responded in detail
> below.
> > >>>>>>>>>> Ismael - thanks for offering to help. There's plenty of work
> > >>>>>>>>>> here to
> > >>>>>>>>>>
> > >>>>>>>>> go
> > >>>>>
> > >>>>>> around. I'll try and think about how we can divide up some next
> > >>>>>>>>>> steps
> > >>>>>>>>>> (probably in a separate thread.) The main next step I see is
> > >>>>>>>>>> deciding
> > >>>>>>>>>> between kubernetes/mesos+marathon/docker swarm - I'm working
> on
> > >>>>>>>>>> that,
> > >>>>>>>>>>
> > >>>>>>>>> but
> > >>>>>>>>
> > >>>>>>>>> having lots of different thoughts on what the
> > >>>>>>>>>> advantages/disadvantages
> > >>>>>>>>>>
> > >>>>>>>>> of
> > >>>>>>>>
> > >>>>>>>>> those are would be helpful (I'm not entirely sure of the
> > >>>>>>>>>> protocol for
> > >>>>>>>>>> collaborating on sub-projects like this.)
> > >>>>>>>>>>
> > >>>>>>>>>> These issues are all related to what kind of tests we want to
> > >>>>>>>>>> write. I
> > >>>>>>>>>> think a kubernetes/mesos/swarm cluster could support all the
> use
> > >>>>>>>>>> cases
> > >>>>>>>>>> we've discussed here (and thus should not block moving forward
> > >>>>>>>>>> with
> > >>>>>>>>>> this),
> > >>>>>>>>>> but understanding what we want to test will help us understand
> > >>>>>>>>>> how the
> > >>>>>>>>>> cluster will be used. I'm working on a proposed user guide for
> > >>>>>>>>>> testing
> > >>>>>>>>>>
> > >>>>>>>>> IO
> > >>>>>>>>
> > >>>>>>>>> Transforms, and I'm going to send out a link to that + a short
> > >>>>>>>>>> summary
> > >>>>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>> the list shortly so folks can get a better sense of where I'm
> > >>>>>>>>>> coming
> > >>>>>>>>>> from.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Here's my thinking on the questions we've raised here -
> > >>>>>>>>>>
> > >>>>>>>>>> Embedded versions of data stores for testing
> > >>>>>>>>>> --------------------
> > >>>>>>>>>> Summary: yes! But we still need real data stores to test
> > against.
> > >>>>>>>>>>
> > >>>>>>>>>> I am a gigantic fan of using embedded versions of the various
> > data
> > >>>>>>>>>> stores.
> > >>>>>>>>>> I think we should test everything we possibly can using them,
> > >>>>>>>>>> and do
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> majority of our correctness testing using embedded versions + the
> > >>>>>>>>>>
> > >>>>>>>>> direct
> > >>>>>>
> > >>>>>>> runner. However, it's also important to have at least one test
> that
> > >>>>>>>>>> actually connects to an actual instance, so we can get
> coverage
> > >>>>>>>>>> for
> > >>>>>>>>>> things
> > >>>>>>>>>> like credentials, real connection strings, etc...
> > >>>>>>>>>>
> > >>>>>>>>>> The key point is that embedded versions definitely can't cover
> > the
> > >>>>>>>>>> performance tests, so we need to host instances if we want to
> > test
> > >>>>>>>>>>
> > >>>>>>>>> that.
> > >>>>>>
> > >>>>>>> I consider the integration tests/performance benchmarks to be
> > >>>>>>>>>> costly
> > >>>>>>>>>> things
> > >>>>>>>>>> that we do only for the IO transforms with large amounts of
> > >>>>>>>>>> community
> > >>>>>>>>>> support/usage. A random IO transform used by a few users
> doesn't
> > >>>>>>>>>> necessarily need integration & perf tests, but for heavily
> used
> > IO
> > >>>>>>>>>> transforms, there's a lot of community value in these tests.
> The
> > >>>>>>>>>> maintenance proposal below scales with the amount of community
> > >>>>>>>>>> support
> > >>>>>>>>>> for
> > >>>>>>>>>> a particular IO transform.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Reusing data stores ("use the data stores across executions.")
> > >>>>>>>>>> ------------------
> > >>>>>>>>>> Summary: I favor a hybrid approach: some frequently used, very
> > >>>>>>>>>> small
> > >>>>>>>>>> instances that we keep up all the time + larger
> multi-container
> > >>>>>>>>>> data
> > >>>>>>>>>> store
> > >>>>>>>>>> instances that we spin up for perf tests.
> > >>>>>>>>>>
> > >>>>>>>>>> I don't think we need to have a strong answer to this
> question,
> > >>>>>>>>>> but I
> > >>>>>>>>>> think
> > >>>>>>>>>> we do need to know what range of capabilities we need, and use
> > >>>>>>>>>> that to
> > >>>>>>>>>> inform our requirements on the hosting infrastructure. I think
> > >>>>>>>>>> kubernetes/mesos + docker can support all the scenarios I
> > discuss
> > >>>>>>>>>>
> > >>>>>>>>> below.
> > >>>>>>
> > >>>>>>> I had been thinking of a hybrid approach - reuse some instances
> and
> > >>>>>>>>>>
> > >>>>>>>>> don't
> > >>>>>>>>
> > >>>>>>>>> reuse others. Some tests require isolation from other tests
> (eg.
> > >>>>>>>>>> performance benchmarking), while others can easily re-use the
> > same
> > >>>>>>>>>> database/data store instance over time, provided they are
> > >>>>>>>>>> written in
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> correct manner (eg. a simple read or write correctness
> integration
> > >>>>>>>>>>
> > >>>>>>>>> tests)
> > >>>>>>>>
> > >>>>>>>>> To me, the question of whether to use one instance over time
> for
> > a
> > >>>>>>>>>> test vs
> > >>>>>>>>>> spin up an instance for each test comes down to a trade off
> > >>>>>>>>>> between
> > >>>>>>>>>>
> > >>>>>>>>> these
> > >>>>>>>>
> > >>>>>>>>> factors:
> > >>>>>>>>>> 1. Flakiness of spin-up of an instance - if it's super flaky,
> > >>>>>>>>>> we'll
> > >>>>>>>>>> want to
> > >>>>>>>>>> keep more instances up and running rather than bring them
> > up/down.
> > >>>>>>>>>>
> > >>>>>>>>> (this
> > >>>>>>
> > >>>>>>> may also vary by the data store in question)
> > >>>>>>>>>> 2. Frequency of testing - if we are running tests every 5
> > >>>>>>>>>> minutes, it
> > >>>>>>>>>>
> > >>>>>>>>> may
> > >>>>>>>>
> > >>>>>>>>> be wasteful to bring machines up/down every time. If we run
> > >>>>>>>>>> tests once
> > >>>>>>>>>>
> > >>>>>>>>> a
> > >>>>>>
> > >>>>>>> day or week, it seems wasteful to keep the machines up the whole
> > >>>>>>>>>> time.
> > >>>>>>>>>> 3. Isolation requirements - If tests must be isolated, it
> means
> > we
> > >>>>>>>>>>
> > >>>>>>>>> either
> > >>>>>>>>
> > >>>>>>>>> have to bring up the instances for each test, or we have to
> have
> > >>>>>>>>>> some
> > >>>>>>>>>> sort
> > >>>>>>>>>> of signaling mechanism to indicate that a given instance is in
> > >>>>>>>>>> use. I
> > >>>>>>>>>> strongly favor bringing up an instance per test.
> > >>>>>>>>>> 4. Number/size of containers - if we need a large number of
> > >>>>>>>>>> machines
> > >>>>>>>>>> for a
> > >>>>>>>>>> particular test, keeping them running all the time will use
> more
> > >>>>>>>>>> resources.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> The major unknown to me is how flaky it'll be to spin these
> up.
> > >>>>>>>>>> I'm
> > >>>>>>>>>> hopeful/assuming they'll be pretty stable to bring up, but I
> > >>>>>>>>>> think the
> > >>>>>>>>>> best
> > >>>>>>>>>> way to test that is to start doing it.
> > >>>>>>>>>>
> > >>>>>>>>>> I suspect the sweet spot is the following: have a set of very
> > >>>>>>>>>> small
> > >>>>>>>>>>
> > >>>>>>>>> data
> > >>>>>>
> > >>>>>>> store instances that stay up to support small-data-size
> post-commit
> > >>>>>>>>>> end to
> > >>>>>>>>>> end tests (post-commits run frequently and the data size means
> > the
> > >>>>>>>>>> instances would not use many resources), combined with the
> > >>>>>>>>>> ability to
> > >>>>>>>>>> spin
> > >>>>>>>>>> up larger instances for once a day/week performance benchmarks
> > >>>>>>>>>> (these
> > >>>>>>>>>>
> > >>>>>>>>> use
> > >>>>>>>>
> > >>>>>>>>> up more resources and are used less frequently.) That's the mix
> > >>>>>>>>>> I'll
> > >>>>>>>>>> propose in my docs on testing IO transforms.  If spinning up
> new
> > >>>>>>>>>> instances
> > >>>>>>>>>> is cheap/non-flaky, I'd be fine with the idea of spinning up
> > >>>>>>>>>> instances
> > >>>>>>>>>> for
> > >>>>>>>>>> each test.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Management ("what's the overhead of managing such a
> deployment")
> > >>>>>>>>>> --------------------
> > >>>>>>>>>> Summary: I propose that anyone can contribute scripts for
> > >>>>>>>>>> setting up
> > >>>>>>>>>>
> > >>>>>>>>> data
> > >>>>>>>>
> > >>>>>>>>> store instances + integration/perf tests, but if the community
> > >>>>>>>>>> doesn't
> > >>>>>>>>>> maintain a particular data store's tests, we disable the tests
> > and
> > >>>>>>>>>> turn off
> > >>>>>>>>>> the data store instances.
> > >>>>>>>>>>
> > >>>>>>>>>> Management of these instances is a crucial question. First,
> > let's
> > >>>>>>>>>>
> > >>>>>>>>> break
> > >>>>>
> > >>>>>> down what tasks we'll need to do on a recurring basis:
> > >>>>>>>>>> 1. Ongoing maintenance (update to new versions, both instance
> &
> > >>>>>>>>>> dependencies) - we don't want to have a lot of old versions
> > >>>>>>>>>> vulnerable
> > >>>>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>> attacks/buggy
> > >>>>>>>>>> 2. Investigate breakages/regressions
> > >>>>>>>>>> (I'm betting there will be more things we'll discover - let me
> > >>>>>>>>>> know if
> > >>>>>>>>>> you
> > >>>>>>>>>> have suggestions)
> > >>>>>>>>>>
> > >>>>>>>>>> There's a couple goals I see:
> > >>>>>>>>>> 1. We should only do sys admin work for things that give us a
> > >>>>>>>>>> lot of
> > >>>>>>>>>> benefit. (ie, don't build IT/perf/data store set up scripts
> for
> > >>>>>>>>>> data
> > >>>>>>>>>> stores
> > >>>>>>>>>> without a large community)
> > >>>>>>>>>> 2. We should do as much as possible of testing via
> > >>>>>>>>>> in-memory/embedded
> > >>>>>>>>>> testing (as you brought up).
> > >>>>>>>>>> 3. Reduce the amount of manual administration overhead
> > >>>>>>>>>>
> > >>>>>>>>>> As I discussed above, I think that integration
> tests/performance
> > >>>>>>>>>> benchmarks
> > >>>>>>>>>> are costly things that we should do only for the IO transforms
> > >>>>>>>>>> with
> > >>>>>>>>>>
> > >>>>>>>>> large
> > >>>>>>>>
> > >>>>>>>>> amounts of community support/usage. Thus, I propose that we
> > >>>>>>>>>> limit the
> > >>>>>>>>>>
> > >>>>>>>>> IO
> > >>>>>>
> > >>>>>>> transforms that get integration tests & performance benchmarks to
> > >>>>>>>>>>
> > >>>>>>>>> those
> > >>>>>
> > >>>>>> that have community support for maintaining the data store
> > >>>>>>>>>> instances.
> > >>>>>>>>>>
> > >>>>>>>>>> We can enforce this organically using some simple rules:
> > >>>>>>>>>> 1. Investigating breakages/regressions: if a given
> > >>>>>>>>>> integration/perf
> > >>>>>>>>>>
> > >>>>>>>>> test
> > >>>>>>
> > >>>>>>> starts failing and no one investigates it within a set period of
> > >>>>>>>>>> time
> > >>>>>>>>>>
> > >>>>>>>>> (a
> > >>>>>>
> > >>>>>>> week?), we disable the tests and shut off the data store
> > >>>>>>>>>> instances if
> > >>>>>>>>>>
> > >>>>>>>>> we
> > >>>>>>
> > >>>>>>> have instances running. When someone wants to step up and
> > >>>>>>>>>> support it
> > >>>>>>>>>> again,
> > >>>>>>>>>> they can fix the test, check it in, and re-enable the test.
> > >>>>>>>>>> 2. Ongoing maintenance: every N months, file a jira issue that
> > >>>>>>>>>> is just
> > >>>>>>>>>> "is
> > >>>>>>>>>> the IO Transform X data store up to date?" - if the jira is
> not
> > >>>>>>>>>> resolved in
> > >>>>>>>>>> a set period of time (1 month?), the perf/integration tests
> are
> > >>>>>>>>>>
> > >>>>>>>>> disabled,
> > >>>>>>>>
> > >>>>>>>>> and the data store instances shut off.
> > >>>>>>>>>>
> > >>>>>>>>>> This is pretty flexible -
> > >>>>>>>>>> * If a particular person or organization wants to support an
> IO
> > >>>>>>>>>> transform,
> > >>>>>>>>>> they can. If a group of people all organically organize to
> keep
> > >>>>>>>>>> the
> > >>>>>>>>>>
> > >>>>>>>>> tests
> > >>>>>>>>
> > >>>>>>>>> running, they can.
> > >>>>>>>>>> * It can be mostly automated - there's not a lot of central
> > >>>>>>>>>> organizing
> > >>>>>>>>>> work
> > >>>>>>>>>> that needs to be done.
> > >>>>>>>>>>
> > >>>>>>>>>> Exposing the information about what IO transforms currently
> have
> > >>>>>>>>>>
> > >>>>>>>>> running
> > >>>>>>
> > >>>>>>> IT/perf benchmarks on the website will let users know what IO
> > >>>>>>>>>>
> > >>>>>>>>> transforms
> > >>>>>>
> > >>>>>>> are well supported.
> > >>>>>>>>>>
> > >>>>>>>>>> I like this solution, but I also recognize this is a tricky
> > >>>>>>>>>> problem.
> > >>>>>>>>>>
> > >>>>>>>>> This
> > >>>>>>>>
> > >>>>>>>>> is something the community needs to be supportive of, so I'm
> > >>>>>>>>>> open to
> > >>>>>>>>>> other
> > >>>>>>>>>> thoughts.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Simulating failures in real nodes ("programmatic tests to
> > simulate
> > >>>>>>>>>> failure")
> > >>>>>>>>>> -----------------
> > >>>>>>>>>> Summary: 1) Focus our testing on the code in Beam 2) We should
> > >>>>>>>>>> encourage a
> > >>>>>>>>>> design pattern separating out network/retry logic from the
> main
> > IO
> > >>>>>>>>>> transform logic
> > >>>>>>>>>>
> > >>>>>>>>>> We *could* create instance failure in any container management
> > >>>>>>>>>>
> > >>>>>>>>> software
> > >>>>>
> > >>>>>> -
> > >>>>>>>>
> > >>>>>>>>> we can use their programmatic APIs to determine which
> containers
> > >>>>>>>>>> are
> > >>>>>>>>>> running the instances, and ask them to kill the container in
> > >>>>>>>>>> question.
> > >>>>>>>>>>
> > >>>>>>>>> A
> > >>>>>>
> > >>>>>>> slow node would be trickier, but I'm sure we could figure it out
> > >>>>>>>>>> - for
> > >>>>>>>>>> example, add a network proxy that would delay responses.
> > >>>>>>>>>>
> > >>>>>>>>>> However, I would argue that this type of testing doesn't gain
> > us a
> > >>>>>>>>>> lot, and
> > >>>>>>>>>> is complicated to set up. I think it will be easier to test
> > >>>>>>>>>> network
> > >>>>>>>>>> errors
> > >>>>>>>>>> and retry behavior in unit tests for the IO transforms.
> > >>>>>>>>>>
> > >>>>>>>>>> Part of the way to handle this is to separate out the read
> code
> > >>>>>>>>>> from
> > >>>>>>>>>>
> > >>>>>>>>> the
> > >>>>>>
> > >>>>>>> network code (eg. bigtable has BigtableService). If you put the
> > >>>>>>>>>>
> > >>>>>>>>> "handle
> > >>>>>
> > >>>>>> errors/retry logic" code in a separate MySourceService class,
> > >>>>>>>>>> you can
> > >>>>>>>>>> test
> > >>>>>>>>>> MySourceService on the wide variety of networks errors/data
> > store
> > >>>>>>>>>> problems,
> > >>>>>>>>>> and then your main IO transform tests focus on the read
> behavior
> > >>>>>>>>>> and
> > >>>>>>>>>> handling the small set of errors the MySourceService class
> will
> > >>>>>>>>>>
> > >>>>>>>>> return.
> > >>>>>
> > >>>>>> I also think we should focus on testing the IO Transform, not
> > >>>>>>>>>> the data
> > >>>>>>>>>> store - if we kill a node in a data store, it's that data
> > store's
> > >>>>>>>>>> problem,
> > >>>>>>>>>> not beam's problem. As you were pointing out, there are a
> > *large*
> > >>>>>>>>>> number of
> > >>>>>>>>>> possible ways that a particular data store can fail, and we
> > >>>>>>>>>> would like
> > >>>>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>> support many different data stores. Rather than try to test
> that
> > >>>>>>>>>> each
> > >>>>>>>>>> data
> > >>>>>>>>>> store behaves well, we should ensure that we handle
> > >>>>>>>>>> generic/expected
> > >>>>>>>>>> errors
> > >>>>>>>>>> in a graceful manner.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Ismaeal had a couple other quick comments/questions, I'll
> answer
> > >>>>>>>>>> here
> > >>>>>>>>>>
> > >>>>>>>>> -
> > >>>>>
> > >>>>>> We can use this to test other runners running on multiple
> > >>>>>>>>>> machines - I
> > >>>>>>>>>> agree. This is also necessary for a good performance benchmark
> > >>>>>>>>>> test.
> > >>>>>>>>>>
> > >>>>>>>>>> "providing the test machines to mount the cluster" - we can
> > >>>>>>>>>> discuss
> > >>>>>>>>>>
> > >>>>>>>>> this
> > >>>>>>
> > >>>>>>> further, but one possible option is that google may be willing to
> > >>>>>>>>>>
> > >>>>>>>>> donate
> > >>>>>>
> > >>>>>>> something to support this.
> > >>>>>>>>>>
> > >>>>>>>>>> "IO Consistency" - let's follow up on those questions in
> another
> > >>>>>>>>>>
> > >>>>>>>>> thread.
> > >>>>>>
> > >>>>>>> That's as much about the public interface we provide to users as
> > >>>>>>>>>>
> > >>>>>>>>> anything
> > >>>>>>>>
> > >>>>>>>>> else. I agree with your sentiment that a user should be able to
> > >>>>>>>>>> expect
> > >>>>>>>>>> predictable behavior from the different IO transforms.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for everyone's questions/comments - I really am excited
> > >>>>>>>>>> to see
> > >>>>>>>>>> that
> > >>>>>>>>>> people care about this :)
> > >>>>>>>>>>
> > >>>>>>>>>> Stephen
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <
> iemejia@gmail.com
> > >
> > >>>>>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>
> > >>>>>> Hello,
> > >>>>>>>>>>>
> > >>>>>>>>>>> @Stephen Thanks for your proposal, it is really interesting,
> I
> > >>>>>>>>>>> would
> > >>>>>>>>>>> really
> > >>>>>>>>>>> like to help with this. I have never played with Kubernetes
> but
> > >>>>>>>>>>> this
> > >>>>>>>>>>> seems
> > >>>>>>>>>>> a really nice chance to do something useful with it.
> > >>>>>>>>>>>
> > >>>>>>>>>>> We (at Talend) are testing most of the IOs using simple
> > container
> > >>>>>>>>>>>
> > >>>>>>>>>> images
> > >>>>>>>>
> > >>>>>>>>> and in some particular cases ‘clusters’ of containers using
> > >>>>>>>>>>> docker-compose
> > >>>>>>>>>>> (a little bit like Amit’s (2) proposal). It would be really
> > >>>>>>>>>>> nice to
> > >>>>>>>>>>>
> > >>>>>>>>>> have
> > >>>>>>>>
> > >>>>>>>>> this at the Beam level, in particular to try to test more
> complex
> > >>>>>>>>>>> semantics, I don’t know how programmable kubernetes is to
> > achieve
> > >>>>>>>>>>> this for
> > >>>>>>>>>>> example:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Let’s think we have a cluster of Cassandra or Kafka nodes, I
> > >>>>>>>>>>> would
> > >>>>>>>>>>> like to
> > >>>>>>>>>>> have programmatic tests to simulate failure (e.g. kill a
> node),
> > >>>>>>>>>>> or
> > >>>>>>>>>>> simulate
> > >>>>>>>>>>> a really slow node, to ensure that the IO behaves as expected
> > >>>>>>>>>>> in the
> > >>>>>>>>>>> Beam
> > >>>>>>>>>>> pipeline for the given runner.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Another related idea is to improve IO consistency: Today the
> > >>>>>>>>>>> different IOs
> > >>>>>>>>>>> have small differences in their failure behavior, I really
> > >>>>>>>>>>> would like
> > >>>>>>>>>>> to be
> > >>>>>>>>>>> able to predict with more precision what will happen in case
> of
> > >>>>>>>>>>>
> > >>>>>>>>>> errors,
> > >>>>>>
> > >>>>>>> e.g. what is the correct behavior if I am writing to a Kafka
> > >>>>>>>>>>> node and
> > >>>>>>>>>>> there
> > >>>>>>>>>>> is a network partition, does the Kafka sink retries or no ?
> and
> > >>>>>>>>>>> what
> > >>>>>>>>>>> if it
> > >>>>>>>>>>> is the JdbcIO ?, will it work the same e.g. assuming
> > >>>>>>>>>>> checkpointing?
> > >>>>>>>>>>> Or do
> > >>>>>>>>>>> we guarantee exactly once writes somehow?, today I am not
> sure
> > >>>>>>>>>>> about
> > >>>>>>>>>>> what
> > >>>>>>>>>>> happens (or if the expected behavior depends on the runner),
> > >>>>>>>>>>> but well
> > >>>>>>>>>>> maybe
> > >>>>>>>>>>> it is just that I don’t know and we have tests to ensure
> this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Of course both are really hard problems, but I think with
> your
> > >>>>>>>>>>> proposal we
> > >>>>>>>>>>> can try to tackle them, as well as the performance ones. And
> > >>>>>>>>>>> apart of
> > >>>>>>>>>>> the
> > >>>>>>>>>>> data stores, I think it will be also really nice to be able
> to
> > >>>>>>>>>>> test
> > >>>>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>
> > >>>>>>> runners in a distributed manner.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So what is the next step? How do you imagine such integration
> > >>>>>>>>>>> tests?
> > >>>>>>>>>>> ? Who
> > >>>>>>>>>>> can provide the test machines so we can mount the cluster?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Maybe my ideas are a bit too far away for an initial setup,
> but
> > >>>>>>>>>>> it
> > >>>>>>>>>>> will be
> > >>>>>>>>>>> really nice to start working on this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ismael
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <
> > >>>>>>>>>>> amitsela33@gmail.com
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Stephen,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I was wondering about how we plan to use the data stores
> > across
> > >>>>>>>>>>>>
> > >>>>>>>>>>> executions.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Clearly, it's best to setup a new instance (container) for
> > every
> > >>>>>>>>>>>>
> > >>>>>>>>>>> test,
> > >>>>>>
> > >>>>>>> running a "standalone" store (say HBase/Cassandra for
> > >>>>>>>>>>>> example), and
> > >>>>>>>>>>>> once
> > >>>>>>>>>>>> the test is done, teardown the instance. It should also be
> > >>>>>>>>>>>> agnostic
> > >>>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>
> > >>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>>> runtime environment (e.g., Docker on Kubernetes).
> > >>>>>>>>>>>> I'm wondering though what's the overhead of managing such a
> > >>>>>>>>>>>>
> > >>>>>>>>>>> deployment
> > >>>>>>
> > >>>>>>> which could become heavy and complicated as more IOs are
> > >>>>>>>>>>>> supported
> > >>>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>
> > >>>>>>> more
> > >>>>>>>>>>>
> > >>>>>>>>>>>> test cases introduced.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Another way to go would be to have small clusters of
> different
> > >>>>>>>>>>>> data
> > >>>>>>>>>>>>
> > >>>>>>>>>>> stores
> > >>>>>>>>>>>
> > >>>>>>>>>>>> and run against new "namespaces" (while lazily evicting old
> > >>>>>>>>>>>> ones),
> > >>>>>>>>>>>> but I
> > >>>>>>>>>>>> think this is less likely as maintaining a distributed
> > instance
> > >>>>>>>>>>>>
> > >>>>>>>>>>> (even
> > >>>>>
> > >>>>>> a
> > >>>>>>>>
> > >>>>>>>>> small one) for each data store sounds even more complex.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A third approach would be to to simply have an "embedded"
> > >>>>>>>>>>>> in-memory
> > >>>>>>>>>>>> instance of a data store as part of a test that runs against
> > it
> > >>>>>>>>>>>> (such as
> > >>>>>>>>>>>>
> > >>>>>>>>>>> an
> > >>>>>>>>>>>
> > >>>>>>>>>>>> embedded Kafka, though not a data store).
> > >>>>>>>>>>>> This is probably the simplest solution in terms of
> > >>>>>>>>>>>> orchestration,
> > >>>>>>>>>>>> but it
> > >>>>>>>>>>>> depends on having a proper "embedded" implementation for an
> > IO.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Does this make sense to you ? have you considered it ?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>> Amit
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <
> > >>>>>>>>>>>>
> > >>>>>>>>>>> jb@nanthrax.net
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Stephen,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> as already discussed a bit together, it sounds great !
> > >>>>>>>>>>>>> Especially I
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>>> it as a both integration test platform and good coverage for
> > >>>>>>>>>>>>> IOs.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'm very late on this but, as said, I will share with you
> my
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Marathon
> > >>>>>>
> > >>>>>>> JSON and Mesos docker images.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> By the way, I started to experiment a bit kubernetes and
> > >>>>>>>>>>>>> swamp but
> > >>>>>>>>>>>>> it's
> > >>>>>>>>>>>>> not yet complete. I will share what I have on the same
> github
> > >>>>>>>>>>>>> repo.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks !
> > >>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>> JB
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi everyone!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Currently we have a good set of unit tests for our IO
> > >>>>>>>>>>>>>> Transforms -
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> those
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> tend to run against in-memory versions of the data stores.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> However,
> > >>>>>
> > >>>>>> we'd
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> like to further increase our test coverage to include
> > >>>>>>>>>>>>>> running them
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> against
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> real instances of the data stores that the IO Transforms
> > work
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> against
> > >>>>>>>>
> > >>>>>>>>> (e.g.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> cassandra, mongodb, kafka, etc…), which means we'll need
> to
> > >>>>>>>>>>>>>> have
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> real
> > >>>>>>>>
> > >>>>>>>>> instances of various data stores.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Additionally, if we want to do performance regression
> > >>>>>>>>>>>>>> detection,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> it's
> > >>>>>>>>
> > >>>>>>>>> important to have instances of the services that behave
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> realistically,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> which isn't true of in-memory or dev versions of the
> services.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Proposed solution
> > >>>>>>>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> If we accept this proposal, we would create an
> > >>>>>>>>>>>>>> infrastructure for
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> running
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> real instances of data stores inside of containers, using
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> container
> > >>>>>
> > >>>>>> management software like mesos/marathon, kubernetes, docker
> > >>>>>>>>>>>>>> swarm,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> etc…
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> manage the instances.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> This would enable us to build integration tests that run
> > >>>>>>>>>>>>>> against
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> those
> > >>>>>>>>>>>
> > >>>>>>>>>>>> real
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> instances and performance tests that run against those
> real
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> instances
> > >>>>>>>>
> > >>>>>>>>> (like
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> those that Jason Kuster is proposing elsewhere.)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Why do we need one centralized set of instances vs just
> > having
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> various
> > >>>>>>>>>>>
> > >>>>>>>>>>>> people host their own instances?
> > >>>>>>>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> Reducing flakiness of tests is key. By not having
> > dependencies
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> from
> > >>>>>
> > >>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>>> core project on external services/instances of data stores
> > >>>>>>>>>>>>>> we have
> > >>>>>>>>>>>>>> guaranteed access to the services and the group can fix
> > issues
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> that
> > >>>>>
> > >>>>>> arise.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> An exception would be something that has an ops team
> > >>>>>>>>>>>>>> supporting it
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> (eg,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> AWS, Google Cloud or other professionally managed service) -
> > >>>>>>>>>>>>>> those
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> we
> > >>>>>>>>
> > >>>>>>>>> trust
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> will be stable.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> There may be a lot of different data stores needed - how
> > >>>>>>>>>>>>>> will we
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> maintain
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> them?
> > >>>>>>>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> It will take work above and beyond that of a normal set of
> > >>>>>>>>>>>>>> unit
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> tests
> > >>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> build and maintain integration/performance tests & their
> data
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> store
> > >>>>>
> > >>>>>> instances.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Setup & maintenance of the data store containers and data
> > >>>>>>>>>>>>>> store
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> instances
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> on it must be automated. It also has to be as simple of a
> > >>>>>>>>>>>>>> setup as
> > >>>>>>>>>>>>>> possible, and we should avoid hand tweaking the
> containers -
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> expecting
> > >>>>>>>>>>>
> > >>>>>>>>>>>> checked in scripts/dockerfiles is key.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Aligned with the community ownership approach of Apache,
> as
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> members
> > >>>>>
> > >>>>>> of
> > >>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> community are excited to contribute & maintain those tests
> > >>>>>>>>>>>>>> and the
> > >>>>>>>>>>>>>> integration/performance tests, people will be able to step
> > >>>>>>>>>>>>>> up and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> do
> > >>>>>>
> > >>>>>>> that.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> If there is no longer support for maintaining a particular
> > >>>>>>>>>>>>>> set of
> > >>>>>>>>>>>>>> integration & performance tests and their data store
> > >>>>>>>>>>>>>> instances,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> then
> > >>>>>>
> > >>>>>>> we
> > >>>>>>>>>>>
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> disable those tests. We may document on the website what
> IO
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Transforms
> > >>>>>>>>>>>
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> current integration/performance tests so users know what
> > >>>>>>>>>>>>>> level of
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> testing
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> the various IO Transforms have.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> What about requirements for the container management
> > software
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> itself?
> > >>>>>>>>
> > >>>>>>>>> -------------------------
> > >>>>>>>>>>>>>> * We should have the data store instances themselves in
> > >>>>>>>>>>>>>> Docker.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> Docker
> > >>>>>>>>>>>
> > >>>>>>>>>>>> allows new instances to be spun up in a quick, reproducible
> > way
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> and
> > >>>>>
> > >>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>>> fairly platform independent. It has wide support from a
> > >>>>>>>>>>>>>> variety of
> > >>>>>>>>>>>>>> different container management services.
> > >>>>>>>>>>>>>> * As little admin work required as possible. Crashing
> > >>>>>>>>>>>>>> instances
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> should
> > >>>>>>>>>>>
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> restarted, setup should be simple, everything possible
> > >>>>>>>>>>>>>> should be
> > >>>>>>>>>>>>>> scripted/scriptable.
> > >>>>>>>>>>>>>> * Logs and test output should be on a publicly available
> > >>>>>>>>>>>>>> website,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> without
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> needing to log into test execution machine. Centralized
> > >>>>>>>>>>>>>> capture of
> > >>>>>>>>>>>>>> monitoring info/logs from instances running in the
> > containers
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> would
> > >>>>>
> > >>>>>> support
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> this. Ideally, this would just be supported by the
> container
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> software
> > >>>>>>>>
> > >>>>>>>>> out
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> of the box.
> > >>>>>>>>>>>>>> * It'd be useful to have good persistent volume in the
> > >>>>>>>>>>>>>> container
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> management
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> software so that databases don't have to reload large data
> > >>>>>>>>>>>>>> sets
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> every
> > >>>>>>>>
> > >>>>>>>>> time.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> * The containers may be a place to execute runners
> > >>>>>>>>>>>>>> themselves if
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> we
> > >>>>>
> > >>>>>> need
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> larger runner instances, so it should play well with Spark,
> > >>>>>>>>>>>>>> Flink,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> etc…
> > >>>>>>>>>>>
> > >>>>>>>>>>>> As I discussed earlier on the mailing list, it looks like
> > >>>>>>>>>>>>>> hosting
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> docker
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> containers on kubernetes, docker swarm or mesos+marathon
> > >>>>>>>>>>>>>> would be
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> a
> > >>>>>
> > >>>>>> good
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> solution.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>> Stephen Sisk
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>>>>>> jbonofre@apache.org
> > >>>>>>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>> jbonofre@apache.org
> > >>>>>>>> http://blog.nanthrax.net
> > >>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>
> > >>>>>>>>
> > >>>
> > >
> > >
> >
>