You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Stephen Sisk <si...@google.com.INVALID> on 2017/01/18 00:27:17 UTC

IO Integration tests - concrete proposal

Hi all!

As I've discussed previously on this list[1], ensuring that we have high
quality IO Transforms is important to beam. We want to do this without
adding too much burden on developers wanting to contribute. Below I have a
concrete proposal for what an IO integration test would look like and an
example integration test[4] that meets those requirements.

Proposal: we should require that an IO transform includes a passing
integration test showing the IO can connect to real instance of the data
store. We still want/expect comprehensive unit tests on an IO transform,
but we would allow check ins with just some unit tests in the presence of
an IT.

To support that, we'll require the following pieces associated with an IT:

1. Dockerfile that can be used to create a running instance of the data
store. We've previously discussed on this list that we would use docker
images running inside kubernetes or mesos[2], and I'd prefer having a
kubernetes/mesos script to start a given data store, but for a single
instance data store, we can take a dockerfile and use it to create a simple
kubernetes/mesos app. If you have questions about how maintaining the
containers long term would work, check [2] as I discussed a detailed plan
there.

2. Code to load test data on the data store created by #1. Needs to be self
contained. For now, the easiest way to do this would be to have code inside
of the IT.

3. The IT. I propose keeping this inside of the same module as the IO
transform itself since having all the IO transform ITs in one module would
mean there may be conflicts between different data store's dependencies.
Integration tests will need connection information pointing to the data
store it is testing. As discussed previously on this list[3], it should
receive that connection information via TestPipelineOptions.

I'd like to get something up and running soon so people checking in new IO
transforms can start taking advantage of an IT framework. Thus, there are a
couple simplifying assumptions in this plan. Pieces of the plan that I
anticipate will evolve:

1. The test data load script - we would like to write these in a uniform
way and especially ensure that the test data is cleaned up after the tests
run.

2. Spinning up/down instances - for now, we'd likely need to do this
manually. It'd be good to get an automated process for this. That's
especially critical for performance tests with multiple nodes - there's no
need to keep instances running for that.

Integrating closer with PKB would be a good way to do both of these things,
but first let's focus on getting some basic ITs running.

As a concrete example of this proposal, I've written JDBC IO IT [4].
JdbcIOTest already did a lot of test setup, so I heavily re-used it. The
key pieces:

* The integration test is in JdbcIOIT.

* JdbcIOIT reads the TestPipelineOptions defined in PostgresTestOptions. We
may move the TestOptions files into a common place so they can be shared
between tests.

* Test data is created/cleaned up inside of the IT.

* kubernetes/mesos scripts - I have provided examples of both under the
"jdbc/src/test/resources" directory, but I'd like us to decide as a project
which container orchestration service we want to use - I'll send mail about
that shortly.

thanks!
Stephen

[1] Integration Testing Sources
https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071aa6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E

[2] Container Orchestration software for hosting data stores
https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E

[3] Some Thoughts on IO Integration Tests
https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E

[4] JDBC IO IT using postgres
https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - have not
been reviewed yet, so may contain code errors, but it does run & pass :)

Re: IO Integration tests - concrete proposal

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Stephen

Yup it sounds good. My proposal is just to document a bit the best practices for IO.

Thanks !
Regards
JB\u2063\u200b

On Jan 26, 2017, 02:25, at 02:25, Stephen Sisk <si...@google.com.INVALID> wrote:
>hi JB!
>
>"IO Writing Guide" sounds like BEAM-1025 (User guide - "How to create
>Beam
>IO Transforms") that I've been working on. Let me pull together the
>stuff
>I've been working on into a draft that folks can take a look at. I had
>an
>earlier draft that was more focused on sources/sinks but since we're
>moving
>away from those, I started a re-write. I'll aim for end of week for
>sharing
>a draft.
>
>There's also a section about fakes in the testing doc:
>https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.cykbne9o4iv
>
>
>Sorry the testing doc/how to create user guide have sat in draft form
>for a
>while, I've wanted to finish up the integration testing environment for
>IOs
>first.
>
>S
>
>On Wed, Jan 25, 2017 at 8:52 AM Jean-Baptiste Onofr� <jb...@nanthrax.net>
>wrote:
>
>Hi
>
>It's what I mentioned in a previous email yup. It should refer a "IO
>Writing Guide\u2063\u200b" describing the purpose of service interface,
>fake/mock, ...
>
>I will tackle that in a PR.
>
>Regards
>JB
>
>On Jan 25, 2017, 09:54, at 09:54, Etienne Chauchot
><ec...@gmail.com>
>wrote:
>>Hey Stephen,
>>
>>That seems perfect!
>>
>>Another thing, more about software design, maybe you could add in the
>>guide comments what have been discussed in the ML about making
>standard
>>
>>the use of:
>>
>>- IOService interface in UT and IT,
>>
>>- implementations EmbeddedIOService and MockIOServcice for UT
>>
>>- implementation RealIOService for IT (name proposal)
>>
>>if we all have an agreement on these points. Maybe it requires some
>>more
>>discussions (methods in the interface, are almost passthrough
>>implementations -EmbeddedIOService, RealIOService - needed, ...)
>>
>>Etienne
>>
>>
>>Le 24/01/2017 � 06:47, Stephen Sisk a �crit :
>>> hey,
>>>
>>> thanks - these are good questions/thoughts.
>>>
>>>> I am more reserved on that one regarding flakiness. IMHO, it is
>>better to
>>> clean in all cases.
>>>
>>> I strongly agree that we should attempt to clean in each case, and
>>the
>>> system should support that. I should have stated that more firmly.
>As
>>I
>>> think about it more, you're also right that we should just not try
>to
>>do
>>> the data loading inside of the test. I amended the guidelines based
>>on your
>>> comments and put them in the draft "Testing IO transforms in Apache
>>Beam"
>>> doc that I've been working on [1].
>>>
>>> Here's that snippet:
>>> """
>>>
>>> For both types of tests (integration and performance), you'll need
>to
>>have
>>> scripts that set up your test data - they will be run independent of
>>the
>>> tests themselves.
>>>
>>> The Integration and Perf Tests themselves:
>>>
>>> 1. Can assume the data load script has been run before the test
>>>
>>> 2. Must work if they are run multiple times without the data load
>>script
>>> being run in between (ie, they should clean up after themselves or
>>use
>>> namespacing such that tests don't interfere with one another)
>>>
>>> 3. Read tests must not load data or clean data
>>>
>>> 4. Write tests must use another storage location than read tests
>>(using
>>> namespace/table names/etc.. for example) and if possible clean it
>>after
>>> each test.
>>> """
>>>
>>> Any other comments?
>>>
>>> Stephen
>>>
>>> [1]
>>>
>>
>https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
>>>
>>> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot
>><ec...@gmail.com>
>>> wrote:
>>>
>>> Hi Stephen,
>>>
>>> My comments are inline
>>>
>>> Le 19/01/2017 � 20:32, Stephen Sisk a �crit :
>>>> I definitely agree that sharing resources between tests is more
>>efficient.
>>>>
>>>> Etienne - do you think it's necessary to separate the IT from the
>>data
>>>> loading script?
>>> Actually, I see separation between IT and loading script more as a
>an
>>> improvement (time and resource effective) not as a necessity.
>Indeed,
>>> for now, for example, loading in ES IT is done within the IT (see
>>>
>https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>>
>>>> The postgres/JdbcIOIT can use the natural namespacing of
>>>> tables and I feel pretty comfortable that will work well over time.
>>> You mean using the same table name with different namespace? But
>>IMHO,
>>> it is still "using another place" that I mentioned, read IT and
>write
>>IT
>>> could use same table name in different namespaces.
>>>>    You
>>>> haven't explicitly mentioned it, but I'm assuming that
>elasticsearch
>>>> doesn't allow such namespacing, so that's why you're having to do
>>the
>>>> separation?
>>> Actually in ES, there is no namespace notion but there is index
>name.
>>> The index is the documents storing entity that is split. And there
>is
>>> the document type that is more like a class definition for the
>>document.
>>> So basically, we could have read IT using readIndex.docType and
>write
>>IT
>>> using writeIndex.docType.
>>>> I'm not trying to discourage separating data load from IT, just
>>>> wondering whether it's truly necessary.
>>> IMHO, more like an optimization like I mentioned.
>>>> I was trying to consolidate what we're discussed down to a few
>>guidelines.
>>>> I think those are that IO ITs:
>>>> 1. Can assume the data load script has been run before the test
>>(unless
>>> the
>>>> data load script is run by the test itself)
>>> I Agree
>>>> 2. Must work if they are run multiple times without the data load
>>script
>>>> being run in between (ie, they should clean up after themselves or
>>use
>>>> namespacing such that tests don't interfere with one another)
>>> Yes, sure
>>>> 3. Tests that generate large amounts of data will attempt to clean
>>up
>>> after
>>>> themselves. (ie, if you just write 100 rows, don't worry about it -
>>if you
>>>> write 5 gb of data, you'd need to clean up.) We will not assume
>this
>>will
>>>> always succeed in cleaning up, but my assumption is that if a
>>particular
>>>> data store gets into a bad state, we'll just destroy/recreate that
>>>> particular data store.
>>> I am more reserved on that one regarding flakiness. IMHO, it is
>>better
>>> to clean in all cases. I mentioned in a thread that sharding in the
>>> datastore might change depending on data volume (it is not he case
>>for
>>> ES because the sharding is defined by configuration) or a
>>> shard/partition in the datastore can become so big that it will be
>>split
>>> more by the IO. Imagine that a test that writes 100 rows does not do
>>> cleanup and is run 1 000 times, then the storage entity becomes
>>bigger
>>> and bigger and it might then be split into more bundles than
>asserted
>>in
>>> split tests (either by decision of the datastore or because
>>> desiredBundleSize is small)
>>>> If the tests follow those assumptions, then that should support all
>>the
>>>> scenarios I can think of: running data store create + data load
>>script
>>>> occasionally (say, once a week or month) all the way up to running
>>them
>>>> once per test run (if we decided to go that far.)
>>> Yes but do we chose to enforce a standard way of coding integration
>>> tests such as
>>> - loading data is done by and exterior loading script
>>> - read tests: do not load data,  do not clean data
>>> - write tests: use another storage place than read tests (using
>>> namespace for example) and clean it after each test.
>>> ?
>>>
>>> Etienne
>>>> S
>>>>
>>>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot
>><ec...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Yes, thanks all for these clarifications about testing
>architecture.
>>>>
>>>> I agree that point 1 and 2 should be shared between tests as much
>as
>>>> possible. Especially sharing data loading between tests is more
>>>> time-effective and resource-effective: tests that need data
>>(testRead,
>>>> testSplit, ...) will save the loading time, the wait for
>>asynchronous
>>>> indexation and cleaning time. Just a small comment:
>>>>
>>>> If we share the data loading between tests, then tests that expect
>>an
>>>> empty dataset (testWrite, ...), obviously cannot clear the shared
>>dataset.
>>>>
>>>> So they will need to write to a dedicated place (other than read
>>tests)
>>>> and clean it afterwards.
>>>>
>>>> I will update ElasticSearch read IT
>>>>
>>(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>>> to not do data loading/cleaning and write IT to use another
>location
>>>> than read IT
>>>>
>>>> Etienne
>>>>
>>>> Le 18/01/2017 � 13:47, Jean-Baptiste Onofr� a �crit :
>>>>> Hi guys,
>>>>>
>>>>> Firs, great e-mail Stephen: complete and detailed proposal.
>>>>>
>>>>> Lukasz raised a good point: it makes sense to be able to leverage
>>the
>>>>> same "bootstrap" script.
>>>>>
>>>>> We discussed about providing the following in each IO:
>>>>> 1. code to load data (java, script, whatever)
>>>>> 2. script to bootstrap the backend (dockerfile, kubernetes script,
>>...)
>>>>> 3. actual integration tests
>>>>>
>>>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we
>>run
>>>>> integration tests for Python or integration tests for Java SDKs.
>>>>>
>>>>> However,  3 may depend to 1 and 2 (the integration tests perform
>>some
>>>>> assertion based on the loaded data for instance).
>>>>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by
>>hand
>>>>> or by Jenkins using a "description" of where the code and script
>>are
>>>>> located.
>>>>>
>>>>> So, I think that we can put 1 and 2 in the IO and use "descriptor"
>>to
>>>>> do the bootstrapping.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>>>>> Since docker containers can run a script on startup, can we embed
>>the
>>>>>> initial data set into that script/container build so that the
>same
>>>>>> docker
>>>>>> container and initial data set can be used across multiple ITs.
>>For
>>>>>> example, if Python and Java both have JdbcIO, it would be nice if
>>they
>>>>>> could leverage the same docker container with the same data set
>to
>>>>>> ensure
>>>>>> the same pipeline produces the same results?
>>>>>>
>>>>>> This would be different from embedding the data in the specific
>IT
>>>>>> implementation and would also create a coupling between ITs from
>>>>>> potentially multiple languages.
>>>>>>
>>>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk
>><si...@google.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all!
>>>>>>>
>>>>>>> As I've discussed previously on this list[1], ensuring that we
>>have
>>>>>>> high
>>>>>>> quality IO Transforms is important to beam. We want to do this
>>without
>>>>>>> adding too much burden on developers wanting to contribute.
>Below
>>I
>>>>>>> have a
>>>>>>> concrete proposal for what an IO integration test would look
>like
>>>>>>> and an
>>>>>>> example integration test[4] that meets those requirements.
>>>>>>>
>>>>>>> Proposal: we should require that an IO transform includes a
>>passing
>>>>>>> integration test showing the IO can connect to real instance of
>>the
>>>>>>> data
>>>>>>> store. We still want/expect comprehensive unit tests on an IO
>>>>>>> transform,
>>>>>>> but we would allow check ins with just some unit tests in the
>>>>>>> presence of
>>>>>>> an IT.
>>>>>>>
>>>>>>> To support that, we'll require the following pieces associated
>>with
>>>>>>> an IT:
>>>>>>>
>>>>>>> 1. Dockerfile that can be used to create a running instance of
>>the data
>>>>>>> store. We've previously discussed on this list that we would use
>>docker
>>>>>>> images running inside kubernetes or mesos[2], and I'd prefer
>>having a
>>>>>>> kubernetes/mesos script to start a given data store, but for a
>>single
>>>>>>> instance data store, we can take a dockerfile and use it to
>>create a
>>>>>>> simple
>>>>>>> kubernetes/mesos app. If you have questions about how
>maintaining
>>the
>>>>>>> containers long term would work, check [2] as I discussed a
>>detailed
>>>>>>> plan
>>>>>>> there.
>>>>>>>
>>>>>>> 2. Code to load test data on the data store created by #1. Needs
>>to
>>>>>>> be self
>>>>>>> contained. For now, the easiest way to do this would be to have
>>code
>>>>>>> inside
>>>>>>> of the IT.
>>>>>>>
>>>>>>> 3. The IT. I propose keeping this inside of the same module as
>>the IO
>>>>>>> transform itself since having all the IO transform ITs in one
>>module
>>>>>>> would
>>>>>>> mean there may be conflicts between different data store's
>>>>>>> dependencies.
>>>>>>> Integration tests will need connection information pointing to
>>the data
>>>>>>> store it is testing. As discussed previously on this list[3], it
>>should
>>>>>>> receive that connection information via TestPipelineOptions.
>>>>>>>
>>>>>>> I'd like to get something up and running soon so people checking
>>in
>>>>>>> new IO
>>>>>>> transforms can start taking advantage of an IT framework. Thus,
>>>>>>> there are a
>>>>>>> couple simplifying assumptions in this plan. Pieces of the plan
>>that I
>>>>>>> anticipate will evolve:
>>>>>>>
>>>>>>> 1. The test data load script - we would like to write these in a
>>>>>>> uniform
>>>>>>> way and especially ensure that the test data is cleaned up after
>>the
>>>>>>> tests
>>>>>>> run.
>>>>>>>
>>>>>>> 2. Spinning up/down instances - for now, we'd likely need to do
>>this
>>>>>>> manually. It'd be good to get an automated process for this.
>>That's
>>>>>>> especially critical for performance tests with multiple nodes -
>>>>>>> there's no
>>>>>>> need to keep instances running for that.
>>>>>>>
>>>>>>> Integrating closer with PKB would be a good way to do both of
>>these
>>>>>>> things,
>>>>>>> but first let's focus on getting some basic ITs running.
>>>>>>>
>>>>>>> As a concrete example of this proposal, I've written JDBC IO IT
>>[4].
>>>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used
>>it.
>>>>>>> The
>>>>>>> key pieces:
>>>>>>>
>>>>>>> * The integration test is in JdbcIOIT.
>>>>>>>
>>>>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>>>>> PostgresTestOptions. We
>>>>>>> may move the TestOptions files into a common place so they can
>be
>>>>>>> shared
>>>>>>> between tests.
>>>>>>>
>>>>>>> * Test data is created/cleaned up inside of the IT.
>>>>>>>
>>>>>>> * kubernetes/mesos scripts - I have provided examples of both
>>under the
>>>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide
>as
>>a
>>>>>>> project
>>>>>>> which container orchestration service we want to use - I'll send
>>>>>>> mail about
>>>>>>> that shortly.
>>>>>>>
>>>>>>> thanks!
>>>>>>> Stephen
>>>>>>>
>>>>>>> [1] Integration Testing Sources
>>>>>>>
>>https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>> [2] Container Orchestration software for hosting data stores
>>>>>>>
>>https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>> [3] Some Thoughts on IO Integration Tests
>>>>>>>
>>https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>> [4] JDBC IO IT using postgres
>>>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc
>-
>>>>>>> have not
>>>>>>> been reviewed yet, so may contain code errors, but it does run &
>>>>>>> pass :)
>>>>>>>

Re: IO Integration tests - concrete proposal

Posted by Stephen Sisk <si...@google.com.INVALID>.

hi JB!

"IO Writing Guide" sounds like BEAM-1025 (User guide - "How to create Beam
IO Transforms") that I've been working on. Let me pull together the stuff
I've been working on into a draft that folks can take a look at. I had an
earlier draft that was more focused on sources/sinks but since we're moving
away from those, I started a re-write. I'll aim for end of week for sharing
a draft.

There's also a section about fakes in the testing doc:
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.cykbne9o4iv


Sorry the testing doc/how to create user guide have sat in draft form for a
while, I've wanted to finish up the integration testing environment for IOs
first.

S

On Wed, Jan 25, 2017 at 8:52 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

Hi

It's what I mentioned in a previous email yup. It should refer a "IO
Writing Guide⁣" describing the purpose of service interface, fake/mock, ...

I will tackle that in a PR.

Regards
JB

On Jan 25, 2017, 09:54, at 09:54, Etienne Chauchot <ec...@gmail.com>
wrote:
>Hey Stephen,
>
>That seems perfect!
>
>Another thing, more about software design, maybe you could add in the
>guide comments what have been discussed in the ML about making standard
>
>the use of:
>
>- IOService interface in UT and IT,
>
>- implementations EmbeddedIOService and MockIOServcice for UT
>
>- implementation RealIOService for IT (name proposal)
>
>if we all have an agreement on these points. Maybe it requires some
>more
>discussions (methods in the interface, are almost passthrough
>implementations -EmbeddedIOService, RealIOService - needed, ...)
>
>Etienne
>
>
>Le 24/01/2017 à 06:47, Stephen Sisk a écrit :
>> hey,
>>
>> thanks - these are good questions/thoughts.
>>
>>> I am more reserved on that one regarding flakiness. IMHO, it is
>better to
>> clean in all cases.
>>
>> I strongly agree that we should attempt to clean in each case, and
>the
>> system should support that. I should have stated that more firmly. As
>I
>> think about it more, you're also right that we should just not try to
>do
>> the data loading inside of the test. I amended the guidelines based
>on your
>> comments and put them in the draft "Testing IO transforms in Apache
>Beam"
>> doc that I've been working on [1].
>>
>> Here's that snippet:
>> """
>>
>> For both types of tests (integration and performance), you'll need to
>have
>> scripts that set up your test data - they will be run independent of
>the
>> tests themselves.
>>
>> The Integration and Perf Tests themselves:
>>
>> 1. Can assume the data load script has been run before the test
>>
>> 2. Must work if they are run multiple times without the data load
>script
>> being run in between (ie, they should clean up after themselves or
>use
>> namespacing such that tests don't interfere with one another)
>>
>> 3. Read tests must not load data or clean data
>>
>> 4. Write tests must use another storage location than read tests
>(using
>> namespace/table names/etc.. for example) and if possible clean it
>after
>> each test.
>> """
>>
>> Any other comments?
>>
>> Stephen
>>
>> [1]
>>
>
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
>>
>> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot
><ec...@gmail.com>
>> wrote:
>>
>> Hi Stephen,
>>
>> My comments are inline
>>
>> Le 19/01/2017 à 20:32, Stephen Sisk a écrit :
>>> I definitely agree that sharing resources between tests is more
>efficient.
>>>
>>> Etienne - do you think it's necessary to separate the IT from the
>data
>>> loading script?
>> Actually, I see separation between IT and loading script more as a an
>> improvement (time and resource effective) not as a necessity. Indeed,
>> for now, for example, loading in ES IT is done within the IT (see
>> https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>
>>> The postgres/JdbcIOIT can use the natural namespacing of
>>> tables and I feel pretty comfortable that will work well over time.
>> You mean using the same table name with different namespace? But
>IMHO,
>> it is still "using another place" that I mentioned, read IT and write
>IT
>> could use same table name in different namespaces.
>>>    You
>>> haven't explicitly mentioned it, but I'm assuming that elasticsearch
>>> doesn't allow such namespacing, so that's why you're having to do
>the
>>> separation?
>> Actually in ES, there is no namespace notion but there is index name.
>> The index is the documents storing entity that is split. And there is
>> the document type that is more like a class definition for the
>document.
>> So basically, we could have read IT using readIndex.docType and write
>IT
>> using writeIndex.docType.
>>> I'm not trying to discourage separating data load from IT, just
>>> wondering whether it's truly necessary.
>> IMHO, more like an optimization like I mentioned.
>>> I was trying to consolidate what we're discussed down to a few
>guidelines.
>>> I think those are that IO ITs:
>>> 1. Can assume the data load script has been run before the test
>(unless
>> the
>>> data load script is run by the test itself)
>> I Agree
>>> 2. Must work if they are run multiple times without the data load
>script
>>> being run in between (ie, they should clean up after themselves or
>use
>>> namespacing such that tests don't interfere with one another)
>> Yes, sure
>>> 3. Tests that generate large amounts of data will attempt to clean
>up
>> after
>>> themselves. (ie, if you just write 100 rows, don't worry about it -
>if you
>>> write 5 gb of data, you'd need to clean up.) We will not assume this
>will
>>> always succeed in cleaning up, but my assumption is that if a
>particular
>>> data store gets into a bad state, we'll just destroy/recreate that
>>> particular data store.
>> I am more reserved on that one regarding flakiness. IMHO, it is
>better
>> to clean in all cases. I mentioned in a thread that sharding in the
>> datastore might change depending on data volume (it is not he case
>for
>> ES because the sharding is defined by configuration) or a
>> shard/partition in the datastore can become so big that it will be
>split
>> more by the IO. Imagine that a test that writes 100 rows does not do
>> cleanup and is run 1 000 times, then the storage entity becomes
>bigger
>> and bigger and it might then be split into more bundles than asserted
>in
>> split tests (either by decision of the datastore or because
>> desiredBundleSize is small)
>>> If the tests follow those assumptions, then that should support all
>the
>>> scenarios I can think of: running data store create + data load
>script
>>> occasionally (say, once a week or month) all the way up to running
>them
>>> once per test run (if we decided to go that far.)
>> Yes but do we chose to enforce a standard way of coding integration
>> tests such as
>> - loading data is done by and exterior loading script
>> - read tests: do not load data,  do not clean data
>> - write tests: use another storage place than read tests (using
>> namespace for example) and clean it after each test.
>> ?
>>
>> Etienne
>>> S
>>>
>>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot
><ec...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Yes, thanks all for these clarifications about testing architecture.
>>>
>>> I agree that point 1 and 2 should be shared between tests as much as
>>> possible. Especially sharing data loading between tests is more
>>> time-effective and resource-effective: tests that need data
>(testRead,
>>> testSplit, ...) will save the loading time, the wait for
>asynchronous
>>> indexation and cleaning time. Just a small comment:
>>>
>>> If we share the data loading between tests, then tests that expect
>an
>>> empty dataset (testWrite, ...), obviously cannot clear the shared
>dataset.
>>>
>>> So they will need to write to a dedicated place (other than read
>tests)
>>> and clean it afterwards.
>>>
>>> I will update ElasticSearch read IT
>>>
>(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>> to not do data loading/cleaning and write IT to use another location
>>> than read IT
>>>
>>> Etienne
>>>
>>> Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :
>>>> Hi guys,
>>>>
>>>> Firs, great e-mail Stephen: complete and detailed proposal.
>>>>
>>>> Lukasz raised a good point: it makes sense to be able to leverage
>the
>>>> same "bootstrap" script.
>>>>
>>>> We discussed about providing the following in each IO:
>>>> 1. code to load data (java, script, whatever)
>>>> 2. script to bootstrap the backend (dockerfile, kubernetes script,
>...)
>>>> 3. actual integration tests
>>>>
>>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we
>run
>>>> integration tests for Python or integration tests for Java SDKs.
>>>>
>>>> However,  3 may depend to 1 and 2 (the integration tests perform
>some
>>>> assertion based on the loaded data for instance).
>>>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by
>hand
>>>> or by Jenkins using a "description" of where the code and script
>are
>>>> located.
>>>>
>>>> So, I think that we can put 1 and 2 in the IO and use "descriptor"
>to
>>>> do the bootstrapping.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>>>> Since docker containers can run a script on startup, can we embed
>the
>>>>> initial data set into that script/container build so that the same
>>>>> docker
>>>>> container and initial data set can be used across multiple ITs.
>For
>>>>> example, if Python and Java both have JdbcIO, it would be nice if
>they
>>>>> could leverage the same docker container with the same data set to
>>>>> ensure
>>>>> the same pipeline produces the same results?
>>>>>
>>>>> This would be different from embedding the data in the specific IT
>>>>> implementation and would also create a coupling between ITs from
>>>>> potentially multiple languages.
>>>>>
>>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk
><si...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>
>>>>>> As I've discussed previously on this list[1], ensuring that we
>have
>>>>>> high
>>>>>> quality IO Transforms is important to beam. We want to do this
>without
>>>>>> adding too much burden on developers wanting to contribute. Below
>I
>>>>>> have a
>>>>>> concrete proposal for what an IO integration test would look like
>>>>>> and an
>>>>>> example integration test[4] that meets those requirements.
>>>>>>
>>>>>> Proposal: we should require that an IO transform includes a
>passing
>>>>>> integration test showing the IO can connect to real instance of
>the
>>>>>> data
>>>>>> store. We still want/expect comprehensive unit tests on an IO
>>>>>> transform,
>>>>>> but we would allow check ins with just some unit tests in the
>>>>>> presence of
>>>>>> an IT.
>>>>>>
>>>>>> To support that, we'll require the following pieces associated
>with
>>>>>> an IT:
>>>>>>
>>>>>> 1. Dockerfile that can be used to create a running instance of
>the data
>>>>>> store. We've previously discussed on this list that we would use
>docker
>>>>>> images running inside kubernetes or mesos[2], and I'd prefer
>having a
>>>>>> kubernetes/mesos script to start a given data store, but for a
>single
>>>>>> instance data store, we can take a dockerfile and use it to
>create a
>>>>>> simple
>>>>>> kubernetes/mesos app. If you have questions about how maintaining
>the
>>>>>> containers long term would work, check [2] as I discussed a
>detailed
>>>>>> plan
>>>>>> there.
>>>>>>
>>>>>> 2. Code to load test data on the data store created by #1. Needs
>to
>>>>>> be self
>>>>>> contained. For now, the easiest way to do this would be to have
>code
>>>>>> inside
>>>>>> of the IT.
>>>>>>
>>>>>> 3. The IT. I propose keeping this inside of the same module as
>the IO
>>>>>> transform itself since having all the IO transform ITs in one
>module
>>>>>> would
>>>>>> mean there may be conflicts between different data store's
>>>>>> dependencies.
>>>>>> Integration tests will need connection information pointing to
>the data
>>>>>> store it is testing. As discussed previously on this list[3], it
>should
>>>>>> receive that connection information via TestPipelineOptions.
>>>>>>
>>>>>> I'd like to get something up and running soon so people checking
>in
>>>>>> new IO
>>>>>> transforms can start taking advantage of an IT framework. Thus,
>>>>>> there are a
>>>>>> couple simplifying assumptions in this plan. Pieces of the plan
>that I
>>>>>> anticipate will evolve:
>>>>>>
>>>>>> 1. The test data load script - we would like to write these in a
>>>>>> uniform
>>>>>> way and especially ensure that the test data is cleaned up after
>the
>>>>>> tests
>>>>>> run.
>>>>>>
>>>>>> 2. Spinning up/down instances - for now, we'd likely need to do
>this
>>>>>> manually. It'd be good to get an automated process for this.
>That's
>>>>>> especially critical for performance tests with multiple nodes -
>>>>>> there's no
>>>>>> need to keep instances running for that.
>>>>>>
>>>>>> Integrating closer with PKB would be a good way to do both of
>these
>>>>>> things,
>>>>>> but first let's focus on getting some basic ITs running.
>>>>>>
>>>>>> As a concrete example of this proposal, I've written JDBC IO IT
>[4].
>>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used
>it.
>>>>>> The
>>>>>> key pieces:
>>>>>>
>>>>>> * The integration test is in JdbcIOIT.
>>>>>>
>>>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>>>> PostgresTestOptions. We
>>>>>> may move the TestOptions files into a common place so they can be
>>>>>> shared
>>>>>> between tests.
>>>>>>
>>>>>> * Test data is created/cleaned up inside of the IT.
>>>>>>
>>>>>> * kubernetes/mesos scripts - I have provided examples of both
>under the
>>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide as
>a
>>>>>> project
>>>>>> which container orchestration service we want to use - I'll send
>>>>>> mail about
>>>>>> that shortly.
>>>>>>
>>>>>> thanks!
>>>>>> Stephen
>>>>>>
>>>>>> [1] Integration Testing Sources
>>>>>>
>https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [2] Container Orchestration software for hosting data stores
>>>>>>
>https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [3] Some Thoughts on IO Integration Tests
>>>>>>
>https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [4] JDBC IO IT using postgres
>>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>>>>> have not
>>>>>> been reviewed yet, so may contain code errors, but it does run &
>>>>>> pass :)
>>>>>>

Re: IO Integration tests - concrete proposal

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi

It's what I mentioned in a previous email yup. It should refer a "IO Writing Guide\u2063\u200b" describing the purpose of service interface, fake/mock, ...

I will tackle that in a PR.

Regards
JB

On Jan 25, 2017, 09:54, at 09:54, Etienne Chauchot <ec...@gmail.com> wrote:
>Hey Stephen,
>
>That seems perfect!
>
>Another thing, more about software design, maybe you could add in the 
>guide comments what have been discussed in the ML about making standard
>
>the use of:
>
>- IOService interface in UT and IT,
>
>- implementations EmbeddedIOService and MockIOServcice for UT
>
>- implementation RealIOService for IT (name proposal)
>
>if we all have an agreement on these points. Maybe it requires some
>more 
>discussions (methods in the interface, are almost passthrough 
>implementations -EmbeddedIOService, RealIOService - needed, ...)
>
>Etienne
>
>
>Le 24/01/2017 � 06:47, Stephen Sisk a �crit :
>> hey,
>>
>> thanks - these are good questions/thoughts.
>>
>>> I am more reserved on that one regarding flakiness. IMHO, it is
>better to
>> clean in all cases.
>>
>> I strongly agree that we should attempt to clean in each case, and
>the
>> system should support that. I should have stated that more firmly. As
>I
>> think about it more, you're also right that we should just not try to
>do
>> the data loading inside of the test. I amended the guidelines based
>on your
>> comments and put them in the draft "Testing IO transforms in Apache
>Beam"
>> doc that I've been working on [1].
>>
>> Here's that snippet:
>> """
>>
>> For both types of tests (integration and performance), you'll need to
>have
>> scripts that set up your test data - they will be run independent of
>the
>> tests themselves.
>>
>> The Integration and Perf Tests themselves:
>>
>> 1. Can assume the data load script has been run before the test
>>
>> 2. Must work if they are run multiple times without the data load
>script
>> being run in between (ie, they should clean up after themselves or
>use
>> namespacing such that tests don't interfere with one another)
>>
>> 3. Read tests must not load data or clean data
>>
>> 4. Write tests must use another storage location than read tests
>(using
>> namespace/table names/etc.. for example) and if possible clean it
>after
>> each test.
>> """
>>
>> Any other comments?
>>
>> Stephen
>>
>> [1]
>>
>https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
>>
>> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot
><ec...@gmail.com>
>> wrote:
>>
>> Hi Stephen,
>>
>> My comments are inline
>>
>> Le 19/01/2017 � 20:32, Stephen Sisk a �crit :
>>> I definitely agree that sharing resources between tests is more
>efficient.
>>>
>>> Etienne - do you think it's necessary to separate the IT from the
>data
>>> loading script?
>> Actually, I see separation between IT and loading script more as a an
>> improvement (time and resource effective) not as a necessity. Indeed,
>> for now, for example, loading in ES IT is done within the IT (see
>> https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>
>>> The postgres/JdbcIOIT can use the natural namespacing of
>>> tables and I feel pretty comfortable that will work well over time.
>> You mean using the same table name with different namespace? But
>IMHO,
>> it is still "using another place" that I mentioned, read IT and write
>IT
>> could use same table name in different namespaces.
>>>    You
>>> haven't explicitly mentioned it, but I'm assuming that elasticsearch
>>> doesn't allow such namespacing, so that's why you're having to do
>the
>>> separation?
>> Actually in ES, there is no namespace notion but there is index name.
>> The index is the documents storing entity that is split. And there is
>> the document type that is more like a class definition for the
>document.
>> So basically, we could have read IT using readIndex.docType and write
>IT
>> using writeIndex.docType.
>>> I'm not trying to discourage separating data load from IT, just
>>> wondering whether it's truly necessary.
>> IMHO, more like an optimization like I mentioned.
>>> I was trying to consolidate what we're discussed down to a few
>guidelines.
>>> I think those are that IO ITs:
>>> 1. Can assume the data load script has been run before the test
>(unless
>> the
>>> data load script is run by the test itself)
>> I Agree
>>> 2. Must work if they are run multiple times without the data load
>script
>>> being run in between (ie, they should clean up after themselves or
>use
>>> namespacing such that tests don't interfere with one another)
>> Yes, sure
>>> 3. Tests that generate large amounts of data will attempt to clean
>up
>> after
>>> themselves. (ie, if you just write 100 rows, don't worry about it -
>if you
>>> write 5 gb of data, you'd need to clean up.) We will not assume this
>will
>>> always succeed in cleaning up, but my assumption is that if a
>particular
>>> data store gets into a bad state, we'll just destroy/recreate that
>>> particular data store.
>> I am more reserved on that one regarding flakiness. IMHO, it is
>better
>> to clean in all cases. I mentioned in a thread that sharding in the
>> datastore might change depending on data volume (it is not he case
>for
>> ES because the sharding is defined by configuration) or a
>> shard/partition in the datastore can become so big that it will be
>split
>> more by the IO. Imagine that a test that writes 100 rows does not do
>> cleanup and is run 1 000 times, then the storage entity becomes
>bigger
>> and bigger and it might then be split into more bundles than asserted
>in
>> split tests (either by decision of the datastore or because
>> desiredBundleSize is small)
>>> If the tests follow those assumptions, then that should support all
>the
>>> scenarios I can think of: running data store create + data load
>script
>>> occasionally (say, once a week or month) all the way up to running
>them
>>> once per test run (if we decided to go that far.)
>> Yes but do we chose to enforce a standard way of coding integration
>> tests such as
>> - loading data is done by and exterior loading script
>> - read tests: do not load data,  do not clean data
>> - write tests: use another storage place than read tests (using
>> namespace for example) and clean it after each test.
>> ?
>>
>> Etienne
>>> S
>>>
>>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot
><ec...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Yes, thanks all for these clarifications about testing architecture.
>>>
>>> I agree that point 1 and 2 should be shared between tests as much as
>>> possible. Especially sharing data loading between tests is more
>>> time-effective and resource-effective: tests that need data
>(testRead,
>>> testSplit, ...) will save the loading time, the wait for
>asynchronous
>>> indexation and cleaning time. Just a small comment:
>>>
>>> If we share the data loading between tests, then tests that expect
>an
>>> empty dataset (testWrite, ...), obviously cannot clear the shared
>dataset.
>>>
>>> So they will need to write to a dedicated place (other than read
>tests)
>>> and clean it afterwards.
>>>
>>> I will update ElasticSearch read IT
>>>
>(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>> to not do data loading/cleaning and write IT to use another location
>>> than read IT
>>>
>>> Etienne
>>>
>>> Le 18/01/2017 � 13:47, Jean-Baptiste Onofr� a �crit :
>>>> Hi guys,
>>>>
>>>> Firs, great e-mail Stephen: complete and detailed proposal.
>>>>
>>>> Lukasz raised a good point: it makes sense to be able to leverage
>the
>>>> same "bootstrap" script.
>>>>
>>>> We discussed about providing the following in each IO:
>>>> 1. code to load data (java, script, whatever)
>>>> 2. script to bootstrap the backend (dockerfile, kubernetes script,
>...)
>>>> 3. actual integration tests
>>>>
>>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we
>run
>>>> integration tests for Python or integration tests for Java SDKs.
>>>>
>>>> However,  3 may depend to 1 and 2 (the integration tests perform
>some
>>>> assertion based on the loaded data for instance).
>>>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by
>hand
>>>> or by Jenkins using a "description" of where the code and script
>are
>>>> located.
>>>>
>>>> So, I think that we can put 1 and 2 in the IO and use "descriptor"
>to
>>>> do the bootstrapping.
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>>>> Since docker containers can run a script on startup, can we embed
>the
>>>>> initial data set into that script/container build so that the same
>>>>> docker
>>>>> container and initial data set can be used across multiple ITs.
>For
>>>>> example, if Python and Java both have JdbcIO, it would be nice if
>they
>>>>> could leverage the same docker container with the same data set to
>>>>> ensure
>>>>> the same pipeline produces the same results?
>>>>>
>>>>> This would be different from embedding the data in the specific IT
>>>>> implementation and would also create a coupling between ITs from
>>>>> potentially multiple languages.
>>>>>
>>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk
><si...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>
>>>>>> As I've discussed previously on this list[1], ensuring that we
>have
>>>>>> high
>>>>>> quality IO Transforms is important to beam. We want to do this
>without
>>>>>> adding too much burden on developers wanting to contribute. Below
>I
>>>>>> have a
>>>>>> concrete proposal for what an IO integration test would look like
>>>>>> and an
>>>>>> example integration test[4] that meets those requirements.
>>>>>>
>>>>>> Proposal: we should require that an IO transform includes a
>passing
>>>>>> integration test showing the IO can connect to real instance of
>the
>>>>>> data
>>>>>> store. We still want/expect comprehensive unit tests on an IO
>>>>>> transform,
>>>>>> but we would allow check ins with just some unit tests in the
>>>>>> presence of
>>>>>> an IT.
>>>>>>
>>>>>> To support that, we'll require the following pieces associated
>with
>>>>>> an IT:
>>>>>>
>>>>>> 1. Dockerfile that can be used to create a running instance of
>the data
>>>>>> store. We've previously discussed on this list that we would use
>docker
>>>>>> images running inside kubernetes or mesos[2], and I'd prefer
>having a
>>>>>> kubernetes/mesos script to start a given data store, but for a
>single
>>>>>> instance data store, we can take a dockerfile and use it to
>create a
>>>>>> simple
>>>>>> kubernetes/mesos app. If you have questions about how maintaining
>the
>>>>>> containers long term would work, check [2] as I discussed a
>detailed
>>>>>> plan
>>>>>> there.
>>>>>>
>>>>>> 2. Code to load test data on the data store created by #1. Needs
>to
>>>>>> be self
>>>>>> contained. For now, the easiest way to do this would be to have
>code
>>>>>> inside
>>>>>> of the IT.
>>>>>>
>>>>>> 3. The IT. I propose keeping this inside of the same module as
>the IO
>>>>>> transform itself since having all the IO transform ITs in one
>module
>>>>>> would
>>>>>> mean there may be conflicts between different data store's
>>>>>> dependencies.
>>>>>> Integration tests will need connection information pointing to
>the data
>>>>>> store it is testing. As discussed previously on this list[3], it
>should
>>>>>> receive that connection information via TestPipelineOptions.
>>>>>>
>>>>>> I'd like to get something up and running soon so people checking
>in
>>>>>> new IO
>>>>>> transforms can start taking advantage of an IT framework. Thus,
>>>>>> there are a
>>>>>> couple simplifying assumptions in this plan. Pieces of the plan
>that I
>>>>>> anticipate will evolve:
>>>>>>
>>>>>> 1. The test data load script - we would like to write these in a
>>>>>> uniform
>>>>>> way and especially ensure that the test data is cleaned up after
>the
>>>>>> tests
>>>>>> run.
>>>>>>
>>>>>> 2. Spinning up/down instances - for now, we'd likely need to do
>this
>>>>>> manually. It'd be good to get an automated process for this.
>That's
>>>>>> especially critical for performance tests with multiple nodes -
>>>>>> there's no
>>>>>> need to keep instances running for that.
>>>>>>
>>>>>> Integrating closer with PKB would be a good way to do both of
>these
>>>>>> things,
>>>>>> but first let's focus on getting some basic ITs running.
>>>>>>
>>>>>> As a concrete example of this proposal, I've written JDBC IO IT
>[4].
>>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used
>it.
>>>>>> The
>>>>>> key pieces:
>>>>>>
>>>>>> * The integration test is in JdbcIOIT.
>>>>>>
>>>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>>>> PostgresTestOptions. We
>>>>>> may move the TestOptions files into a common place so they can be
>>>>>> shared
>>>>>> between tests.
>>>>>>
>>>>>> * Test data is created/cleaned up inside of the IT.
>>>>>>
>>>>>> * kubernetes/mesos scripts - I have provided examples of both
>under the
>>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide as
>a
>>>>>> project
>>>>>> which container orchestration service we want to use - I'll send
>>>>>> mail about
>>>>>> that shortly.
>>>>>>
>>>>>> thanks!
>>>>>> Stephen
>>>>>>
>>>>>> [1] Integration Testing Sources
>>>>>>
>https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [2] Container Orchestration software for hosting data stores
>>>>>>
>https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [3] Some Thoughts on IO Integration Tests
>>>>>>
>https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> [4] JDBC IO IT using postgres
>>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>>>>> have not
>>>>>> been reviewed yet, so may contain code errors, but it does run &
>>>>>> pass :)
>>>>>>

Re: IO Integration tests - concrete proposal

Posted by Etienne Chauchot <ec...@gmail.com>.

Hey Stephen,

That seems perfect!

Another thing, more about software design, maybe you could add in the 
guide comments what have been discussed in the ML about making standard 
the use of:

- IOService interface in UT and IT,

- implementations EmbeddedIOService and MockIOServcice for UT

- implementation RealIOService for IT (name proposal)

if we all have an agreement on these points. Maybe it requires some more 
discussions (methods in the interface, are almost passthrough 
implementations -EmbeddedIOService, RealIOService - needed, ...)

Etienne


Le 24/01/2017 � 06:47, Stephen Sisk a �crit :
> hey,
>
> thanks - these are good questions/thoughts.
>
>> I am more reserved on that one regarding flakiness. IMHO, it is better to
> clean in all cases.
>
> I strongly agree that we should attempt to clean in each case, and the
> system should support that. I should have stated that more firmly. As I
> think about it more, you're also right that we should just not try to do
> the data loading inside of the test. I amended the guidelines based on your
> comments and put them in the draft "Testing IO transforms in Apache Beam"
> doc that I've been working on [1].
>
> Here's that snippet:
> """
>
> For both types of tests (integration and performance), you'll need to have
> scripts that set up your test data - they will be run independent of the
> tests themselves.
>
> The Integration and Perf Tests themselves:
>
> 1. Can assume the data load script has been run before the test
>
> 2. Must work if they are run multiple times without the data load script
> being run in between (ie, they should clean up after themselves or use
> namespacing such that tests don't interfere with one another)
>
> 3. Read tests must not load data or clean data
>
> 4. Write tests must use another storage location than read tests (using
> namespace/table names/etc.. for example) and if possible clean it after
> each test.
> """
>
> Any other comments?
>
> Stephen
>
> [1]
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
>
> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot <ec...@gmail.com>
> wrote:
>
> Hi Stephen,
>
> My comments are inline
>
> Le 19/01/2017 � 20:32, Stephen Sisk a �crit :
>> I definitely agree that sharing resources between tests is more efficient.
>>
>> Etienne - do you think it's necessary to separate the IT from the data
>> loading script?
> Actually, I see separation between IT and loading script more as a an
> improvement (time and resource effective) not as a necessity. Indeed,
> for now, for example, loading in ES IT is done within the IT (see
> https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>
>> The postgres/JdbcIOIT can use the natural namespacing of
>> tables and I feel pretty comfortable that will work well over time.
> You mean using the same table name with different namespace? But IMHO,
> it is still "using another place" that I mentioned, read IT and write IT
> could use same table name in different namespaces.
>>    You
>> haven't explicitly mentioned it, but I'm assuming that elasticsearch
>> doesn't allow such namespacing, so that's why you're having to do the
>> separation?
> Actually in ES, there is no namespace notion but there is index name.
> The index is the documents storing entity that is split. And there is
> the document type that is more like a class definition for the document.
> So basically, we could have read IT using readIndex.docType and write IT
> using writeIndex.docType.
>> I'm not trying to discourage separating data load from IT, just
>> wondering whether it's truly necessary.
> IMHO, more like an optimization like I mentioned.
>> I was trying to consolidate what we're discussed down to a few guidelines.
>> I think those are that IO ITs:
>> 1. Can assume the data load script has been run before the test (unless
> the
>> data load script is run by the test itself)
> I Agree
>> 2. Must work if they are run multiple times without the data load script
>> being run in between (ie, they should clean up after themselves or use
>> namespacing such that tests don't interfere with one another)
> Yes, sure
>> 3. Tests that generate large amounts of data will attempt to clean up
> after
>> themselves. (ie, if you just write 100 rows, don't worry about it - if you
>> write 5 gb of data, you'd need to clean up.) We will not assume this will
>> always succeed in cleaning up, but my assumption is that if a particular
>> data store gets into a bad state, we'll just destroy/recreate that
>> particular data store.
> I am more reserved on that one regarding flakiness. IMHO, it is better
> to clean in all cases. I mentioned in a thread that sharding in the
> datastore might change depending on data volume (it is not he case for
> ES because the sharding is defined by configuration) or a
> shard/partition in the datastore can become so big that it will be split
> more by the IO. Imagine that a test that writes 100 rows does not do
> cleanup and is run 1 000 times, then the storage entity becomes bigger
> and bigger and it might then be split into more bundles than asserted in
> split tests (either by decision of the datastore or because
> desiredBundleSize is small)
>> If the tests follow those assumptions, then that should support all the
>> scenarios I can think of: running data store create + data load script
>> occasionally (say, once a week or month) all the way up to running them
>> once per test run (if we decided to go that far.)
> Yes but do we chose to enforce a standard way of coding integration
> tests such as
> - loading data is done by and exterior loading script
> - read tests: do not load data,  do not clean data
> - write tests: use another storage place than read tests (using
> namespace for example) and clean it after each test.
> ?
>
> Etienne
>> S
>>
>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot <ec...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> Yes, thanks all for these clarifications about testing architecture.
>>
>> I agree that point 1 and 2 should be shared between tests as much as
>> possible. Especially sharing data loading between tests is more
>> time-effective and resource-effective: tests that need data (testRead,
>> testSplit, ...) will save the loading time, the wait for asynchronous
>> indexation and cleaning time. Just a small comment:
>>
>> If we share the data loading between tests, then tests that expect an
>> empty dataset (testWrite, ...), obviously cannot clear the shared dataset.
>>
>> So they will need to write to a dedicated place (other than read tests)
>> and clean it afterwards.
>>
>> I will update ElasticSearch read IT
>> (https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>> to not do data loading/cleaning and write IT to use another location
>> than read IT
>>
>> Etienne
>>
>> Le 18/01/2017 � 13:47, Jean-Baptiste Onofr� a �crit :
>>> Hi guys,
>>>
>>> Firs, great e-mail Stephen: complete and detailed proposal.
>>>
>>> Lukasz raised a good point: it makes sense to be able to leverage the
>>> same "bootstrap" script.
>>>
>>> We discussed about providing the following in each IO:
>>> 1. code to load data (java, script, whatever)
>>> 2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
>>> 3. actual integration tests
>>>
>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we run
>>> integration tests for Python or integration tests for Java SDKs.
>>>
>>> However,  3 may depend to 1 and 2 (the integration tests perform some
>>> assertion based on the loaded data for instance).
>>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand
>>> or by Jenkins using a "description" of where the code and script are
>>> located.
>>>
>>> So, I think that we can put 1 and 2 in the IO and use "descriptor" to
>>> do the bootstrapping.
>>>
>>> Regards
>>> JB
>>>
>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>>> Since docker containers can run a script on startup, can we embed the
>>>> initial data set into that script/container build so that the same
>>>> docker
>>>> container and initial data set can be used across multiple ITs. For
>>>> example, if Python and Java both have JdbcIO, it would be nice if they
>>>> could leverage the same docker container with the same data set to
>>>> ensure
>>>> the same pipeline produces the same results?
>>>>
>>>> This would be different from embedding the data in the specific IT
>>>> implementation and would also create a coupling between ITs from
>>>> potentially multiple languages.
>>>>
>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <si...@google.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi all!
>>>>>
>>>>> As I've discussed previously on this list[1], ensuring that we have
>>>>> high
>>>>> quality IO Transforms is important to beam. We want to do this without
>>>>> adding too much burden on developers wanting to contribute. Below I
>>>>> have a
>>>>> concrete proposal for what an IO integration test would look like
>>>>> and an
>>>>> example integration test[4] that meets those requirements.
>>>>>
>>>>> Proposal: we should require that an IO transform includes a passing
>>>>> integration test showing the IO can connect to real instance of the
>>>>> data
>>>>> store. We still want/expect comprehensive unit tests on an IO
>>>>> transform,
>>>>> but we would allow check ins with just some unit tests in the
>>>>> presence of
>>>>> an IT.
>>>>>
>>>>> To support that, we'll require the following pieces associated with
>>>>> an IT:
>>>>>
>>>>> 1. Dockerfile that can be used to create a running instance of the data
>>>>> store. We've previously discussed on this list that we would use docker
>>>>> images running inside kubernetes or mesos[2], and I'd prefer having a
>>>>> kubernetes/mesos script to start a given data store, but for a single
>>>>> instance data store, we can take a dockerfile and use it to create a
>>>>> simple
>>>>> kubernetes/mesos app. If you have questions about how maintaining the
>>>>> containers long term would work, check [2] as I discussed a detailed
>>>>> plan
>>>>> there.
>>>>>
>>>>> 2. Code to load test data on the data store created by #1. Needs to
>>>>> be self
>>>>> contained. For now, the easiest way to do this would be to have code
>>>>> inside
>>>>> of the IT.
>>>>>
>>>>> 3. The IT. I propose keeping this inside of the same module as the IO
>>>>> transform itself since having all the IO transform ITs in one module
>>>>> would
>>>>> mean there may be conflicts between different data store's
>>>>> dependencies.
>>>>> Integration tests will need connection information pointing to the data
>>>>> store it is testing. As discussed previously on this list[3], it should
>>>>> receive that connection information via TestPipelineOptions.
>>>>>
>>>>> I'd like to get something up and running soon so people checking in
>>>>> new IO
>>>>> transforms can start taking advantage of an IT framework. Thus,
>>>>> there are a
>>>>> couple simplifying assumptions in this plan. Pieces of the plan that I
>>>>> anticipate will evolve:
>>>>>
>>>>> 1. The test data load script - we would like to write these in a
>>>>> uniform
>>>>> way and especially ensure that the test data is cleaned up after the
>>>>> tests
>>>>> run.
>>>>>
>>>>> 2. Spinning up/down instances - for now, we'd likely need to do this
>>>>> manually. It'd be good to get an automated process for this. That's
>>>>> especially critical for performance tests with multiple nodes -
>>>>> there's no
>>>>> need to keep instances running for that.
>>>>>
>>>>> Integrating closer with PKB would be a good way to do both of these
>>>>> things,
>>>>> but first let's focus on getting some basic ITs running.
>>>>>
>>>>> As a concrete example of this proposal, I've written JDBC IO IT [4].
>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used it.
>>>>> The
>>>>> key pieces:
>>>>>
>>>>> * The integration test is in JdbcIOIT.
>>>>>
>>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>>> PostgresTestOptions. We
>>>>> may move the TestOptions files into a common place so they can be
>>>>> shared
>>>>> between tests.
>>>>>
>>>>> * Test data is created/cleaned up inside of the IT.
>>>>>
>>>>> * kubernetes/mesos scripts - I have provided examples of both under the
>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide as a
>>>>> project
>>>>> which container orchestration service we want to use - I'll send
>>>>> mail about
>>>>> that shortly.
>>>>>
>>>>> thanks!
>>>>> Stephen
>>>>>
>>>>> [1] Integration Testing Sources
>>>>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>>
>>>>> [2] Container Orchestration software for hosting data stores
>>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>
>>>>> [3] Some Thoughts on IO Integration Tests
>>>>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>>
>>>>> [4] JDBC IO IT using postgres
>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>>>> have not
>>>>> been reviewed yet, so may contain code errors, but it does run &
>>>>> pass :)
>>>>>

Re: IO Integration tests - concrete proposal

Posted by Stephen Sisk <si...@google.com.INVALID>.

hey,

thanks - these are good questions/thoughts.

> I am more reserved on that one regarding flakiness. IMHO, it is better to
clean in all cases.

I strongly agree that we should attempt to clean in each case, and the
system should support that. I should have stated that more firmly. As I
think about it more, you're also right that we should just not try to do
the data loading inside of the test. I amended the guidelines based on your
comments and put them in the draft "Testing IO transforms in Apache Beam"
doc that I've been working on [1].

Here's that snippet:
"""

For both types of tests (integration and performance), you'll need to have
scripts that set up your test data - they will be run independent of the
tests themselves.

The Integration and Perf Tests themselves:

1. Can assume the data load script has been run before the test

2. Must work if they are run multiple times without the data load script
being run in between (ie, they should clean up after themselves or use
namespacing such that tests don't interfere with one another)

3. Read tests must not load data or clean data

4. Write tests must use another storage location than read tests (using
namespace/table names/etc.. for example) and if possible clean it after
each test.
"""

Any other comments?

Stephen

[1]
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m

On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot <ec...@gmail.com>
wrote:

Hi Stephen,

My comments are inline

Le 19/01/2017 à 20:32, Stephen Sisk a écrit :
> I definitely agree that sharing resources between tests is more efficient.
>
> Etienne - do you think it's necessary to separate the IT from the data
> loading script?
Actually, I see separation between IT and loading script more as a an
improvement (time and resource effective) not as a necessity. Indeed,
for now, for example, loading in ES IT is done within the IT (see
https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)

> The postgres/JdbcIOIT can use the natural namespacing of
> tables and I feel pretty comfortable that will work well over time.
You mean using the same table name with different namespace? But IMHO,
it is still "using another place" that I mentioned, read IT and write IT
could use same table name in different namespaces.
>   You
> haven't explicitly mentioned it, but I'm assuming that elasticsearch
> doesn't allow such namespacing, so that's why you're having to do the
> separation?
Actually in ES, there is no namespace notion but there is index name.
The index is the documents storing entity that is split. And there is
the document type that is more like a class definition for the document.
So basically, we could have read IT using readIndex.docType and write IT
using writeIndex.docType.
> I'm not trying to discourage separating data load from IT, just
> wondering whether it's truly necessary.
IMHO, more like an optimization like I mentioned.
>
> I was trying to consolidate what we're discussed down to a few guidelines.
> I think those are that IO ITs:
> 1. Can assume the data load script has been run before the test (unless
the
> data load script is run by the test itself)
I Agree
> 2. Must work if they are run multiple times without the data load script
> being run in between (ie, they should clean up after themselves or use
> namespacing such that tests don't interfere with one another)
Yes, sure
> 3. Tests that generate large amounts of data will attempt to clean up
after
> themselves. (ie, if you just write 100 rows, don't worry about it - if you
> write 5 gb of data, you'd need to clean up.) We will not assume this will
> always succeed in cleaning up, but my assumption is that if a particular
> data store gets into a bad state, we'll just destroy/recreate that
> particular data store.
I am more reserved on that one regarding flakiness. IMHO, it is better
to clean in all cases. I mentioned in a thread that sharding in the
datastore might change depending on data volume (it is not he case for
ES because the sharding is defined by configuration) or a
shard/partition in the datastore can become so big that it will be split
more by the IO. Imagine that a test that writes 100 rows does not do
cleanup and is run 1 000 times, then the storage entity becomes bigger
and bigger and it might then be split into more bundles than asserted in
split tests (either by decision of the datastore or because
desiredBundleSize is small)
>
> If the tests follow those assumptions, then that should support all the
> scenarios I can think of: running data store create + data load script
> occasionally (say, once a week or month) all the way up to running them
> once per test run (if we decided to go that far.)
Yes but do we chose to enforce a standard way of coding integration
tests such as
- loading data is done by and exterior loading script
- read tests: do not load data,  do not clean data
- write tests: use another storage place than read tests (using
namespace for example) and clean it after each test.
?

Etienne
>
> S
>
> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot <ec...@gmail.com>
> wrote:
>
> Hi,
>
> Yes, thanks all for these clarifications about testing architecture.
>
> I agree that point 1 and 2 should be shared between tests as much as
> possible. Especially sharing data loading between tests is more
> time-effective and resource-effective: tests that need data (testRead,
> testSplit, ...) will save the loading time, the wait for asynchronous
> indexation and cleaning time. Just a small comment:
>
> If we share the data loading between tests, then tests that expect an
> empty dataset (testWrite, ...), obviously cannot clear the shared dataset.
>
> So they will need to write to a dedicated place (other than read tests)
> and clean it afterwards.
>
> I will update ElasticSearch read IT
> (https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
> to not do data loading/cleaning and write IT to use another location
> than read IT
>
> Etienne
>
> Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :
>> Hi guys,
>>
>> Firs, great e-mail Stephen: complete and detailed proposal.
>>
>> Lukasz raised a good point: it makes sense to be able to leverage the
>> same "bootstrap" script.
>>
>> We discussed about providing the following in each IO:
>> 1. code to load data (java, script, whatever)
>> 2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
>> 3. actual integration tests
>>
>> Only 3 is specific to the IO: 1 and 2 can be the same either if we run
>> integration tests for Python or integration tests for Java SDKs.
>>
>> However,  3 may depend to 1 and 2 (the integration tests perform some
>> assertion based on the loaded data for instance).
>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand
>> or by Jenkins using a "description" of where the code and script are
>> located.
>>
>> So, I think that we can put 1 and 2 in the IO and use "descriptor" to
>> do the bootstrapping.
>>
>> Regards
>> JB
>>
>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>> Since docker containers can run a script on startup, can we embed the
>>> initial data set into that script/container build so that the same
>>> docker
>>> container and initial data set can be used across multiple ITs. For
>>> example, if Python and Java both have JdbcIO, it would be nice if they
>>> could leverage the same docker container with the same data set to
>>> ensure
>>> the same pipeline produces the same results?
>>>
>>> This would be different from embedding the data in the specific IT
>>> implementation and would also create a coupling between ITs from
>>> potentially multiple languages.
>>>
>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <si...@google.com.invalid>
>>> wrote:
>>>
>>>> Hi all!
>>>>
>>>> As I've discussed previously on this list[1], ensuring that we have
>>>> high
>>>> quality IO Transforms is important to beam. We want to do this without
>>>> adding too much burden on developers wanting to contribute. Below I
>>>> have a
>>>> concrete proposal for what an IO integration test would look like
>>>> and an
>>>> example integration test[4] that meets those requirements.
>>>>
>>>> Proposal: we should require that an IO transform includes a passing
>>>> integration test showing the IO can connect to real instance of the
>>>> data
>>>> store. We still want/expect comprehensive unit tests on an IO
>>>> transform,
>>>> but we would allow check ins with just some unit tests in the
>>>> presence of
>>>> an IT.
>>>>
>>>> To support that, we'll require the following pieces associated with
>>>> an IT:
>>>>
>>>> 1. Dockerfile that can be used to create a running instance of the data
>>>> store. We've previously discussed on this list that we would use docker
>>>> images running inside kubernetes or mesos[2], and I'd prefer having a
>>>> kubernetes/mesos script to start a given data store, but for a single
>>>> instance data store, we can take a dockerfile and use it to create a
>>>> simple
>>>> kubernetes/mesos app. If you have questions about how maintaining the
>>>> containers long term would work, check [2] as I discussed a detailed
>>>> plan
>>>> there.
>>>>
>>>> 2. Code to load test data on the data store created by #1. Needs to
>>>> be self
>>>> contained. For now, the easiest way to do this would be to have code
>>>> inside
>>>> of the IT.
>>>>
>>>> 3. The IT. I propose keeping this inside of the same module as the IO
>>>> transform itself since having all the IO transform ITs in one module
>>>> would
>>>> mean there may be conflicts between different data store's
>>>> dependencies.
>>>> Integration tests will need connection information pointing to the data
>>>> store it is testing. As discussed previously on this list[3], it should
>>>> receive that connection information via TestPipelineOptions.
>>>>
>>>> I'd like to get something up and running soon so people checking in
>>>> new IO
>>>> transforms can start taking advantage of an IT framework. Thus,
>>>> there are a
>>>> couple simplifying assumptions in this plan. Pieces of the plan that I
>>>> anticipate will evolve:
>>>>
>>>> 1. The test data load script - we would like to write these in a
>>>> uniform
>>>> way and especially ensure that the test data is cleaned up after the
>>>> tests
>>>> run.
>>>>
>>>> 2. Spinning up/down instances - for now, we'd likely need to do this
>>>> manually. It'd be good to get an automated process for this. That's
>>>> especially critical for performance tests with multiple nodes -
>>>> there's no
>>>> need to keep instances running for that.
>>>>
>>>> Integrating closer with PKB would be a good way to do both of these
>>>> things,
>>>> but first let's focus on getting some basic ITs running.
>>>>
>>>> As a concrete example of this proposal, I've written JDBC IO IT [4].
>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used it.
>>>> The
>>>> key pieces:
>>>>
>>>> * The integration test is in JdbcIOIT.
>>>>
>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>> PostgresTestOptions. We
>>>> may move the TestOptions files into a common place so they can be
>>>> shared
>>>> between tests.
>>>>
>>>> * Test data is created/cleaned up inside of the IT.
>>>>
>>>> * kubernetes/mesos scripts - I have provided examples of both under the
>>>> "jdbc/src/test/resources" directory, but I'd like us to decide as a
>>>> project
>>>> which container orchestration service we want to use - I'll send
>>>> mail about
>>>> that shortly.
>>>>
>>>> thanks!
>>>> Stephen
>>>>
>>>> [1] Integration Testing Sources
>>>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>
>>>> [2] Container Orchestration software for hosting data stores
>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>
>>>> [3] Some Thoughts on IO Integration Tests
>>>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>
>>>> [4] JDBC IO IT using postgres
>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>>> have not
>>>> been reviewed yet, so may contain code errors, but it does run &
>>>> pass :)
>>>>

Re: IO Integration tests - concrete proposal

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi Stephen,

My comments are inline

Le 19/01/2017 � 20:32, Stephen Sisk a �crit :
> I definitely agree that sharing resources between tests is more efficient.
>
> Etienne - do you think it's necessary to separate the IT from the data
> loading script?
Actually, I see separation between IT and loading script more as a an 
improvement (time and resource effective) not as a necessity. Indeed, 
for now, for example, loading in ES IT is done within the IT (see 
https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)

> The postgres/JdbcIOIT can use the natural namespacing of
> tables and I feel pretty comfortable that will work well over time.
You mean using the same table name with different namespace? But IMHO, 
it is still "using another place" that I mentioned, read IT and write IT 
could use same table name in different namespaces.
>   You
> haven't explicitly mentioned it, but I'm assuming that elasticsearch
> doesn't allow such namespacing, so that's why you're having to do the
> separation?
Actually in ES, there is no namespace notion but there is index name. 
The index is the documents storing entity that is split. And there is 
the document type that is more like a class definition for the document. 
So basically, we could have read IT using readIndex.docType and write IT 
using writeIndex.docType.
> I'm not trying to discourage separating data load from IT, just
> wondering whether it's truly necessary.
IMHO, more like an optimization like I mentioned.
>
> I was trying to consolidate what we're discussed down to a few guidelines.
> I think those are that IO ITs:
> 1. Can assume the data load script has been run before the test (unless the
> data load script is run by the test itself)
I Agree
> 2. Must work if they are run multiple times without the data load script
> being run in between (ie, they should clean up after themselves or use
> namespacing such that tests don't interfere with one another)
Yes, sure
> 3. Tests that generate large amounts of data will attempt to clean up after
> themselves. (ie, if you just write 100 rows, don't worry about it - if you
> write 5 gb of data, you'd need to clean up.) We will not assume this will
> always succeed in cleaning up, but my assumption is that if a particular
> data store gets into a bad state, we'll just destroy/recreate that
> particular data store.
I am more reserved on that one regarding flakiness. IMHO, it is better 
to clean in all cases. I mentioned in a thread that sharding in the 
datastore might change depending on data volume (it is not he case for 
ES because the sharding is defined by configuration) or a 
shard/partition in the datastore can become so big that it will be split 
more by the IO. Imagine that a test that writes 100 rows does not do 
cleanup and is run 1 000 times, then the storage entity becomes bigger 
and bigger and it might then be split into more bundles than asserted in 
split tests (either by decision of the datastore or because 
desiredBundleSize is small)
>
> If the tests follow those assumptions, then that should support all the
> scenarios I can think of: running data store create + data load script
> occasionally (say, once a week or month) all the way up to running them
> once per test run (if we decided to go that far.)
Yes but do we chose to enforce a standard way of coding integration 
tests such as
- loading data is done by and exterior loading script
- read tests: do not load data,  do not clean data
- write tests: use another storage place than read tests (using 
namespace for example) and clean it after each test.
?

Etienne
>
> S
>
> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot <ec...@gmail.com>
> wrote:
>
> Hi,
>
> Yes, thanks all for these clarifications about testing architecture.
>
> I agree that point 1 and 2 should be shared between tests as much as
> possible. Especially sharing data loading between tests is more
> time-effective and resource-effective: tests that need data (testRead,
> testSplit, ...) will save the loading time, the wait for asynchronous
> indexation and cleaning time. Just a small comment:
>
> If we share the data loading between tests, then tests that expect an
> empty dataset (testWrite, ...), obviously cannot clear the shared dataset.
>
> So they will need to write to a dedicated place (other than read tests)
> and clean it afterwards.
>
> I will update ElasticSearch read IT
> (https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
> to not do data loading/cleaning and write IT to use another location
> than read IT
>
> Etienne
>
> Le 18/01/2017 � 13:47, Jean-Baptiste Onofr� a �crit :
>> Hi guys,
>>
>> Firs, great e-mail Stephen: complete and detailed proposal.
>>
>> Lukasz raised a good point: it makes sense to be able to leverage the
>> same "bootstrap" script.
>>
>> We discussed about providing the following in each IO:
>> 1. code to load data (java, script, whatever)
>> 2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
>> 3. actual integration tests
>>
>> Only 3 is specific to the IO: 1 and 2 can be the same either if we run
>> integration tests for Python or integration tests for Java SDKs.
>>
>> However,  3 may depend to 1 and 2 (the integration tests perform some
>> assertion based on the loaded data for instance).
>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand
>> or by Jenkins using a "description" of where the code and script are
>> located.
>>
>> So, I think that we can put 1 and 2 in the IO and use "descriptor" to
>> do the bootstrapping.
>>
>> Regards
>> JB
>>
>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>> Since docker containers can run a script on startup, can we embed the
>>> initial data set into that script/container build so that the same
>>> docker
>>> container and initial data set can be used across multiple ITs. For
>>> example, if Python and Java both have JdbcIO, it would be nice if they
>>> could leverage the same docker container with the same data set to
>>> ensure
>>> the same pipeline produces the same results?
>>>
>>> This would be different from embedding the data in the specific IT
>>> implementation and would also create a coupling between ITs from
>>> potentially multiple languages.
>>>
>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <si...@google.com.invalid>
>>> wrote:
>>>
>>>> Hi all!
>>>>
>>>> As I've discussed previously on this list[1], ensuring that we have
>>>> high
>>>> quality IO Transforms is important to beam. We want to do this without
>>>> adding too much burden on developers wanting to contribute. Below I
>>>> have a
>>>> concrete proposal for what an IO integration test would look like
>>>> and an
>>>> example integration test[4] that meets those requirements.
>>>>
>>>> Proposal: we should require that an IO transform includes a passing
>>>> integration test showing the IO can connect to real instance of the
>>>> data
>>>> store. We still want/expect comprehensive unit tests on an IO
>>>> transform,
>>>> but we would allow check ins with just some unit tests in the
>>>> presence of
>>>> an IT.
>>>>
>>>> To support that, we'll require the following pieces associated with
>>>> an IT:
>>>>
>>>> 1. Dockerfile that can be used to create a running instance of the data
>>>> store. We've previously discussed on this list that we would use docker
>>>> images running inside kubernetes or mesos[2], and I'd prefer having a
>>>> kubernetes/mesos script to start a given data store, but for a single
>>>> instance data store, we can take a dockerfile and use it to create a
>>>> simple
>>>> kubernetes/mesos app. If you have questions about how maintaining the
>>>> containers long term would work, check [2] as I discussed a detailed
>>>> plan
>>>> there.
>>>>
>>>> 2. Code to load test data on the data store created by #1. Needs to
>>>> be self
>>>> contained. For now, the easiest way to do this would be to have code
>>>> inside
>>>> of the IT.
>>>>
>>>> 3. The IT. I propose keeping this inside of the same module as the IO
>>>> transform itself since having all the IO transform ITs in one module
>>>> would
>>>> mean there may be conflicts between different data store's
>>>> dependencies.
>>>> Integration tests will need connection information pointing to the data
>>>> store it is testing. As discussed previously on this list[3], it should
>>>> receive that connection information via TestPipelineOptions.
>>>>
>>>> I'd like to get something up and running soon so people checking in
>>>> new IO
>>>> transforms can start taking advantage of an IT framework. Thus,
>>>> there are a
>>>> couple simplifying assumptions in this plan. Pieces of the plan that I
>>>> anticipate will evolve:
>>>>
>>>> 1. The test data load script - we would like to write these in a
>>>> uniform
>>>> way and especially ensure that the test data is cleaned up after the
>>>> tests
>>>> run.
>>>>
>>>> 2. Spinning up/down instances - for now, we'd likely need to do this
>>>> manually. It'd be good to get an automated process for this. That's
>>>> especially critical for performance tests with multiple nodes -
>>>> there's no
>>>> need to keep instances running for that.
>>>>
>>>> Integrating closer with PKB would be a good way to do both of these
>>>> things,
>>>> but first let's focus on getting some basic ITs running.
>>>>
>>>> As a concrete example of this proposal, I've written JDBC IO IT [4].
>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used it.
>>>> The
>>>> key pieces:
>>>>
>>>> * The integration test is in JdbcIOIT.
>>>>
>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>> PostgresTestOptions. We
>>>> may move the TestOptions files into a common place so they can be
>>>> shared
>>>> between tests.
>>>>
>>>> * Test data is created/cleaned up inside of the IT.
>>>>
>>>> * kubernetes/mesos scripts - I have provided examples of both under the
>>>> "jdbc/src/test/resources" directory, but I'd like us to decide as a
>>>> project
>>>> which container orchestration service we want to use - I'll send
>>>> mail about
>>>> that shortly.
>>>>
>>>> thanks!
>>>> Stephen
>>>>
>>>> [1] Integration Testing Sources
>>>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>
>>>> [2] Container Orchestration software for hosting data stores
>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>
>>>> [3] Some Thoughts on IO Integration Tests
>>>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>
>>>> [4] JDBC IO IT using postgres
>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>>> have not
>>>> been reviewed yet, so may contain code errors, but it does run &
>>>> pass :)
>>>>

Re: IO Integration tests - concrete proposal

Posted by Stephen Sisk <si...@google.com.INVALID>.

I definitely agree that sharing resources between tests is more efficient.

Etienne - do you think it's necessary to separate the IT from the data
loading script? The postgres/JdbcIOIT can use the natural namespacing of
tables and I feel pretty comfortable that will work well over time. You
haven't explicitly mentioned it, but I'm assuming that elasticsearch
doesn't allow such namespacing, so that's why you're having to do the
separation? I'm not trying to discourage separating data load from IT, just
wondering whether it's truly necessary.

I was trying to consolidate what we're discussed down to a few guidelines.
I think those are that IO ITs:
1. Can assume the data load script has been run before the test (unless the
data load script is run by the test itself)
2. Must work if they are run multiple times without the data load script
being run in between (ie, they should clean up after themselves or use
namespacing such that tests don't interfere with one another)
3. Tests that generate large amounts of data will attempt to clean up after
themselves. (ie, if you just write 100 rows, don't worry about it - if you
write 5 gb of data, you'd need to clean up.) We will not assume this will
always succeed in cleaning up, but my assumption is that if a particular
data store gets into a bad state, we'll just destroy/recreate that
particular data store.

If the tests follow those assumptions, then that should support all the
scenarios I can think of: running data store create + data load script
occasionally (say, once a week or month) all the way up to running them
once per test run (if we decided to go that far.)

S

On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot <ec...@gmail.com>
wrote:

Hi,

Yes, thanks all for these clarifications about testing architecture.

I agree that point 1 and 2 should be shared between tests as much as
possible. Especially sharing data loading between tests is more
time-effective and resource-effective: tests that need data (testRead,
testSplit, ...) will save the loading time, the wait for asynchronous
indexation and cleaning time. Just a small comment:

If we share the data loading between tests, then tests that expect an
empty dataset (testWrite, ...), obviously cannot clear the shared dataset.

So they will need to write to a dedicated place (other than read tests)
and clean it afterwards.

I will update ElasticSearch read IT
(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
to not do data loading/cleaning and write IT to use another location
than read IT

Etienne

Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :
> Hi guys,
>
> Firs, great e-mail Stephen: complete and detailed proposal.
>
> Lukasz raised a good point: it makes sense to be able to leverage the
> same "bootstrap" script.
>
> We discussed about providing the following in each IO:
> 1. code to load data (java, script, whatever)
> 2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
> 3. actual integration tests
>
> Only 3 is specific to the IO: 1 and 2 can be the same either if we run
> integration tests for Python or integration tests for Java SDKs.
>
> However,  3 may depend to 1 and 2 (the integration tests perform some
> assertion based on the loaded data for instance).
> Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand
> or by Jenkins using a "description" of where the code and script are
> located.
>
> So, I think that we can put 1 and 2 in the IO and use "descriptor" to
> do the bootstrapping.
>
> Regards
> JB
>
> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>> Since docker containers can run a script on startup, can we embed the
>> initial data set into that script/container build so that the same
>> docker
>> container and initial data set can be used across multiple ITs. For
>> example, if Python and Java both have JdbcIO, it would be nice if they
>> could leverage the same docker container with the same data set to
>> ensure
>> the same pipeline produces the same results?
>>
>> This would be different from embedding the data in the specific IT
>> implementation and would also create a coupling between ITs from
>> potentially multiple languages.
>>
>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <si...@google.com.invalid>
>> wrote:
>>
>>> Hi all!
>>>
>>> As I've discussed previously on this list[1], ensuring that we have
>>> high
>>> quality IO Transforms is important to beam. We want to do this without
>>> adding too much burden on developers wanting to contribute. Below I
>>> have a
>>> concrete proposal for what an IO integration test would look like
>>> and an
>>> example integration test[4] that meets those requirements.
>>>
>>> Proposal: we should require that an IO transform includes a passing
>>> integration test showing the IO can connect to real instance of the
>>> data
>>> store. We still want/expect comprehensive unit tests on an IO
>>> transform,
>>> but we would allow check ins with just some unit tests in the
>>> presence of
>>> an IT.
>>>
>>> To support that, we'll require the following pieces associated with
>>> an IT:
>>>
>>> 1. Dockerfile that can be used to create a running instance of the data
>>> store. We've previously discussed on this list that we would use docker
>>> images running inside kubernetes or mesos[2], and I'd prefer having a
>>> kubernetes/mesos script to start a given data store, but for a single
>>> instance data store, we can take a dockerfile and use it to create a
>>> simple
>>> kubernetes/mesos app. If you have questions about how maintaining the
>>> containers long term would work, check [2] as I discussed a detailed
>>> plan
>>> there.
>>>
>>> 2. Code to load test data on the data store created by #1. Needs to
>>> be self
>>> contained. For now, the easiest way to do this would be to have code
>>> inside
>>> of the IT.
>>>
>>> 3. The IT. I propose keeping this inside of the same module as the IO
>>> transform itself since having all the IO transform ITs in one module
>>> would
>>> mean there may be conflicts between different data store's
>>> dependencies.
>>> Integration tests will need connection information pointing to the data
>>> store it is testing. As discussed previously on this list[3], it should
>>> receive that connection information via TestPipelineOptions.
>>>
>>> I'd like to get something up and running soon so people checking in
>>> new IO
>>> transforms can start taking advantage of an IT framework. Thus,
>>> there are a
>>> couple simplifying assumptions in this plan. Pieces of the plan that I
>>> anticipate will evolve:
>>>
>>> 1. The test data load script - we would like to write these in a
>>> uniform
>>> way and especially ensure that the test data is cleaned up after the
>>> tests
>>> run.
>>>
>>> 2. Spinning up/down instances - for now, we'd likely need to do this
>>> manually. It'd be good to get an automated process for this. That's
>>> especially critical for performance tests with multiple nodes -
>>> there's no
>>> need to keep instances running for that.
>>>
>>> Integrating closer with PKB would be a good way to do both of these
>>> things,
>>> but first let's focus on getting some basic ITs running.
>>>
>>> As a concrete example of this proposal, I've written JDBC IO IT [4].
>>> JdbcIOTest already did a lot of test setup, so I heavily re-used it.
>>> The
>>> key pieces:
>>>
>>> * The integration test is in JdbcIOIT.
>>>
>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>> PostgresTestOptions. We
>>> may move the TestOptions files into a common place so they can be
>>> shared
>>> between tests.
>>>
>>> * Test data is created/cleaned up inside of the IT.
>>>
>>> * kubernetes/mesos scripts - I have provided examples of both under the
>>> "jdbc/src/test/resources" directory, but I'd like us to decide as a
>>> project
>>> which container orchestration service we want to use - I'll send
>>> mail about
>>> that shortly.
>>>
>>> thanks!
>>> Stephen
>>>
>>> [1] Integration Testing Sources
>>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>
>>> [2] Container Orchestration software for hosting data stores
>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>
>>> [3] Some Thoughts on IO Integration Tests
>>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>
>>> [4] JDBC IO IT using postgres
>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>> have not
>>> been reviewed yet, so may contain code errors, but it does run &
>>> pass :)
>>>
>>
>

Re: IO Integration tests - concrete proposal

Posted by Etienne Chauchot <ec...@gmail.com>.

Hi,

Yes, thanks all for these clarifications about testing architecture.

I agree that point 1 and 2 should be shared between tests as much as 
possible. Especially sharing data loading between tests is more 
time-effective and resource-effective: tests that need data (testRead, 
testSplit, ...) will save the loading time, the wait for asynchronous 
indexation and cleaning time. Just a small comment:

If we share the data loading between tests, then tests that expect an 
empty dataset (testWrite, ...), obviously cannot clear the shared dataset.

So they will need to write to a dedicated place (other than read tests) 
and clean it afterwards.

I will update ElasticSearch read IT 
(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT) 
to not do data loading/cleaning and write IT to use another location 
than read IT

Etienne

Le 18/01/2017 � 13:47, Jean-Baptiste Onofr� a �crit :
> Hi guys,
>
> Firs, great e-mail Stephen: complete and detailed proposal.
>
> Lukasz raised a good point: it makes sense to be able to leverage the 
> same "bootstrap" script.
>
> We discussed about providing the following in each IO:
> 1. code to load data (java, script, whatever)
> 2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
> 3. actual integration tests
>
> Only 3 is specific to the IO: 1 and 2 can be the same either if we run 
> integration tests for Python or integration tests for Java SDKs.
>
> However,  3 may depend to 1 and 2 (the integration tests perform some 
> assertion based on the loaded data for instance).
> Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand 
> or by Jenkins using a "description" of where the code and script are 
> located.
>
> So, I think that we can put 1 and 2 in the IO and use "descriptor" to 
> do the bootstrapping.
>
> Regards
> JB
>
> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>> Since docker containers can run a script on startup, can we embed the
>> initial data set into that script/container build so that the same 
>> docker
>> container and initial data set can be used across multiple ITs. For
>> example, if Python and Java both have JdbcIO, it would be nice if they
>> could leverage the same docker container with the same data set to 
>> ensure
>> the same pipeline produces the same results?
>>
>> This would be different from embedding the data in the specific IT
>> implementation and would also create a coupling between ITs from
>> potentially multiple languages.
>>
>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <si...@google.com.invalid>
>> wrote:
>>
>>> Hi all!
>>>
>>> As I've discussed previously on this list[1], ensuring that we have 
>>> high
>>> quality IO Transforms is important to beam. We want to do this without
>>> adding too much burden on developers wanting to contribute. Below I 
>>> have a
>>> concrete proposal for what an IO integration test would look like 
>>> and an
>>> example integration test[4] that meets those requirements.
>>>
>>> Proposal: we should require that an IO transform includes a passing
>>> integration test showing the IO can connect to real instance of the 
>>> data
>>> store. We still want/expect comprehensive unit tests on an IO 
>>> transform,
>>> but we would allow check ins with just some unit tests in the 
>>> presence of
>>> an IT.
>>>
>>> To support that, we'll require the following pieces associated with 
>>> an IT:
>>>
>>> 1. Dockerfile that can be used to create a running instance of the data
>>> store. We've previously discussed on this list that we would use docker
>>> images running inside kubernetes or mesos[2], and I'd prefer having a
>>> kubernetes/mesos script to start a given data store, but for a single
>>> instance data store, we can take a dockerfile and use it to create a 
>>> simple
>>> kubernetes/mesos app. If you have questions about how maintaining the
>>> containers long term would work, check [2] as I discussed a detailed 
>>> plan
>>> there.
>>>
>>> 2. Code to load test data on the data store created by #1. Needs to 
>>> be self
>>> contained. For now, the easiest way to do this would be to have code 
>>> inside
>>> of the IT.
>>>
>>> 3. The IT. I propose keeping this inside of the same module as the IO
>>> transform itself since having all the IO transform ITs in one module 
>>> would
>>> mean there may be conflicts between different data store's 
>>> dependencies.
>>> Integration tests will need connection information pointing to the data
>>> store it is testing. As discussed previously on this list[3], it should
>>> receive that connection information via TestPipelineOptions.
>>>
>>> I'd like to get something up and running soon so people checking in 
>>> new IO
>>> transforms can start taking advantage of an IT framework. Thus, 
>>> there are a
>>> couple simplifying assumptions in this plan. Pieces of the plan that I
>>> anticipate will evolve:
>>>
>>> 1. The test data load script - we would like to write these in a 
>>> uniform
>>> way and especially ensure that the test data is cleaned up after the 
>>> tests
>>> run.
>>>
>>> 2. Spinning up/down instances - for now, we'd likely need to do this
>>> manually. It'd be good to get an automated process for this. That's
>>> especially critical for performance tests with multiple nodes - 
>>> there's no
>>> need to keep instances running for that.
>>>
>>> Integrating closer with PKB would be a good way to do both of these 
>>> things,
>>> but first let's focus on getting some basic ITs running.
>>>
>>> As a concrete example of this proposal, I've written JDBC IO IT [4].
>>> JdbcIOTest already did a lot of test setup, so I heavily re-used it. 
>>> The
>>> key pieces:
>>>
>>> * The integration test is in JdbcIOIT.
>>>
>>> * JdbcIOIT reads the TestPipelineOptions defined in 
>>> PostgresTestOptions. We
>>> may move the TestOptions files into a common place so they can be 
>>> shared
>>> between tests.
>>>
>>> * Test data is created/cleaned up inside of the IT.
>>>
>>> * kubernetes/mesos scripts - I have provided examples of both under the
>>> "jdbc/src/test/resources" directory, but I'd like us to decide as a 
>>> project
>>> which container orchestration service we want to use - I'll send 
>>> mail about
>>> that shortly.
>>>
>>> thanks!
>>> Stephen
>>>
>>> [1] Integration Testing Sources
>>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>
>>> [2] Container Orchestration software for hosting data stores
>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>
>>> [3] Some Thoughts on IO Integration Tests
>>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>
>>> [4] JDBC IO IT using postgres
>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - 
>>> have not
>>> been reviewed yet, so may contain code errors, but it does run & 
>>> pass :)
>>>
>>
>

Re: IO Integration tests - concrete proposal

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi guys,

Firs, great e-mail Stephen: complete and detailed proposal.

Lukasz raised a good point: it makes sense to be able to leverage the 
same "bootstrap" script.

We discussed about providing the following in each IO:
1. code to load data (java, script, whatever)
2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
3. actual integration tests

Only 3 is specific to the IO: 1 and 2 can be the same either if we run 
integration tests for Python or integration tests for Java SDKs.

However,  3 may depend to 1 and 2 (the integration tests perform some 
assertion based on the loaded data for instance).
Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand or 
by Jenkins using a "description" of where the code and script are located.

So, I think that we can put 1 and 2 in the IO and use "descriptor" to do 
the bootstrapping.

Regards
JB

On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
> Since docker containers can run a script on startup, can we embed the
> initial data set into that script/container build so that the same docker
> container and initial data set can be used across multiple ITs. For
> example, if Python and Java both have JdbcIO, it would be nice if they
> could leverage the same docker container with the same data set to ensure
> the same pipeline produces the same results?
>
> This would be different from embedding the data in the specific IT
> implementation and would also create a coupling between ITs from
> potentially multiple languages.
>
> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <si...@google.com.invalid>
> wrote:
>
>> Hi all!
>>
>> As I've discussed previously on this list[1], ensuring that we have high
>> quality IO Transforms is important to beam. We want to do this without
>> adding too much burden on developers wanting to contribute. Below I have a
>> concrete proposal for what an IO integration test would look like and an
>> example integration test[4] that meets those requirements.
>>
>> Proposal: we should require that an IO transform includes a passing
>> integration test showing the IO can connect to real instance of the data
>> store. We still want/expect comprehensive unit tests on an IO transform,
>> but we would allow check ins with just some unit tests in the presence of
>> an IT.
>>
>> To support that, we'll require the following pieces associated with an IT:
>>
>> 1. Dockerfile that can be used to create a running instance of the data
>> store. We've previously discussed on this list that we would use docker
>> images running inside kubernetes or mesos[2], and I'd prefer having a
>> kubernetes/mesos script to start a given data store, but for a single
>> instance data store, we can take a dockerfile and use it to create a simple
>> kubernetes/mesos app. If you have questions about how maintaining the
>> containers long term would work, check [2] as I discussed a detailed plan
>> there.
>>
>> 2. Code to load test data on the data store created by #1. Needs to be self
>> contained. For now, the easiest way to do this would be to have code inside
>> of the IT.
>>
>> 3. The IT. I propose keeping this inside of the same module as the IO
>> transform itself since having all the IO transform ITs in one module would
>> mean there may be conflicts between different data store's dependencies.
>> Integration tests will need connection information pointing to the data
>> store it is testing. As discussed previously on this list[3], it should
>> receive that connection information via TestPipelineOptions.
>>
>> I'd like to get something up and running soon so people checking in new IO
>> transforms can start taking advantage of an IT framework. Thus, there are a
>> couple simplifying assumptions in this plan. Pieces of the plan that I
>> anticipate will evolve:
>>
>> 1. The test data load script - we would like to write these in a uniform
>> way and especially ensure that the test data is cleaned up after the tests
>> run.
>>
>> 2. Spinning up/down instances - for now, we'd likely need to do this
>> manually. It'd be good to get an automated process for this. That's
>> especially critical for performance tests with multiple nodes - there's no
>> need to keep instances running for that.
>>
>> Integrating closer with PKB would be a good way to do both of these things,
>> but first let's focus on getting some basic ITs running.
>>
>> As a concrete example of this proposal, I've written JDBC IO IT [4].
>> JdbcIOTest already did a lot of test setup, so I heavily re-used it. The
>> key pieces:
>>
>> * The integration test is in JdbcIOIT.
>>
>> * JdbcIOIT reads the TestPipelineOptions defined in PostgresTestOptions. We
>> may move the TestOptions files into a common place so they can be shared
>> between tests.
>>
>> * Test data is created/cleaned up inside of the IT.
>>
>> * kubernetes/mesos scripts - I have provided examples of both under the
>> "jdbc/src/test/resources" directory, but I'd like us to decide as a project
>> which container orchestration service we want to use - I'll send mail about
>> that shortly.
>>
>> thanks!
>> Stephen
>>
>> [1] Integration Testing Sources
>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>
>> [2] Container Orchestration software for hosting data stores
>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>
>> [3] Some Thoughts on IO Integration Tests
>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>
>> [4] JDBC IO IT using postgres
>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - have not
>> been reviewed yet, so may contain code errors, but it does run & pass :)
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: IO Integration tests - concrete proposal

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

Since docker containers can run a script on startup, can we embed the
initial data set into that script/container build so that the same docker
container and initial data set can be used across multiple ITs. For
example, if Python and Java both have JdbcIO, it would be nice if they
could leverage the same docker container with the same data set to ensure
the same pipeline produces the same results?

This would be different from embedding the data in the specific IT
implementation and would also create a coupling between ITs from
potentially multiple languages.

On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <si...@google.com.invalid>
wrote:

> Hi all!
>
> As I've discussed previously on this list[1], ensuring that we have high
> quality IO Transforms is important to beam. We want to do this without
> adding too much burden on developers wanting to contribute. Below I have a
> concrete proposal for what an IO integration test would look like and an
> example integration test[4] that meets those requirements.
>
> Proposal: we should require that an IO transform includes a passing
> integration test showing the IO can connect to real instance of the data
> store. We still want/expect comprehensive unit tests on an IO transform,
> but we would allow check ins with just some unit tests in the presence of
> an IT.
>
> To support that, we'll require the following pieces associated with an IT:
>
> 1. Dockerfile that can be used to create a running instance of the data
> store. We've previously discussed on this list that we would use docker
> images running inside kubernetes or mesos[2], and I'd prefer having a
> kubernetes/mesos script to start a given data store, but for a single
> instance data store, we can take a dockerfile and use it to create a simple
> kubernetes/mesos app. If you have questions about how maintaining the
> containers long term would work, check [2] as I discussed a detailed plan
> there.
>
> 2. Code to load test data on the data store created by #1. Needs to be self
> contained. For now, the easiest way to do this would be to have code inside
> of the IT.
>
> 3. The IT. I propose keeping this inside of the same module as the IO
> transform itself since having all the IO transform ITs in one module would
> mean there may be conflicts between different data store's dependencies.
> Integration tests will need connection information pointing to the data
> store it is testing. As discussed previously on this list[3], it should
> receive that connection information via TestPipelineOptions.
>
> I'd like to get something up and running soon so people checking in new IO
> transforms can start taking advantage of an IT framework. Thus, there are a
> couple simplifying assumptions in this plan. Pieces of the plan that I
> anticipate will evolve:
>
> 1. The test data load script - we would like to write these in a uniform
> way and especially ensure that the test data is cleaned up after the tests
> run.
>
> 2. Spinning up/down instances - for now, we'd likely need to do this
> manually. It'd be good to get an automated process for this. That's
> especially critical for performance tests with multiple nodes - there's no
> need to keep instances running for that.
>
> Integrating closer with PKB would be a good way to do both of these things,
> but first let's focus on getting some basic ITs running.
>
> As a concrete example of this proposal, I've written JDBC IO IT [4].
> JdbcIOTest already did a lot of test setup, so I heavily re-used it. The
> key pieces:
>
> * The integration test is in JdbcIOIT.
>
> * JdbcIOIT reads the TestPipelineOptions defined in PostgresTestOptions. We
> may move the TestOptions files into a common place so they can be shared
> between tests.
>
> * Test data is created/cleaned up inside of the IT.
>
> * kubernetes/mesos scripts - I have provided examples of both under the
> "jdbc/src/test/resources" directory, but I'd like us to decide as a project
> which container orchestration service we want to use - I'll send mail about
> that shortly.
>
> thanks!
> Stephen
>
> [1] Integration Testing Sources
> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>
> [2] Container Orchestration software for hosting data stores
> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>
> [3] Some Thoughts on IO Integration Tests
> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>
> [4] JDBC IO IT using postgres
> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - have not
> been reviewed yet, so may contain code errors, but it does run & pass :)
>

Re: IO Integration tests - concrete proposal

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

In addition of Stephen's e-mail, I would like to add some points for IO 
contributors about unit tests in addition of integration tests (it's 
likely indirectly related ;)).

Depending of the backend, it could be difficult to write unit tests 
using a running service for some IO.
For instance, I started the implementation of ReddisIO and RabbitMqIO. 
It's not really possible to embed Reddis or RabbitMQ in the java unit 
test as Reddis or RabbitMQ depend to the running system (erlang, ruby, ...).
The problem can also happen if the actual service is not easy to 
bootstrap or return unpredictable results (for instance, I'm working on 
Facebook and Twitter IO where the tweets returned during a test might 
change ;)).

In this situation, the unit tests has to be implemented using a "fake" 
service or mock.

It means that the IO should provide:
- a IOService interface describing the interaction/behavior of the backend
- a Fake/Mock IOService used in the unit tests
- a IOServiceImpl actually using the backend and used both in the IO 
code and in the integration test code.

Dan implemented this in BigTable IO.

I started to do the same in ReddisIO, RabbitMqIO and others along with 
integration tests approach proposed by Stephen.

My $0.01 ;)

Regards
JB

On 01/17/2017 04:27 PM, Stephen Sisk wrote:
> Hi all!
>
> As I've discussed previously on this list[1], ensuring that we have high
> quality IO Transforms is important to beam. We want to do this without
> adding too much burden on developers wanting to contribute. Below I have a
> concrete proposal for what an IO integration test would look like and an
> example integration test[4] that meets those requirements.
>
> Proposal: we should require that an IO transform includes a passing
> integration test showing the IO can connect to real instance of the data
> store. We still want/expect comprehensive unit tests on an IO transform,
> but we would allow check ins with just some unit tests in the presence of
> an IT.
>
> To support that, we'll require the following pieces associated with an IT:
>
> 1. Dockerfile that can be used to create a running instance of the data
> store. We've previously discussed on this list that we would use docker
> images running inside kubernetes or mesos[2], and I'd prefer having a
> kubernetes/mesos script to start a given data store, but for a single
> instance data store, we can take a dockerfile and use it to create a simple
> kubernetes/mesos app. If you have questions about how maintaining the
> containers long term would work, check [2] as I discussed a detailed plan
> there.
>
> 2. Code to load test data on the data store created by #1. Needs to be self
> contained. For now, the easiest way to do this would be to have code inside
> of the IT.
>
> 3. The IT. I propose keeping this inside of the same module as the IO
> transform itself since having all the IO transform ITs in one module would
> mean there may be conflicts between different data store's dependencies.
> Integration tests will need connection information pointing to the data
> store it is testing. As discussed previously on this list[3], it should
> receive that connection information via TestPipelineOptions.
>
> I'd like to get something up and running soon so people checking in new IO
> transforms can start taking advantage of an IT framework. Thus, there are a
> couple simplifying assumptions in this plan. Pieces of the plan that I
> anticipate will evolve:
>
> 1. The test data load script - we would like to write these in a uniform
> way and especially ensure that the test data is cleaned up after the tests
> run.
>
> 2. Spinning up/down instances - for now, we'd likely need to do this
> manually. It'd be good to get an automated process for this. That's
> especially critical for performance tests with multiple nodes - there's no
> need to keep instances running for that.
>
> Integrating closer with PKB would be a good way to do both of these things,
> but first let's focus on getting some basic ITs running.
>
> As a concrete example of this proposal, I've written JDBC IO IT [4].
> JdbcIOTest already did a lot of test setup, so I heavily re-used it. The
> key pieces:
>
> * The integration test is in JdbcIOIT.
>
> * JdbcIOIT reads the TestPipelineOptions defined in PostgresTestOptions. We
> may move the TestOptions files into a common place so they can be shared
> between tests.
>
> * Test data is created/cleaned up inside of the IT.
>
> * kubernetes/mesos scripts - I have provided examples of both under the
> "jdbc/src/test/resources" directory, but I'd like us to decide as a project
> which container orchestration service we want to use - I'll send mail about
> that shortly.
>
> thanks!
> Stephen
>
> [1] Integration Testing Sources
> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071aa6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>
> [2] Container Orchestration software for hosting data stores
> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>
> [3] Some Thoughts on IO Integration Tests
> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>
> [4] JDBC IO IT using postgres
> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - have not
> been reviewed yet, so may contain code errors, but it does run & pass :)
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com