You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Juan Rodríguez Hortalá <ju...@gmail.com> on 2019/07/23 05:11:49 UTC

Execution environments for testing: local vs collection vs mini cluster

Hi,

In
https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html
and
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html
I see there are 3 ways to create an execution environment for testing:

   - StreamExecutionEnvironment.createLocalEnvironment and
   ExecutionEnvironment.createLocalEnvironment create an execution environment
   running on a single JVM using different threads.
   - CollectionEnvironment runs on a single JVM on a single thread.
   - I haven't found not much documentation on the Mini Cluster, but it
   sounds similar to the Hadoop MiniCluster
   <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CLIMiniCluster.html>.
   If that is then case, then it would run on many local JVMs, each of them
   running multiple threads.

Am I correct about the Mini Cluster? Is there any additional documentation
about it? I discovered it looking at the source code of AbstractTestBase,
that is mentioned on
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing.
Also, it looks like launching the mini cluster registers it somewhere, so
subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment`
return an environment that uses the mini cluster. Is that performed by
`executionEnvironment.setAsContext()` in
https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56
? Is that execution environment registration process documented anywhere?

Which test execution environment is recommended for each test use case? For
example I don't see why would I use CollectionEnvironment when I have the
local environment available and running on several threads, what is a good
use case for CollectionEnvironment?

Are all these 3 environments supported equality, or maybe some of them is
expected to be deprecated?

Are there any additional execution environments that could be useful for
testing on a single host?

Thanks,

Juan

Re: Execution environments for testing: local vs collection vs mini cluster

Posted by Juan Rodríguez Hortalá <ju...@gmail.com>.

Hi,

Thanks for your answer. I hadn't noticed that the collection environment
only works for the batch API. It's also nice to know that the mini cluster
is more an internal tool. So that the local execution environments for
batch and streaming are working very well for me, I was just curious,
thanks for the clarifications.

Greetings,

Juan


On Fri, Jul 26, 2019 at 1:32 AM Biao Liu <mm...@gmail.com> wrote:

> Hi Juan,
>
> Sorry for the late reply.
>
> 1. the environments of data stream and data set are not same. An obvious
> difference is there always be a "stream" prefix of environment for data
> stream. For example, StreamExecutionEnvironment is for data stream,
> ExecutionEnvironment and CollectionEnvironment are for data set.
>
> You could use "StreamExecutionEnvironment.createLocalEnvironment" to run
> or test a data stream job. Use ExecutionEnvironment.createLocalEnvironment
> or CollectionEnvironment to run or test a data set job.
>
> Actually you could also use
> StreamExecutionEnvironment.getExecutionEnvironment
> or ExecutionEnvironment.getExecutionEnvironment. Because they would choose
> local environment automatically if you are running job standalone (in IDE
> or execute the main method directly).
>
> 2. Regarding to MiniCluster, IMO it's a bit internal. The MiniCluster runs
> as backend behind local environment. I think there is a subtle difference
> of the position between mini cluster of Flink and mini cluster of Hadoop.
>
> 3. I will try to answer your questions below.
>
> > Which test execution environment is recommended for each test use case?
> It depends on which mode you are testing, data stream or data set.
>
> > For example I don't see why would I use CollectionEnvironment when I
> have the local environment available and running on several threads, what
> is a good use case for CollectionEnvironment?
> In the official document, it says "CollectionEnvironment is a low-overhead
> approach for executing Flink programs". As I don't have much experience of
> data set, I just check the relevant codes. The CollectionEnvironment seems
> not to start a mini cluster. I believe it executes job in a lighter way.
> BTW, There is no such an equivalent environment for data stream.
>
> > Are all these 3 environments supported equality, or maybe some of them
> is expected to be deprecated?
> Obviously they are not same as mentioned above.
> If a class is deprecated, it would be decorated by an annotation
> "Deprecated".
>
> > Are there any additional execution environments that could be useful for
> testing on a single host?
> I would suggest to follow the official documents [1][2] which you have
> discovered, even there might be some other ways which seem to be
> equivalent. Because if you depend on some internal implementation, it might
> be changed over time without any notification.
>
>
> 1.
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing
> 2.
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/local_execution.html
>
>
> On Tue, Jul 23, 2019 at 11:30 PM Juan Rodríguez Hortalá <
> juan.rodriguez.hortala@gmail.com> wrote:
>
>> Hi Bao,
>>
>> Thanks for your answer.
>>
>> 1. Integration tests for my project.
>> 2. Both data stream and data sets
>>
>>
>>
>> On Mon, Jul 22, 2019 at 11:44 PM Biao Liu <mm...@gmail.com> wrote:
>>
>>> Hi Juan,
>>>
>>> I'm not sure what you really want. Before giving some suggestions, could
>>> you answer the questions below first?
>>>
>>> 1. Do you want to write a unit test (or integration test) case for your
>>> project or for Flink? Or just want to run your job locally?
>>> 2. Which mode do you want to test? DataStream or DataSet?
>>>
>>>
>>>
>>> Juan Rodríguez Hortalá <ju...@gmail.com> 于2019年7月23日周二
>>> 下午1:12写道：
>>>
>>>> Hi,
>>>>
>>>> In
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html
>>>> and
>>>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html
>>>> I see there are 3 ways to create an execution environment for testing:
>>>>
>>>>    - StreamExecutionEnvironment.createLocalEnvironment and
>>>>    ExecutionEnvironment.createLocalEnvironment create an execution environment
>>>>    running on a single JVM using different threads.
>>>>    - CollectionEnvironment runs on a single JVM on a single thread.
>>>>    - I haven't found not much documentation on the Mini Cluster, but
>>>>    it sounds similar to the Hadoop MiniCluster
>>>>    <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CLIMiniCluster.html>.
>>>>    If that is then case, then it would run on many local JVMs, each of them
>>>>    running multiple threads.
>>>>
>>>> Am I correct about the Mini Cluster? Is there any additional
>>>> documentation about it? I discovered it looking at the source code of
>>>> AbstractTestBase, that is mentioned on
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing.
>>>> Also, it looks like launching the mini cluster registers it somewhere, so
>>>> subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment`
>>>> return an environment that uses the mini cluster. Is that performed by
>>>> `executionEnvironment.setAsContext()` in
>>>> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56
>>>> ? Is that execution environment registration process documented anywhere?
>>>>
>>>> Which test execution environment is recommended for each test use case?
>>>> For example I don't see why would I use CollectionEnvironment when I have
>>>> the local environment available and running on several threads, what is a
>>>> good use case for CollectionEnvironment?
>>>>
>>>> Are all these 3 environments supported equality, or maybe some of them
>>>> is expected to be deprecated?
>>>>
>>>> Are there any additional execution environments that could be useful
>>>> for testing on a single host?
>>>>
>>>> Thanks,
>>>>
>>>> Juan
>>>>
>>>>
>>>>

Re: Execution environments for testing: local vs collection vs mini cluster

Posted by Biao Liu <mm...@gmail.com>.

Hi Juan,

Sorry for the late reply.

1. the environments of data stream and data set are not same. An obvious
difference is there always be a "stream" prefix of environment for data
stream. For example, StreamExecutionEnvironment is for data stream,
ExecutionEnvironment and CollectionEnvironment are for data set.

You could use "StreamExecutionEnvironment.createLocalEnvironment" to run or
test a data stream job. Use ExecutionEnvironment.createLocalEnvironment or
CollectionEnvironment to run or test a data set job.

Actually you could also use
StreamExecutionEnvironment.getExecutionEnvironment
or ExecutionEnvironment.getExecutionEnvironment. Because they would choose
local environment automatically if you are running job standalone (in IDE
or execute the main method directly).

2. Regarding to MiniCluster, IMO it's a bit internal. The MiniCluster runs
as backend behind local environment. I think there is a subtle difference
of the position between mini cluster of Flink and mini cluster of Hadoop.

3. I will try to answer your questions below.

> Which test execution environment is recommended for each test use case?
It depends on which mode you are testing, data stream or data set.

> For example I don't see why would I use CollectionEnvironment when I have
the local environment available and running on several threads, what is a
good use case for CollectionEnvironment?
In the official document, it says "CollectionEnvironment is a low-overhead
approach for executing Flink programs". As I don't have much experience of
data set, I just check the relevant codes. The CollectionEnvironment seems
not to start a mini cluster. I believe it executes job in a lighter way.
BTW, There is no such an equivalent environment for data stream.

> Are all these 3 environments supported equality, or maybe some of them is
expected to be deprecated?
Obviously they are not same as mentioned above.
If a class is deprecated, it would be decorated by an annotation
"Deprecated".

> Are there any additional execution environments that could be useful for
testing on a single host?
I would suggest to follow the official documents [1][2] which you have
discovered, even there might be some other ways which seem to be
equivalent. Because if you depend on some internal implementation, it might
be changed over time without any notification.


1.
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing
2.
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/local_execution.html


On Tue, Jul 23, 2019 at 11:30 PM Juan Rodríguez Hortalá <
juan.rodriguez.hortala@gmail.com> wrote:

> Hi Bao,
>
> Thanks for your answer.
>
> 1. Integration tests for my project.
> 2. Both data stream and data sets
>
>
>
> On Mon, Jul 22, 2019 at 11:44 PM Biao Liu <mm...@gmail.com> wrote:
>
>> Hi Juan,
>>
>> I'm not sure what you really want. Before giving some suggestions, could
>> you answer the questions below first?
>>
>> 1. Do you want to write a unit test (or integration test) case for your
>> project or for Flink? Or just want to run your job locally?
>> 2. Which mode do you want to test? DataStream or DataSet?
>>
>>
>>
>> Juan Rodríguez Hortalá <ju...@gmail.com> 于2019年7月23日周二
>> 下午1:12写道：
>>
>>> Hi,
>>>
>>> In
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html
>>> and
>>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html
>>> I see there are 3 ways to create an execution environment for testing:
>>>
>>>    - StreamExecutionEnvironment.createLocalEnvironment and
>>>    ExecutionEnvironment.createLocalEnvironment create an execution environment
>>>    running on a single JVM using different threads.
>>>    - CollectionEnvironment runs on a single JVM on a single thread.
>>>    - I haven't found not much documentation on the Mini Cluster, but it
>>>    sounds similar to the Hadoop MiniCluster
>>>    <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CLIMiniCluster.html>.
>>>    If that is then case, then it would run on many local JVMs, each of them
>>>    running multiple threads.
>>>
>>> Am I correct about the Mini Cluster? Is there any additional
>>> documentation about it? I discovered it looking at the source code of
>>> AbstractTestBase, that is mentioned on
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing.
>>> Also, it looks like launching the mini cluster registers it somewhere, so
>>> subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment`
>>> return an environment that uses the mini cluster. Is that performed by
>>> `executionEnvironment.setAsContext()` in
>>> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56
>>> ? Is that execution environment registration process documented anywhere?
>>>
>>> Which test execution environment is recommended for each test use case?
>>> For example I don't see why would I use CollectionEnvironment when I have
>>> the local environment available and running on several threads, what is a
>>> good use case for CollectionEnvironment?
>>>
>>> Are all these 3 environments supported equality, or maybe some of them
>>> is expected to be deprecated?
>>>
>>> Are there any additional execution environments that could be useful for
>>> testing on a single host?
>>>
>>> Thanks,
>>>
>>> Juan
>>>
>>>
>>>

Re: Execution environments for testing: local vs collection vs mini cluster

Posted by Juan Rodríguez Hortalá <ju...@gmail.com>.

Hi Bao,

Thanks for your answer.

1. Integration tests for my project.
2. Both data stream and data sets



On Mon, Jul 22, 2019 at 11:44 PM Biao Liu <mm...@gmail.com> wrote:

> Hi Juan,
>
> I'm not sure what you really want. Before giving some suggestions, could
> you answer the questions below first?
>
> 1. Do you want to write a unit test (or integration test) case for your
> project or for Flink? Or just want to run your job locally?
> 2. Which mode do you want to test? DataStream or DataSet?
>
>
>
> Juan Rodríguez Hortalá <ju...@gmail.com> 于2019年7月23日周二
> 下午1:12写道：
>
>> Hi,
>>
>> In
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html
>> and
>> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html
>> I see there are 3 ways to create an execution environment for testing:
>>
>>    - StreamExecutionEnvironment.createLocalEnvironment and
>>    ExecutionEnvironment.createLocalEnvironment create an execution environment
>>    running on a single JVM using different threads.
>>    - CollectionEnvironment runs on a single JVM on a single thread.
>>    - I haven't found not much documentation on the Mini Cluster, but it
>>    sounds similar to the Hadoop MiniCluster
>>    <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CLIMiniCluster.html>.
>>    If that is then case, then it would run on many local JVMs, each of them
>>    running multiple threads.
>>
>> Am I correct about the Mini Cluster? Is there any additional
>> documentation about it? I discovered it looking at the source code of
>> AbstractTestBase, that is mentioned on
>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing.
>> Also, it looks like launching the mini cluster registers it somewhere, so
>> subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment`
>> return an environment that uses the mini cluster. Is that performed by
>> `executionEnvironment.setAsContext()` in
>> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56
>> ? Is that execution environment registration process documented anywhere?
>>
>> Which test execution environment is recommended for each test use case?
>> For example I don't see why would I use CollectionEnvironment when I have
>> the local environment available and running on several threads, what is a
>> good use case for CollectionEnvironment?
>>
>> Are all these 3 environments supported equality, or maybe some of them is
>> expected to be deprecated?
>>
>> Are there any additional execution environments that could be useful for
>> testing on a single host?
>>
>> Thanks,
>>
>> Juan
>>
>>
>>

Re: Execution environments for testing: local vs collection vs mini cluster

Posted by Biao Liu <mm...@gmail.com>.

Hi Juan,

I'm not sure what you really want. Before giving some suggestions, could
you answer the questions below first?

1. Do you want to write a unit test (or integration test) case for your
project or for Flink? Or just want to run your job locally?
2. Which mode do you want to test? DataStream or DataSet?



Juan Rodríguez Hortalá <ju...@gmail.com> 于2019年7月23日周二
下午1:12写道：

> Hi,
>
> In
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/local_execution.html
> and
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/runtime/minicluster/MiniCluster.html
> I see there are 3 ways to create an execution environment for testing:
>
>    - StreamExecutionEnvironment.createLocalEnvironment and
>    ExecutionEnvironment.createLocalEnvironment create an execution environment
>    running on a single JVM using different threads.
>    - CollectionEnvironment runs on a single JVM on a single thread.
>    - I haven't found not much documentation on the Mini Cluster, but it
>    sounds similar to the Hadoop MiniCluster
>    <https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CLIMiniCluster.html>.
>    If that is then case, then it would run on many local JVMs, each of them
>    running multiple threads.
>
> Am I correct about the Mini Cluster? Is there any additional documentation
> about it? I discovered it looking at the source code of AbstractTestBase,
> that is mentioned on
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/stream/testing.html#integration-testing.
> Also, it looks like launching the mini cluster registers it somewhere, so
> subsequent calls to `StreamExecutionEnvironment.getExecutionEnvironment`
> return an environment that uses the mini cluster. Is that performed by
> `executionEnvironment.setAsContext()` in
> https://github.com/apache/flink/blob/master/flink-test-utils-parent/flink-test-utils/src/main/java/org/apache/flink/test/util/MiniClusterWithClientResource.java#L56
> ? Is that execution environment registration process documented anywhere?
>
> Which test execution environment is recommended for each test use case?
> For example I don't see why would I use CollectionEnvironment when I have
> the local environment available and running on several threads, what is a
> good use case for CollectionEnvironment?
>
> Are all these 3 environments supported equality, or maybe some of them is
> expected to be deprecated?
>
> Are there any additional execution environments that could be useful for
> testing on a single host?
>
> Thanks,
>
> Juan
>
>
>