You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2014/11/21 09:33:37 UTC

Flink ProgramDriver

Hi guys,
I forgot to ask you if there's a Flink utility to simulate the Hadoop
ProgramDriver class that acts somehow like a registry of jobs. Is there
something similar?

Best,
Flavio

-- 

Flavio Pompermaier

*Development Department*_______________________________________________
*OKKAM**Srl **- www.okkam.it <http://www.okkam.it/>*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* pompermaier@okkam.it
*Headquarters:* Trento (Italy), via G.B. Trener 8
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.

Re: Flink ProgramDriver

Posted by Stephan Ewen <se...@apache.org>.

The execute() call on the Environment blocks. The future will hence not be
done until the execution is finished...

On Tue, Nov 25, 2014 at 7:00 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Sounds good to me..how do you check for completion from java code?
>
> On Tue, Nov 25, 2014 at 6:56 PM, Stephan Ewen <se...@apache.org> wrote:
>
>> Hi!
>>
>> 1) The Remote Executor will automatically transfer the jar, if needed.
>>
>> 2) Background execution is not supported out of the box. I would go for a
>> Java ExecutorService with a FutureTask to kick of tasks in a background
>> thread and allow to check for completion.
>>
>> Stephan
>>
>>
>> On Tue, Nov 25, 2014 at 6:41 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> Do I have to upload the jar from my application to the Flink Job manager
>>> every time?
>>> Do I have to wait the job to finish? I'd like to start the job
>>> execution, get an id of it and then poll for its status..is that possible?
>>>
>>> On Tue, Nov 25, 2014 at 6:04 PM, Robert Metzger <rm...@apache.org>
>>> wrote:
>>>
>>>> Cool.
>>>>
>>>> So you have basically two options:
>>>> a) use the bin/flink run tool.
>>>> This tool is meant for users to submit a job once. To use that, upload
>>>> the jar to any location in the file system (not HDFS).
>>>> use ./bin/flink run <pathToJar> -c classNameOfJobYouWantToRun
>>>> <JobArguments>
>>>> to run the job.
>>>>
>>>> b) use the RemoteExecutor.
>>>> For using the remove Executor, you don't need to put your jar file
>>>> anywhere in your cluster.
>>>> The only thing you need is the jar file somewhere were the Java
>>>> Application can access it.
>>>> Inside this Java Application, you have something like:
>>>>
>>>> runJobOne(ExecutionEnvironment ee) {
>>>>  ee.readFile( ... );
>>>>  ...
>>>>   ee.execute("job 1");
>>>> }
>>>>
>>>> runJobTwo(Exe ..) {
>>>>  ...
>>>> }
>>>>
>>>>
>>>> main() {
>>>>  ExecutionEnvironment  ee = new Remote execution environment ..
>>>>
>>>>  if(something) {
>>>>      runJobOne(ee);
>>>>  } else if(something else) {
>>>>     runJobTwo(ee);
>>>>  } ...
>>>> }
>>>>
>>>>
>>>> The object returned by the ExecutionEnvironment.execute() call also
>>>> contains information about the final status of the program (failed etc.).
>>>>
>>>> I hope that helps.
>>>>
>>>> On Tue, Nov 25, 2014 at 5:30 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> See inline
>>>>>
>>>>> On Tue, Nov 25, 2014 at 3:37 PM, Robert Metzger <rm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> maybe we need to go a step back because I did not yet fully
>>>>>> understand what you want to do.
>>>>>>
>>>>>> My understanding so far is the following:
>>>>>> - You have a set of jobs that you've written for Flink
>>>>>>
>>>>>
>>>>> Yes, and they are all in the same jar (that I want to put in the
>>>>> cluster somehow)
>>>>>
>>>>> - You have a cluster with Flink running
>>>>>>
>>>>>
>>>>> Yes!
>>>>>
>>>>>
>>>>>> - You have an external client, which is a Java Application that is
>>>>>> controlling when and how the different jobs are launched. The client is
>>>>>> running basically 24/7 or started by a cronjob.
>>>>>>
>>>>>
>>>>> I have a Java application somewhere that triggers the execution of one
>>>>> of the available jobs in the jar (so I need to pass also the necessary
>>>>> arguments required by each job) and then monitor if the job has been put
>>>>> into a running state and its status (running/failed/finished and percentage
>>>>> would be awesome).
>>>>> I don't think RemoteExecutor is enough..am I wrong?
>>>>>
>>>>>
>>>>>> Correct me if these assumptions are wrong. If they are true, the
>>>>>> RemoteExecutor is probably what you are looking for. Otherwise, we have to
>>>>>> find another solution.
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 25, 2014 at 2:56 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Hi Robert,
>>>>>>> I tried to look at the RemoteExecutor but I can't understand what
>>>>>>> are the exact steps to:
>>>>>>> 1 - (upload if necessary and) register a jar containing multiple
>>>>>>> main methods (one for each job)
>>>>>>> 2 - start the execution of a job from a client
>>>>>>> 3 - monitor the execution of the job
>>>>>>>
>>>>>>> Could you give me the exact java commands/snippets to do that?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rmetzger@apache.org
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> +1 for providing some utilities/tools for application developers.
>>>>>>>> This could include something like an application registry. I also
>>>>>>>> think that almost every user needs something to parse command line
>>>>>>>> arguments (including default values and comprehensive error messages).
>>>>>>>> We should also see if we can document and properly expose the
>>>>>>>> FileSystem abstraction to Flink app programmers. Users sometimes need to do
>>>>>>>> manipulate files directly.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regarding your second question:
>>>>>>>> For deploying a jar on your cluster, you can use the "bin/flink run
>>>>>>>> <JAR FILE>" command.
>>>>>>>> For starting a Job from an external client you can use the
>>>>>>>> RemoteExecutionEnvironment (you need to know the JobManager address for
>>>>>>>> that). Here is some documentation on that:
>>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> That was exactly what I was looking for. In my case it is not a
>>>>>>>>> problem to use hadoop version because I work on Hadoop. Don't you think it
>>>>>>>>> could be useful to add a Flink ProgramDriver so that you can use it both
>>>>>>>>> for hadoop and native-flink jobs?
>>>>>>>>>
>>>>>>>>> Now that I understood how to bundle together a bunch of jobs, my
>>>>>>>>> next objective will be to deploy the jar on the cluster (similarity to what
>>>>>>>>> tge webclient does) and then start the jobs from my external client (which
>>>>>>>>> in theory just need to know the jar name and the parameters to pass to
>>>>>>>>> every job it wants to call). Do you have an example of that?
>>>>>>>>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Are you looking for something like
>>>>>>>>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> You should be able to use the Hadoop ProgramDriver directly, see
>>>>>>>>>> for example here:
>>>>>>>>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>>>>>>>>
>>>>>>>>>> If you don't want to introduce a Hadoop dependency in your
>>>>>>>>>> project, you can just copy-paste ProgramDriver, it does not have any
>>>>>>>>>> dependencies to Hadoop classes. That class just accumulates <String,Class>
>>>>>>>>>> pairs (simplifying a bit) and calls the main method of the corresponding
>>>>>>>>>> class.
>>>>>>>>>>
>>>>>>>>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Not sure I get exactly what this is, but packaging multiple
>>>>>>>>>>> examples in one program is well possible. You can have arbitrary control
>>>>>>>>>>> flow in the main() method.
>>>>>>>>>>>
>>>>>>>>>>> Should be well possible to do something like that hadoop
>>>>>>>>>>> examples setup...
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> That was something I used to do with hadoop and it's
>>>>>>>>>>>> comfortable when testing stuff (so it is not so important).
>>>>>>>>>>>> For an example see what happens when you run the old "hadoop
>>>>>>>>>>>> jar hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>>>>>>>>> invokation of that job.
>>>>>>>>>>>> However, the important thing is that I'd like to keep existing
>>>>>>>>>>>> related jobs somewhere (like a repository of jobs), deploy them and then be
>>>>>>>>>>>> able to start the one I need from an external program.
>>>>>>>>>>>>
>>>>>>>>>>>> Could this be done with RemoteExecutor? Or is there any WS to
>>>>>>>>>>>> manage the job execution? That would be very useful..
>>>>>>>>>>>> Is the Client interface the only one that allow something
>>>>>>>>>>>> similar right now?
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <sewen@apache.org
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I am not sure exactly what you need there. In Flink you can
>>>>>>>>>>>>> write more than one program in the same program ;-) You can define complex
>>>>>>>>>>>>> flows and execute arbitrarily at intermediate points:
>>>>>>>>>>>>>
>>>>>>>>>>>>> main() {
>>>>>>>>>>>>>   ExecutionEnvironment env = ...;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>>>>>>>>   env.execute();
>>>>>>>>>>>>>
>>>>>>>>>>>>>   env.readTheNextThing().do()Something();
>>>>>>>>>>>>>   env.execute();
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can also just "save" a program and keep it for later
>>>>>>>>>>>>> execution:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Plan plan = env.createProgramPlan();
>>>>>>>>>>>>>
>>>>>>>>>>>>> at a later point you can start that plan: new
>>>>>>>>>>>>> RemoteExecutor(master, 6123).execute(plan);
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Stephan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any help on this? :(
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>> I forgot to ask you if there's a Flink utility to simulate
>>>>>>>>>>>>>>> the Hadoop ProgramDriver class that acts somehow like a registry of jobs.
>>>>>>>>>>>>>>> Is there something similar?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Flavio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Flink ProgramDriver

Posted by Flavio Pompermaier <po...@okkam.it>.

Sounds good to me..how do you check for completion from java code?

On Tue, Nov 25, 2014 at 6:56 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> 1) The Remote Executor will automatically transfer the jar, if needed.
>
> 2) Background execution is not supported out of the box. I would go for a
> Java ExecutorService with a FutureTask to kick of tasks in a background
> thread and allow to check for completion.
>
> Stephan
>
>
> On Tue, Nov 25, 2014 at 6:41 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Do I have to upload the jar from my application to the Flink Job manager
>> every time?
>> Do I have to wait the job to finish? I'd like to start the job execution,
>> get an id of it and then poll for its status..is that possible?
>>
>> On Tue, Nov 25, 2014 at 6:04 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Cool.
>>>
>>> So you have basically two options:
>>> a) use the bin/flink run tool.
>>> This tool is meant for users to submit a job once. To use that, upload
>>> the jar to any location in the file system (not HDFS).
>>> use ./bin/flink run <pathToJar> -c classNameOfJobYouWantToRun
>>> <JobArguments>
>>> to run the job.
>>>
>>> b) use the RemoteExecutor.
>>> For using the remove Executor, you don't need to put your jar file
>>> anywhere in your cluster.
>>> The only thing you need is the jar file somewhere were the Java
>>> Application can access it.
>>> Inside this Java Application, you have something like:
>>>
>>> runJobOne(ExecutionEnvironment ee) {
>>>  ee.readFile( ... );
>>>  ...
>>>   ee.execute("job 1");
>>> }
>>>
>>> runJobTwo(Exe ..) {
>>>  ...
>>> }
>>>
>>>
>>> main() {
>>>  ExecutionEnvironment  ee = new Remote execution environment ..
>>>
>>>  if(something) {
>>>      runJobOne(ee);
>>>  } else if(something else) {
>>>     runJobTwo(ee);
>>>  } ...
>>> }
>>>
>>>
>>> The object returned by the ExecutionEnvironment.execute() call also
>>> contains information about the final status of the program (failed etc.).
>>>
>>> I hope that helps.
>>>
>>> On Tue, Nov 25, 2014 at 5:30 PM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> See inline
>>>>
>>>> On Tue, Nov 25, 2014 at 3:37 PM, Robert Metzger <rm...@apache.org>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> maybe we need to go a step back because I did not yet fully understand
>>>>> what you want to do.
>>>>>
>>>>> My understanding so far is the following:
>>>>> - You have a set of jobs that you've written for Flink
>>>>>
>>>>
>>>> Yes, and they are all in the same jar (that I want to put in the
>>>> cluster somehow)
>>>>
>>>> - You have a cluster with Flink running
>>>>>
>>>>
>>>> Yes!
>>>>
>>>>
>>>>> - You have an external client, which is a Java Application that is
>>>>> controlling when and how the different jobs are launched. The client is
>>>>> running basically 24/7 or started by a cronjob.
>>>>>
>>>>
>>>> I have a Java application somewhere that triggers the execution of one
>>>> of the available jobs in the jar (so I need to pass also the necessary
>>>> arguments required by each job) and then monitor if the job has been put
>>>> into a running state and its status (running/failed/finished and percentage
>>>> would be awesome).
>>>> I don't think RemoteExecutor is enough..am I wrong?
>>>>
>>>>
>>>>> Correct me if these assumptions are wrong. If they are true, the
>>>>> RemoteExecutor is probably what you are looking for. Otherwise, we have to
>>>>> find another solution.
>>>>>
>>>>>
>>>>> On Tue, Nov 25, 2014 at 2:56 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Hi Robert,
>>>>>> I tried to look at the RemoteExecutor but I can't understand what are
>>>>>> the exact steps to:
>>>>>> 1 - (upload if necessary and) register a jar containing multiple main
>>>>>> methods (one for each job)
>>>>>> 2 - start the execution of a job from a client
>>>>>> 3 - monitor the execution of the job
>>>>>>
>>>>>> Could you give me the exact java commands/snippets to do that?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rm...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 for providing some utilities/tools for application developers.
>>>>>>> This could include something like an application registry. I also
>>>>>>> think that almost every user needs something to parse command line
>>>>>>> arguments (including default values and comprehensive error messages).
>>>>>>> We should also see if we can document and properly expose the
>>>>>>> FileSystem abstraction to Flink app programmers. Users sometimes need to do
>>>>>>> manipulate files directly.
>>>>>>>
>>>>>>>
>>>>>>> Regarding your second question:
>>>>>>> For deploying a jar on your cluster, you can use the "bin/flink run
>>>>>>> <JAR FILE>" command.
>>>>>>> For starting a Job from an external client you can use the
>>>>>>> RemoteExecutionEnvironment (you need to know the JobManager address for
>>>>>>> that). Here is some documentation on that:
>>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> That was exactly what I was looking for. In my case it is not a
>>>>>>>> problem to use hadoop version because I work on Hadoop. Don't you think it
>>>>>>>> could be useful to add a Flink ProgramDriver so that you can use it both
>>>>>>>> for hadoop and native-flink jobs?
>>>>>>>>
>>>>>>>> Now that I understood how to bundle together a bunch of jobs, my
>>>>>>>> next objective will be to deploy the jar on the cluster (similarity to what
>>>>>>>> tge webclient does) and then start the jobs from my external client (which
>>>>>>>> in theory just need to know the jar name and the parameters to pass to
>>>>>>>> every job it wants to call). Do you have an example of that?
>>>>>>>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Are you looking for something like
>>>>>>>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>> You should be able to use the Hadoop ProgramDriver directly, see
>>>>>>>>> for example here:
>>>>>>>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>>>>>>>
>>>>>>>>> If you don't want to introduce a Hadoop dependency in your
>>>>>>>>> project, you can just copy-paste ProgramDriver, it does not have any
>>>>>>>>> dependencies to Hadoop classes. That class just accumulates <String,Class>
>>>>>>>>> pairs (simplifying a bit) and calls the main method of the corresponding
>>>>>>>>> class.
>>>>>>>>>
>>>>>>>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Not sure I get exactly what this is, but packaging multiple
>>>>>>>>>> examples in one program is well possible. You can have arbitrary control
>>>>>>>>>> flow in the main() method.
>>>>>>>>>>
>>>>>>>>>> Should be well possible to do something like that hadoop examples
>>>>>>>>>> setup...
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>
>>>>>>>>>>> That was something I used to do with hadoop and it's comfortable
>>>>>>>>>>> when testing stuff (so it is not so important).
>>>>>>>>>>> For an example see what happens when you run the old "hadoop jar
>>>>>>>>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>>>>>>>> invokation of that job.
>>>>>>>>>>> However, the important thing is that I'd like to keep existing
>>>>>>>>>>> related jobs somewhere (like a repository of jobs), deploy them and then be
>>>>>>>>>>> able to start the one I need from an external program.
>>>>>>>>>>>
>>>>>>>>>>> Could this be done with RemoteExecutor? Or is there any WS to
>>>>>>>>>>> manage the job execution? That would be very useful..
>>>>>>>>>>> Is the Client interface the only one that allow something
>>>>>>>>>>> similar right now?
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am not sure exactly what you need there. In Flink you can
>>>>>>>>>>>> write more than one program in the same program ;-) You can define complex
>>>>>>>>>>>> flows and execute arbitrarily at intermediate points:
>>>>>>>>>>>>
>>>>>>>>>>>> main() {
>>>>>>>>>>>>   ExecutionEnvironment env = ...;
>>>>>>>>>>>>
>>>>>>>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>>>>>>>   env.execute();
>>>>>>>>>>>>
>>>>>>>>>>>>   env.readTheNextThing().do()Something();
>>>>>>>>>>>>   env.execute();
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> You can also just "save" a program and keep it for later
>>>>>>>>>>>> execution:
>>>>>>>>>>>>
>>>>>>>>>>>> Plan plan = env.createProgramPlan();
>>>>>>>>>>>>
>>>>>>>>>>>> at a later point you can start that plan: new
>>>>>>>>>>>> RemoteExecutor(master, 6123).execute(plan);
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Stephan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Any help on this? :(
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>> I forgot to ask you if there's a Flink utility to simulate
>>>>>>>>>>>>>> the Hadoop ProgramDriver class that acts somehow like a registry of jobs.
>>>>>>>>>>>>>> Is there something similar?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Flavio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Flink ProgramDriver

Posted by Stephan Ewen <se...@apache.org>.

Hi!

1) The Remote Executor will automatically transfer the jar, if needed.

2) Background execution is not supported out of the box. I would go for a
Java ExecutorService with a FutureTask to kick of tasks in a background
thread and allow to check for completion.

Stephan


On Tue, Nov 25, 2014 at 6:41 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Do I have to upload the jar from my application to the Flink Job manager
> every time?
> Do I have to wait the job to finish? I'd like to start the job execution,
> get an id of it and then poll for its status..is that possible?
>
> On Tue, Nov 25, 2014 at 6:04 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Cool.
>>
>> So you have basically two options:
>> a) use the bin/flink run tool.
>> This tool is meant for users to submit a job once. To use that, upload
>> the jar to any location in the file system (not HDFS).
>> use ./bin/flink run <pathToJar> -c classNameOfJobYouWantToRun
>> <JobArguments>
>> to run the job.
>>
>> b) use the RemoteExecutor.
>> For using the remove Executor, you don't need to put your jar file
>> anywhere in your cluster.
>> The only thing you need is the jar file somewhere were the Java
>> Application can access it.
>> Inside this Java Application, you have something like:
>>
>> runJobOne(ExecutionEnvironment ee) {
>>  ee.readFile( ... );
>>  ...
>>   ee.execute("job 1");
>> }
>>
>> runJobTwo(Exe ..) {
>>  ...
>> }
>>
>>
>> main() {
>>  ExecutionEnvironment  ee = new Remote execution environment ..
>>
>>  if(something) {
>>      runJobOne(ee);
>>  } else if(something else) {
>>     runJobTwo(ee);
>>  } ...
>> }
>>
>>
>> The object returned by the ExecutionEnvironment.execute() call also
>> contains information about the final status of the program (failed etc.).
>>
>> I hope that helps.
>>
>> On Tue, Nov 25, 2014 at 5:30 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> See inline
>>>
>>> On Tue, Nov 25, 2014 at 3:37 PM, Robert Metzger <rm...@apache.org>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> maybe we need to go a step back because I did not yet fully understand
>>>> what you want to do.
>>>>
>>>> My understanding so far is the following:
>>>> - You have a set of jobs that you've written for Flink
>>>>
>>>
>>> Yes, and they are all in the same jar (that I want to put in the cluster
>>> somehow)
>>>
>>> - You have a cluster with Flink running
>>>>
>>>
>>> Yes!
>>>
>>>
>>>> - You have an external client, which is a Java Application that is
>>>> controlling when and how the different jobs are launched. The client is
>>>> running basically 24/7 or started by a cronjob.
>>>>
>>>
>>> I have a Java application somewhere that triggers the execution of one
>>> of the available jobs in the jar (so I need to pass also the necessary
>>> arguments required by each job) and then monitor if the job has been put
>>> into a running state and its status (running/failed/finished and percentage
>>> would be awesome).
>>> I don't think RemoteExecutor is enough..am I wrong?
>>>
>>>
>>>> Correct me if these assumptions are wrong. If they are true, the
>>>> RemoteExecutor is probably what you are looking for. Otherwise, we have to
>>>> find another solution.
>>>>
>>>>
>>>> On Tue, Nov 25, 2014 at 2:56 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Hi Robert,
>>>>> I tried to look at the RemoteExecutor but I can't understand what are
>>>>> the exact steps to:
>>>>> 1 - (upload if necessary and) register a jar containing multiple main
>>>>> methods (one for each job)
>>>>> 2 - start the execution of a job from a client
>>>>> 3 - monitor the execution of the job
>>>>>
>>>>> Could you give me the exact java commands/snippets to do that?
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> +1 for providing some utilities/tools for application developers.
>>>>>> This could include something like an application registry. I also
>>>>>> think that almost every user needs something to parse command line
>>>>>> arguments (including default values and comprehensive error messages).
>>>>>> We should also see if we can document and properly expose the
>>>>>> FileSystem abstraction to Flink app programmers. Users sometimes need to do
>>>>>> manipulate files directly.
>>>>>>
>>>>>>
>>>>>> Regarding your second question:
>>>>>> For deploying a jar on your cluster, you can use the "bin/flink run
>>>>>> <JAR FILE>" command.
>>>>>> For starting a Job from an external client you can use the
>>>>>> RemoteExecutionEnvironment (you need to know the JobManager address for
>>>>>> that). Here is some documentation on that:
>>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> That was exactly what I was looking for. In my case it is not a
>>>>>>> problem to use hadoop version because I work on Hadoop. Don't you think it
>>>>>>> could be useful to add a Flink ProgramDriver so that you can use it both
>>>>>>> for hadoop and native-flink jobs?
>>>>>>>
>>>>>>> Now that I understood how to bundle together a bunch of jobs, my
>>>>>>> next objective will be to deploy the jar on the cluster (similarity to what
>>>>>>> tge webclient does) and then start the jobs from my external client (which
>>>>>>> in theory just need to know the jar name and the parameters to pass to
>>>>>>> every job it wants to call). Do you have an example of that?
>>>>>>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Are you looking for something like
>>>>>>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>>>>>>> ?
>>>>>>>>
>>>>>>>> You should be able to use the Hadoop ProgramDriver directly, see
>>>>>>>> for example here:
>>>>>>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>>>>>>
>>>>>>>> If you don't want to introduce a Hadoop dependency in your project,
>>>>>>>> you can just copy-paste ProgramDriver, it does not have any dependencies to
>>>>>>>> Hadoop classes. That class just accumulates <String,Class> pairs
>>>>>>>> (simplifying a bit) and calls the main method of the corresponding class.
>>>>>>>>
>>>>>>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Not sure I get exactly what this is, but packaging multiple
>>>>>>>>> examples in one program is well possible. You can have arbitrary control
>>>>>>>>> flow in the main() method.
>>>>>>>>>
>>>>>>>>> Should be well possible to do something like that hadoop examples
>>>>>>>>> setup...
>>>>>>>>>
>>>>>>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>
>>>>>>>>>> That was something I used to do with hadoop and it's comfortable
>>>>>>>>>> when testing stuff (so it is not so important).
>>>>>>>>>> For an example see what happens when you run the old "hadoop jar
>>>>>>>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>>>>>>> invokation of that job.
>>>>>>>>>> However, the important thing is that I'd like to keep existing
>>>>>>>>>> related jobs somewhere (like a repository of jobs), deploy them and then be
>>>>>>>>>> able to start the one I need from an external program.
>>>>>>>>>>
>>>>>>>>>> Could this be done with RemoteExecutor? Or is there any WS to
>>>>>>>>>> manage the job execution? That would be very useful..
>>>>>>>>>> Is the Client interface the only one that allow something
>>>>>>>>>> similar right now?
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am not sure exactly what you need there. In Flink you can
>>>>>>>>>>> write more than one program in the same program ;-) You can define complex
>>>>>>>>>>> flows and execute arbitrarily at intermediate points:
>>>>>>>>>>>
>>>>>>>>>>> main() {
>>>>>>>>>>>   ExecutionEnvironment env = ...;
>>>>>>>>>>>
>>>>>>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>>>>>>   env.execute();
>>>>>>>>>>>
>>>>>>>>>>>   env.readTheNextThing().do()Something();
>>>>>>>>>>>   env.execute();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You can also just "save" a program and keep it for later
>>>>>>>>>>> execution:
>>>>>>>>>>>
>>>>>>>>>>> Plan plan = env.createProgramPlan();
>>>>>>>>>>>
>>>>>>>>>>> at a later point you can start that plan: new
>>>>>>>>>>> RemoteExecutor(master, 6123).execute(plan);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Stephan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Any help on this? :(
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>> I forgot to ask you if there's a Flink utility to simulate the
>>>>>>>>>>>>> Hadoop ProgramDriver class that acts somehow like a registry of jobs. Is
>>>>>>>>>>>>> there something similar?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Flavio
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Flink ProgramDriver

Posted by Flavio Pompermaier <po...@okkam.it>.

Do I have to upload the jar from my application to the Flink Job manager
every time?
Do I have to wait the job to finish? I'd like to start the job execution,
get an id of it and then poll for its status..is that possible?

On Tue, Nov 25, 2014 at 6:04 PM, Robert Metzger <rm...@apache.org> wrote:

> Cool.
>
> So you have basically two options:
> a) use the bin/flink run tool.
> This tool is meant for users to submit a job once. To use that, upload the
> jar to any location in the file system (not HDFS).
> use ./bin/flink run <pathToJar> -c classNameOfJobYouWantToRun
> <JobArguments>
> to run the job.
>
> b) use the RemoteExecutor.
> For using the remove Executor, you don't need to put your jar file
> anywhere in your cluster.
> The only thing you need is the jar file somewhere were the Java
> Application can access it.
> Inside this Java Application, you have something like:
>
> runJobOne(ExecutionEnvironment ee) {
>  ee.readFile( ... );
>  ...
>   ee.execute("job 1");
> }
>
> runJobTwo(Exe ..) {
>  ...
> }
>
>
> main() {
>  ExecutionEnvironment  ee = new Remote execution environment ..
>
>  if(something) {
>      runJobOne(ee);
>  } else if(something else) {
>     runJobTwo(ee);
>  } ...
> }
>
>
> The object returned by the ExecutionEnvironment.execute() call also
> contains information about the final status of the program (failed etc.).
>
> I hope that helps.
>
> On Tue, Nov 25, 2014 at 5:30 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> See inline
>>
>> On Tue, Nov 25, 2014 at 3:37 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> Hey,
>>>
>>> maybe we need to go a step back because I did not yet fully understand
>>> what you want to do.
>>>
>>> My understanding so far is the following:
>>> - You have a set of jobs that you've written for Flink
>>>
>>
>> Yes, and they are all in the same jar (that I want to put in the cluster
>> somehow)
>>
>> - You have a cluster with Flink running
>>>
>>
>> Yes!
>>
>>
>>> - You have an external client, which is a Java Application that is
>>> controlling when and how the different jobs are launched. The client is
>>> running basically 24/7 or started by a cronjob.
>>>
>>
>> I have a Java application somewhere that triggers the execution of one of
>> the available jobs in the jar (so I need to pass also the necessary
>> arguments required by each job) and then monitor if the job has been put
>> into a running state and its status (running/failed/finished and percentage
>> would be awesome).
>> I don't think RemoteExecutor is enough..am I wrong?
>>
>>
>>> Correct me if these assumptions are wrong. If they are true, the
>>> RemoteExecutor is probably what you are looking for. Otherwise, we have to
>>> find another solution.
>>>
>>>
>>> On Tue, Nov 25, 2014 at 2:56 PM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> Hi Robert,
>>>> I tried to look at the RemoteExecutor but I can't understand what are
>>>> the exact steps to:
>>>> 1 - (upload if necessary and) register a jar containing multiple main
>>>> methods (one for each job)
>>>> 2 - start the execution of a job from a client
>>>> 3 - monitor the execution of the job
>>>>
>>>> Could you give me the exact java commands/snippets to do that?
>>>>
>>>>
>>>>
>>>> On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rm...@apache.org>
>>>> wrote:
>>>>
>>>>> +1 for providing some utilities/tools for application developers.
>>>>> This could include something like an application registry. I also
>>>>> think that almost every user needs something to parse command line
>>>>> arguments (including default values and comprehensive error messages).
>>>>> We should also see if we can document and properly expose the
>>>>> FileSystem abstraction to Flink app programmers. Users sometimes need to do
>>>>> manipulate files directly.
>>>>>
>>>>>
>>>>> Regarding your second question:
>>>>> For deploying a jar on your cluster, you can use the "bin/flink run
>>>>> <JAR FILE>" command.
>>>>> For starting a Job from an external client you can use the
>>>>> RemoteExecutionEnvironment (you need to know the JobManager address for
>>>>> that). Here is some documentation on that:
>>>>> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> That was exactly what I was looking for. In my case it is not a
>>>>>> problem to use hadoop version because I work on Hadoop. Don't you think it
>>>>>> could be useful to add a Flink ProgramDriver so that you can use it both
>>>>>> for hadoop and native-flink jobs?
>>>>>>
>>>>>> Now that I understood how to bundle together a bunch of jobs, my next
>>>>>> objective will be to deploy the jar on the cluster (similarity to what tge
>>>>>> webclient does) and then start the jobs from my external client (which in
>>>>>> theory just need to know the jar name and the parameters to pass to every
>>>>>> job it wants to call). Do you have an example of that?
>>>>>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Are you looking for something like
>>>>>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>>>>>> ?
>>>>>>>
>>>>>>> You should be able to use the Hadoop ProgramDriver directly, see for
>>>>>>> example here:
>>>>>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>>>>>
>>>>>>> If you don't want to introduce a Hadoop dependency in your project,
>>>>>>> you can just copy-paste ProgramDriver, it does not have any dependencies to
>>>>>>> Hadoop classes. That class just accumulates <String,Class> pairs
>>>>>>> (simplifying a bit) and calls the main method of the corresponding class.
>>>>>>>
>>>>>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Not sure I get exactly what this is, but packaging multiple
>>>>>>>> examples in one program is well possible. You can have arbitrary control
>>>>>>>> flow in the main() method.
>>>>>>>>
>>>>>>>> Should be well possible to do something like that hadoop examples
>>>>>>>> setup...
>>>>>>>>
>>>>>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> That was something I used to do with hadoop and it's comfortable
>>>>>>>>> when testing stuff (so it is not so important).
>>>>>>>>> For an example see what happens when you run the old "hadoop jar
>>>>>>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>>>>>> invokation of that job.
>>>>>>>>> However, the important thing is that I'd like to keep existing
>>>>>>>>> related jobs somewhere (like a repository of jobs), deploy them and then be
>>>>>>>>> able to start the one I need from an external program.
>>>>>>>>>
>>>>>>>>> Could this be done with RemoteExecutor? Or is there any WS to
>>>>>>>>> manage the job execution? That would be very useful..
>>>>>>>>> Is the Client interface the only one that allow something similar
>>>>>>>>> right now?
>>>>>>>>>
>>>>>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am not sure exactly what you need there. In Flink you can write
>>>>>>>>>> more than one program in the same program ;-) You can define complex flows
>>>>>>>>>> and execute arbitrarily at intermediate points:
>>>>>>>>>>
>>>>>>>>>> main() {
>>>>>>>>>>   ExecutionEnvironment env = ...;
>>>>>>>>>>
>>>>>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>>>>>   env.execute();
>>>>>>>>>>
>>>>>>>>>>   env.readTheNextThing().do()Something();
>>>>>>>>>>   env.execute();
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You can also just "save" a program and keep it for later
>>>>>>>>>> execution:
>>>>>>>>>>
>>>>>>>>>> Plan plan = env.createProgramPlan();
>>>>>>>>>>
>>>>>>>>>> at a later point you can start that plan: new
>>>>>>>>>> RemoteExecutor(master, 6123).execute(plan);
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>
>>>>>>>>>>> Any help on this? :(
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>> I forgot to ask you if there's a Flink utility to simulate the
>>>>>>>>>>>> Hadoop ProgramDriver class that acts somehow like a registry of jobs. Is
>>>>>>>>>>>> there something similar?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Flavio
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>
>

Re: Flink ProgramDriver

Posted by Robert Metzger <rm...@apache.org>.

Cool.

So you have basically two options:
a) use the bin/flink run tool.
This tool is meant for users to submit a job once. To use that, upload the
jar to any location in the file system (not HDFS).
use ./bin/flink run <pathToJar> -c classNameOfJobYouWantToRun <JobArguments>
to run the job.

b) use the RemoteExecutor.
For using the remove Executor, you don't need to put your jar file anywhere
in your cluster.
The only thing you need is the jar file somewhere were the Java Application
can access it.
Inside this Java Application, you have something like:

runJobOne(ExecutionEnvironment ee) {
 ee.readFile( ... );
 ...
  ee.execute("job 1");
}

runJobTwo(Exe ..) {
 ...
}


main() {
 ExecutionEnvironment  ee = new Remote execution environment ..

 if(something) {
     runJobOne(ee);
 } else if(something else) {
    runJobTwo(ee);
 } ...
}


The object returned by the ExecutionEnvironment.execute() call also
contains information about the final status of the program (failed etc.).

I hope that helps.

On Tue, Nov 25, 2014 at 5:30 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> See inline
>
> On Tue, Nov 25, 2014 at 3:37 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> Hey,
>>
>> maybe we need to go a step back because I did not yet fully understand
>> what you want to do.
>>
>> My understanding so far is the following:
>> - You have a set of jobs that you've written for Flink
>>
>
> Yes, and they are all in the same jar (that I want to put in the cluster
> somehow)
>
> - You have a cluster with Flink running
>>
>
> Yes!
>
>
>> - You have an external client, which is a Java Application that is
>> controlling when and how the different jobs are launched. The client is
>> running basically 24/7 or started by a cronjob.
>>
>
> I have a Java application somewhere that triggers the execution of one of
> the available jobs in the jar (so I need to pass also the necessary
> arguments required by each job) and then monitor if the job has been put
> into a running state and its status (running/failed/finished and percentage
> would be awesome).
> I don't think RemoteExecutor is enough..am I wrong?
>
>
>> Correct me if these assumptions are wrong. If they are true, the
>> RemoteExecutor is probably what you are looking for. Otherwise, we have to
>> find another solution.
>>
>>
>> On Tue, Nov 25, 2014 at 2:56 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> Hi Robert,
>>> I tried to look at the RemoteExecutor but I can't understand what are
>>> the exact steps to:
>>> 1 - (upload if necessary and) register a jar containing multiple main
>>> methods (one for each job)
>>> 2 - start the execution of a job from a client
>>> 3 - monitor the execution of the job
>>>
>>> Could you give me the exact java commands/snippets to do that?
>>>
>>>
>>>
>>> On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rm...@apache.org>
>>> wrote:
>>>
>>>> +1 for providing some utilities/tools for application developers.
>>>> This could include something like an application registry. I also think
>>>> that almost every user needs something to parse command line arguments
>>>> (including default values and comprehensive error messages).
>>>> We should also see if we can document and properly expose the
>>>> FileSystem abstraction to Flink app programmers. Users sometimes need to do
>>>> manipulate files directly.
>>>>
>>>>
>>>> Regarding your second question:
>>>> For deploying a jar on your cluster, you can use the "bin/flink run
>>>> <JAR FILE>" command.
>>>> For starting a Job from an external client you can use the
>>>> RemoteExecutionEnvironment (you need to know the JobManager address for
>>>> that). Here is some documentation on that:
>>>> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> That was exactly what I was looking for. In my case it is not a
>>>>> problem to use hadoop version because I work on Hadoop. Don't you think it
>>>>> could be useful to add a Flink ProgramDriver so that you can use it both
>>>>> for hadoop and native-flink jobs?
>>>>>
>>>>> Now that I understood how to bundle together a bunch of jobs, my next
>>>>> objective will be to deploy the jar on the cluster (similarity to what tge
>>>>> webclient does) and then start the jobs from my external client (which in
>>>>> theory just need to know the jar name and the parameters to pass to every
>>>>> job it wants to call). Do you have an example of that?
>>>>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org> wrote:
>>>>>
>>>>>> Are you looking for something like
>>>>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>>>>> ?
>>>>>>
>>>>>> You should be able to use the Hadoop ProgramDriver directly, see for
>>>>>> example here:
>>>>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>>>>
>>>>>> If you don't want to introduce a Hadoop dependency in your project,
>>>>>> you can just copy-paste ProgramDriver, it does not have any dependencies to
>>>>>> Hadoop classes. That class just accumulates <String,Class> pairs
>>>>>> (simplifying a bit) and calls the main method of the corresponding class.
>>>>>>
>>>>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Not sure I get exactly what this is, but packaging multiple examples
>>>>>>> in one program is well possible. You can have arbitrary control flow in the
>>>>>>> main() method.
>>>>>>>
>>>>>>> Should be well possible to do something like that hadoop examples
>>>>>>> setup...
>>>>>>>
>>>>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> That was something I used to do with hadoop and it's comfortable
>>>>>>>> when testing stuff (so it is not so important).
>>>>>>>> For an example see what happens when you run the old "hadoop jar
>>>>>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>>>>> invokation of that job.
>>>>>>>> However, the important thing is that I'd like to keep existing
>>>>>>>> related jobs somewhere (like a repository of jobs), deploy them and then be
>>>>>>>> able to start the one I need from an external program.
>>>>>>>>
>>>>>>>> Could this be done with RemoteExecutor? Or is there any WS to
>>>>>>>> manage the job execution? That would be very useful..
>>>>>>>> Is the Client interface the only one that allow something similar
>>>>>>>> right now?
>>>>>>>>
>>>>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am not sure exactly what you need there. In Flink you can write
>>>>>>>>> more than one program in the same program ;-) You can define complex flows
>>>>>>>>> and execute arbitrarily at intermediate points:
>>>>>>>>>
>>>>>>>>> main() {
>>>>>>>>>   ExecutionEnvironment env = ...;
>>>>>>>>>
>>>>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>>>>   env.execute();
>>>>>>>>>
>>>>>>>>>   env.readTheNextThing().do()Something();
>>>>>>>>>   env.execute();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You can also just "save" a program and keep it for later execution:
>>>>>>>>>
>>>>>>>>> Plan plan = env.createProgramPlan();
>>>>>>>>>
>>>>>>>>> at a later point you can start that plan: new
>>>>>>>>> RemoteExecutor(master, 6123).execute(plan);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Stephan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>
>>>>>>>>>> Any help on this? :(
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>> I forgot to ask you if there's a Flink utility to simulate the
>>>>>>>>>>> Hadoop ProgramDriver class that acts somehow like a registry of jobs. Is
>>>>>>>>>>> there something similar?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Flavio
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>

Re: Flink ProgramDriver

Posted by Flavio Pompermaier <po...@okkam.it>.

See inline

On Tue, Nov 25, 2014 at 3:37 PM, Robert Metzger <rm...@apache.org> wrote:

> Hey,
>
> maybe we need to go a step back because I did not yet fully understand
> what you want to do.
>
> My understanding so far is the following:
> - You have a set of jobs that you've written for Flink
>

Yes, and they are all in the same jar (that I want to put in the cluster
somehow)

- You have a cluster with Flink running
>

Yes!


> - You have an external client, which is a Java Application that is
> controlling when and how the different jobs are launched. The client is
> running basically 24/7 or started by a cronjob.
>

I have a Java application somewhere that triggers the execution of one of
the available jobs in the jar (so I need to pass also the necessary
arguments required by each job) and then monitor if the job has been put
into a running state and its status (running/failed/finished and percentage
would be awesome).
I don't think RemoteExecutor is enough..am I wrong?


> Correct me if these assumptions are wrong. If they are true, the
> RemoteExecutor is probably what you are looking for. Otherwise, we have to
> find another solution.
>
>
> On Tue, Nov 25, 2014 at 2:56 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Hi Robert,
>> I tried to look at the RemoteExecutor but I can't understand what are the
>> exact steps to:
>> 1 - (upload if necessary and) register a jar containing multiple main
>> methods (one for each job)
>> 2 - start the execution of a job from a client
>> 3 - monitor the execution of the job
>>
>> Could you give me the exact java commands/snippets to do that?
>>
>>
>>
>> On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rm...@apache.org>
>> wrote:
>>
>>> +1 for providing some utilities/tools for application developers.
>>> This could include something like an application registry. I also think
>>> that almost every user needs something to parse command line arguments
>>> (including default values and comprehensive error messages).
>>> We should also see if we can document and properly expose the FileSystem
>>> abstraction to Flink app programmers. Users sometimes need to do manipulate
>>> files directly.
>>>
>>>
>>> Regarding your second question:
>>> For deploying a jar on your cluster, you can use the "bin/flink run <JAR
>>> FILE>" command.
>>> For starting a Job from an external client you can use the
>>> RemoteExecutionEnvironment (you need to know the JobManager address for
>>> that). Here is some documentation on that:
>>> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> That was exactly what I was looking for. In my case it is not a problem
>>>> to use hadoop version because I work on Hadoop. Don't you think it could be
>>>> useful to add a Flink ProgramDriver so that you can use it both for hadoop
>>>> and native-flink jobs?
>>>>
>>>> Now that I understood how to bundle together a bunch of jobs, my next
>>>> objective will be to deploy the jar on the cluster (similarity to what tge
>>>> webclient does) and then start the jobs from my external client (which in
>>>> theory just need to know the jar name and the parameters to pass to every
>>>> job it wants to call). Do you have an example of that?
>>>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org> wrote:
>>>>
>>>>> Are you looking for something like
>>>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>>>> ?
>>>>>
>>>>> You should be able to use the Hadoop ProgramDriver directly, see for
>>>>> example here:
>>>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>>>
>>>>> If you don't want to introduce a Hadoop dependency in your project,
>>>>> you can just copy-paste ProgramDriver, it does not have any dependencies to
>>>>> Hadoop classes. That class just accumulates <String,Class> pairs
>>>>> (simplifying a bit) and calls the main method of the corresponding class.
>>>>>
>>>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Not sure I get exactly what this is, but packaging multiple examples
>>>>>> in one program is well possible. You can have arbitrary control flow in the
>>>>>> main() method.
>>>>>>
>>>>>> Should be well possible to do something like that hadoop examples
>>>>>> setup...
>>>>>>
>>>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> That was something I used to do with hadoop and it's comfortable
>>>>>>> when testing stuff (so it is not so important).
>>>>>>> For an example see what happens when you run the old "hadoop jar
>>>>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>>>> invokation of that job.
>>>>>>> However, the important thing is that I'd like to keep existing
>>>>>>> related jobs somewhere (like a repository of jobs), deploy them and then be
>>>>>>> able to start the one I need from an external program.
>>>>>>>
>>>>>>> Could this be done with RemoteExecutor? Or is there any WS to
>>>>>>> manage the job execution? That would be very useful..
>>>>>>> Is the Client interface the only one that allow something similar
>>>>>>> right now?
>>>>>>>
>>>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I am not sure exactly what you need there. In Flink you can write
>>>>>>>> more than one program in the same program ;-) You can define complex flows
>>>>>>>> and execute arbitrarily at intermediate points:
>>>>>>>>
>>>>>>>> main() {
>>>>>>>>   ExecutionEnvironment env = ...;
>>>>>>>>
>>>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>>>   env.execute();
>>>>>>>>
>>>>>>>>   env.readTheNextThing().do()Something();
>>>>>>>>   env.execute();
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> You can also just "save" a program and keep it for later execution:
>>>>>>>>
>>>>>>>> Plan plan = env.createProgramPlan();
>>>>>>>>
>>>>>>>> at a later point you can start that plan: new
>>>>>>>> RemoteExecutor(master, 6123).execute(plan);
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> Any help on this? :(
>>>>>>>>>
>>>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>> I forgot to ask you if there's a Flink utility to simulate the
>>>>>>>>>> Hadoop ProgramDriver class that acts somehow like a registry of jobs. Is
>>>>>>>>>> there something similar?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Flavio
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>

Re: Flink ProgramDriver

Posted by Robert Metzger <rm...@apache.org>.

Hey,

maybe we need to go a step back because I did not yet fully understand what
you want to do.

My understanding so far is the following:
- You have a set of jobs that you've written for Flink
- You have a cluster with Flink running
- You have an external client, which is a Java Application that is
controlling when and how the different jobs are launched. The client is
running basically 24/7 or started by a cronjob.

Correct me if these assumptions are wrong. If they are true, the
RemoteExecutor is probably what you are looking for. Otherwise, we have to
find another solution.


On Tue, Nov 25, 2014 at 2:56 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi Robert,
> I tried to look at the RemoteExecutor but I can't understand what are the
> exact steps to:
> 1 - (upload if necessary and) register a jar containing multiple main
> methods (one for each job)
> 2 - start the execution of a job from a client
> 3 - monitor the execution of the job
>
> Could you give me the exact java commands/snippets to do that?
>
>
>
> On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rm...@apache.org>
> wrote:
>
>> +1 for providing some utilities/tools for application developers.
>> This could include something like an application registry. I also think
>> that almost every user needs something to parse command line arguments
>> (including default values and comprehensive error messages).
>> We should also see if we can document and properly expose the FileSystem
>> abstraction to Flink app programmers. Users sometimes need to do manipulate
>> files directly.
>>
>>
>> Regarding your second question:
>> For deploying a jar on your cluster, you can use the "bin/flink run <JAR
>> FILE>" command.
>> For starting a Job from an external client you can use the
>> RemoteExecutionEnvironment (you need to know the JobManager address for
>> that). Here is some documentation on that:
>> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> That was exactly what I was looking for. In my case it is not a problem
>>> to use hadoop version because I work on Hadoop. Don't you think it could be
>>> useful to add a Flink ProgramDriver so that you can use it both for hadoop
>>> and native-flink jobs?
>>>
>>> Now that I understood how to bundle together a bunch of jobs, my next
>>> objective will be to deploy the jar on the cluster (similarity to what tge
>>> webclient does) and then start the jobs from my external client (which in
>>> theory just need to know the jar name and the parameters to pass to every
>>> job it wants to call). Do you have an example of that?
>>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org> wrote:
>>>
>>>> Are you looking for something like
>>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>>> ?
>>>>
>>>> You should be able to use the Hadoop ProgramDriver directly, see for
>>>> example here:
>>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>>
>>>> If you don't want to introduce a Hadoop dependency in your project, you
>>>> can just copy-paste ProgramDriver, it does not have any dependencies to
>>>> Hadoop classes. That class just accumulates <String,Class> pairs
>>>> (simplifying a bit) and calls the main method of the corresponding class.
>>>>
>>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>>> Not sure I get exactly what this is, but packaging multiple examples
>>>>> in one program is well possible. You can have arbitrary control flow in the
>>>>> main() method.
>>>>>
>>>>> Should be well possible to do something like that hadoop examples
>>>>> setup...
>>>>>
>>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> That was something I used to do with hadoop and it's comfortable when
>>>>>> testing stuff (so it is not so important).
>>>>>> For an example see what happens when you run the old "hadoop jar
>>>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>>> invokation of that job.
>>>>>> However, the important thing is that I'd like to keep existing
>>>>>> related jobs somewhere (like a repository of jobs), deploy them and then be
>>>>>> able to start the one I need from an external program.
>>>>>>
>>>>>> Could this be done with RemoteExecutor? Or is there any WS to manage
>>>>>> the job execution? That would be very useful..
>>>>>> Is the Client interface the only one that allow something similar
>>>>>> right now?
>>>>>>
>>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I am not sure exactly what you need there. In Flink you can write
>>>>>>> more than one program in the same program ;-) You can define complex flows
>>>>>>> and execute arbitrarily at intermediate points:
>>>>>>>
>>>>>>> main() {
>>>>>>>   ExecutionEnvironment env = ...;
>>>>>>>
>>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>>   env.execute();
>>>>>>>
>>>>>>>   env.readTheNextThing().do()Something();
>>>>>>>   env.execute();
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> You can also just "save" a program and keep it for later execution:
>>>>>>>
>>>>>>> Plan plan = env.createProgramPlan();
>>>>>>>
>>>>>>> at a later point you can start that plan: new RemoteExecutor(master,
>>>>>>> 6123).execute(plan);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> Any help on this? :(
>>>>>>>>
>>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>> I forgot to ask you if there's a Flink utility to simulate the
>>>>>>>>> Hadoop ProgramDriver class that acts somehow like a registry of jobs. Is
>>>>>>>>> there something similar?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Flink ProgramDriver

Posted by Flavio Pompermaier <po...@okkam.it>.

Hi Robert,
I tried to look at the RemoteExecutor but I can't understand what are the
exact steps to:
1 - (upload if necessary and) register a jar containing multiple main
methods (one for each job)
2 - start the execution of a job from a client
3 - monitor the execution of the job

Could you give me the exact java commands/snippets to do that?



On Sun, Nov 23, 2014 at 8:26 PM, Robert Metzger <rm...@apache.org> wrote:

> +1 for providing some utilities/tools for application developers.
> This could include something like an application registry. I also think
> that almost every user needs something to parse command line arguments
> (including default values and comprehensive error messages).
> We should also see if we can document and properly expose the FileSystem
> abstraction to Flink app programmers. Users sometimes need to do manipulate
> files directly.
>
>
> Regarding your second question:
> For deploying a jar on your cluster, you can use the "bin/flink run <JAR
> FILE>" command.
> For starting a Job from an external client you can use the
> RemoteExecutionEnvironment (you need to know the JobManager address for
> that). Here is some documentation on that:
> http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment
>
>
>
>
>
>
>
> On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> That was exactly what I was looking for. In my case it is not a problem
>> to use hadoop version because I work on Hadoop. Don't you think it could be
>> useful to add a Flink ProgramDriver so that you can use it both for hadoop
>> and native-flink jobs?
>>
>> Now that I understood how to bundle together a bunch of jobs, my next
>> objective will be to deploy the jar on the cluster (similarity to what tge
>> webclient does) and then start the jobs from my external client (which in
>> theory just need to know the jar name and the parameters to pass to every
>> job it wants to call). Do you have an example of that?
>> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org> wrote:
>>
>>> Are you looking for something like
>>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>>> ?
>>>
>>> You should be able to use the Hadoop ProgramDriver directly, see for
>>> example here:
>>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>>
>>> If you don't want to introduce a Hadoop dependency in your project, you
>>> can just copy-paste ProgramDriver, it does not have any dependencies to
>>> Hadoop classes. That class just accumulates <String,Class> pairs
>>> (simplifying a bit) and calls the main method of the corresponding class.
>>>
>>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Not sure I get exactly what this is, but packaging multiple examples in
>>>> one program is well possible. You can have arbitrary control flow in the
>>>> main() method.
>>>>
>>>> Should be well possible to do something like that hadoop examples
>>>> setup...
>>>>
>>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> That was something I used to do with hadoop and it's comfortable when
>>>>> testing stuff (so it is not so important).
>>>>> For an example see what happens when you run the old "hadoop jar
>>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>>> invokation of that job.
>>>>> However, the important thing is that I'd like to keep existing related
>>>>> jobs somewhere (like a repository of jobs), deploy them and then be able to
>>>>> start the one I need from an external program.
>>>>>
>>>>> Could this be done with RemoteExecutor? Or is there any WS to manage
>>>>> the job execution? That would be very useful..
>>>>> Is the Client interface the only one that allow something similar
>>>>> right now?
>>>>>
>>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I am not sure exactly what you need there. In Flink you can write
>>>>>> more than one program in the same program ;-) You can define complex flows
>>>>>> and execute arbitrarily at intermediate points:
>>>>>>
>>>>>> main() {
>>>>>>   ExecutionEnvironment env = ...;
>>>>>>
>>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>>   env.execute();
>>>>>>
>>>>>>   env.readTheNextThing().do()Something();
>>>>>>   env.execute();
>>>>>> }
>>>>>>
>>>>>>
>>>>>> You can also just "save" a program and keep it for later execution:
>>>>>>
>>>>>> Plan plan = env.createProgramPlan();
>>>>>>
>>>>>> at a later point you can start that plan: new RemoteExecutor(master,
>>>>>> 6123).execute(plan);
>>>>>>
>>>>>>
>>>>>>
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Any help on this? :(
>>>>>>>
>>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>> I forgot to ask you if there's a Flink utility to simulate the
>>>>>>>> Hadoop ProgramDriver class that acts somehow like a registry of jobs. Is
>>>>>>>> there something similar?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>

Re: Flink ProgramDriver

Posted by Robert Metzger <rm...@apache.org>.

+1 for providing some utilities/tools for application developers.
This could include something like an application registry. I also think
that almost every user needs something to parse command line arguments
(including default values and comprehensive error messages).
We should also see if we can document and properly expose the FileSystem
abstraction to Flink app programmers. Users sometimes need to do manipulate
files directly.


Regarding your second question:
For deploying a jar on your cluster, you can use the "bin/flink run <JAR
FILE>" command.
For starting a Job from an external client you can use the
RemoteExecutionEnvironment (you need to know the JobManager address for
that). Here is some documentation on that:
http://flink.incubator.apache.org/docs/0.7-incubating/cluster_execution.html#remote-environment







On Sat, Nov 22, 2014 at 9:06 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> That was exactly what I was looking for. In my case it is not a problem to
> use hadoop version because I work on Hadoop. Don't you think it could be
> useful to add a Flink ProgramDriver so that you can use it both for hadoop
> and native-flink jobs?
>
> Now that I understood how to bundle together a bunch of jobs, my next
> objective will be to deploy the jar on the cluster (similarity to what tge
> webclient does) and then start the jobs from my external client (which in
> theory just need to know the jar name and the parameters to pass to every
> job it wants to call). Do you have an example of that?
> On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org> wrote:
>
>> Are you looking for something like
>> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
>> ?
>>
>> You should be able to use the Hadoop ProgramDriver directly, see for
>> example here:
>> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>>
>> If you don't want to introduce a Hadoop dependency in your project, you
>> can just copy-paste ProgramDriver, it does not have any dependencies to
>> Hadoop classes. That class just accumulates <String,Class> pairs
>> (simplifying a bit) and calls the main method of the corresponding class.
>>
>> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> Not sure I get exactly what this is, but packaging multiple examples in
>>> one program is well possible. You can have arbitrary control flow in the
>>> main() method.
>>>
>>> Should be well possible to do something like that hadoop examples
>>> setup...
>>>
>>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> That was something I used to do with hadoop and it's comfortable when
>>>> testing stuff (so it is not so important).
>>>> For an example see what happens when you run the old "hadoop jar
>>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>>> invokation of that job.
>>>> However, the important thing is that I'd like to keep existing related
>>>> jobs somewhere (like a repository of jobs), deploy them and then be able to
>>>> start the one I need from an external program.
>>>>
>>>> Could this be done with RemoteExecutor? Or is there any WS to manage
>>>> the job execution? That would be very useful..
>>>> Is the Client interface the only one that allow something similar
>>>> right now?
>>>>
>>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>
>>>>> I am not sure exactly what you need there. In Flink you can write more
>>>>> than one program in the same program ;-) You can define complex flows and
>>>>> execute arbitrarily at intermediate points:
>>>>>
>>>>> main() {
>>>>>   ExecutionEnvironment env = ...;
>>>>>
>>>>>   env.readSomething().map().join(...).and().so().on();
>>>>>   env.execute();
>>>>>
>>>>>   env.readTheNextThing().do()Something();
>>>>>   env.execute();
>>>>> }
>>>>>
>>>>>
>>>>> You can also just "save" a program and keep it for later execution:
>>>>>
>>>>> Plan plan = env.createProgramPlan();
>>>>>
>>>>> at a later point you can start that plan: new RemoteExecutor(master,
>>>>> 6123).execute(plan);
>>>>>
>>>>>
>>>>>
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Any help on this? :(
>>>>>>
>>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>>> pompermaier@okkam.it> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>> I forgot to ask you if there's a Flink utility to simulate the
>>>>>>> Hadoop ProgramDriver class that acts somehow like a registry of jobs. Is
>>>>>>> there something similar?
>>>>>>>
>>>>>>> Best,
>>>>>>> Flavio
>>>>>>>
>>>>>>
>>>>
>>>
>>

Re: Flink ProgramDriver

Posted by Flavio Pompermaier <po...@okkam.it>.

That was exactly what I was looking for. In my case it is not a problem to
use hadoop version because I work on Hadoop. Don't you think it could be
useful to add a Flink ProgramDriver so that you can use it both for hadoop
and native-flink jobs?

Now that I understood how to bundle together a bunch of jobs, my next
objective will be to deploy the jar on the cluster (similarity to what tge
webclient does) and then start the jobs from my external client (which in
theory just need to know the jar name and the parameters to pass to every
job it wants to call). Do you have an example of that?
On Nov 22, 2014 6:11 PM, "Kostas Tzoumas" <kt...@apache.org> wrote:

> Are you looking for something like
> https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
> ?
>
> You should be able to use the Hadoop ProgramDriver directly, see for
> example here:
> https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java
>
> If you don't want to introduce a Hadoop dependency in your project, you
> can just copy-paste ProgramDriver, it does not have any dependencies to
> Hadoop classes. That class just accumulates <String,Class> pairs
> (simplifying a bit) and calls the main method of the corresponding class.
>
> On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org> wrote:
>
>> Not sure I get exactly what this is, but packaging multiple examples in
>> one program is well possible. You can have arbitrary control flow in the
>> main() method.
>>
>> Should be well possible to do something like that hadoop examples setup...
>>
>> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> That was something I used to do with hadoop and it's comfortable when
>>> testing stuff (so it is not so important).
>>> For an example see what happens when you run the old "hadoop jar
>>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>>> invokation of that job.
>>> However, the important thing is that I'd like to keep existing related
>>> jobs somewhere (like a repository of jobs), deploy them and then be able to
>>> start the one I need from an external program.
>>>
>>> Could this be done with RemoteExecutor? Or is there any WS to manage
>>> the job execution? That would be very useful..
>>> Is the Client interface the only one that allow something similar right
>>> now?
>>>
>>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> I am not sure exactly what you need there. In Flink you can write more
>>>> than one program in the same program ;-) You can define complex flows and
>>>> execute arbitrarily at intermediate points:
>>>>
>>>> main() {
>>>>   ExecutionEnvironment env = ...;
>>>>
>>>>   env.readSomething().map().join(...).and().so().on();
>>>>   env.execute();
>>>>
>>>>   env.readTheNextThing().do()Something();
>>>>   env.execute();
>>>> }
>>>>
>>>>
>>>> You can also just "save" a program and keep it for later execution:
>>>>
>>>> Plan plan = env.createProgramPlan();
>>>>
>>>> at a later point you can start that plan: new RemoteExecutor(master,
>>>> 6123).execute(plan);
>>>>
>>>>
>>>>
>>>> Stephan
>>>>
>>>>
>>>>
>>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Any help on this? :(
>>>>>
>>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>> I forgot to ask you if there's a Flink utility to simulate the Hadoop
>>>>>> ProgramDriver class that acts somehow like a registry of jobs. Is there
>>>>>> something similar?
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>
>>>
>>
>

Re: Flink ProgramDriver

Posted by Kostas Tzoumas <kt...@apache.org>.

Are you looking for something like
https://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/util/ProgramDriver.html
?

You should be able to use the Hadoop ProgramDriver directly, see for
example here:
https://github.com/ktzoumas/incubator-flink/blob/tez_support/flink-addons/flink-tez/src/main/java/org/apache/flink/tez/examples/ExampleDriver.java

If you don't want to introduce a Hadoop dependency in your project, you can
just copy-paste ProgramDriver, it does not have any dependencies to Hadoop
classes. That class just accumulates <String,Class> pairs (simplifying a
bit) and calls the main method of the corresponding class.

On Sat, Nov 22, 2014 at 5:34 PM, Stephan Ewen <se...@apache.org> wrote:

> Not sure I get exactly what this is, but packaging multiple examples in
> one program is well possible. You can have arbitrary control flow in the
> main() method.
>
> Should be well possible to do something like that hadoop examples setup...
>
> On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> That was something I used to do with hadoop and it's comfortable when
>> testing stuff (so it is not so important).
>> For an example see what happens when you run the old "hadoop jar
>> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
>> invokation of that job.
>> However, the important thing is that I'd like to keep existing related
>> jobs somewhere (like a repository of jobs), deploy them and then be able to
>> start the one I need from an external program.
>>
>> Could this be done with RemoteExecutor? Or is there any WS to manage the
>> job execution? That would be very useful..
>> Is the Client interface the only one that allow something similar right
>> now?
>>
>> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> I am not sure exactly what you need there. In Flink you can write more
>>> than one program in the same program ;-) You can define complex flows and
>>> execute arbitrarily at intermediate points:
>>>
>>> main() {
>>>   ExecutionEnvironment env = ...;
>>>
>>>   env.readSomething().map().join(...).and().so().on();
>>>   env.execute();
>>>
>>>   env.readTheNextThing().do()Something();
>>>   env.execute();
>>> }
>>>
>>>
>>> You can also just "save" a program and keep it for later execution:
>>>
>>> Plan plan = env.createProgramPlan();
>>>
>>> at a later point you can start that plan: new RemoteExecutor(master,
>>> 6123).execute(plan);
>>>
>>>
>>>
>>> Stephan
>>>
>>>
>>>
>>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> Any help on this? :(
>>>>
>>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>
>>>>> Hi guys,
>>>>> I forgot to ask you if there's a Flink utility to simulate the Hadoop
>>>>> ProgramDriver class that acts somehow like a registry of jobs. Is there
>>>>> something similar?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>
>>
>

Re: Flink ProgramDriver

Posted by Stephan Ewen <se...@apache.org>.

Not sure I get exactly what this is, but packaging multiple examples in one
program is well possible. You can have arbitrary control flow in the main()
method.

Should be well possible to do something like that hadoop examples setup...

On Fri, Nov 21, 2014 at 7:02 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> That was something I used to do with hadoop and it's comfortable when
> testing stuff (so it is not so important).
> For an example see what happens when you run the old "hadoop jar
> hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
> invokation of that job.
> However, the important thing is that I'd like to keep existing related
> jobs somewhere (like a repository of jobs), deploy them and then be able to
> start the one I need from an external program.
>
> Could this be done with RemoteExecutor? Or is there any WS to manage the
> job execution? That would be very useful..
> Is the Client interface the only one that allow something similar right
> now?
>
> On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org> wrote:
>
>> I am not sure exactly what you need there. In Flink you can write more
>> than one program in the same program ;-) You can define complex flows and
>> execute arbitrarily at intermediate points:
>>
>> main() {
>>   ExecutionEnvironment env = ...;
>>
>>   env.readSomething().map().join(...).and().so().on();
>>   env.execute();
>>
>>   env.readTheNextThing().do()Something();
>>   env.execute();
>> }
>>
>>
>> You can also just "save" a program and keep it for later execution:
>>
>> Plan plan = env.createProgramPlan();
>>
>> at a later point you can start that plan: new RemoteExecutor(master,
>> 6123).execute(plan);
>>
>>
>>
>> Stephan
>>
>>
>>
>> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> Any help on this? :(
>>>
>>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <
>>> pompermaier@okkam.it> wrote:
>>>
>>>> Hi guys,
>>>> I forgot to ask you if there's a Flink utility to simulate the Hadoop
>>>> ProgramDriver class that acts somehow like a registry of jobs. Is there
>>>> something similar?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>
>

Re: Flink ProgramDriver

Posted by Flavio Pompermaier <po...@okkam.it>.

That was something I used to do with hadoop and it's comfortable when
testing stuff (so it is not so important).
For an example see what happens when you run the old "hadoop jar
hadoop-mapreduce-examples.jar" command..it "drives" you to the correct
invokation of that job.
However, the important thing is that I'd like to keep existing related jobs
somewhere (like a repository of jobs), deploy them and then be able to
start the one I need from an external program.

Could this be done with RemoteExecutor? Or is there any WS to manage the
job execution? That would be very useful..
Is the Client interface the only one that allow something similar right now?

On Fri, Nov 21, 2014 at 6:19 PM, Stephan Ewen <se...@apache.org> wrote:

> I am not sure exactly what you need there. In Flink you can write more
> than one program in the same program ;-) You can define complex flows and
> execute arbitrarily at intermediate points:
>
> main() {
>   ExecutionEnvironment env = ...;
>
>   env.readSomething().map().join(...).and().so().on();
>   env.execute();
>
>   env.readTheNextThing().do()Something();
>   env.execute();
> }
>
>
> You can also just "save" a program and keep it for later execution:
>
> Plan plan = env.createProgramPlan();
>
> at a later point you can start that plan: new RemoteExecutor(master,
> 6123).execute(plan);
>
>
>
> Stephan
>
>
>
> On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Any help on this? :(
>>
>> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>
>>> Hi guys,
>>> I forgot to ask you if there's a Flink utility to simulate the Hadoop
>>> ProgramDriver class that acts somehow like a registry of jobs. Is there
>>> something similar?
>>>
>>> Best,
>>> Flavio
>>>
>>

Re: Flink ProgramDriver

Posted by Stephan Ewen <se...@apache.org>.

I am not sure exactly what you need there. In Flink you can write more than
one program in the same program ;-) You can define complex flows and
execute arbitrarily at intermediate points:

main() {
  ExecutionEnvironment env = ...;

  env.readSomething().map().join(...).and().so().on();
  env.execute();

  env.readTheNextThing().do()Something();
  env.execute();
}

You can also just "save" a program and keep it for later execution:

Plan plan = env.createProgramPlan();

at a later point you can start that plan: new RemoteExecutor(master,
6123).execute(plan);

Stephan

On Fri, Nov 21, 2014 at 5:49 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Any help on this? :(
>
> On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Hi guys,
>> I forgot to ask you if there's a Flink utility to simulate the Hadoop
>> ProgramDriver class that acts somehow like a registry of jobs. Is there
>> something similar?
>>
>> Best,
>> Flavio
>>
>

Re: Flink ProgramDriver

Posted by Flavio Pompermaier <po...@okkam.it>.

Any help on this? :(

On Fri, Nov 21, 2014 at 9:33 AM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi guys,
> I forgot to ask you if there's a Flink utility to simulate the Hadoop
> ProgramDriver class that acts somehow like a registry of jobs. Is there
> something similar?
>
> Best,
> Flavio
>