You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Arjun Rao <sp...@gmail.com> on 2016/01/07 05:40:10 UTC

New to Apache Flink

Hi,

I am new to Apache Flink and I really like the look of the API. I have been
working with Storm for the past year and have questions about the
DataStream API among others.

1. What are the interactions of the actor system in the flink ecosystem?
Where can I find more information?
2. Is there a Flink web UI in embedded mode? ( Not on cluster )
3. Why is there a poll time min of 50ms for the Datastream API?
4. Is it possible to log each datastream to a separate file?( i.e. not all
in the same task manager log ).
5. I checked out the master from git, but I am unable to build the project
in Intellij, due to compilation errors in flink-staging. The root cause is
lack of "generated" files of avro such as

import org.apache.flink.api.io.avro.generated.Address;
import org.apache.flink.api.io.avro.generated.Colors;
import org.apache.flink.api.io.avro.generated.Fixed16;
import org.apache.flink.api.io.avro.generated.User;

These files are not being found. I have the avro plugin installed for
Intellij. What else do I need to do to make the project build/compile?

Appreciate the help!

Best,

Arjun

Re: New to Apache Flink

Posted by Stephan Ewen <se...@apache.org>.

Actually, the latest master should have this fix. The Dashboard ignores the
log fiel if not found...

On Tue, Jan 12, 2016 at 10:32 AM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi Arjun,
>
> yes, it is possible to start the web dashboard also when running jobs from
> the IDE.
> It is a bit hacky though...
>
> You can do it by creating a LocalStreamExecutionEnvironment as follows:
>
> val config = new Configuration()
> config.setBoolean(ConfigConstants.LOCAL_START_WEBSERVER, true)
> config.setString(ConfigConstants.JOB_MANAGER_WEB_LOG_PATH_KEY,
> "/path/to/some/random/text/file")
> new LocalStreamEnvironment(config)
>
> The dashboard will try to access the job manager log file and terminate if
> it cannot be found.
> Since the JM log file is not present when running from the IDE, we need to
> point it to some other (text) file.
> I hope this will be fixed soon.
>
> Best, Fabian
>
>
>
> 2016-01-12 2:27 GMT+01:00 Arjun Rao <sp...@gmail.com>:
>
> > Thanks for the replies and help. Stephan, the Maven shortcut worked like
> a
> > charm :).
> >
> > - As for the 50ms window duration, when I was running the WindowWordCount
> > example with a duration of 5ms, I encountered this error stack trace:
> >
> > Exception in thread "main" java.lang.IllegalArgumentException: Window
> > length must be at least 50 msecs
> > at
> >
> >
> org.apache.flink.streaming.runtime.operators.windowing.AbstractAlignedProcessingTimeWindowOperator.<init>(AbstractAlignedProcessingTimeWindowOperator.java:84)
> > at
> >
> >
> org.apache.flink.streaming.runtime.operators.windowing.AggregatingProcessingTimeWindowOperator.<init>(AggregatingProcessingTimeWindowOperator.java:40)
> > at
> >
> >
> org.apache.flink.streaming.api.datastream.WindowedStream.createFastTimeOperatorIfValid(WindowedStream.java:616)
> > at
> >
> >
> org.apache.flink.streaming.api.datastream.WindowedStream.reduce(WindowedStream.java:146)
> > at
> >
> >
> org.apache.flink.streaming.api.datastream.WindowedStream.aggregate(WindowedStream.java:556)
> > at
> >
> >
> org.apache.flink.streaming.api.datastream.WindowedStream.sum(WindowedStream.java:373)
> >
> > - As for logging separate datastreams individually, the reason I ask is
> for
> > debugging purposes. Let us say that I have 5 different streaming jobs
> and 1
> > taskmanager. If run out of the box, all these 5 jobs will log to the same
> > taskmanager log. My question is, is there a way to configure the logging
> > framework( or some other component perhaps) in such a way, that we are
> able
> > to log the different jobs separately to their own files, so it can be
> > easier to debug?
> >
> > - With respect to web page display in embedded mode, I apologize for the
> > confusion in terminology. When I say embedded, I mean running the flink
> > jobs through my IDE, with the flink jars in my local classpath. Is there
> a
> > way to look at the web page in this mode? Without having to deploy the
> > application jar to the flink cluster. Especially for  the DataStream
> mode.
> >
> > Thanks!
> >
> > Best,
> > Arjun
> >
> > On Thu, Jan 7, 2016 at 10:37 AM, Stephan Ewen <se...@apache.org> wrote:
> >
> > > For the generated classes like
> > > "org.apache.flink.api.io.avro.generated.Address;",
> > > IntelliJ has a shortcut: Just right-click on the project, select "Maven
> > ->
> > > generate sources and update folders".
> > >
> > > That should do the trick...
> > >
> > > On Thu, Jan 7, 2016 at 11:55 AM, Till Rohrmann <tr...@apache.org>
> > > wrote:
> > >
> > > > Hi Arjun,
> > > >
> > > > welcome to the Flink community :-)
> > > >
> > > > On Thu, Jan 7, 2016 at 5:40 AM, Arjun Rao <sp...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am new to Apache Flink and I really like the look of the API. I
> > have
> > > > been
> > > > > working with Storm for the past year and have questions about the
> > > > > DataStream API among others.
> > > > >
> > > > > 1. What are the interactions of the actor system in the flink
> > > ecosystem?
> > > > > Where can I find more information?
> > > > >
> > > >
> > > > Actors are used internally by the system for the communication
> between
> > > the
> > > > different distributed components (Client, JobManager and
> TaskManager).
> > > > Thus, they are used to submit jobs, send job updates and delivering
> the
> > > > execution result for example. However, the actual data transfer of
> your
> > > job
> > > > is done using Netty.
> > > >
> > > >
> > > > > 2. Is there a Flink web UI in embedded mode? ( Not on cluster )
> > > > >
> > > >
> > > > What do you mean with embedded mode? When you start Flink locally,
> then
> > > you
> > > > will also have a web UI.
> > > >
> > > >
> > > > > 3. Why is there a poll time min of 50ms for the Datastream API?
> > > > >
> > > >
> > > > Where do you see the poll time of 50 ms in the DataStream API? In
> order
> > > to
> > > > make the data transfer more efficient, data records are grouped
> > together
> > > > and shipped together. The size of this group is defined by the
> network
> > > > buffer size. Whenever such a buffer is full, it will be sent.
> However,
> > in
> > > > order to avoid that records are never sent, there is also a buffer
> > > timeout
> > > > after which the elements of the buffer are sent even if the buffer is
> > not
> > > > yet full.
> > > >
> > > >
> > > > > 4. Is it possible to log each datastream to a separate file?( i.e.
> > not
> > > > all
> > > >
> > > > in the same task manager log ).
> > > > >
> > > >
> > > > What do you mean by logging different DataStreams to separate files?
> At
> > > the
> > > > moment, all the logging which happens on the TaskManager is written
> to
> > > the
> > > > log file of this task manager.
> > > >
> > > >
> > > > > 5. I checked out the master from git, but I am unable to build the
> > > > project
> > > > > in Intellij, due to compilation errors in flink-staging. The root
> > cause
> > > > is
> > > > > lack of "generated" files of avro such as
> > > > >
> > > > > import org.apache.flink.api.io.avro.generated.Address;
> > > > > import org.apache.flink.api.io.avro.generated.Colors;
> > > > > import org.apache.flink.api.io.avro.generated.Fixed16;
> > > > > import org.apache.flink.api.io.avro.generated.User;
> > > > >
> > > > > These files are not being found. I have the avro plugin installed
> for
> > > > > Intellij. What else do I need to do to make the project
> > build/compile?
> > > > >
> > > >
> > > > I think the first time, you have to execute `mvn clean package
> > > -DskipTests
> > > > -Dmaven.javadoc.skip=true` on the command line in order to generate
> the
> > > > missing files.
> > > >
> > > >
> > > > >
> > > > > Appreciate the help!
> > > > >
> > > > > Best,
> > > > >
> > > > > Arjun
> > > > >
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > >
> >
>

Re: New to Apache Flink

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Arjun,

yes, it is possible to start the web dashboard also when running jobs from
the IDE.
It is a bit hacky though...

You can do it by creating a LocalStreamExecutionEnvironment as follows:

val config = new Configuration()
config.setBoolean(ConfigConstants.LOCAL_START_WEBSERVER, true)
config.setString(ConfigConstants.JOB_MANAGER_WEB_LOG_PATH_KEY,
"/path/to/some/random/text/file")
new LocalStreamEnvironment(config)

The dashboard will try to access the job manager log file and terminate if
it cannot be found.
Since the JM log file is not present when running from the IDE, we need to
point it to some other (text) file.
I hope this will be fixed soon.

Best, Fabian



2016-01-12 2:27 GMT+01:00 Arjun Rao <sp...@gmail.com>:

> Thanks for the replies and help. Stephan, the Maven shortcut worked like a
> charm :).
>
> - As for the 50ms window duration, when I was running the WindowWordCount
> example with a duration of 5ms, I encountered this error stack trace:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Window
> length must be at least 50 msecs
> at
>
> org.apache.flink.streaming.runtime.operators.windowing.AbstractAlignedProcessingTimeWindowOperator.<init>(AbstractAlignedProcessingTimeWindowOperator.java:84)
> at
>
> org.apache.flink.streaming.runtime.operators.windowing.AggregatingProcessingTimeWindowOperator.<init>(AggregatingProcessingTimeWindowOperator.java:40)
> at
>
> org.apache.flink.streaming.api.datastream.WindowedStream.createFastTimeOperatorIfValid(WindowedStream.java:616)
> at
>
> org.apache.flink.streaming.api.datastream.WindowedStream.reduce(WindowedStream.java:146)
> at
>
> org.apache.flink.streaming.api.datastream.WindowedStream.aggregate(WindowedStream.java:556)
> at
>
> org.apache.flink.streaming.api.datastream.WindowedStream.sum(WindowedStream.java:373)
>
> - As for logging separate datastreams individually, the reason I ask is for
> debugging purposes. Let us say that I have 5 different streaming jobs and 1
> taskmanager. If run out of the box, all these 5 jobs will log to the same
> taskmanager log. My question is, is there a way to configure the logging
> framework( or some other component perhaps) in such a way, that we are able
> to log the different jobs separately to their own files, so it can be
> easier to debug?
>
> - With respect to web page display in embedded mode, I apologize for the
> confusion in terminology. When I say embedded, I mean running the flink
> jobs through my IDE, with the flink jars in my local classpath. Is there a
> way to look at the web page in this mode? Without having to deploy the
> application jar to the flink cluster. Especially for  the DataStream mode.
>
> Thanks!
>
> Best,
> Arjun
>
> On Thu, Jan 7, 2016 at 10:37 AM, Stephan Ewen <se...@apache.org> wrote:
>
> > For the generated classes like
> > "org.apache.flink.api.io.avro.generated.Address;",
> > IntelliJ has a shortcut: Just right-click on the project, select "Maven
> ->
> > generate sources and update folders".
> >
> > That should do the trick...
> >
> > On Thu, Jan 7, 2016 at 11:55 AM, Till Rohrmann <tr...@apache.org>
> > wrote:
> >
> > > Hi Arjun,
> > >
> > > welcome to the Flink community :-)
> > >
> > > On Thu, Jan 7, 2016 at 5:40 AM, Arjun Rao <sp...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am new to Apache Flink and I really like the look of the API. I
> have
> > > been
> > > > working with Storm for the past year and have questions about the
> > > > DataStream API among others.
> > > >
> > > > 1. What are the interactions of the actor system in the flink
> > ecosystem?
> > > > Where can I find more information?
> > > >
> > >
> > > Actors are used internally by the system for the communication between
> > the
> > > different distributed components (Client, JobManager and TaskManager).
> > > Thus, they are used to submit jobs, send job updates and delivering the
> > > execution result for example. However, the actual data transfer of your
> > job
> > > is done using Netty.
> > >
> > >
> > > > 2. Is there a Flink web UI in embedded mode? ( Not on cluster )
> > > >
> > >
> > > What do you mean with embedded mode? When you start Flink locally, then
> > you
> > > will also have a web UI.
> > >
> > >
> > > > 3. Why is there a poll time min of 50ms for the Datastream API?
> > > >
> > >
> > > Where do you see the poll time of 50 ms in the DataStream API? In order
> > to
> > > make the data transfer more efficient, data records are grouped
> together
> > > and shipped together. The size of this group is defined by the network
> > > buffer size. Whenever such a buffer is full, it will be sent. However,
> in
> > > order to avoid that records are never sent, there is also a buffer
> > timeout
> > > after which the elements of the buffer are sent even if the buffer is
> not
> > > yet full.
> > >
> > >
> > > > 4. Is it possible to log each datastream to a separate file?( i.e.
> not
> > > all
> > >
> > > in the same task manager log ).
> > > >
> > >
> > > What do you mean by logging different DataStreams to separate files? At
> > the
> > > moment, all the logging which happens on the TaskManager is written to
> > the
> > > log file of this task manager.
> > >
> > >
> > > > 5. I checked out the master from git, but I am unable to build the
> > > project
> > > > in Intellij, due to compilation errors in flink-staging. The root
> cause
> > > is
> > > > lack of "generated" files of avro such as
> > > >
> > > > import org.apache.flink.api.io.avro.generated.Address;
> > > > import org.apache.flink.api.io.avro.generated.Colors;
> > > > import org.apache.flink.api.io.avro.generated.Fixed16;
> > > > import org.apache.flink.api.io.avro.generated.User;
> > > >
> > > > These files are not being found. I have the avro plugin installed for
> > > > Intellij. What else do I need to do to make the project
> build/compile?
> > > >
> > >
> > > I think the first time, you have to execute `mvn clean package
> > -DskipTests
> > > -Dmaven.javadoc.skip=true` on the command line in order to generate the
> > > missing files.
> > >
> > >
> > > >
> > > > Appreciate the help!
> > > >
> > > > Best,
> > > >
> > > > Arjun
> > > >
> > >
> > > Cheers,
> > > Till
> > >
> >
>

Re: New to Apache Flink

Posted by Arjun Rao <sp...@gmail.com>.

Thanks for the replies and help. Stephan, the Maven shortcut worked like a
charm :).

- As for the 50ms window duration, when I was running the WindowWordCount
example with a duration of 5ms, I encountered this error stack trace:

Exception in thread "main" java.lang.IllegalArgumentException: Window
length must be at least 50 msecs
at
org.apache.flink.streaming.runtime.operators.windowing.AbstractAlignedProcessingTimeWindowOperator.<init>(AbstractAlignedProcessingTimeWindowOperator.java:84)
at
org.apache.flink.streaming.runtime.operators.windowing.AggregatingProcessingTimeWindowOperator.<init>(AggregatingProcessingTimeWindowOperator.java:40)
at
org.apache.flink.streaming.api.datastream.WindowedStream.createFastTimeOperatorIfValid(WindowedStream.java:616)
at
org.apache.flink.streaming.api.datastream.WindowedStream.reduce(WindowedStream.java:146)
at
org.apache.flink.streaming.api.datastream.WindowedStream.aggregate(WindowedStream.java:556)
at
org.apache.flink.streaming.api.datastream.WindowedStream.sum(WindowedStream.java:373)

- As for logging separate datastreams individually, the reason I ask is for
debugging purposes. Let us say that I have 5 different streaming jobs and 1
taskmanager. If run out of the box, all these 5 jobs will log to the same
taskmanager log. My question is, is there a way to configure the logging
framework( or some other component perhaps) in such a way, that we are able
to log the different jobs separately to their own files, so it can be
easier to debug?

- With respect to web page display in embedded mode, I apologize for the
confusion in terminology. When I say embedded, I mean running the flink
jobs through my IDE, with the flink jars in my local classpath. Is there a
way to look at the web page in this mode? Without having to deploy the
application jar to the flink cluster. Especially for  the DataStream mode.

Thanks!

Best,
Arjun

On Thu, Jan 7, 2016 at 10:37 AM, Stephan Ewen <se...@apache.org> wrote:

> For the generated classes like
> "org.apache.flink.api.io.avro.generated.Address;",
> IntelliJ has a shortcut: Just right-click on the project, select "Maven ->
> generate sources and update folders".
>
> That should do the trick...
>
> On Thu, Jan 7, 2016 at 11:55 AM, Till Rohrmann <tr...@apache.org>
> wrote:
>
> > Hi Arjun,
> >
> > welcome to the Flink community :-)
> >
> > On Thu, Jan 7, 2016 at 5:40 AM, Arjun Rao <sp...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I am new to Apache Flink and I really like the look of the API. I have
> > been
> > > working with Storm for the past year and have questions about the
> > > DataStream API among others.
> > >
> > > 1. What are the interactions of the actor system in the flink
> ecosystem?
> > > Where can I find more information?
> > >
> >
> > Actors are used internally by the system for the communication between
> the
> > different distributed components (Client, JobManager and TaskManager).
> > Thus, they are used to submit jobs, send job updates and delivering the
> > execution result for example. However, the actual data transfer of your
> job
> > is done using Netty.
> >
> >
> > > 2. Is there a Flink web UI in embedded mode? ( Not on cluster )
> > >
> >
> > What do you mean with embedded mode? When you start Flink locally, then
> you
> > will also have a web UI.
> >
> >
> > > 3. Why is there a poll time min of 50ms for the Datastream API?
> > >
> >
> > Where do you see the poll time of 50 ms in the DataStream API? In order
> to
> > make the data transfer more efficient, data records are grouped together
> > and shipped together. The size of this group is defined by the network
> > buffer size. Whenever such a buffer is full, it will be sent. However, in
> > order to avoid that records are never sent, there is also a buffer
> timeout
> > after which the elements of the buffer are sent even if the buffer is not
> > yet full.
> >
> >
> > > 4. Is it possible to log each datastream to a separate file?( i.e. not
> > all
> >
> > in the same task manager log ).
> > >
> >
> > What do you mean by logging different DataStreams to separate files? At
> the
> > moment, all the logging which happens on the TaskManager is written to
> the
> > log file of this task manager.
> >
> >
> > > 5. I checked out the master from git, but I am unable to build the
> > project
> > > in Intellij, due to compilation errors in flink-staging. The root cause
> > is
> > > lack of "generated" files of avro such as
> > >
> > > import org.apache.flink.api.io.avro.generated.Address;
> > > import org.apache.flink.api.io.avro.generated.Colors;
> > > import org.apache.flink.api.io.avro.generated.Fixed16;
> > > import org.apache.flink.api.io.avro.generated.User;
> > >
> > > These files are not being found. I have the avro plugin installed for
> > > Intellij. What else do I need to do to make the project build/compile?
> > >
> >
> > I think the first time, you have to execute `mvn clean package
> -DskipTests
> > -Dmaven.javadoc.skip=true` on the command line in order to generate the
> > missing files.
> >
> >
> > >
> > > Appreciate the help!
> > >
> > > Best,
> > >
> > > Arjun
> > >
> >
> > Cheers,
> > Till
> >
>

Re: New to Apache Flink

Posted by Stephan Ewen <se...@apache.org>.

For the generated classes like
"org.apache.flink.api.io.avro.generated.Address;",
IntelliJ has a shortcut: Just right-click on the project, select "Maven ->
generate sources and update folders".

That should do the trick...

On Thu, Jan 7, 2016 at 11:55 AM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Arjun,
>
> welcome to the Flink community :-)
>
> On Thu, Jan 7, 2016 at 5:40 AM, Arjun Rao <sp...@gmail.com> wrote:
>
> > Hi,
> >
> > I am new to Apache Flink and I really like the look of the API. I have
> been
> > working with Storm for the past year and have questions about the
> > DataStream API among others.
> >
> > 1. What are the interactions of the actor system in the flink ecosystem?
> > Where can I find more information?
> >
>
> Actors are used internally by the system for the communication between the
> different distributed components (Client, JobManager and TaskManager).
> Thus, they are used to submit jobs, send job updates and delivering the
> execution result for example. However, the actual data transfer of your job
> is done using Netty.
>
>
> > 2. Is there a Flink web UI in embedded mode? ( Not on cluster )
> >
>
> What do you mean with embedded mode? When you start Flink locally, then you
> will also have a web UI.
>
>
> > 3. Why is there a poll time min of 50ms for the Datastream API?
> >
>
> Where do you see the poll time of 50 ms in the DataStream API? In order to
> make the data transfer more efficient, data records are grouped together
> and shipped together. The size of this group is defined by the network
> buffer size. Whenever such a buffer is full, it will be sent. However, in
> order to avoid that records are never sent, there is also a buffer timeout
> after which the elements of the buffer are sent even if the buffer is not
> yet full.
>
>
> > 4. Is it possible to log each datastream to a separate file?( i.e. not
> all
>
> in the same task manager log ).
> >
>
> What do you mean by logging different DataStreams to separate files? At the
> moment, all the logging which happens on the TaskManager is written to the
> log file of this task manager.
>
>
> > 5. I checked out the master from git, but I am unable to build the
> project
> > in Intellij, due to compilation errors in flink-staging. The root cause
> is
> > lack of "generated" files of avro such as
> >
> > import org.apache.flink.api.io.avro.generated.Address;
> > import org.apache.flink.api.io.avro.generated.Colors;
> > import org.apache.flink.api.io.avro.generated.Fixed16;
> > import org.apache.flink.api.io.avro.generated.User;
> >
> > These files are not being found. I have the avro plugin installed for
> > Intellij. What else do I need to do to make the project build/compile?
> >
>
> I think the first time, you have to execute `mvn clean package -DskipTests
> -Dmaven.javadoc.skip=true` on the command line in order to generate the
> missing files.
>
>
> >
> > Appreciate the help!
> >
> > Best,
> >
> > Arjun
> >
>
> Cheers,
> Till
>

Re: New to Apache Flink

Posted by Till Rohrmann <tr...@apache.org>.

Hi Arjun,

welcome to the Flink community :-)

On Thu, Jan 7, 2016 at 5:40 AM, Arjun Rao <sp...@gmail.com> wrote:

> Hi,
>
> I am new to Apache Flink and I really like the look of the API. I have been
> working with Storm for the past year and have questions about the
> DataStream API among others.
>
> 1. What are the interactions of the actor system in the flink ecosystem?
> Where can I find more information?
>

Actors are used internally by the system for the communication between the
different distributed components (Client, JobManager and TaskManager).
Thus, they are used to submit jobs, send job updates and delivering the
execution result for example. However, the actual data transfer of your job
is done using Netty.

> 2. Is there a Flink web UI in embedded mode? ( Not on cluster )
>

What do you mean with embedded mode? When you start Flink locally, then you
will also have a web UI.

> 3. Why is there a poll time min of 50ms for the Datastream API?
>

Where do you see the poll time of 50 ms in the DataStream API? In order to
make the data transfer more efficient, data records are grouped together
and shipped together. The size of this group is defined by the network
buffer size. Whenever such a buffer is full, it will be sent. However, in
order to avoid that records are never sent, there is also a buffer timeout
after which the elements of the buffer are sent even if the buffer is not
yet full.

> 4. Is it possible to log each datastream to a separate file?( i.e. not all

in the same task manager log ).
>

What do you mean by logging different DataStreams to separate files? At the
moment, all the logging which happens on the TaskManager is written to the
log file of this task manager.

> 5. I checked out the master from git, but I am unable to build the project
> in Intellij, due to compilation errors in flink-staging. The root cause is
> lack of "generated" files of avro such as
>
> import org.apache.flink.api.io.avro.generated.Address;
> import org.apache.flink.api.io.avro.generated.Colors;
> import org.apache.flink.api.io.avro.generated.Fixed16;
> import org.apache.flink.api.io.avro.generated.User;
>
> These files are not being found. I have the avro plugin installed for
> Intellij. What else do I need to do to make the project build/compile?
>

I think the first time, you have to execute `mvn clean package -DskipTests
-Dmaven.javadoc.skip=true` on the command line in order to generate the
missing files.

>
> Appreciate the help!
>
> Best,
>
> Arjun
>

Cheers,
Till