You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Martin Neumann <mn...@sics.se> on 2016/02/09 15:35:35 UTC

User Feedback

During this year's FOSDEM Martin Junghans and I set together and gathered
some feedback for the Flink project. It is based on our personal experience
as well as the feedback and questions from People we taught the system.
This is going to be a longer email therefore I have split things into
categories:


*Website and Documentation:*

   1. *Out-dated Google Search results*: Google searches lead to outdated
   web site versions (e.g. “flink transformations” or “flink iterations”
   return the 0.7 version of the corresponding pages).
   2. *Invalid Links on Website: *Links are confusing / broken (e.g. the
   Gelly /ML Links on the start page lead to the top of the feature page
   (which start with streaming) *-> maybe this can be validated
   automatically?*


*Batch API:*

   1. *.reduceGroup(GroupReduceFunction) and
   .groupCombine(CombineGroupFunction): *In other functions such as
   .flatMap(FlatMapFunction) the function call matches the naming of the
   operator. This structure is quite convenient for new user since they can
   make use of the autocompletion features of the IDE, basically start typing
   the function call and you get the correct class. This does not work for
   .reduceGroup() and .groupCombine() since the names are switched around. *->
   maybe the function can be renamed*
   2. *.print() and env.execute(): *Often .print() is used for debugging
   and developing programs replacing regular data sinks. Such a project will
   not run until the env.execute() is removed. It's very easy to forget to add
   it back in, once you change the .print() back to a proper sink. The project
   now will compile fine but will not produce any output since .execute() is
   missing. This is a very difficult bug to find especially since there is no
   warning or error when running the job. It’s common that people use more
   than one .print() statement during debugging and development. This can lead
   to confusion since each .print() forces the program to execute so the
   execution behavior is different than without the print. This is especially
   important, if the program contains non-deterministic data generation (like
   generating IDs). In the stream API .print() would not require to
   remove .execute() as a result the behavior of the two interfaces is
   inconsistent.
   3. *calling new when applying an operator eg: .reduceGroup(new
   GroupReduceFunction()): *Some of the people I taught the API’s to where
   confused by this. They knew it was a distributed system and they were
   wondering where the constructor would be actually called. They expected to
   hand a class to the function that would be initialized on each of the
   worker nodes. *-> maybe have a section about this in the documentation*
   4. *.project() loses type information / does not support .returns(..): *The
   project transformation currently loses type information which affects
   chained call with other transformations. One workaround is the definition
   of an intermediate dataset. However, to be consistent with other operators,
   project should support .returns() to define a type information if needed.


*Stream API:*

   1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every
   operator that consumes a KeyedDataStream produces a DataStream. This means
   it is not possible to create a program that uses a keyBy() followed by a
   sequence of transformation for each key without having to reapply keyBy()
   after each of those operators. (This was a common problem in my work for
   Ericsson and Spotify)
   2. *split() operator with multiple output types.: *Its common to have to
   split a single Stream into a different streams. For example a stream
   containing different system events might need to be broken into a stream
   for each type. The current split() operator requires all outputs to have
   the same data type. I cases where there are no direct type hierarchies the
   user needs to implement a wrapper type to make use of this function. An
   operator similar to split that allows output streams to have different
   types would greatly simplify those use cases


cheers Martin

RE: User Feedback

Posted by Ken Krugler <kk...@transpac.com>.
> From: Vasiliki Kalavri
> Sent: February 9, 2016 10:54:51am PST
> To: dev@flink.apache.org
> Cc: Martin Junghanns
> Subject: Re: User Feedback
> 
> Hi Martin,
> 
> thank you for the feedback. Let me try to answer some of your concerns.
> 
> 
> On 9 February 2016 at 15:35, Martin Neumann <mn...@sics.se> wrote:
> 
>> During this year's FOSDEM Martin Junghans and I set together and gathered
>> some feedback for the Flink project. It is based on our personal experience
>> as well as the feedback and questions from People we taught the system.
>> This is going to be a longer email therefore I have split things into
>> categories:
>> 
>> 
>> *Website and Documentation:*
>> 
>>   1. *Out-dated Google Search results*: Google searches lead to outdated
>>   web site versions (e.g. “flink transformations” or “flink iterations”
>>   return the 0.7 version of the corresponding pages).
>> 
> 
> ​I'm not sure we can do much about this. I would suggest searching in the
> documentation instead of relying on Google.

This issue (Google finding out-of-date documentation) impacts many open source projects.

And everyone does in fact use Google :)

Wouldn't adding a sitemap help here?

Regards,

-- Ken

[snip]

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: User Feedback

Posted by Stephan Ewen <se...@apache.org>.
I can elaborate in the project(...) method:

".returns()" is there to supply TypeInformation in cases where the system
cannot determine it. In the case of "project()", the system can perfectly
determine the output type info from the input and the projection.

For just getting a typed result, I would use Java's generic method syntax,
then you can get around defining an intermediate variable:

DataSet<Tuple3<Long, String, Integer>> input = ...;

Tuple2<Long, Integer> aTuple = input.<Tuple2<Long,
Integer>>project(0,2).collect().get(0);


Greetings,
Stephan


On Tue, Feb 9, 2016 at 7:54 PM, Vasiliki Kalavri <va...@gmail.com>
wrote:

> Hi Martin,
>
> thank you for the feedback. Let me try to answer some of your concerns.
>
>
> On 9 February 2016 at 15:35, Martin Neumann <mn...@sics.se> wrote:
>
> > During this year's FOSDEM Martin Junghans and I set together and gathered
> > some feedback for the Flink project. It is based on our personal
> experience
> > as well as the feedback and questions from People we taught the system.
> > This is going to be a longer email therefore I have split things into
> > categories:
> >
> >
> > *Website and Documentation:*
> >
> >    1. *Out-dated Google Search results*: Google searches lead to outdated
> >    web site versions (e.g. “flink transformations” or “flink iterations”
> >    return the 0.7 version of the corresponding pages).
> >
>
> ​I'm not sure we can do much about this. I would suggest searching in the
> documentation instead of relying on Google.
> There is a search box on the top of all documentation pages.
>
>
>
> >    2. *Invalid Links on Website: *Links are confusing / broken (e.g. the
> >    Gelly /ML Links on the start page lead to the top of the feature page
> >    (which start with streaming) *-> maybe this can be validated
> >    automatically?*
> >
> >
> ​That was bug recently reported and fixed (see FLINK-3316). If you find
> ​ more of those, please report by opening a JIRA or Pull Request​.
>
>
>
> >
> > *Batch API:*
> >
> >    1. *.reduceGroup(GroupReduceFunction) and
> >    .groupCombine(CombineGroupFunction): *In other functions such as
> >    .flatMap(FlatMapFunction) the function call matches the naming of the
> >    operator. This structure is quite convenient for new user since they
> can
> >    make use of the autocompletion features of the IDE, basically start
> > typing
> >    the function call and you get the correct class. This does not work
> for
> >    .reduceGroup() and .groupCombine() since the names are switched
> around.
> > *->
> >    maybe the function can be renamed*
> >
>
> ​I agree this might be strange for new users, but I think it will be much
> more annoying for existing users if we change this. In my view, it's not an
> important case to justify breaking the API.
>
>
>
> >    2. *.print() and env.execute(): *Often .print() is used for debugging
> >    and developing programs replacing regular data sinks. Such a project
> > will
> >    not run until the env.execute() is removed. It's very easy to forget
> to
> > add
> >    it back in, once you change the .print() back to a proper sink. The
> > project
> >    now will compile fine but will not produce any output since .execute()
> > is
> >    missing. This is a very difficult bug to find especially since there
> is
> > no
> >    warning or error when running the job. It’s common that people use
> more
> >    than one .print() statement during debugging and development. This can
> > lead
> >    to confusion since each .print() forces the program to execute so the
> >    execution behavior is different than without the print. This is
> > especially
> >    important, if the program contains non-deterministic data generation
> > (like
> >    generating IDs). In the stream API .print() would not require to
> >    remove .execute() as a result the behavior of the two interfaces is
> >    inconsistent.
> >
>
> ​This is indeed an issue that many users find hard to get used to. We have
> changed the behavior of print() a couple of times before and I'm not sure
> it would be wise to do so again. Actually, once a user understands the
> difference between eager and lazy sinks, I think it's quite easy​ to avoid
> mistakes.
>
>
>
> >    3. *calling new when applying an operator eg: .reduceGroup(new
> >    GroupReduceFunction()): *Some of the people I taught the API’s to
> where
> >    confused by this. They knew it was a distributed system and they were
> >    wondering where the constructor would be actually called. They
> expected
> > to
> >    hand a class to the function that would be initialized on each of the
> >    worker nodes. *-> maybe have a section about this in the
> documentation*
> >
>
> ​I'm not sure I understand the confusion with this one. The goal of
> high-level APIs is to relieve the users from having to think about
> distribution. The only thing they need to understand is the
> DataSet/DataStream abstractions and how to create transformations on them.
>
>
> >    4. *.project() loses type information / does not support .returns(..):
> > *The
> >    project transformation currently loses type information which affects
> >    chained call with other transformations. One workaround is the
> > definition
> >    of an intermediate dataset. However, to be consistent with other
> > operators,
> >    project should support .returns() to define a type information if
> > needed.
> >
> >
> ​I'm not sure _why_ this is the case. Maybe someone who knows more can
> clarify this one.​
>
>
>
> >
> > *Stream API:*
> >
> >    1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every
> >    operator that consumes a KeyedDataStream produces a DataStream. This
> > means
> >    it is not possible to create a program that uses a keyBy() followed
> by a
> >    sequence of transformation for each key without having to reapply
> > keyBy()
> >    after each of those operators. (This was a common problem in my work
> for
> >    Ericsson and Spotify)
> >
>
> I might be missing something here, but if you want to apply a
> transformation on a keyed stream without changing the keys, isn't a map
> transformation ​enough? Can you give an example of a case where you had
> this problem?
>
>
>
> >    2. *split() operator with multiple output types.: *Its common to have
> to
> >    split a single Stream into a different streams. For example a stream
> >    containing different system events might need to be broken into a
> stream
> >    for each type. The current split() operator requires all outputs to
> have
> >    the same data type. I cases where there are no direct type hierarchies
> > the
> >    user needs to implement a wrapper type to make use of this function.
> An
> >    operator similar to split that allows output streams to have different
> >    types would greatly simplify those use cases
> >
> >
> > cheers Martin
> >
>
> ​Cheers,
> -Vasia.​
>

Re: User Feedback

Posted by Vasiliki Kalavri <va...@gmail.com>.
Hi Martin,

thank you for the feedback. Let me try to answer some of your concerns.


On 9 February 2016 at 15:35, Martin Neumann <mn...@sics.se> wrote:

> During this year's FOSDEM Martin Junghans and I set together and gathered
> some feedback for the Flink project. It is based on our personal experience
> as well as the feedback and questions from People we taught the system.
> This is going to be a longer email therefore I have split things into
> categories:
>
>
> *Website and Documentation:*
>
>    1. *Out-dated Google Search results*: Google searches lead to outdated
>    web site versions (e.g. “flink transformations” or “flink iterations”
>    return the 0.7 version of the corresponding pages).
>

​I'm not sure we can do much about this. I would suggest searching in the
documentation instead of relying on Google.
There is a search box on the top of all documentation pages.



>    2. *Invalid Links on Website: *Links are confusing / broken (e.g. the
>    Gelly /ML Links on the start page lead to the top of the feature page
>    (which start with streaming) *-> maybe this can be validated
>    automatically?*
>
>
​That was bug recently reported and fixed (see FLINK-3316). If you find
​ more of those, please report by opening a JIRA or Pull Request​.



>
> *Batch API:*
>
>    1. *.reduceGroup(GroupReduceFunction) and
>    .groupCombine(CombineGroupFunction): *In other functions such as
>    .flatMap(FlatMapFunction) the function call matches the naming of the
>    operator. This structure is quite convenient for new user since they can
>    make use of the autocompletion features of the IDE, basically start
> typing
>    the function call and you get the correct class. This does not work for
>    .reduceGroup() and .groupCombine() since the names are switched around.
> *->
>    maybe the function can be renamed*
>

​I agree this might be strange for new users, but I think it will be much
more annoying for existing users if we change this. In my view, it's not an
important case to justify breaking the API.



>    2. *.print() and env.execute(): *Often .print() is used for debugging
>    and developing programs replacing regular data sinks. Such a project
> will
>    not run until the env.execute() is removed. It's very easy to forget to
> add
>    it back in, once you change the .print() back to a proper sink. The
> project
>    now will compile fine but will not produce any output since .execute()
> is
>    missing. This is a very difficult bug to find especially since there is
> no
>    warning or error when running the job. It’s common that people use more
>    than one .print() statement during debugging and development. This can
> lead
>    to confusion since each .print() forces the program to execute so the
>    execution behavior is different than without the print. This is
> especially
>    important, if the program contains non-deterministic data generation
> (like
>    generating IDs). In the stream API .print() would not require to
>    remove .execute() as a result the behavior of the two interfaces is
>    inconsistent.
>

​This is indeed an issue that many users find hard to get used to. We have
changed the behavior of print() a couple of times before and I'm not sure
it would be wise to do so again. Actually, once a user understands the
difference between eager and lazy sinks, I think it's quite easy​ to avoid
mistakes.



>    3. *calling new when applying an operator eg: .reduceGroup(new
>    GroupReduceFunction()): *Some of the people I taught the API’s to where
>    confused by this. They knew it was a distributed system and they were
>    wondering where the constructor would be actually called. They expected
> to
>    hand a class to the function that would be initialized on each of the
>    worker nodes. *-> maybe have a section about this in the documentation*
>

​I'm not sure I understand the confusion with this one. The goal of
high-level APIs is to relieve the users from having to think about
distribution. The only thing they need to understand is the
DataSet/DataStream abstractions and how to create transformations on them.


>    4. *.project() loses type information / does not support .returns(..):
> *The
>    project transformation currently loses type information which affects
>    chained call with other transformations. One workaround is the
> definition
>    of an intermediate dataset. However, to be consistent with other
> operators,
>    project should support .returns() to define a type information if
> needed.
>
>
​I'm not sure _why_ this is the case. Maybe someone who knows more can
clarify this one.​



>
> *Stream API:*
>
>    1. *.keyBy(): *Currently .keyBy() creates a KeyedDataStream but every
>    operator that consumes a KeyedDataStream produces a DataStream. This
> means
>    it is not possible to create a program that uses a keyBy() followed by a
>    sequence of transformation for each key without having to reapply
> keyBy()
>    after each of those operators. (This was a common problem in my work for
>    Ericsson and Spotify)
>

I might be missing something here, but if you want to apply a
transformation on a keyed stream without changing the keys, isn't a map
transformation ​enough? Can you give an example of a case where you had
this problem?



>    2. *split() operator with multiple output types.: *Its common to have to
>    split a single Stream into a different streams. For example a stream
>    containing different system events might need to be broken into a stream
>    for each type. The current split() operator requires all outputs to have
>    the same data type. I cases where there are no direct type hierarchies
> the
>    user needs to implement a wrapper type to make use of this function. An
>    operator similar to split that allows output streams to have different
>    types would greatly simplify those use cases
>
>
> cheers Martin
>

​Cheers,
-Vasia.​