You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Wouter Zorgdrager <W....@tudelft.nl> on 2019/03/04 10:32:00 UTC

Using Flink in an university course

Hi all,

I'm working on a setup to use Apache Flink in an assignment for a Big Data
(bachelor) university course and I'm interested in your view on this. To
sketch the situation:
-  > 200 students follow this course
- students have to write some (simple) Flink applications using the
DataStream API; the focus is on writing the transformation code
- students need to write Scala code
- we provide a dataset and a template (Scala class) with function
signatures and detailed description per application.
e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String,
Int)] = ???
- we provide some setup code like parsing of data and setting up the
streaming environment
- assignments need to be auto-graded, based on correct results

In last years course edition we approached this by a custom Docker
container. This container first compiled the students code, run all the
Flink applications against a different dataset and then verified the output
against our solutions. This was turned into a grade and reported back to
the student. Although this was a working approach, I think we can do better.

I'm wondering if any of you have experience with using Apache Flink in a
university course (or have seen this somewhere) as well as assessing Flink
code.

Thanks a lot!

Kind regards,
Wouter Zorgdrager

Re: Using Flink in an university course

Posted by Wouter Zorgdrager <W....@tudelft.nl>.

Hi all,

Thanks for the input. Much appreciated.

Regards,
Wouter

Op ma 4 mrt. 2019 om 20:40 schreef Addison Higham <ad...@gmail.com>:

> Hi there,
>
> As far as a runtime for students, it seems like docker is your best bet.
> However, you could have them instead package a jar using some interface
> (for example, see
> https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/packaging.html,
> which details the `Program` interface) and then execute it inside a custom
> runner. That *might* result in something less prone to breakage as it would
> need to conform to an interface, but it may require a fair amount of custom
> code to reduce the boiler plate to build up a program plan as well as the
> custom runner. The code for how flink loads a jar and turns it into
> something it can execute is mostly encapsulated
> in org.apache.flink.client.program.PackagedProgram, which might be a good
> thing to read and understand if you go down this route.
>
> If you want to give more insight, you could build some tooling to traverse
> the underlying graphs that the students build up in their data stream
> application. For example, calling
> `StreamExecutionEnvironment.getStreamGraph` after the data stream is built
> will get a graph of the current job, which you can then use to traverse a
> graph and see which operators and edges are in use. This is very similar to
> the process flink uses to build the job DAG it renders in the UI. I am not
> sure what you could do as an automated analysis, but the StreamGraph API is
> quite low level and exposes a lot of information about the program.
>
> Hopefully that is a little bit helpful. Good luck and sounds like a fun
> course!
>
>
> On Mon, Mar 4, 2019 at 7:16 AM Wouter Zorgdrager <
> W.D.Zorgdrager@tudelft.nl> wrote:
>
>> Hey all,
>>
>> Thanks for the replies. The issues we were running into (which are not
>> specific to Docker):
>> - Students changing the template wrongly failed the container.
>> - We give full points if the output matches our solutions (and none
>> otherwise), but it would be nice if we could give partial grades per
>> assignment (and better feedback). This would require instead of looking
>> only at results also at the operators used. The pitfall is that in many
>> cases a correct solution can be achieved in multiple ways. I came across a
>> Flink test library [1] which allows to test Flink code more extensively but
>> seems to be only in Java.
>>
>> In retrospective, I do think using Docker is a good approach as Fabian
>> confirms. However, the way we currently assess student solutions might be
>> improved. I assume that in your trainings manual feedback is given, but
>> unfortunately this is quite difficult for so many students.
>>
>> Cheers,
>> Wouter
>>
>> 1: https://github.com/ottogroup/flink-spector
>>
>>
>> Op ma 4 mrt. 2019 om 14:39 schreef Fabian Hueske <fh...@gmail.com>:
>>
>>> Hi Wouter,
>>>
>>> We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper)
>>> setups for our trainings and it is working very well.
>>> We have an additional container that feeds a Kafka topic via the
>>> commandline producer to simulate a somewhat realistic behavior.
>>> Of course, you can do it without Kafka as and use some kind of data
>>> generating source that reads from a file that is replace for evaluation.
>>>
>>> The biggest benefit that I see with using Docker is that the students
>>> have an environment that is close to grading situation for development and
>>> testing.
>>> You do not need to provide infrastructure but everyone is running it
>>> locally in a well-defined context.
>>>
>>> So, as Joern said, what problems do you see with Docker?
>>>
>>> Best,
>>> Fabian
>>>
>>> Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <
>>> jornfranke@gmail.com>:
>>>
>>>> It would help to understand the current issues that you have with this
>>>> approach? I used a similar approach (not with Flink, but a similar big data
>>>> technology) some years ago
>>>>
>>>> > Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <
>>>> W.D.Zorgdrager@tudelft.nl>:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > I'm working on a setup to use Apache Flink in an assignment for a Big
>>>> Data (bachelor) university course and I'm interested in your view on this.
>>>> To sketch the situation:
>>>> > -  > 200 students follow this course
>>>> > - students have to write some (simple) Flink applications using the
>>>> DataStream API; the focus is on writing the transformation code
>>>> > - students need to write Scala code
>>>> > - we provide a dataset and a template (Scala class) with function
>>>> signatures and detailed description per application.
>>>> > e.g.: def assignment_one(input: DataStream[Event]):
>>>> DataStream[(String, Int)] = ???
>>>> > - we provide some setup code like parsing of data and setting up the
>>>> streaming environment
>>>> > - assignments need to be auto-graded, based on correct results
>>>> >
>>>> > In last years course edition we approached this by a custom Docker
>>>> container. This container first compiled the students code, run all the
>>>> Flink applications against a different dataset and then verified the output
>>>> against our solutions. This was turned into a grade and reported back to
>>>> the student. Although this was a working approach, I think we can do better.
>>>> >
>>>> > I'm wondering if any of you have experience with using Apache Flink
>>>> in a university course (or have seen this somewhere) as well as assessing
>>>> Flink code.
>>>> >
>>>> > Thanks a lot!
>>>> >
>>>> > Kind regards,
>>>> > Wouter Zorgdrager
>>>>
>>>

Re: Using Flink in an university course

Posted by Addison Higham <ad...@gmail.com>.

Hi there,

As far as a runtime for students, it seems like docker is your best bet.
However, you could have them instead package a jar using some interface
(for example, see
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/packaging.html,
which details the `Program` interface) and then execute it inside a custom
runner. That *might* result in something less prone to breakage as it would
need to conform to an interface, but it may require a fair amount of custom
code to reduce the boiler plate to build up a program plan as well as the
custom runner. The code for how flink loads a jar and turns it into
something it can execute is mostly encapsulated
in org.apache.flink.client.program.PackagedProgram, which might be a good
thing to read and understand if you go down this route.

If you want to give more insight, you could build some tooling to traverse
the underlying graphs that the students build up in their data stream
application. For example, calling
`StreamExecutionEnvironment.getStreamGraph` after the data stream is built
will get a graph of the current job, which you can then use to traverse a
graph and see which operators and edges are in use. This is very similar to
the process flink uses to build the job DAG it renders in the UI. I am not
sure what you could do as an automated analysis, but the StreamGraph API is
quite low level and exposes a lot of information about the program.

Hopefully that is a little bit helpful. Good luck and sounds like a fun
course!


On Mon, Mar 4, 2019 at 7:16 AM Wouter Zorgdrager <W....@tudelft.nl>
wrote:

> Hey all,
>
> Thanks for the replies. The issues we were running into (which are not
> specific to Docker):
> - Students changing the template wrongly failed the container.
> - We give full points if the output matches our solutions (and none
> otherwise), but it would be nice if we could give partial grades per
> assignment (and better feedback). This would require instead of looking
> only at results also at the operators used. The pitfall is that in many
> cases a correct solution can be achieved in multiple ways. I came across a
> Flink test library [1] which allows to test Flink code more extensively but
> seems to be only in Java.
>
> In retrospective, I do think using Docker is a good approach as Fabian
> confirms. However, the way we currently assess student solutions might be
> improved. I assume that in your trainings manual feedback is given, but
> unfortunately this is quite difficult for so many students.
>
> Cheers,
> Wouter
>
> 1: https://github.com/ottogroup/flink-spector
>
>
> Op ma 4 mrt. 2019 om 14:39 schreef Fabian Hueske <fh...@gmail.com>:
>
>> Hi Wouter,
>>
>> We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups
>> for our trainings and it is working very well.
>> We have an additional container that feeds a Kafka topic via the
>> commandline producer to simulate a somewhat realistic behavior.
>> Of course, you can do it without Kafka as and use some kind of data
>> generating source that reads from a file that is replace for evaluation.
>>
>> The biggest benefit that I see with using Docker is that the students
>> have an environment that is close to grading situation for development and
>> testing.
>> You do not need to provide infrastructure but everyone is running it
>> locally in a well-defined context.
>>
>> So, as Joern said, what problems do you see with Docker?
>>
>> Best,
>> Fabian
>>
>> Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <
>> jornfranke@gmail.com>:
>>
>>> It would help to understand the current issues that you have with this
>>> approach? I used a similar approach (not with Flink, but a similar big data
>>> technology) some years ago
>>>
>>> > Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <
>>> W.D.Zorgdrager@tudelft.nl>:
>>> >
>>> > Hi all,
>>> >
>>> > I'm working on a setup to use Apache Flink in an assignment for a Big
>>> Data (bachelor) university course and I'm interested in your view on this.
>>> To sketch the situation:
>>> > -  > 200 students follow this course
>>> > - students have to write some (simple) Flink applications using the
>>> DataStream API; the focus is on writing the transformation code
>>> > - students need to write Scala code
>>> > - we provide a dataset and a template (Scala class) with function
>>> signatures and detailed description per application.
>>> > e.g.: def assignment_one(input: DataStream[Event]):
>>> DataStream[(String, Int)] = ???
>>> > - we provide some setup code like parsing of data and setting up the
>>> streaming environment
>>> > - assignments need to be auto-graded, based on correct results
>>> >
>>> > In last years course edition we approached this by a custom Docker
>>> container. This container first compiled the students code, run all the
>>> Flink applications against a different dataset and then verified the output
>>> against our solutions. This was turned into a grade and reported back to
>>> the student. Although this was a working approach, I think we can do better.
>>> >
>>> > I'm wondering if any of you have experience with using Apache Flink in
>>> a university course (or have seen this somewhere) as well as assessing
>>> Flink code.
>>> >
>>> > Thanks a lot!
>>> >
>>> > Kind regards,
>>> > Wouter Zorgdrager
>>>
>>

Re: Using Flink in an university course

Posted by Wouter Zorgdrager <W....@tudelft.nl>.

Hey all,

Thanks for the replies. The issues we were running into (which are not
specific to Docker):
- Students changing the template wrongly failed the container.
- We give full points if the output matches our solutions (and none
otherwise), but it would be nice if we could give partial grades per
assignment (and better feedback). This would require instead of looking
only at results also at the operators used. The pitfall is that in many
cases a correct solution can be achieved in multiple ways. I came across a
Flink test library [1] which allows to test Flink code more extensively but
seems to be only in Java.

In retrospective, I do think using Docker is a good approach as Fabian
confirms. However, the way we currently assess student solutions might be
improved. I assume that in your trainings manual feedback is given, but
unfortunately this is quite difficult for so many students.

Cheers,
Wouter

1: https://github.com/ottogroup/flink-spector


Op ma 4 mrt. 2019 om 14:39 schreef Fabian Hueske <fh...@gmail.com>:

> Hi Wouter,
>
> We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups
> for our trainings and it is working very well.
> We have an additional container that feeds a Kafka topic via the
> commandline producer to simulate a somewhat realistic behavior.
> Of course, you can do it without Kafka as and use some kind of data
> generating source that reads from a file that is replace for evaluation.
>
> The biggest benefit that I see with using Docker is that the students have
> an environment that is close to grading situation for development and
> testing.
> You do not need to provide infrastructure but everyone is running it
> locally in a well-defined context.
>
> So, as Joern said, what problems do you see with Docker?
>
> Best,
> Fabian
>
> Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <
> jornfranke@gmail.com>:
>
>> It would help to understand the current issues that you have with this
>> approach? I used a similar approach (not with Flink, but a similar big data
>> technology) some years ago
>>
>> > Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <
>> W.D.Zorgdrager@tudelft.nl>:
>> >
>> > Hi all,
>> >
>> > I'm working on a setup to use Apache Flink in an assignment for a Big
>> Data (bachelor) university course and I'm interested in your view on this.
>> To sketch the situation:
>> > -  > 200 students follow this course
>> > - students have to write some (simple) Flink applications using the
>> DataStream API; the focus is on writing the transformation code
>> > - students need to write Scala code
>> > - we provide a dataset and a template (Scala class) with function
>> signatures and detailed description per application.
>> > e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String,
>> Int)] = ???
>> > - we provide some setup code like parsing of data and setting up the
>> streaming environment
>> > - assignments need to be auto-graded, based on correct results
>> >
>> > In last years course edition we approached this by a custom Docker
>> container. This container first compiled the students code, run all the
>> Flink applications against a different dataset and then verified the output
>> against our solutions. This was turned into a grade and reported back to
>> the student. Although this was a working approach, I think we can do better.
>> >
>> > I'm wondering if any of you have experience with using Apache Flink in
>> a university course (or have seen this somewhere) as well as assessing
>> Flink code.
>> >
>> > Thanks a lot!
>> >
>> > Kind regards,
>> > Wouter Zorgdrager
>>
>

Re: Using Flink in an university course

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Wouter,

We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups
for our trainings and it is working very well.
We have an additional container that feeds a Kafka topic via the
commandline producer to simulate a somewhat realistic behavior.
Of course, you can do it without Kafka as and use some kind of data
generating source that reads from a file that is replace for evaluation.

The biggest benefit that I see with using Docker is that the students have
an environment that is close to grading situation for development and
testing.
You do not need to provide infrastructure but everyone is running it
locally in a well-defined context.

So, as Joern said, what problems do you see with Docker?

Best,
Fabian

Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <jornfranke@gmail.com
>:

> It would help to understand the current issues that you have with this
> approach? I used a similar approach (not with Flink, but a similar big data
> technology) some years ago
>
> > Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <
> W.D.Zorgdrager@tudelft.nl>:
> >
> > Hi all,
> >
> > I'm working on a setup to use Apache Flink in an assignment for a Big
> Data (bachelor) university course and I'm interested in your view on this.
> To sketch the situation:
> > -  > 200 students follow this course
> > - students have to write some (simple) Flink applications using the
> DataStream API; the focus is on writing the transformation code
> > - students need to write Scala code
> > - we provide a dataset and a template (Scala class) with function
> signatures and detailed description per application.
> > e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String,
> Int)] = ???
> > - we provide some setup code like parsing of data and setting up the
> streaming environment
> > - assignments need to be auto-graded, based on correct results
> >
> > In last years course edition we approached this by a custom Docker
> container. This container first compiled the students code, run all the
> Flink applications against a different dataset and then verified the output
> against our solutions. This was turned into a grade and reported back to
> the student. Although this was a working approach, I think we can do better.
> >
> > I'm wondering if any of you have experience with using Apache Flink in a
> university course (or have seen this somewhere) as well as assessing Flink
> code.
> >
> > Thanks a lot!
> >
> > Kind regards,
> > Wouter Zorgdrager
>

Re: Using Flink in an university course

Posted by Jörn Franke <jo...@gmail.com>.

It would help to understand the current issues that you have with this approach? I used a similar approach (not with Flink, but a similar big data technology) some years ago

> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <W....@tudelft.nl>:
> 
> Hi all,
> 
> I'm working on a setup to use Apache Flink in an assignment for a Big Data (bachelor) university course and I'm interested in your view on this. To sketch the situation:
> -  > 200 students follow this course
> - students have to write some (simple) Flink applications using the DataStream API; the focus is on writing the transformation code
> - students need to write Scala code
> - we provide a dataset and a template (Scala class) with function signatures and detailed description per application.
> e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String, Int)] = ???
> - we provide some setup code like parsing of data and setting up the streaming environment
> - assignments need to be auto-graded, based on correct results
> 
> In last years course edition we approached this by a custom Docker container. This container first compiled the students code, run all the Flink applications against a different dataset and then verified the output against our solutions. This was turned into a grade and reported back to the student. Although this was a working approach, I think we can do better.
> 
> I'm wondering if any of you have experience with using Apache Flink in a university course (or have seen this somewhere) as well as assessing Flink code.
> 
> Thanks a lot!
> 
> Kind regards,
> Wouter Zorgdrager