You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemml.apache.org by "Kunft, Andreas" <an...@tu-berlin.de> on 2016/03/02 20:49:47 UTC

Add Apache Flink as new backend

Hi all,

we are a group of researchers from the Database group (DIMA) at TU Berlin. We would like to add Apache Flink as an execution backend to SystemML in addition to Hadoop MR and Spark.
To this end we started implementing a proof of concept consisting of several instructions together with the necessary de-/serialization and execution-logic.
You can see the current state of our fork [1] including two test-cases showing what we currently support [2][3].

For our simple POC implementation we realized that we had to duplicate a lot of functionality (especially from spark instructions). We saw that people already raised concerns regarding the refactoring of the runtime package [4][5], potentially making it easier to integrate further backend-systems.
Given that this would be a bigger change, it would be helpful to get some input from the SystemML community regarding this effort.

In particular, we would like to discuss the following questions:

  *
How should we deal with shared functionality between the different backends (Flink, Spark, etc.) to avoid code duplication, especially in instructions, but also introduce modularity? And is this modularization even desired?
  *
How should we integrate Flink into the different runtime-modes? (Flink-only, Flink-Hybrid, etc.)
  *
How should we structure the integration? (multiple/single commits)

We’re looking forward to feedback and hope the community likes the idea of adding Flink as an execution backend to SystemML.

Best,
Andreas Kunft
Christoph Brücke
Felix Schüler

[1] https://github.com/stratosphere/incubator-systemml/tree/flink-integration
[2] https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/TsmmFLInstructionTest.java
[3] https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/utils/DataSetConverterUtilsTest.java
[4] https://issues.apache.org/jira/browse/SYSTEMML-33
[5] https://www.mail-archive.com/search?l=dev%40systemml.incubator.apache.org&q=subject%3A%22Runtime+package+refactoring%22&o=newest&f=1

Re: AW: Add Apache Flink as new backend

Posted by Niketan Pansare <np...@us.ibm.com>.

Hi Andreas,

I too like the idea of having Flink as a backend for SystemML and the POC
is in the right direction. I agree with Matthias's comments regarding
shared functionality and multi-stage integration. I would also recommend
using similar naming conventions as our Hadoop and Spark backend. i.e.
hybrid_flink and flink. Looking forward for more discussions on this
topic :)

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:	"Kunft, Andreas" <an...@tu-berlin.de>
To:	"dev@systemml.incubator.apache.org"
            <de...@systemml.incubator.apache.org>
Date:	03/03/2016 09:47 AM
Subject:	AW: Add Apache Flink as new backend



Hello,


thank you for the fast reply. We are glad you like the idea!


As next step, we will focus on implementing a end-to-end integration based
on your suggestions. We think that this initial integration is a good start
for further discussions based on the concrete implementation in the pull
request.


Best

Andreas

________________________________
Von: Matthias Boehm <mb...@us.ibm.com>
Gesendet: Donnerstag, 3. März 2016 06:44
An: dev@systemml.incubator.apache.org
Betreff: Re: Add Apache Flink as new backend


Thanks guys, for sharing the details of this prototype. In general, I
really like the idea of having a Flink backend in SystemML. We just need to
structure the code (similar to our Spark backend) in a way that Flink
libraries are not necessarily required when running in Spark or MapReduce
execution modes.

To answer your questions in detail:

1) Shared Functionality: I would recommend to reuse the upper levels (i.e.,
language, hops, lops, etc) and core block operations but keep the
instructions (and everything that accesses Flink APIs) independent. Yes,
this separation comes at the cost of code duplication but it allows to run
backends without the need for libraries of the other backends. Note that we
did the same for our Spark backend, which allows us to run the same jar in
old MapReduce v1 environments where these libraries are not available
during runtime. Down the road we might consolidate common functionality
like runtime maintenance of matrix characteristics etc.

2) Execution Modes: Yes, please add two new execution modes in
DMLScript.RUNTIME_PLATFORM and one new execution type in Lop.ExecType. Once
this is done, you can already run end-to-end scripts with '-exec
hybrid_flink'. Of course we can have more detailed discussions about how
and when to select Flink operators during operator selection in
hybrid_flink mode. As a start, I would recommend our default heuristic of
compiling Flink operators whenever the memory estimate of an operation
exceeds the local memory budget of the driver/client process.

3) Pull Requests: I would recommend multiple stages: (1) initially a
minimal end-to-end integration, (2) multiple packages of "instruction sets"
incl tests, (3) specific rewrites / optimizer extensions, and later (4)
continuous improvements. For the initial end-to-end integration, I would
focus on two or three simple yet very important instructions (tsmm, mapmm,
mapmmchain), basic converter utils, and a basic end-to-end integration
(execution types, serialization, buffer pool, etc). Having tsmm and mapmm
(plus optionally mapmmchain) would already allow you to run end-to-end
algorithms like LinregDS, LinregCG, GLM, L2SVM, PageRank, etc for common
scenarios where only transpose-self matrix multiplications or matrix-vector
multiplications are compiled to distributed operations while remaining
operators are executed in the driver/client and vectors are small enough to
be broadcast to mapmm/mapmmchain.

4) Refactoring runtime package: Again, don't worry about the refactoring of
our runtime packages. The focus is mainly our block runtime, restructuring
everything such that this runtime can be easily packaged as an individual
jar and distributed/consumed independently of SystemML.

I'm looking forward to many more discussions on this topic.

Regards
Matthias

[Inactive hide details for "Kunft, Andreas" ---03/02/2016 11:51:16 AM---Hi
all, we are a group of researchers from the Database]"Kunft, Andreas"
---03/02/2016 11:51:16 AM---Hi all, we are a group of researchers from the
Database group (DIMA) at TU Berlin. We would like to

From: "Kunft, Andreas" <an...@tu-berlin.de>
To: "dev@systemml.incubator.apache.org" <de...@systemml.incubator.apache.org>
Date: 03/02/2016 11:51 AM
Subject: Add Apache Flink as new backend

________________________________



Hi all,

we are a group of researchers from the Database group (DIMA) at TU Berlin.
We would like to add Apache Flink as an execution backend to SystemML in
addition to Hadoop MR and Spark.
To this end we started implementing a proof of concept consisting of
several instructions together with the necessary de-/serialization and
execution-logic.
You can see the current state of our fork [1] including two test-cases
showing what we currently support [2][3].

For our simple POC implementation we realized that we had to duplicate a
lot of functionality (especially from spark instructions). We saw that
people already raised concerns regarding the refactoring of the runtime
package [4][5], potentially making it easier to integrate further
backend-systems.
Given that this would be a bigger change, it would be helpful to get some
input from the SystemML community regarding this effort.

In particular, we would like to discuss the following questions:

 *
How should we deal with shared functionality between the different backends
(Flink, Spark, etc.) to avoid code duplication, especially in instructions,
but also introduce modularity? And is this modularization even desired?
 *
How should we integrate Flink into the different runtime-modes?
(Flink-only, Flink-Hybrid, etc.)
 *
How should we structure the integration? (multiple/single commits)

We're looking forward to feedback and hope the community likes the idea of
adding Flink as an execution backend to SystemML.

Best,
Andreas Kunft
Christoph Brücke
Felix Schüler

[1]
https://github.com/stratosphere/incubator-systemml/tree/flink-integration
[2]
https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/TsmmFLInstructionTest.java

[3]
https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/utils/DataSetConverterUtilsTest.java

[4] https://issues.apache.org/jira/browse/SYSTEMML-33
[5]
https://www.mail-archive.com/search?l=dev%40systemml.incubator.apache.org&q=subject%3A%22Runtime
+package+refactoring%22&o=newest&f=1?

AW: Add Apache Flink as new backend

Posted by "Kunft, Andreas" <an...@tu-berlin.de>.

Hello,


thank you for the fast reply. We are glad you like the idea!


As next step, we will focus on implementing a end-to-end integration based on your suggestions. We think that this initial integration is a good start for further discussions based on the concrete implementation in the pull request.


Best

Andreas

________________________________
Von: Matthias Boehm <mb...@us.ibm.com>
Gesendet: Donnerstag, 3. März 2016 06:44
An: dev@systemml.incubator.apache.org
Betreff: Re: Add Apache Flink as new backend


Thanks guys, for sharing the details of this prototype. In general, I really like the idea of having a Flink backend in SystemML. We just need to structure the code (similar to our Spark backend) in a way that Flink libraries are not necessarily required when running in Spark or MapReduce execution modes.

To answer your questions in detail:

1) Shared Functionality: I would recommend to reuse the upper levels (i.e., language, hops, lops, etc) and core block operations but keep the instructions (and everything that accesses Flink APIs) independent. Yes, this separation comes at the cost of code duplication but it allows to run backends without the need for libraries of the other backends. Note that we did the same for our Spark backend, which allows us to run the same jar in old MapReduce v1 environments where these libraries are not available during runtime. Down the road we might consolidate common functionality like runtime maintenance of matrix characteristics etc.

2) Execution Modes: Yes, please add two new execution modes in DMLScript.RUNTIME_PLATFORM and one new execution type in Lop.ExecType. Once this is done, you can already run end-to-end scripts with '-exec hybrid_flink'. Of course we can have more detailed discussions about how and when to select Flink operators during operator selection in hybrid_flink mode. As a start, I would recommend our default heuristic of compiling Flink operators whenever the memory estimate of an operation exceeds the local memory budget of the driver/client process.

3) Pull Requests: I would recommend multiple stages: (1) initially a minimal end-to-end integration, (2) multiple packages of "instruction sets" incl tests, (3) specific rewrites / optimizer extensions, and later (4) continuous improvements. For the initial end-to-end integration, I would focus on two or three simple yet very important instructions (tsmm, mapmm, mapmmchain), basic converter utils, and a basic end-to-end integration (execution types, serialization, buffer pool, etc). Having tsmm and mapmm (plus optionally mapmmchain) would already allow you to run end-to-end algorithms like LinregDS, LinregCG, GLM, L2SVM, PageRank, etc for common scenarios where only transpose-self matrix multiplications or matrix-vector multiplications are compiled to distributed operations while remaining operators are executed in the driver/client and vectors are small enough to be broadcast to mapmm/mapmmchain.

4) Refactoring runtime package: Again, don't worry about the refactoring of our runtime packages. The focus is mainly our block runtime, restructuring everything such that this runtime can be easily packaged as an individual jar and distributed/consumed independently of SystemML.

I'm looking forward to many more discussions on this topic.

Regards
Matthias

[Inactive hide details for "Kunft, Andreas" ---03/02/2016 11:51:16 AM---Hi all, we are a group of researchers from the Database]"Kunft, Andreas" ---03/02/2016 11:51:16 AM---Hi all, we are a group of researchers from the Database group (DIMA) at TU Berlin. We would like to

From: "Kunft, Andreas" <an...@tu-berlin.de>
To: "dev@systemml.incubator.apache.org" <de...@systemml.incubator.apache.org>
Date: 03/02/2016 11:51 AM
Subject: Add Apache Flink as new backend

________________________________



Hi all,

we are a group of researchers from the Database group (DIMA) at TU Berlin. We would like to add Apache Flink as an execution backend to SystemML in addition to Hadoop MR and Spark.
To this end we started implementing a proof of concept consisting of several instructions together with the necessary de-/serialization and execution-logic.
You can see the current state of our fork [1] including two test-cases showing what we currently support [2][3].

For our simple POC implementation we realized that we had to duplicate a lot of functionality (especially from spark instructions). We saw that people already raised concerns regarding the refactoring of the runtime package [4][5], potentially making it easier to integrate further backend-systems.
Given that this would be a bigger change, it would be helpful to get some input from the SystemML community regarding this effort.

In particular, we would like to discuss the following questions:

 *
How should we deal with shared functionality between the different backends (Flink, Spark, etc.) to avoid code duplication, especially in instructions, but also introduce modularity? And is this modularization even desired?
 *
How should we integrate Flink into the different runtime-modes? (Flink-only, Flink-Hybrid, etc.)
 *
How should we structure the integration? (multiple/single commits)

We're looking forward to feedback and hope the community likes the idea of adding Flink as an execution backend to SystemML.

Best,
Andreas Kunft
Christoph Brücke
Felix Schüler

[1] https://github.com/stratosphere/incubator-systemml/tree/flink-integration
[2] https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/TsmmFLInstructionTest.java
[3] https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/utils/DataSetConverterUtilsTest.java
[4] https://issues.apache.org/jira/browse/SYSTEMML-33
[5] https://www.mail-archive.com/search?l=dev%40systemml.incubator.apache.org&q=subject%3A%22Runtime+package+refactoring%22&o=newest&f=1?

Re: Add Apache Flink as new backend

Posted by Matthias Boehm <mb...@us.ibm.com>.

Thanks guys, for sharing the details of this prototype. In general, I
really like the idea of having a Flink backend in SystemML. We just need to
structure the code (similar to our Spark backend) in a way that Flink
libraries are not necessarily required when running in Spark or MapReduce
execution modes.

To answer your questions in detail:

1) Shared Functionality: I would recommend to reuse the upper levels (i.e.,
language, hops, lops, etc) and core block operations but keep the
instructions (and everything that accesses Flink APIs) independent. Yes,
this separation comes at the cost of code duplication but it allows to run
backends without the need for libraries of the other backends. Note that we
did the same for our Spark backend, which allows us to run the same jar in
old MapReduce v1 environments where these libraries are not available
during runtime. Down the road we might consolidate common functionality
like runtime maintenance of matrix characteristics etc.

2) Execution Modes: Yes, please add two new execution modes in
DMLScript.RUNTIME_PLATFORM and one new execution type in Lop.ExecType. Once
this is done, you can already run end-to-end scripts with '-exec
hybrid_flink'. Of course we can have more detailed discussions about how
and when to select Flink operators during operator selection in
hybrid_flink mode. As a start, I would recommend our default heuristic of
compiling Flink operators whenever the memory estimate of an operation
exceeds the local memory budget of the driver/client process.

3) Pull Requests: I would recommend multiple stages: (1) initially a
minimal end-to-end integration, (2) multiple packages of "instruction sets"
incl tests, (3) specific rewrites / optimizer extensions, and later (4)
continuous improvements. For the initial end-to-end integration, I would
focus on two or three simple yet very important instructions (tsmm, mapmm,
mapmmchain), basic converter utils, and a basic end-to-end integration
(execution types, serialization, buffer pool, etc). Having tsmm and mapmm
(plus optionally mapmmchain) would already allow you to run end-to-end
algorithms like LinregDS, LinregCG, GLM, L2SVM, PageRank, etc for common
scenarios where only transpose-self matrix multiplications or matrix-vector
multiplications are compiled to distributed operations while remaining
operators are executed in the driver/client and vectors are small enough to
be broadcast to mapmm/mapmmchain.

4) Refactoring runtime package: Again, don't worry about the refactoring of
our runtime packages. The focus is mainly our block runtime, restructuring
everything such that this runtime can be easily packaged as an individual
jar and distributed/consumed independently of SystemML.

I'm looking forward to many more discussions on this topic.

Regards
Matthias



From:	"Kunft, Andreas" <an...@tu-berlin.de>
To:	"dev@systemml.incubator.apache.org"
            <de...@systemml.incubator.apache.org>
Date:	03/02/2016 11:51 AM
Subject:	Add Apache Flink as new backend



Hi all,

we are a group of researchers from the Database group (DIMA) at TU Berlin.
We would like to add Apache Flink as an execution backend to SystemML in
addition to Hadoop MR and Spark.
To this end we started implementing a proof of concept consisting of
several instructions together with the necessary de-/serialization and
execution-logic.
You can see the current state of our fork [1] including two test-cases
showing what we currently support [2][3].

For our simple POC implementation we realized that we had to duplicate a
lot of functionality (especially from spark instructions). We saw that
people already raised concerns regarding the refactoring of the runtime
package [4][5], potentially making it easier to integrate further
backend-systems.
Given that this would be a bigger change, it would be helpful to get some
input from the SystemML community regarding this effort.

In particular, we would like to discuss the following questions:

  *
How should we deal with shared functionality between the different backends
(Flink, Spark, etc.) to avoid code duplication, especially in instructions,
but also introduce modularity? And is this modularization even desired?
  *
How should we integrate Flink into the different runtime-modes?
(Flink-only, Flink-Hybrid, etc.)
  *
How should we structure the integration? (multiple/single commits)

We’re looking forward to feedback and hope the community likes the idea of
adding Flink as an execution backend to SystemML.

Best,
Andreas Kunft
Christoph Brücke
Felix Schüler

[1]
https://github.com/stratosphere/incubator-systemml/tree/flink-integration
[2]
https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/TsmmFLInstructionTest.java

[3]
https://github.com/stratosphere/incubator-systemml/blob/flink-integration/src/test/java/org/apache/sysml/runtime/instructions/flink/utils/DataSetConverterUtilsTest.java

[4] https://issues.apache.org/jira/browse/SYSTEMML-33
[5]
https://www.mail-archive.com/search?l=dev%40systemml.incubator.apache.org&q=subject%3A%22Runtime
+package+refactoring%22&o=newest&f=1