You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Janardhan <ja...@gmail.com> on 2020/04/10 02:39:45 UTC

Re: Roadmap Merge and Rename SystemDS

Hi Matthias,

   Would you be so kind as to announce the following:
1.  Apache Infra jira ticket for name change
2. new committers (welcome!) and of course contributors.
3. New release version number (is it SYSTEMDS-0.3.0-SNAPSHOT)

Thank you,
Janardhan

On Tue, Mar 24, 2020 at 6:28 PM Matthias Boehm <mb...@gmail.com> wrote:

> that's a good point Henry. Yes, with SystemDS 0.1.0, we removed the
> MapReduce compiler and runtime backend, the pydml parser and language
> support, the Java-UDF framework, and the script-level debugger. We are
> concentrating on local, spark, GPU, and federated backends now, added
> new language bindings including an initial Python binding. However, the
> script-level operation support remains intact and is even largely
> extended by builtins for algorithms, data cleaning, and debugging.
>
> Accordingly, it might be good to deprecate the removed things while
> merging the code in and then make the next Apache SystemDS (pending
> approval) release a major release which allows us to break external APIs.
>
> Regards,
> Matthias
>
> On 3/24/2020 2:07 AM, Henry Saputra wrote:
> > Thanks for starting this discussions, Matthias.
> >
> > Are there any features from SystemML that could be be removed or
> deprecated
> > when SystemDS being merged to SystemML repository?
> >
> > - Henry
> >
> > On Sat, Mar 21, 2020 at 2:47 PM Matthias Boehm <mb...@gmail.com>
> wrote:
> >
> >> just FYI, we created a ticket for the suitable name search, and shared
> >> the related results [1]. So from my perspective, it really boils down to
> >> the question if we accept the closeness to 'Linux systemd'. Back in 2018
> >> (when starting SystemDS), I came to the conclusion that it's fine
> >> because of the very different objectives and because SystemDS reflects
> >> both the origin from SystemML and its new focus on data science
> pipelines.
> >>
> >> [1]
> >>
> >>
> https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues
> >>
> >> Regards,
> >> Matthias
> >>
> >> On 3/9/2020 6:37 PM, Matthias Boehm wrote:
> >>> Hi all,
> >>>
> >>> as you're probably aware, development activities of Apache SystemML
> >>> significantly slowed down and were virtually non-existing in the last
> >>> year for various reasons. Part of that was that my team and I [1]
> >>> decided to start SystemDS [2,3] as a fork of SystemML in 09/2018 with a
> >>> new vision and roadmap for the future.
> >>>
> >>> During PMC discussions regarding the retirement of SystemML, we came to
> >>> the conclusions that the best path forward -- for the entire community
> >>> -- would be to merge SystemDS back into Apache SystemML, rename it to
> >>> SystemDS, and continue jointly. Before doing so, I want to share the
> >>> plan with the entire community.
> >>>
> >>> SystemDS aims at providing better systems support for the end-to-end
> >>> data science lifecycle, with a special focus on ML pipelines from data
> >>> integration, cleaning, and preparation, over efficient ML model
> >>> training, to model debugging and serving. A key observation is that
> >>> state-of-the-art data integration and cleaning primitives are
> themselves
> >>> based on machine learning. Our main objectives are to support effective
> >>> and efficient data preparation, ML training and debugging at scale,
> >>> something that cannot be composed from existing libraries. The game
> plan
> >>> includes three major parts:
> >>>
> >>> 1) DSL-based, High-level Abstractions: We aim to provide a hierarchy of
> >>> abstractions for the different lifecycle tasks as well as users with
> >>> different expertise (ML researchers, data scientists, domain experts),
> >>> based on our DSL for ML training and scoring. Exploratory data science
> >>> interleaves data preparation, ML training, scoring, and debugging in an
> >>> iterative process; and once these tasks are expressed in dense or
> sparse
> >>> linear algebra, we expect very good performance.
> >>>
> >>> 2) Hybrid Runtime Plans and Optimizing Compiler: To support the wide
> >>> variety of algorithm classes, we will continue to provide different
> >>> parallelization strategies, enriched by a new backend for federated ML
> >>> and privacy enhancing technologies. Since the hierarchy of language
> >>> abstractions inevitably leads to redundancy, we further aim to improve
> >>> the automatic optimization capabilities of the compiler and underlying
> >>> runtime.
> >>>
> >>> 3) Data Model - Heterogeneous Tensors: To support data integration and
> >>> cleaning primitives in linear algebra programs requires a more generic
> >>> data model for handling heterogeneous and structured data. In contrast
> >>> to existing ML systems, our central data model are heterogeneous
> >>> tensors. Thus, we generalize SystemML's FP64 matrices to
> >>> multi-dimensional arrays where one dimension may have a schema
> including
> >>> JSON strings to represent nested data.
> >>>
> >>> Admin: We intend to create the SystemDS 0.2 release in March.
> Afterwards
> >>> we would then rebase all our commits (369) back onto the SystemML
> >>> codeline. Subsequently, we will rename Apache SystemML to Apache
> >>> SystemDS and continue our development under Apache umbrella. I just
> went
> >>> through the Apache name search guidelines and we'll perform a 'suitable
> >>> name search' accordingly and then transfer SystemDS. The existing PMC
> >>> and committer status stays of course intact unless people want to
> leave.
> >>> Shortly after the merge, I will nominate the four most active
> >>> contributors of the last year to become committers. Regarding releases
> >>> (and JIRA numbers), it's up for discussion but both, continuing with
> >>> SystemML versions (i.e., 1.3) or SystemDS versions (0.3) seem fine to
> me.
> >>>
> >>> Roadmap: At technical level, SystemDS will continue to support all
> >>> operations and algorithms SystemML provided but significantly extent
> the
> >>> scope and functionality via the mentioned hierarchy of language
> >>> abstractions (in form of builtin functions). However, during the fork
> we
> >>> already removed old baggage like the MR backend, the scrip-level
> >>> debugger, the PyDML frontend and several other things [4]. Major new
> >>> internals are native support for lineage tracing and reuse, the data
> >>> model of heterogeneous tensors, and a new federated backend.
> >>>
> >>> [1] https://damslab.github.io/
> >>> [2] https://github.com/tugraz-isds/systemds
> >>> [3] http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
> >>> [4] https://github.com/tugraz-isds/systemds/releases/tag/v0.1.0
> >>>
> >>> Regards,
> >>> Matthias
> >>
> >
>

Re: Roadmap Merge and Rename SystemDS

Posted by Janardhan <ja...@gmail.com>.
Thank you.

> ad 3) It's still not decided yet, if we go directly to 2.0 in order to
> avoid tag conflicts or 0.3. Feel free to express your opinion here too.
2.0 can be an option due to SystemDS is taking over SystemML, it should
not rewrite history.

Few concerns:
1. [COMMUNITY] As you know, Open discussions are critical for the Apache
Way. But, how I am
as a previous SystemML committer is knowing the project is by git history
and talking
directly to the authors of the PR. Some discussions seems to have happened
offline.

Do all contributors know about mailing list.
TL;DR Effective communication process is not yet in shape.

2. [TECHNICAL] Mllearn framework does not seem to be available but seems
there is a
preliminary work with ONNX import.
Also, the model serving framework. Is there any alternatives proposed.
link to Github PR: https://github.com/apache/systemml/pull/860

3. [DOCUMENTATION] The design decisions are not documented anywhere as I
could
 see (forgive if I missed some folder) such as tensor matrices available
datatypes.

Thank a lot,
Janardhan


On Fri, Apr 10, 2020 at 5:48 PM Matthias Boehm <mb...@gmail.com> wrote:

> yes, all that will be covered, but there official processes to follow:
>
> ad 1) there yesterday, the podling for the suitable name search has been
> approved with additional comments to the PMC.
>
> https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues
>
> ad 2) we follow the official new committer process, and there are still
> additional steps to do
> http://community.apache.org/newcommitter.html
>
> ad 3) It's still not decided yet, if we go directly to 2.0 in order to
> avoid tag conflicts or 0.3. Feel free to express your opinion here too.
>
> Regards,
> Matthias
>
> On 4/10/2020 4:39 AM, Janardhan wrote:
> > Hi Matthias,
> >
> >     Would you be so kind as to announce the following:
> > 1.  Apache Infra jira ticket for name change
> > 2. new committers (welcome!) and of course contributors.
> > 3. New release version number (is it SYSTEMDS-0.3.0-SNAPSHOT)
> >
> > Thank you,
> > Janardhan
> >
> > On Tue, Mar 24, 2020 at 6:28 PM Matthias Boehm <mb...@gmail.com>
> wrote:
> >
> >> that's a good point Henry. Yes, with SystemDS 0.1.0, we removed the
> >> MapReduce compiler and runtime backend, the pydml parser and language
> >> support, the Java-UDF framework, and the script-level debugger. We are
> >> concentrating on local, spark, GPU, and federated backends now, added
> >> new language bindings including an initial Python binding. However, the
> >> script-level operation support remains intact and is even largely
> >> extended by builtins for algorithms, data cleaning, and debugging.
> >>
> >> Accordingly, it might be good to deprecate the removed things while
> >> merging the code in and then make the next Apache SystemDS (pending
> >> approval) release a major release which allows us to break external
> APIs.
> >>
> >> Regards,
> >> Matthias
> >>
> >> On 3/24/2020 2:07 AM, Henry Saputra wrote:
> >>> Thanks for starting this discussions, Matthias.
> >>>
> >>> Are there any features from SystemML that could be be removed or
> >> deprecated
> >>> when SystemDS being merged to SystemML repository?
> >>>
> >>> - Henry
> >>>
> >>> On Sat, Mar 21, 2020 at 2:47 PM Matthias Boehm <mb...@gmail.com>
> >> wrote:
> >>>
> >>>> just FYI, we created a ticket for the suitable name search, and shared
> >>>> the related results [1]. So from my perspective, it really boils down
> to
> >>>> the question if we accept the closeness to 'Linux systemd'. Back in
> 2018
> >>>> (when starting SystemDS), I came to the conclusion that it's fine
> >>>> because of the very different objectives and because SystemDS reflects
> >>>> both the origin from SystemML and its new focus on data science
> >> pipelines.
> >>>>
> >>>> [1]
> >>>>
> >>>>
> >>
> https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues
> >>>>
> >>>> Regards,
> >>>> Matthias
> >>>>
> >>>> On 3/9/2020 6:37 PM, Matthias Boehm wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> as you're probably aware, development activities of Apache SystemML
> >>>>> significantly slowed down and were virtually non-existing in the last
> >>>>> year for various reasons. Part of that was that my team and I [1]
> >>>>> decided to start SystemDS [2,3] as a fork of SystemML in 09/2018
> with a
> >>>>> new vision and roadmap for the future.
> >>>>>
> >>>>> During PMC discussions regarding the retirement of SystemML, we came
> to
> >>>>> the conclusions that the best path forward -- for the entire
> community
> >>>>> -- would be to merge SystemDS back into Apache SystemML, rename it to
> >>>>> SystemDS, and continue jointly. Before doing so, I want to share the
> >>>>> plan with the entire community.
> >>>>>
> >>>>> SystemDS aims at providing better systems support for the end-to-end
> >>>>> data science lifecycle, with a special focus on ML pipelines from
> data
> >>>>> integration, cleaning, and preparation, over efficient ML model
> >>>>> training, to model debugging and serving. A key observation is that
> >>>>> state-of-the-art data integration and cleaning primitives are
> >> themselves
> >>>>> based on machine learning. Our main objectives are to support
> effective
> >>>>> and efficient data preparation, ML training and debugging at scale,
> >>>>> something that cannot be composed from existing libraries. The game
> >> plan
> >>>>> includes three major parts:
> >>>>>
> >>>>> 1) DSL-based, High-level Abstractions: We aim to provide a hierarchy
> of
> >>>>> abstractions for the different lifecycle tasks as well as users with
> >>>>> different expertise (ML researchers, data scientists, domain
> experts),
> >>>>> based on our DSL for ML training and scoring. Exploratory data
> science
> >>>>> interleaves data preparation, ML training, scoring, and debugging in
> an
> >>>>> iterative process; and once these tasks are expressed in dense or
> >> sparse
> >>>>> linear algebra, we expect very good performance.
> >>>>>
> >>>>> 2) Hybrid Runtime Plans and Optimizing Compiler: To support the wide
> >>>>> variety of algorithm classes, we will continue to provide different
> >>>>> parallelization strategies, enriched by a new backend for federated
> ML
> >>>>> and privacy enhancing technologies. Since the hierarchy of language
> >>>>> abstractions inevitably leads to redundancy, we further aim to
> improve
> >>>>> the automatic optimization capabilities of the compiler and
> underlying
> >>>>> runtime.
> >>>>>
> >>>>> 3) Data Model - Heterogeneous Tensors: To support data integration
> and
> >>>>> cleaning primitives in linear algebra programs requires a more
> generic
> >>>>> data model for handling heterogeneous and structured data. In
> contrast
> >>>>> to existing ML systems, our central data model are heterogeneous
> >>>>> tensors. Thus, we generalize SystemML's FP64 matrices to
> >>>>> multi-dimensional arrays where one dimension may have a schema
> >> including
> >>>>> JSON strings to represent nested data.
> >>>>>
> >>>>> Admin: We intend to create the SystemDS 0.2 release in March.
> >> Afterwards
> >>>>> we would then rebase all our commits (369) back onto the SystemML
> >>>>> codeline. Subsequently, we will rename Apache SystemML to Apache
> >>>>> SystemDS and continue our development under Apache umbrella. I just
> >> went
> >>>>> through the Apache name search guidelines and we'll perform a
> 'suitable
> >>>>> name search' accordingly and then transfer SystemDS. The existing PMC
> >>>>> and committer status stays of course intact unless people want to
> >> leave.
> >>>>> Shortly after the merge, I will nominate the four most active
> >>>>> contributors of the last year to become committers. Regarding
> releases
> >>>>> (and JIRA numbers), it's up for discussion but both, continuing with
> >>>>> SystemML versions (i.e., 1.3) or SystemDS versions (0.3) seem fine to
> >> me.
> >>>>>
> >>>>> Roadmap: At technical level, SystemDS will continue to support all
> >>>>> operations and algorithms SystemML provided but significantly extent
> >> the
> >>>>> scope and functionality via the mentioned hierarchy of language
> >>>>> abstractions (in form of builtin functions). However, during the fork
> >> we
> >>>>> already removed old baggage like the MR backend, the scrip-level
> >>>>> debugger, the PyDML frontend and several other things [4]. Major new
> >>>>> internals are native support for lineage tracing and reuse, the data
> >>>>> model of heterogeneous tensors, and a new federated backend.
> >>>>>
> >>>>> [1] https://damslab.github.io/
> >>>>> [2] https://github.com/tugraz-isds/systemds
> >>>>> [3] http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
> >>>>> [4] https://github.com/tugraz-isds/systemds/releases/tag/v0.1.0
> >>>>>
> >>>>> Regards,
> >>>>> Matthias
> >>>>
> >>>
> >>
> >
>

Re: Roadmap Merge and Rename SystemDS

Posted by Matthias Boehm <mb...@gmail.com>.
yes, all that will be covered, but there official processes to follow:

ad 1) there yesterday, the podling for the suitable name search has been 
approved with additional comments to the PMC.
https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues

ad 2) we follow the official new committer process, and there are still 
additional steps to do
http://community.apache.org/newcommitter.html

ad 3) It's still not decided yet, if we go directly to 2.0 in order to 
avoid tag conflicts or 0.3. Feel free to express your opinion here too.

Regards,
Matthias

On 4/10/2020 4:39 AM, Janardhan wrote:
> Hi Matthias,
> 
>     Would you be so kind as to announce the following:
> 1.  Apache Infra jira ticket for name change
> 2. new committers (welcome!) and of course contributors.
> 3. New release version number (is it SYSTEMDS-0.3.0-SNAPSHOT)
> 
> Thank you,
> Janardhan
> 
> On Tue, Mar 24, 2020 at 6:28 PM Matthias Boehm <mb...@gmail.com> wrote:
> 
>> that's a good point Henry. Yes, with SystemDS 0.1.0, we removed the
>> MapReduce compiler and runtime backend, the pydml parser and language
>> support, the Java-UDF framework, and the script-level debugger. We are
>> concentrating on local, spark, GPU, and federated backends now, added
>> new language bindings including an initial Python binding. However, the
>> script-level operation support remains intact and is even largely
>> extended by builtins for algorithms, data cleaning, and debugging.
>>
>> Accordingly, it might be good to deprecate the removed things while
>> merging the code in and then make the next Apache SystemDS (pending
>> approval) release a major release which allows us to break external APIs.
>>
>> Regards,
>> Matthias
>>
>> On 3/24/2020 2:07 AM, Henry Saputra wrote:
>>> Thanks for starting this discussions, Matthias.
>>>
>>> Are there any features from SystemML that could be be removed or
>> deprecated
>>> when SystemDS being merged to SystemML repository?
>>>
>>> - Henry
>>>
>>> On Sat, Mar 21, 2020 at 2:47 PM Matthias Boehm <mb...@gmail.com>
>> wrote:
>>>
>>>> just FYI, we created a ticket for the suitable name search, and shared
>>>> the related results [1]. So from my perspective, it really boils down to
>>>> the question if we accept the closeness to 'Linux systemd'. Back in 2018
>>>> (when starting SystemDS), I came to the conclusion that it's fine
>>>> because of the very different objectives and because SystemDS reflects
>>>> both the origin from SystemML and its new focus on data science
>> pipelines.
>>>>
>>>> [1]
>>>>
>>>>
>> https://issues.apache.org/jira/projects/PODLINGNAMESEARCH/issues/PODLINGNAMESEARCH-179?filter=allissues
>>>>
>>>> Regards,
>>>> Matthias
>>>>
>>>> On 3/9/2020 6:37 PM, Matthias Boehm wrote:
>>>>> Hi all,
>>>>>
>>>>> as you're probably aware, development activities of Apache SystemML
>>>>> significantly slowed down and were virtually non-existing in the last
>>>>> year for various reasons. Part of that was that my team and I [1]
>>>>> decided to start SystemDS [2,3] as a fork of SystemML in 09/2018 with a
>>>>> new vision and roadmap for the future.
>>>>>
>>>>> During PMC discussions regarding the retirement of SystemML, we came to
>>>>> the conclusions that the best path forward -- for the entire community
>>>>> -- would be to merge SystemDS back into Apache SystemML, rename it to
>>>>> SystemDS, and continue jointly. Before doing so, I want to share the
>>>>> plan with the entire community.
>>>>>
>>>>> SystemDS aims at providing better systems support for the end-to-end
>>>>> data science lifecycle, with a special focus on ML pipelines from data
>>>>> integration, cleaning, and preparation, over efficient ML model
>>>>> training, to model debugging and serving. A key observation is that
>>>>> state-of-the-art data integration and cleaning primitives are
>> themselves
>>>>> based on machine learning. Our main objectives are to support effective
>>>>> and efficient data preparation, ML training and debugging at scale,
>>>>> something that cannot be composed from existing libraries. The game
>> plan
>>>>> includes three major parts:
>>>>>
>>>>> 1) DSL-based, High-level Abstractions: We aim to provide a hierarchy of
>>>>> abstractions for the different lifecycle tasks as well as users with
>>>>> different expertise (ML researchers, data scientists, domain experts),
>>>>> based on our DSL for ML training and scoring. Exploratory data science
>>>>> interleaves data preparation, ML training, scoring, and debugging in an
>>>>> iterative process; and once these tasks are expressed in dense or
>> sparse
>>>>> linear algebra, we expect very good performance.
>>>>>
>>>>> 2) Hybrid Runtime Plans and Optimizing Compiler: To support the wide
>>>>> variety of algorithm classes, we will continue to provide different
>>>>> parallelization strategies, enriched by a new backend for federated ML
>>>>> and privacy enhancing technologies. Since the hierarchy of language
>>>>> abstractions inevitably leads to redundancy, we further aim to improve
>>>>> the automatic optimization capabilities of the compiler and underlying
>>>>> runtime.
>>>>>
>>>>> 3) Data Model - Heterogeneous Tensors: To support data integration and
>>>>> cleaning primitives in linear algebra programs requires a more generic
>>>>> data model for handling heterogeneous and structured data. In contrast
>>>>> to existing ML systems, our central data model are heterogeneous
>>>>> tensors. Thus, we generalize SystemML's FP64 matrices to
>>>>> multi-dimensional arrays where one dimension may have a schema
>> including
>>>>> JSON strings to represent nested data.
>>>>>
>>>>> Admin: We intend to create the SystemDS 0.2 release in March.
>> Afterwards
>>>>> we would then rebase all our commits (369) back onto the SystemML
>>>>> codeline. Subsequently, we will rename Apache SystemML to Apache
>>>>> SystemDS and continue our development under Apache umbrella. I just
>> went
>>>>> through the Apache name search guidelines and we'll perform a 'suitable
>>>>> name search' accordingly and then transfer SystemDS. The existing PMC
>>>>> and committer status stays of course intact unless people want to
>> leave.
>>>>> Shortly after the merge, I will nominate the four most active
>>>>> contributors of the last year to become committers. Regarding releases
>>>>> (and JIRA numbers), it's up for discussion but both, continuing with
>>>>> SystemML versions (i.e., 1.3) or SystemDS versions (0.3) seem fine to
>> me.
>>>>>
>>>>> Roadmap: At technical level, SystemDS will continue to support all
>>>>> operations and algorithms SystemML provided but significantly extent
>> the
>>>>> scope and functionality via the mentioned hierarchy of language
>>>>> abstractions (in form of builtin functions). However, during the fork
>> we
>>>>> already removed old baggage like the MR backend, the scrip-level
>>>>> debugger, the PyDML frontend and several other things [4]. Major new
>>>>> internals are native support for lineage tracing and reuse, the data
>>>>> model of heterogeneous tensors, and a new federated backend.
>>>>>
>>>>> [1] https://damslab.github.io/
>>>>> [2] https://github.com/tugraz-isds/systemds
>>>>> [3] http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
>>>>> [4] https://github.com/tugraz-isds/systemds/releases/tag/v0.1.0
>>>>>
>>>>> Regards,
>>>>> Matthias
>>>>
>>>
>>
>