You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemds.apache.org by Badrul Chowdhury <ba...@gmail.com> on 2022/11/24 03:27:18 UTC
Re: Plan for Builtin Functions Parity with Numpy, etc

Hi All,

Following up on this thread. I have created a PR with the basic template
for the comparison here: https://github.com/apache/systemds/pull/1735

Please feel free to comment on the outline for the survey or suggest ideas.
I can start filling in the details of the actual comparison once we agree
on the template for comparison.


Thanks,
Badrul

On Fri, 5 Aug 2022 at 23:55, Badrul Chowdhury <ba...@gmail.com>
wrote:

> Thanks for sharing your thoughts Matthias! Ack, I will create the PR in
> the main repo.
>
> Thanks,
> Badrul
>
> On Fri, 5 Aug 2022 at 11:42, Matthias Boehm <mb...@gmail.com> wrote:
>
>> thanks for driving this discussion Badrul. In general, I think it's a
>> great idea to do an assessment of coverage as a basis for discussions
>> regarding further development, API consistency, and improved
>> documentation. At algorithm level you will encounter subtle differences
>> due to different algorithmic choices, implementation details, and
>> related parameters. Here we should not make ourselves dependent on
>> existing libraries but make case-by-base decisions, balancing various
>> constraints with the benefits of API similarity.
>>
>> By default, we stick to names of R builtin functions of selected
>> packages (e.g., Matrix, stats, algorithms) and indexing semantics (e.g.,
>> copy on write, 1-based indexing) but should look more broadly (numpy,
>> pandas) for missing functionality at DSL and API level. The overall
>> vision of SystemDS is to build up a hierarchy of builtin functions for
>> the entire data science lifecycle (data preparation, cleaning, training,
>> scoring, debugging) while still being able to compile hybrid runtime
>> plans for local CPU/GPU, distributed, and federated backends.
>>
>> Let's do this assessment in the main github repo (e.g., as markdown
>> files in docs) before we put anything on the main website as we need to
>> distinguish the assessment from actual documentation. Thanks.
>>
>> Regards,
>> Matthias
>>
>> On 8/5/2022 8:23 PM, Badrul Chowdhury wrote:
>> > Thank you both for your thoughtful comments. Agreed: we should not force
>> > parity; rather, we should make sure that SystemDS built-in functions
>> > "cover" important use cases. I will start with an audit of SystemDS's
>> > existing capabilities and create a PR on systemds-website
>> > <https://github.com/apache/systemds-website> with my findings. This
>> would
>> > also be a good way to identify gaps in the documentation for existing
>> > builtins so we can update it.
>> >
>> > Thanks,
>> > Badrul
>> >
>> > On Tue, 2 Aug 2022 at 06:12, arnab phani <ph...@gmail.com> wrote:
>> >
>> >> In my understanding, parity matters if 1) frameworks share a similar
>> user
>> >> base and use cases (sklearn, pandas, etc.)
>> >> or 2) one framework shares APIs with another (dask, modin, pandas).
>> >> Otherwise, forcing parity can be counterproductive. During our work on
>> >> feature transformations,
>> >> we have seen major differences in supported feature transformations,
>> user
>> >> APIs, and configurations among ML Systems.
>> >> For instance, TensorFlow tunes its APIs based on the expected use cases
>> >> (neural network) and data
>> >> characteristics (text, image), while sklearn aims for traditional ML
>> jobs.
>> >> Moreover, some API changes are
>> >> required to be able to use certain underlying optimizations.
>> >> Having said that, It is definitely important to support popular
>> builtins,
>> >> however, I don't think it is necessary to
>> >> use the same names, APIs, and flags. I liked the idea of writing our
>> >> documentation in a way that helps new users to draw
>> >> similarities with popular libraries. A capability matrix to map
>> builtins
>> >> from other systems to ours can be helpful.
>> >>
>> >> Regards,
>> >> Arnab..
>> >>
>> >> On Tue, Aug 2, 2022 at 6:16 AM Janardhan <ja...@apache.org> wrote:
>> >>
>> >>> Hi Badrul,
>> >>>
>> >>> Adding to this discussion,
>> >>> I think we can start with what we already have implemented. We do not
>> >>> need to implement every last function, we can choose a use-case based
>> >>> approach for best results. I would start with the present status of
>> >>> the builtins - they are enough for a lot of use cases! then implement
>> >>> one by one based on priority. Most of our builtin functions other than
>> >>> ML (including NN library) are inspired from R language.
>> >>>
>> >>> During the implementation/testing, we might need to modify/could find
>> >>> optimization opportunities for our system internals.
>> >>>
>> >>> One of the approaches:
>> >>> 1. Take an algorithm/product that is already implemented in another
>> >>> system/library.
>> >>> 2. Find places where SystemDS can perform better. Find the low hanging
>> >>> fruit, like can we use one of our python builtins or a combination to
>> >>> achieve similar or better results. and can we improve it further.
>> >>> 3. So, we identified a candidate for builtin.
>> >>> 4. and repeat the cycle.
>> >>>
>> >>>
>> >>> Best regards,
>> >>> Janardhan
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Aug 2, 2022 at 2:09 AM Badrul Chowdhury
>> >>> <ba...@gmail.com> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I wanted to start a discussion on building parity of built-in
>> functions
>> >>>> with popular OSS libraries. I am thinking of attaining parity as a
>> >> 3-step
>> >>>> process:
>> >>>>
>> >>>> *Step 1*
>> >>>> As far as I can tell from the existing built-in functions, SystemDS
>> >> aims
>> >>> to
>> >>>> offer a hybrid set of APIs for scientific computing and ML (data
>> >>>> engineering included) to users. Therefore, the most obvious OSS
>> >> libraries
>> >>>> for comparison would be numpy, sklearn (scipy), and pandas. Apache
>> >>>> DataSketches would be another relevant system for specialized use
>> cases
>> >>>> (sketches).
>> >>>>
>> >>>> *Step 2*
>> >>>> Once we have established a set of libraries, I would propose that we
>> >>> create
>> >>>> a capability matrix with sections for each library, like so:
>> >>>>
>> >>>> Section 1: numpy
>> >>>>
>> >>>> f_1
>> >>>>
>> >>>> f_2
>> >>>>
>> >>>> [..]
>> >>>>
>> >>>>
>> >>>> f_n
>> >>>>
>> >>>> Section 2: sklearn
>> >>>>
>> >>>> [..]
>> >>>>
>> >>>>
>> >>>> The columns could be a checklist like this: f_i -> (DML, Python, CP,
>> >> SP,
>> >>>> RowCol, Row, Col, Federated, documentationPublished)
>> >>>>
>> >>>> *Step 3*
>> >>>> Create JIRA tasks, assign them, and start coding.
>> >>>>
>> >>>>
>> >>>> Thoughts?
>> >>>>
>> >>>>
>> >>>> Thanks,
>> >>>> Badrul
>> >>>
>> >>
>> >
>> >
>>
>
>
> --
>
> Cheers,
> Badrul
>


-- 

Cheers,
Badrul