You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by Amol Kekre <am...@datatorrent.com> on 2016/11/28 23:26:57 UTC

Re: Megh operator library

I am not sure where we are on this. As Apex community we need to take a
hard look before we reject code that passes license and normal pull request
requirements. A lot of Megh code is in production, and is stable. Is there
a reason why we cannot accomodate Megh code in a directory that clarifies
its origins. Assuming there is duplicates, the code can still be taken in a
directory that marks is as such.

Without this a lof customer custom code that they want to contribute to
Malhar will be stuck in "replace with ...". That will not happen as folks
do not change production code once it works and stabalized. If the word
"contrib" is an issue, I suggest we get a new name. HDHT etc. are in
production and it makes sense to let them reside in a directory (to be
named) in Malhar.

Thks
Amol


On Mon, Sep 26, 2016 at 10:22 PM, Pramod Immaneni <pr...@datatorrent.com>
wrote:

> Added a section for flume based on the feedback.
>
> Thanks
>
> On Mon, Sep 26, 2016 at 8:51 AM, Pramod Immaneni <pr...@datatorrent.com>
> wrote:
>
> > Hi Thomas,
> >
> > My responses are inline
> >
> > On Sun, Sep 25, 2016 at 11:39 AM, Thomas Weise <th...@gmail.com>
> > wrote:
> >
> >> Thanks for putting it together. It looks like there are really only 2
> >> operators?
> >>
> >
> > There were others but looked like they were already good implementations
> > or alternatives for it in Malhar. For example, enrichment and deduper
> have
> > implementations already, for laggards operator looked like the concept is
> > already covered in the new windowing work.
> >
> >
> >>
> >> +1 for the Flume connector. It would be good to also look what has
> changed
> >> in Flume since it was written. It needs its own Maven module and
> >> documentation is also needed.
> >>
> >
> > Yes in the table in the document I have it going to its own module and
> > path. Will make a note in the document about checking against newer flume
> > versions and documentation.
> >
> >
> >> I don't agree with the proposed "as-is" move for the dimension compute
> >> operator into contrib. It does not belong there. Contrib is for new,
> >> incomplete work ("immature" and under the radar WRT CI etc.), with
> >> particular focus to provide an easier entry path for new contributors.
> >>
> >> I would like to see the following changes to dimension computation:
> >> * Replace HDHT with managed state (or spillable DS)
> >> * Move to org.apache.apex.malhar.lib.*
> >> * Documentation (your draft is a good start towards that), it also needs
> >> to
> >> cover query support.
> >>
> >> I think it is a very valuable operator that should be a first class
> >> citizen
> >> and the folks familiar with the operator and state management should
> take
> >> up the work to port it. Tim indicated he may be able to take it up.
> >>
> >> In the meantime, the operator can remain in the Megh repository under
> >> existing name and consumed from there.
> >>
> >
> > I thought it could eventually have its own module under Malhar but
> > suggested contrib as an intermediate location till any porting is
> > completed. I agree with the documentation, I just wrote up something
> quick
> > to highlight the operator, Tim has more detailed docs for it I think.
> Since
> > the operator(s) are readily usable in production applications, implement
> > quite a bit of functionality and provide valuable functionality, I am of
> > the opinion that we do the minimal now to make it available and parallely
> > start the work on porting some of the internal subsystems to newer
> > components.
> >
> > Thanks
> >
> >
> >>
> >> Thomas
> >>
> >> On Sat, Sep 24, 2016 at 12:29 PM, Pramod Immaneni <
> pramod@datatorrent.com
> >> >
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > Here is the initial proposal. Please go through it and you can comment
> >> > right on the document. Regarding the discussions around Dimensional
> >> > operators, there is a specific section for it and future plans. After
> >> the
> >> > comments are addressed, I can start with one of the components such as
> >> > flume and document the steps involved. Then others can take up the
> other
> >> > components and use the steps in a similar fashion.
> >> >
> >> > https://docs.google.com/document/d/1BzWAwJDEUs0G42DWTuGYvM5sm0Uu5
> >> > nTP7cUQOAlVs0g
> >> >
> >> > Thanks
> >> >
> >> > On Sat, Sep 10, 2016 at 10:29 AM, Amol Kekre <am...@datatorrent.com>
> >> wrote:
> >> >
> >> > > Thomas,
> >> > > IMHO we should also look at the cost to users on keeping code in a
> >> github
> >> > > (even if under ASF 2.0 license) outside Malhar. There is value to
> >> > > deprecating code in Megh, and moving it to Malhar. Volunteers in
> this
> >> > > effort could decide on how much overlap means "mark as overlapping",
> >> My
> >> > > suggesstion is to absorb overlapping operators into a directory in
> >> Malhar
> >> > > that marks it as such. A lot of these operators are being used in
> >> > > production and it make sense to absorb them into Apache gitHub.
> >> > >
> >> > > Thks
> >> > > Amol
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Sat, Sep 10, 2016 at 7:20 AM, Pramod Immaneni <
> >> pramod@datatorrent.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > It would be great to have Tim's help with dimension computation
> but
> >> I
> >> > > > think we can still debate whether HDHT dependency needs to be
> >> removed
> >> > > > before contribution or whether it can be done as a two step
> process
> >> > > > since we also have a place to put experimental code contrib and
> HDHT
> >> > > > could go in there till we can determine/port it to use managed.
> >> state.
> >> > > >
> >> > > > My thought on this is that if it is going to be a significant
> >> porting
> >> > > > effort then we do it as a two step process.
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > > > On Sep 9, 2016, at 11:52 PM, Thomas Weise <
> thomas@datatorrent.com
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > Tim,
> >> > > > >
> >> > > > > The functionality of the dimension compute operator should be
> >> > available
> >> > > > in
> >> > > > > Malhar. My concern is moving things without regard to code
> >> > duplication
> >> > > > and
> >> > > > > long term maintenance cost. There are several pieces to the
> >> dimension
> >> > > > > compute operator that in fact are (or should be) reusable
> >> components
> >> > by
> >> > > > > themselves. Live querying (queryable state) with schemas is one
> >> such
> >> > > > > example. It's a major feature and not limited to the dimension
> >> > compute
> >> > > > > operator. It should ideally work with the new windowing support
> as
> >> > > well.
> >> > > > > But the main area that needs work is the state store - the
> >> dependency
> >> > > on
> >> > > > > HDHT needs to be removed and replaced with managed state. Also
> I'm
> >> > > > curious
> >> > > > > why the window operator should not scale for large time buckets?
> >> Are
> >> > > you
> >> > > > > referring to the current intermediate implementation or the work
> >> in
> >> > > > > progress that will use incremental state saving? If so, please
> >> bring
> >> > it
> >> > > > up
> >> > > > > on APEXMALHAR-2130 as it is pretty important.
> >> > > > >
> >> > > > > Since you have written almost all of the dimension compute code,
> >> > could
> >> > > > you
> >> > > > > help with the changes needed to bring it over? It would also be
> >> good
> >> > to
> >> > > > see
> >> > > > > the user documentation in Malhar.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Thomas
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Fri, Sep 9, 2016 at 10:52 PM, Timothy Farkas <
> >> > > > timothyfarkas@apache.org>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Hi Thomas,
> >> > > > >>
> >> > > > >> With respect to the dimension operator, I would like to learn
> >> more
> >> > > about
> >> > > > >> the underlying framework you mentioned and the code
> duplication.
> >> If
> >> > > you
> >> > > > are
> >> > > > >> talking about the Window operator framework, that framework is
> >> not
> >> > > > suitable
> >> > > > >> for the dimension computation use case because it doesn't scale
> >> for
> >> > > > large
> >> > > > >> timebuckets. Furthermore that framework has no support for
> >> Querying.
> >> > > The
> >> > > > >> dimension operators support live queries of the aggregated
> data.
> >> > > > Querying
> >> > > > >> of live data streams is a popular feature in other open source
> >> > > > platforms,
> >> > > > >> and I believe it is a worthwhile addition to Malhar.
> >> > > > >>
> >> > > > >> Given the fact that the dimension framework has been used in
> many
> >> > POCs
> >> > > > and
> >> > > > >> is even running in production and has novel features like live
> >> > > > querying, it
> >> > > > >> more than meets the bar for a malhar contribution. If a
> concrete
> >> > > > argument
> >> > > > >> cannot be provided to prevent this work from going into Malhar,
> >> then
> >> > > > these
> >> > > > >> efforts should not be blocked.
> >> > > > >>
> >> > > > >> Thanks,
> >> > > > >> Tim
> >> > > > >>
> >> > > > >>> On 2016-09-09 17:18 (-0700), Thomas Weise <
> >> thomas@datatorrent.com>
> >> > > > wrote:
> >> > > > >>> I see no reason to move the dimension operator along with
> >> > everything
> >> > > it
> >> > > > >>> duplicates to Malhar. It's available to use for everyone as it
> >> is
> >> > and
> >> > > > >> there
> >> > > > >>> should be an initiative to make it confirm to the underlying
> >> > > framework
> >> > > > to
> >> > > > >>> be part of Malhar.
> >> > > > >>>
> >> > > > >>> Also there is already an enrichment operator, there is even
> >> > > > documentation
> >> > > > >>> for it.
> >> > > > >>>
> >> > > > >>> Hence, this needs to be analyzed properly.
> >> > > > >>>
> >> > > > >>> Thomas
> >> > > > >>>
> >> > > > >>> On Fri, Sep 9, 2016 at 5:10 PM, Pramod Immaneni <
> >> > > > pramod@datatorrent.com>
> >> > > > >>> wrote:
> >> > > > >>>
> >> > > > >>>> Yes, I do plan to come up with a proposal with a list. The
> ones
> >> > that
> >> > > > >> come
> >> > > > >>>> to mind are flume, enrichment, various dimensional operators
> >> and
> >> > any
> >> > > > >> custom
> >> > > > >>>> partitioners. The dimensional operators are in a mature state
> >> and
> >> > > > >> usable
> >> > > > >>>> today, in future they could also be ported onto the new
> >> windowing
> >> > > and
> >> > > > >>>> managed state operator framework.
> >> > > > >>>>
> >> > > > >>>> Thanks
> >> > > > >>>>
> >> > > > >>>> On Fri, Sep 9, 2016 at 4:29 PM, Thomas Weise <
> >> > > thomas@datatorrent.com>
> >> > > > >>>> wrote:
> >> > > > >>>>
> >> > > > >>>>> A cursory look suggests there is a lot of overlap. I'm
> looking
> >> > > > >> forward to
> >> > > > >>>>> see a proposal that reflects a vision how to evolve Malhar
> >> rather
> >> > > > >> than
> >> > > > >>>> just
> >> > > > >>>>> moving around code.
> >> > > > >>>>>
> >> > > > >>>>> Thomas
> >> > > > >>>>>
> >> > > > >>>>>
> >> > > > >>>>> On Thu, Sep 8, 2016 at 2:40 PM, Pramod Immaneni <
> >> > > > >> pramod@datatorrent.com>
> >> > > > >>>>> wrote:
> >> > > > >>>>>
> >> > > > >>>>>> Hi,
> >> > > > >>>>>>
> >> > > > >>>>>> DataTorrent, the initial contributor to Apex and the
> company
> >> I
> >> > > work
> >> > > > >>>> for,
> >> > > > >>>>>> has opened up a library of operators called Megh recently
> to
> >> the
> >> > > > >> public
> >> > > > >>>>> and
> >> > > > >>>>>> has made the repository available under the Apache License.
> >> The
> >> > > > >> link to
> >> > > > >>>>> the
> >> > > > >>>>>> repository is below. These operators, for the most part,
> >> contain
> >> > > > >>>>>> functionality that is complementary to what Malhar library
> >> > > > >> provides and
> >> > > > >>>>>> were developed to solve business use cases that arose over
> >> time.
> >> > > > >> Also,
> >> > > > >>>>> some
> >> > > > >>>>>> operators in Malhar were inspired from early
> implementations
> >> in
> >> > > the
> >> > > > >>>> Megh
> >> > > > >>>>>> library and were built upon knowledge gained in doing the
> >> > original
> >> > > > >>>>>> implementations.
> >> > > > >>>>>>
> >> > > > >>>>>> Our goal is to not have Megh as a separate library but
> rather
> >> > > bring
> >> > > > >>>> these
> >> > > > >>>>>> operators into Malhar in a fashion that it is consistent
> with
> >> > the
> >> > > > >>>> Malhar
> >> > > > >>>>>> project and repository. In the upcoming days, in a gradual
> >> > > > >> fashion, we
> >> > > > >>>>> will
> >> > > > >>>>>> have more details on the individual operators that we would
> >> like
> >> > > to
> >> > > > >>>>>> contribute. Also, if you are interested in helping with
> this
> >> > > effort
> >> > > > >>>>> please
> >> > > > >>>>>> raise your hand.
> >> > > > >>>>>>
> >> > > > >>>>>> https://github.com/DataTorrent/Megh/
> >> > > > >>>>>>
> >> > > > >>>>>> Thanks
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>