You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@asterixdb.apache.org by Torsten Bergh Moss <to...@ig.ntnu.no> on 2019/11/17 14:43:39 UTC

UDF Lifecycle

Dear developers,


I am trying to build a machine learning-based UDF for classification. This involves loading in a model that has been trained offline, which in practice basically is deserialization of a big object. This process of deserialization takes a significant amount of time, but it only "needs" to happen once, and after that the model can do the classification rather rapidly.


Therefore, in order to avoid having to load the model every time the UDF is called, I am wondering where in the UDF lifecycle I can do the loading in order to achieve a "load model once, classify infinitely"-scenario, and how to implement it. I am assuming it should be done somewhere inside the factory-function-relationship, but I am not sure where/how and can't seem to find a lot of documentation on it.


All help is appreciated, thanks!


Best wishes,

Torsten

Re: UDF Lifecycle

Posted by Torsten Bergh Moss <to...@ig.ntnu.no>.

Thanks,

Using the initialize-method solved the problem perfectly.

It does add a delay of about 70 seconds before starting the classification process, but in a context of a very large dataset it's not gonna be that significant, and even less so in a feed context where the feed might collect tweets for days. 

Also, to specifically analyse the classification speed I can just time it inside the UDF and write it to the nc-log.

Best wishes,
Torsten

________________________________________
From: Ian Maxon <im...@uci.edu>
Sent: Monday, November 18, 2019 8:27 PM
To: dev@asterixdb.apache.org
Subject: Re: UDF Lifecycle

Yeah, I think maybe a global init is not really appropriate for this
case though- we don't want UDFs taking memory outside of jobs (i
think?). initialize() in the current API sounds appropriate for the
problem described if I understand it right, I think. Gift's Stanford
NLP demo has a similar problem that's solved that way. I agree however
it would be useful in other cases.

Do please let us know how it performs (or if it is troublesome) that
way, Torsten.

On Sun, Nov 17, 2019 at 11:38 PM Mike Carey <dt...@gmail.com> wrote:
>
> @Ian: This is a very interesting use/test case for the work you are
> doing at UCI on the new more dynamic deployment model, and on how the
> underlying UDF infrastructure can best support ML-model-based UDFs...!
>
> On 11/17/19 4:56 PM, Xikui Wang wrote:
> > I wonder what would the deployment-initialization do?
> >
> > btw, the UDF does have a deinitialize() method which is expected to be
> > invoked when the UDF is deinitialized, but that's is ignored for now as the
> > IScalarEvaluator in general doesn't not deinitialize. To make that work, we
> > would need a bigger change in Hyracks to make it aware that step. This
> > could one improvement as well...
> >
> > Best,
> > Xikui
> >
> > On Sun, Nov 17, 2019 at 11:30 AM Till Westmann <ti...@apache.org> wrote:
> >
> >> It seems that it's be nice if we had a step (similar to the
> >> initialization step) in the deployment lifecycle as well.
> >> And I guess that we'd need to corresponding clean-up step for
> >> un-deployment as well.
> >>
> >> Does that make sense? If so, should we file an improvement for this?
> >>
> >> Cheers,
> >> Till
> >>
> >> On 17 Nov 2019, at 9:29, Xikui Wang wrote:
> >>
> >>> The UDF interface has an initialize method which is invoked per every
> >>> lifecycle. Putting the model loading code in there can probably solve
> >>> your
> >>> problem. The initialization is done per query (Hyrack job). For
> >>> example, if
> >>> you do
> >>>
> >>> SELECT mylib#myudf(t) FROM Tweets t;
> >>>
> >>> in which there are 100 tweets in the Tweets dataset. The
> >>> initialization
> >>> method will be called once and the evaluate method will be invoked 100
> >>> times. In the context of feeds attached with UDFs, the
> >>> initialization happens only once when feed starts.
> >>>
> >>> Best,
> >>> Xikui
> >>>
> >>> On Sun, Nov 17, 2019 at 6:44 AM Torsten Bergh Moss <
> >>> torsten.b.moss@ig.ntnu.no> wrote:
> >>>
> >>>> Dear developers,
> >>>>
> >>>>
> >>>> I am trying to build a machine learning-based UDF for classification.
> >>>> This
> >>>> involves loading in a model that has been trained offline, which in
> >>>> practice basically is deserialization of a big object. This process
> >>>> of
> >>>> deserialization takes a significant amount of time, but it only
> >>>> "needs" to
> >>>> happen once, and after that the model can do the classification
> >>>> rather
> >>>> rapidly.
> >>>>
> >>>>
> >>>> Therefore, in order to avoid having to load the model every time the
> >>>> UDF
> >>>> is called, I am wondering where in the UDF lifecycle I can do the
> >>>> loading
> >>>> in order to achieve a "load model once, classify
> >>>> infinitely"-scenario, and
> >>>> how to implement it. I am assuming it should be done somewhere inside
> >>>> the
> >>>> factory-function-relationship, but I am not sure where/how and can't
> >>>> seem
> >>>> to find a lot of documentation on it.
> >>>>
> >>>>
> >>>> All help is appreciated, thanks!
> >>>>
> >>>>
> >>>> Best wishes,
> >>>>
> >>>> Torsten
> >>>>

Re: UDF Lifecycle

Posted by Ian Maxon <im...@uci.edu>.

Yeah, I think maybe a global init is not really appropriate for this
case though- we don't want UDFs taking memory outside of jobs (i
think?). initialize() in the current API sounds appropriate for the
problem described if I understand it right, I think. Gift's Stanford
NLP demo has a similar problem that's solved that way. I agree however
it would be useful in other cases.

Do please let us know how it performs (or if it is troublesome) that
way, Torsten.

On Sun, Nov 17, 2019 at 11:38 PM Mike Carey <dt...@gmail.com> wrote:
>
> @Ian: This is a very interesting use/test case for the work you are
> doing at UCI on the new more dynamic deployment model, and on how the
> underlying UDF infrastructure can best support ML-model-based UDFs...!
>
> On 11/17/19 4:56 PM, Xikui Wang wrote:
> > I wonder what would the deployment-initialization do?
> >
> > btw, the UDF does have a deinitialize() method which is expected to be
> > invoked when the UDF is deinitialized, but that's is ignored for now as the
> > IScalarEvaluator in general doesn't not deinitialize. To make that work, we
> > would need a bigger change in Hyracks to make it aware that step. This
> > could one improvement as well...
> >
> > Best,
> > Xikui
> >
> > On Sun, Nov 17, 2019 at 11:30 AM Till Westmann <ti...@apache.org> wrote:
> >
> >> It seems that it's be nice if we had a step (similar to the
> >> initialization step) in the deployment lifecycle as well.
> >> And I guess that we'd need to corresponding clean-up step for
> >> un-deployment as well.
> >>
> >> Does that make sense? If so, should we file an improvement for this?
> >>
> >> Cheers,
> >> Till
> >>
> >> On 17 Nov 2019, at 9:29, Xikui Wang wrote:
> >>
> >>> The UDF interface has an initialize method which is invoked per every
> >>> lifecycle. Putting the model loading code in there can probably solve
> >>> your
> >>> problem. The initialization is done per query (Hyrack job). For
> >>> example, if
> >>> you do
> >>>
> >>> SELECT mylib#myudf(t) FROM Tweets t;
> >>>
> >>> in which there are 100 tweets in the Tweets dataset. The
> >>> initialization
> >>> method will be called once and the evaluate method will be invoked 100
> >>> times. In the context of feeds attached with UDFs, the
> >>> initialization happens only once when feed starts.
> >>>
> >>> Best,
> >>> Xikui
> >>>
> >>> On Sun, Nov 17, 2019 at 6:44 AM Torsten Bergh Moss <
> >>> torsten.b.moss@ig.ntnu.no> wrote:
> >>>
> >>>> Dear developers,
> >>>>
> >>>>
> >>>> I am trying to build a machine learning-based UDF for classification.
> >>>> This
> >>>> involves loading in a model that has been trained offline, which in
> >>>> practice basically is deserialization of a big object. This process
> >>>> of
> >>>> deserialization takes a significant amount of time, but it only
> >>>> "needs" to
> >>>> happen once, and after that the model can do the classification
> >>>> rather
> >>>> rapidly.
> >>>>
> >>>>
> >>>> Therefore, in order to avoid having to load the model every time the
> >>>> UDF
> >>>> is called, I am wondering where in the UDF lifecycle I can do the
> >>>> loading
> >>>> in order to achieve a "load model once, classify
> >>>> infinitely"-scenario, and
> >>>> how to implement it. I am assuming it should be done somewhere inside
> >>>> the
> >>>> factory-function-relationship, but I am not sure where/how and can't
> >>>> seem
> >>>> to find a lot of documentation on it.
> >>>>
> >>>>
> >>>> All help is appreciated, thanks!
> >>>>
> >>>>
> >>>> Best wishes,
> >>>>
> >>>> Torsten
> >>>>

Re: UDF Lifecycle

Posted by Mike Carey <dt...@gmail.com>.

@Ian: This is a very interesting use/test case for the work you are 
doing at UCI on the new more dynamic deployment model, and on how the 
underlying UDF infrastructure can best support ML-model-based UDFs...!

On 11/17/19 4:56 PM, Xikui Wang wrote:
> I wonder what would the deployment-initialization do?
>
> btw, the UDF does have a deinitialize() method which is expected to be
> invoked when the UDF is deinitialized, but that's is ignored for now as the
> IScalarEvaluator in general doesn't not deinitialize. To make that work, we
> would need a bigger change in Hyracks to make it aware that step. This
> could one improvement as well...
>
> Best,
> Xikui
>
> On Sun, Nov 17, 2019 at 11:30 AM Till Westmann <ti...@apache.org> wrote:
>
>> It seems that it's be nice if we had a step (similar to the
>> initialization step) in the deployment lifecycle as well.
>> And I guess that we'd need to corresponding clean-up step for
>> un-deployment as well.
>>
>> Does that make sense? If so, should we file an improvement for this?
>>
>> Cheers,
>> Till
>>
>> On 17 Nov 2019, at 9:29, Xikui Wang wrote:
>>
>>> The UDF interface has an initialize method which is invoked per every
>>> lifecycle. Putting the model loading code in there can probably solve
>>> your
>>> problem. The initialization is done per query (Hyrack job). For
>>> example, if
>>> you do
>>>
>>> SELECT mylib#myudf(t) FROM Tweets t;
>>>
>>> in which there are 100 tweets in the Tweets dataset. The
>>> initialization
>>> method will be called once and the evaluate method will be invoked 100
>>> times. In the context of feeds attached with UDFs, the
>>> initialization happens only once when feed starts.
>>>
>>> Best,
>>> Xikui
>>>
>>> On Sun, Nov 17, 2019 at 6:44 AM Torsten Bergh Moss <
>>> torsten.b.moss@ig.ntnu.no> wrote:
>>>
>>>> Dear developers,
>>>>
>>>>
>>>> I am trying to build a machine learning-based UDF for classification.
>>>> This
>>>> involves loading in a model that has been trained offline, which in
>>>> practice basically is deserialization of a big object. This process
>>>> of
>>>> deserialization takes a significant amount of time, but it only
>>>> "needs" to
>>>> happen once, and after that the model can do the classification
>>>> rather
>>>> rapidly.
>>>>
>>>>
>>>> Therefore, in order to avoid having to load the model every time the
>>>> UDF
>>>> is called, I am wondering where in the UDF lifecycle I can do the
>>>> loading
>>>> in order to achieve a "load model once, classify
>>>> infinitely"-scenario, and
>>>> how to implement it. I am assuming it should be done somewhere inside
>>>> the
>>>> factory-function-relationship, but I am not sure where/how and can't
>>>> seem
>>>> to find a lot of documentation on it.
>>>>
>>>>
>>>> All help is appreciated, thanks!
>>>>
>>>>
>>>> Best wishes,
>>>>
>>>> Torsten
>>>>

Re: UDF Lifecycle

Posted by Xikui Wang <xi...@uci.edu>.

I wonder what would the deployment-initialization do?

btw, the UDF does have a deinitialize() method which is expected to be
invoked when the UDF is deinitialized, but that's is ignored for now as the
IScalarEvaluator in general doesn't not deinitialize. To make that work, we
would need a bigger change in Hyracks to make it aware that step. This
could one improvement as well...

Best,
Xikui

On Sun, Nov 17, 2019 at 11:30 AM Till Westmann <ti...@apache.org> wrote:

> It seems that it's be nice if we had a step (similar to the
> initialization step) in the deployment lifecycle as well.
> And I guess that we'd need to corresponding clean-up step for
> un-deployment as well.
>
> Does that make sense? If so, should we file an improvement for this?
>
> Cheers,
> Till
>
> On 17 Nov 2019, at 9:29, Xikui Wang wrote:
>
> > The UDF interface has an initialize method which is invoked per every
> > lifecycle. Putting the model loading code in there can probably solve
> > your
> > problem. The initialization is done per query (Hyrack job). For
> > example, if
> > you do
> >
> > SELECT mylib#myudf(t) FROM Tweets t;
> >
> > in which there are 100 tweets in the Tweets dataset. The
> > initialization
> > method will be called once and the evaluate method will be invoked 100
> > times. In the context of feeds attached with UDFs, the
> > initialization happens only once when feed starts.
> >
> > Best,
> > Xikui
> >
> > On Sun, Nov 17, 2019 at 6:44 AM Torsten Bergh Moss <
> > torsten.b.moss@ig.ntnu.no> wrote:
> >
> >> Dear developers,
> >>
> >>
> >> I am trying to build a machine learning-based UDF for classification.
> >> This
> >> involves loading in a model that has been trained offline, which in
> >> practice basically is deserialization of a big object. This process
> >> of
> >> deserialization takes a significant amount of time, but it only
> >> "needs" to
> >> happen once, and after that the model can do the classification
> >> rather
> >> rapidly.
> >>
> >>
> >> Therefore, in order to avoid having to load the model every time the
> >> UDF
> >> is called, I am wondering where in the UDF lifecycle I can do the
> >> loading
> >> in order to achieve a "load model once, classify
> >> infinitely"-scenario, and
> >> how to implement it. I am assuming it should be done somewhere inside
> >> the
> >> factory-function-relationship, but I am not sure where/how and can't
> >> seem
> >> to find a lot of documentation on it.
> >>
> >>
> >> All help is appreciated, thanks!
> >>
> >>
> >> Best wishes,
> >>
> >> Torsten
> >>
>

Re: UDF Lifecycle

Posted by Till Westmann <ti...@apache.org>.

It seems that it's be nice if we had a step (similar to the 
initialization step) in the deployment lifecycle as well.
And I guess that we'd need to corresponding clean-up step for 
un-deployment as well.

Does that make sense? If so, should we file an improvement for this?

Cheers,
Till

On 17 Nov 2019, at 9:29, Xikui Wang wrote:

> The UDF interface has an initialize method which is invoked per every
> lifecycle. Putting the model loading code in there can probably solve 
> your
> problem. The initialization is done per query (Hyrack job). For 
> example, if
> you do
>
> SELECT mylib#myudf(t) FROM Tweets t;
>
> in which there are 100 tweets in the Tweets dataset. The 
> initialization
> method will be called once and the evaluate method will be invoked 100
> times. In the context of feeds attached with UDFs, the
> initialization happens only once when feed starts.
>
> Best,
> Xikui
>
> On Sun, Nov 17, 2019 at 6:44 AM Torsten Bergh Moss <
> torsten.b.moss@ig.ntnu.no> wrote:
>
>> Dear developers,
>>
>>
>> I am trying to build a machine learning-based UDF for classification. 
>> This
>> involves loading in a model that has been trained offline, which in
>> practice basically is deserialization of a big object. This process 
>> of
>> deserialization takes a significant amount of time, but it only 
>> "needs" to
>> happen once, and after that the model can do the classification 
>> rather
>> rapidly.
>>
>>
>> Therefore, in order to avoid having to load the model every time the 
>> UDF
>> is called, I am wondering where in the UDF lifecycle I can do the 
>> loading
>> in order to achieve a "load model once, classify 
>> infinitely"-scenario, and
>> how to implement it. I am assuming it should be done somewhere inside 
>> the
>> factory-function-relationship, but I am not sure where/how and can't 
>> seem
>> to find a lot of documentation on it.
>>
>>
>> All help is appreciated, thanks!
>>
>>
>> Best wishes,
>>
>> Torsten
>>

Re: UDF Lifecycle

Posted by Xikui Wang <xi...@uci.edu>.

The UDF interface has an initialize method which is invoked per every
lifecycle. Putting the model loading code in there can probably solve your
problem. The initialization is done per query (Hyrack job). For example, if
you do

SELECT mylib#myudf(t) FROM Tweets t;

in which there are 100 tweets in the Tweets dataset. The initialization
method will be called once and the evaluate method will be invoked 100
times. In the context of feeds attached with UDFs, the
initialization happens only once when feed starts.

Best,
Xikui

On Sun, Nov 17, 2019 at 6:44 AM Torsten Bergh Moss <
torsten.b.moss@ig.ntnu.no> wrote:

> Dear developers,
>
>
> I am trying to build a machine learning-based UDF for classification. This
> involves loading in a model that has been trained offline, which in
> practice basically is deserialization of a big object. This process of
> deserialization takes a significant amount of time, but it only "needs" to
> happen once, and after that the model can do the classification rather
> rapidly.
>
>
> Therefore, in order to avoid having to load the model every time the UDF
> is called, I am wondering where in the UDF lifecycle I can do the loading
> in order to achieve a "load model once, classify infinitely"-scenario, and
> how to implement it. I am assuming it should be done somewhere inside the
> factory-function-relationship, but I am not sure where/how and can't seem
> to find a lot of documentation on it.
>
>
> All help is appreciated, thanks!
>
>
> Best wishes,
>
> Torsten
>