You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by rothnorton <ro...@yahoo.com> on 2021/03/12 07:59:02 UTC

Maximising online ml performance

I am trying to build a low latency machine learning system from scratch using
apache ignite.  Note: I am in design phase, and have not implemented
anything yet.

General data pipeline is:
Json Data via socket -> Ignite Cache -> Ignite ML (Updating) -> Ignite Cache
-> App (via continuous query) (maybe Ignite ML -> App via socket for
improved latency)

I am trying to minimise latency, and also improve ml speed.
Obviously the distributed in-memory colocated processing is quite useful for
high performance ml when dealing with lots of data.

However, I am wondering:
1) What is best practice for performing various operations to improve
latency / ml performance
2) Whether there could be fundamental changes in ignite framework to better
support such thing

One important thing here could be serialisation / deserialisation speed.
This includes json ->(some object) -> cache.  and cache -> (some object) ->
ml vector

So optimal would be to serialise from json direct into cache representation
(binaryserialise?), and straight from cache into ml vector?  Is this
possible?  Any best practice? Did this make it easier:
https://issues.apache.org/jira/browse/IGNITE-13672

As well as potentially looking at more optimal method of representing data
to improve ml performance.  I have heard that storing columnar data is quite
useful: https://arrow.apache.org/overview/
Is it possible that something like this could be implemented as an
alternative cache memory architecture?  If this is not possible - then is
there an alternative to the java array [] / Vector on heap, that seems to be
used in the ml algorithms?  Is it possible for the ml algorithms to work on
the data in place (in the cache), without having to retrieve it (is this
what IgniteRDD does for spark?)

The two considerations would be improving ml algorithm performance, and as
well mininising (de)serialisation overhead.

Thanks ! 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Maximising online ml performance

Posted by Vladimir Pligin <vo...@yandex.ru>.

\+ zaleslaw.sin@

Hi,

It seems that for some reason we've missed this thread (something with nabble
maybe?).

Maybe we could ask Alexey **Zinoviev to have a look at this one if he is
available.**

 **Thanks a lot in advance.**

13.03.2021, 16:36, "rothnorton" <ro...@yahoo.com>:

> After reading the examples etc, more generally, i have two questions:  
>  
> 1) What is the best way to stream json data into a machine learning  
> algorithm to minimise latency of receiving date to the result.  
>  
> Including serialisation into cache from json for vector representation for  
> ml algs, cache setup, datastore style etc. Its unclear from the  
> documentation about this.  
> I'm happy to benchmark different methods, but maybe someone already knows?  
>  
> 2) How does columnstore ml compare to row based ml, and if its better for  
> things like linear regression, is there a comparison?  
>  
> And for both of these, what is the optimal setup (if any) for minimising  
> latency for realtime machine learning, and can the apache ignite  
> infrastructure be modified to achieve this?  
>  
>  
>

>

> \--  
> Sent from: <http://apache-ignite-users.70518.x6.nabble.com/>

\--

Warm Regards,

Vladimir Pligin

Re: Maximising online ml performance

Posted by rothnorton <ro...@yahoo.com>.

After reading the examples etc, more generally, i have two questions:

1)  What is the best way to stream json data into a machine learning
algorithm to minimise latency of receiving date to the result.  

Including serialisation into cache from json for vector representation for
ml algs, cache setup, datastore style etc.  Its unclear from the
documentation about this.
I'm happy to benchmark different methods, but maybe someone already knows?

2) How does columnstore ml compare to row based ml, and if its better for
things like linear regression, is there a comparison?

And for both of these, what is the optimal setup (if any) for minimising
latency for realtime machine learning, and can the apache ignite
infrastructure be modified to achieve this?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/