You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by "aaron@tophold.com" <aa...@tophold.com> on 2017/09/25 11:30:59 UTC

What's the best practice to init the cache when in cluster env?

hi All, 

If we have dozen of nodes to cache millions data from DB;

When init,  what's the best way to loading those data? we use the data streamer to load data, while all our entry include a partition ID when insert into DB. 

As the nodes are started one by one, if loading from one Node and then re-balance this seems impossible & wasting. 

Not sure whether there any guideline or best practice/advice for such scenario.

Thanks for our time!


Regards
Aaron


aaron@tophold.com

Re: Re: What's the best practice to init the cache when in cluster env?

Posted by Denis Mekhanikov <dm...@gmail.com>.

Well, if you cannot know when topology is full, then you cannot guarantee
that no rebalancing will happen.

If backups are not configured, then data that moved to other nodes will be
removed from the initial node. Rebalancing happens according to a
configured affinity function
<https://apacheignite.readme.io/docs/affinity-collocation#affinity-function>.
By default it is RendezvousAffinityFunction
<https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/affinity/rendezvous/RendezvousAffinityFunction.html>
and
it aims to minimize data transmission between nodes.
You can implement your own affinity function that will work better by
utilizing additional knowledge about your topology.

Denis

вт, 26 сент. 2017 г. в 11:46, aaron@tophold.com <aa...@tophold.com>:

> Thanks Denis,
>
> The real tough issue is we not sure when the entire cluster may be ready,
> as we may increase or decrease the nodes at run-time.
>
> Another question is , if I load the data once on first started node,
>  after other nodes bring up, and after re-balance, will the primary nodes
> evict the entries not below to it?
>
> As we have regular aggregated run locally on each nodes, we do not want
> this will be too heavy on the first node.
>
>
> Regards
> Aaron
> ------------------------------
> aaron@tophold.com
>
>
> *From:* Denis Mekhanikov <dm...@gmail.com>
> *Date:* 2017-09-25 19:46
> *To:* user <us...@ignite.apache.org>
> *Subject:* Re: What's the best practice to init the cache when in cluster
> env?
>
> Hi Aaron!
>
> There are two good options for data loading: using DataStreamer or
> IgniteCache.loadCache(...)
> <https://apacheignite.readme.io/docs/3rd-party-store#section-loadcache->.
> The second option is good when initial data is stored in some database.
>
> If you worry about overhead on data rebalancing, you can start the cluster
> and start streaming data once all nodes are up. In this case records will
> appear at their final destination at once, without need to move to other
> nodes.
>
> Denis
>
> пн, 25 сент. 2017 г. в 14:31, aaron@tophold.com <aa...@tophold.com>:
>
>> hi All,
>>
>> If we have dozen of nodes to cache millions data from DB;
>>
>> When init,  what's the best way to loading those data? we use the data
>> streamer to load data, while all our entry include a partition ID when
>> insert into DB.
>>
>> As the nodes are started one by one, if loading from one Node and then
>> re-balance this seems impossible & wasting.
>>
>> Not sure whether there any guideline or best practice/advice for such
>> scenario.
>>
>> Thanks for our time!
>>
>>
>> Regards
>> Aaron
>> ------------------------------
>> aaron@tophold.com
>>
>

Re: Re: What's the best practice to init the cache when in cluster env?

Posted by "aaron@tophold.com" <aa...@tophold.com>.

Got, thanks Denis! very appreciated !


Regards
Aaron


aaron@tophold.com
 
From: Denis Magda
Date: 2017-09-27 06:03
To: user
Subject: Re: What's the best practice to init the cache when in cluster env?
Aaron,

It’s safe to preload the data on a changing cluster topology. Both data stream and CacheStore approaches handle this:
https://apacheignite.readme.io/docs/data-loading

During the rebalancing a node evicts data it’s no longer primary or backup one. You don’t need to worry about this, it’s Ignite’s job.

—
Denis M.

On Sep 26, 2017, at 1:46 AM, aaron@tophold.com wrote:

Thanks Denis, 

The real tough issue is we not sure when the entire cluster may be ready, as we may increase or decrease the nodes at run-time. 

Another question is , if I load the data once on first started node,  after other nodes bring up, and after re-balance, will the primary nodes evict the entries not below to it?

As we have regular aggregated run locally on each nodes, we do not want this will be too heavy on the first node. 


Regards
Aaron


aaron@tophold.com
 
From: Denis Mekhanikov
Date: 2017-09-25 19:46
To: user
Subject: Re: What's the best practice to init the cache when in cluster env?
Hi Aaron!

There are two good options for data loading: using DataStreamer or IgniteCache.loadCache(...). The second option is good when initial data is stored in some database.

If you worry about overhead on data rebalancing, you can start the cluster and start streaming data once all nodes are up. In this case records will appear at their final destination at once, without need to move to other nodes.

Denis

пн, 25 сент. 2017 г. в 14:31, aaron@tophold.com <aa...@tophold.com>:
hi All, 

If we have dozen of nodes to cache millions data from DB;

When init,  what's the best way to loading those data? we use the data streamer to load data, while all our entry include a partition ID when insert into DB. 

As the nodes are started one by one, if loading from one Node and then re-balance this seems impossible & wasting. 

Not sure whether there any guideline or best practice/advice for such scenario.

Thanks for our time!


Regards
Aaron


aaron@tophold.com

Re: What's the best practice to init the cache when in cluster env?

Posted by Denis Magda <dm...@apache.org>.

Aaron,

It’s safe to preload the data on a changing cluster topology. Both data stream and CacheStore approaches handle this:
https://apacheignite.readme.io/docs/data-loading

During the rebalancing a node evicts data it’s no longer primary or backup one. You don’t need to worry about this, it’s Ignite’s job.

—
Denis M.

> On Sep 26, 2017, at 1:46 AM, aaron@tophold.com wrote:
> 
> Thanks Denis, 
> 
> The real tough issue is we not sure when the entire cluster may be ready, as we may increase or decrease the nodes at run-time. 
> 
> Another question is , if I load the data once on first started node,  after other nodes bring up, and after re-balance, will the primary nodes evict the entries not below to it?
> 
> As we have regular aggregated run locally on each nodes, we do not want this will be too heavy on the first node. 
> 
> 
> Regards
> Aaron
> aaron@tophold.com <ma...@tophold.com>
>  
> From: Denis Mekhanikov <ma...@gmail.com>
> Date: 2017-09-25 19:46
> To: user <ma...@ignite.apache.org>
> Subject: Re: What's the best practice to init the cache when in cluster env?
> Hi Aaron!
> 
> There are two good options for data loading: using DataStreamer or IgniteCache.loadCache(...) <https://apacheignite.readme.io/docs/3rd-party-store#section-loadcache->. The second option is good when initial data is stored in some database.
> 
> If you worry about overhead on data rebalancing, you can start the cluster and start streaming data once all nodes are up. In this case records will appear at their final destination at once, without need to move to other nodes.
> 
> Denis
> 
> пн, 25 сент. 2017 г. в 14:31, aaron@tophold.com <ma...@tophold.com> <aaron@tophold.com <ma...@tophold.com>>:
> hi All, 
> 
> If we have dozen of nodes to cache millions data from DB;
> 
> When init,  what's the best way to loading those data? we use the data streamer to load data, while all our entry include a partition ID when insert into DB. 
> 
> As the nodes are started one by one, if loading from one Node and then re-balance this seems impossible & wasting. 
> 
> Not sure whether there any guideline or best practice/advice for such scenario.
> 
> Thanks for our time!
> 
> 
> Regards
> Aaron
> aaron@tophold.com <ma...@tophold.com>

Re: Re: What's the best practice to init the cache when in cluster env?

Posted by "aaron@tophold.com" <aa...@tophold.com>.

Thanks Denis, 

The real tough issue is we not sure when the entire cluster may be ready, as we may increase or decrease the nodes at run-time. 

Another question is , if I load the data once on first started node,  after other nodes bring up, and after re-balance, will the primary nodes evict the entries not below to it?

As we have regular aggregated run locally on each nodes, we do not want this will be too heavy on the first node. 


Regards
Aaron


aaron@tophold.com
 
From: Denis Mekhanikov
Date: 2017-09-25 19:46
To: user
Subject: Re: What's the best practice to init the cache when in cluster env?
Hi Aaron!

There are two good options for data loading: using DataStreamer or IgniteCache.loadCache(...). The second option is good when initial data is stored in some database.

If you worry about overhead on data rebalancing, you can start the cluster and start streaming data once all nodes are up. In this case records will appear at their final destination at once, without need to move to other nodes.

Denis

пн, 25 сент. 2017 г. в 14:31, aaron@tophold.com <aa...@tophold.com>:
hi All, 

If we have dozen of nodes to cache millions data from DB;

When init,  what's the best way to loading those data? we use the data streamer to load data, while all our entry include a partition ID when insert into DB. 

As the nodes are started one by one, if loading from one Node and then re-balance this seems impossible & wasting. 

Not sure whether there any guideline or best practice/advice for such scenario.

Thanks for our time!


Regards
Aaron


aaron@tophold.com

Re: What's the best practice to init the cache when in cluster env?

Posted by Denis Mekhanikov <dm...@gmail.com>.

Hi Aaron!

There are two good options for data loading: using DataStreamer or
IgniteCache.loadCache(...)
<https://apacheignite.readme.io/docs/3rd-party-store#section-loadcache->.
The second option is good when initial data is stored in some database.

If you worry about overhead on data rebalancing, you can start the cluster
and start streaming data once all nodes are up. In this case records will
appear at their final destination at once, without need to move to other
nodes.

Denis

пн, 25 сент. 2017 г. в 14:31, aaron@tophold.com <aa...@tophold.com>:

> hi All,
>
> If we have dozen of nodes to cache millions data from DB;
>
> When init,  what's the best way to loading those data? we use the data
> streamer to load data, while all our entry include a partition ID when
> insert into DB.
>
> As the nodes are started one by one, if loading from one Node and then
> re-balance this seems impossible & wasting.
>
> Not sure whether there any guideline or best practice/advice for such
> scenario.
>
> Thanks for our time!
>
>
> Regards
> Aaron
> ------------------------------
> aaron@tophold.com
>