You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geode.apache.org by Denis Magda <ma...@gmail.com> on 2016/08/19 18:59:52 UTC

Persistence and OQL over cold data

Hello Geode community,

I've been investigating possibilities of Geode Persistence for a while and
still can't get it clear whether I need to have all my data in memory if I
want to execute OQL queries or OQL engine works over the persistence as
well.

My use case is the following. During the cluster startup I don't want to
wait while all the data has been pre-loaded from the persistence to RAM and
want to execute OQL queries right away. Is it feasible to implement with
Geode? Please provide me with the links where I can read more about this.

Regards,
Denis

Re: Persistence and OQL over cold data

Posted by Anthony Baker <ab...@pivotal.io>.

Hey Denis,

Like all ASF projects we’re an open community that welcomes discussion and collaboration.  Thanks for posting your question!

Anthony

> On Aug 19, 2016, at 12:29 PM, Denis Magda <ma...@gmail.com> wrote:
> 
> If I'm an Ignite committer it doesn't mean that my curiosity is driven by a kind of marketing needs :) Moreover, Ignite community doesn't provide any kind of comparison sheets or whatever.
>

Re: Persistence and OQL over cold data

Posted by Denis Magda <ma...@gmail.com>.

Hi Greg,

If I'm an Ignite committer it doesn't mean that my curiosity is driven by a
kind of marketing needs :) Moreover, Ignite community doesn't provide any
kind of comparison sheets or whatever.

So, returning back to my original question. I had to deal with this use
case many times in practice. And all they time I had to preload all the
data from a persistence in memory if I wanted to execute SQL or other kind
of non key based queries right away. Ignite, Hazelcast and many other
platforms I know doesn't support such a feature.

But looks like at least Geode can now deal with this use case according to
the text on the main page (Object Query Language allows distributed query
execution on hot and *cold* data, with SQL-like capabilities, including
joins.). Is it so, guys?

-
Denis

On Fri, Aug 19, 2016 at 12:08 PM, Greg Chase <gc...@gmail.com> wrote:

> Hi Denis,
> How does Ignite handle this use case?
>
> I trust you are fishing for comparisons.
>
> Greg
>
> This email encrypted by tiny buttons & fat thumbs, beta voice recognition,
> and autocorrect on my iPhone.
>
> On Aug 19, 2016, at 11:59 AM, Denis Magda <ma...@gmail.com> wrote:
>
> Hello Geode community,
>
> I've been investigating possibilities of Geode Persistence for a while and
> still can't get it clear whether I need to have all my data in memory if I
> want to execute OQL queries or OQL engine works over the persistence as
> well.
>
> My use case is the following. During the cluster startup I don't want to
> wait while all the data has been pre-loaded from the persistence to RAM and
> want to execute OQL queries right away. Is it feasible to implement with
> Geode? Please provide me with the links where I can read more about this.
>
> Regards,
> Denis
>
>

-- 
Удачи,
Денис Магда

Re: Persistence and OQL over cold data

Posted by Greg Chase <gc...@gmail.com>.

Hi Denis,
How does Ignite handle this use case?

I trust you are fishing for comparisons.

Greg 

This email encrypted by tiny buttons & fat thumbs, beta voice recognition, and autocorrect on my iPhone.

> On Aug 19, 2016, at 11:59 AM, Denis Magda <ma...@gmail.com> wrote:
> 
> Hello Geode community,
> 
> I've been investigating possibilities of Geode Persistence for a while and still can't get it clear whether I need to have all my data in memory if I want to execute OQL queries or OQL engine works over the persistence as well. 
> 
> My use case is the following. During the cluster startup I don't want to wait while all the data has been pre-loaded from the persistence to RAM and want to execute OQL queries right away. Is it feasible to implement with Geode? Please provide me with the links where I can read more about this.
> 
> Regards,
> Denis

Re: Persistence and OQL over cold data

Posted by Michael Stolz <ms...@pivotal.io>.

JB: But, by combining the Function Execution service with querying (on
PARTITIONED data) [2], you could target the nodes that would supposedly
hold the data of interests, and execute the queries there.
MS: In order to target the nodes that would supposedly hold the data of
interest you need to know the keys you are looking for. If you know the
keys why are you querying in the first place? Just do getAll(keys).

JB: Additionally, assuming the Indexes were defined properly based on the
predicates in the queries (most often) used, that it would target the data
on disk matching the predicate and load only the data required
MS: Correct, and that's exactly how Geode works.

JB:(no data store, RDBMS or otherwise, especially disk-bound stores, should
have to load the entire table/Region/Map/whatever to access the data
matching the predicate; that's absurd, OOMEs galore).
MS: The trouble happens when you are NOT hitting your indices. If you do a
query that requires a full table scan, then every row in the database table
needs to be examined, and to examine it, it has to be in memory at least
briefly.


--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Fri, Aug 19, 2016 at 4:39 PM, John Blum <jb...@pivotal.io> wrote:

> Hi All-
>
> DISCLAIMER: I am no expert in querying and index
> architecture/implementation; mostly a consumer.
>
> Perhaps *Anil* or *Jason* can shed more light on the subject, but for my
> own understanding/sanity, it would seem we could do better than this,
> meaning...
>
> I would think any UC partially depends on the organization of your data in
> the grid as well.  If you used a PARTITION data management policy [1],
> for instance, then, of course, your data would be distributed and
> partitioned across all the data nodes in the grid (cluster) holding the
> data (i.e. data nodes that have declared the same PARTITION Region).  It
> should then be possible to make this more optimal by have a redundancy
> level of 1 or more (depending on the frequency of transactions and data
> changes) to parallelize the data access.
>
> Not only does having more nodes mean better (or more optimal)
> organization, but more memory.  Still, given a very large data set, clearly
> some of the data will need to OVERFLOW (to disk).
>
> But, by combining the Function Execution service with querying (on
> PARTITIONED data) [2], you could target the nodes that would supposedly
> hold the data of interests, and execute the queries there.
>
> Additionally, assuming the Indexes were defined properly based on the
> predicates in the queries (most often) used, that it would target the data
> on disk matching the predicate and load only the data required (no data
> store, RDBMS or otherwise, especially disk-bound stores, should have to
> load the entire table/Region/Map/whatever to access the data matching the
> predicate; that's absurd, OOMEs galore).
>
> TMK, Geode keeps Indexes in memory (even loads them on startup) and
> updates them (either sync/async depending on your configuration) as the
> data changes.  You would assume the data would not be changing in the
> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
> also assume that that data would then have to be in-memory (I think so).
>
> Please let me know if I am way of basis here, but I would think Geode
> gives you enough options that particular UCs could be made, with nominal
> effort, more optimal.
>
> Additional references...
>
> * Query Partitioned Regions [3]
> * Working with Indexes [4], and then...
> * Tips and Guidelines on Using Indexes [5], but also important...
> * Using Indexes with Overflow Regions [6]
>
> Hope this helps.
>
> Cheers!
> -John
>
>
> [1] http://geode.docs.pivotal.io/docs/developing/region_
> options/region_types.html
> [2] http://geode.docs.pivotal.io/docs/developing/querying_
> basics/performance_considerations.html
> [3] http://geode.docs.pivotal.io/docs/developing/querying_
> basics/querying_partitioned_regions.html
> [4] http://geode.docs.pivotal.io/docs/developing/query_
> index/query_index.html
> [5] http://geode.docs.pivotal.io/docs/developing/query_
> index/indexing_guidelines.html
> [6] http://geode.docs.pivotal.io/docs/developing/query_
> index/indexes_with_overflow_regions.html
>
>
> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com> wrote:
>
>> Thanks, now I see.
>>
>> This works the same way as in Ignite then. If you set up an eviction
>> policy in Ignite the data may be evicted to swap at some point of time and
>> if a query is executed right after that the it may swap in the data back to
>> memory. However the indexes must always be in memory.
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>> wrote:
>>
>>> There is a notion of data aging out in Geode. We call it overflow to
>>> disk.
>>>
>>> The idea is that as data gets old you can have the records in memory
>>> expire, and that expiry can be to disk. That's the cold data.
>>>
>>> You may have built an index while you were initially loading the data,
>>> and if your predicates only hit the indexes you will still get really fast
>>> queries if the result sets aren't large.
>>>
>>> If, however, you ever resort to hitting the disk-based data for a query
>>> it is going to have to read every record that isn't in memory from disk
>>> which is going to be extremely slow. I personally would never use Geode
>>> that way.
>>>
>>>
>>> --
>>> Mike Stolz
>>> Principal Engineer, GemFire Product Manager
>>> Mobile: 631-835-4771
>>>
>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>
>>>> I just thought that you were able to do something with indexes in a
>>>> such way that there is no need to preload everything from disk into memory
>>>> when a query is executed over cold data.
>>>>
>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>> following sentence from the main page:
>>>>
>>>> *Object Query Language allows distributed query execution on hot and
>>>> cold data, with SQL-like capabilities, including joins.*
>>>>
>>>> --
>>>> Denis
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>> wrote:
>>>>
>>>>> Here's the thing...
>>>>>
>>>>> On any In-memory data grid, if you run a query before the data has
>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>
>>>>> And the system will still have to go ahead and load everything into
>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>
>>>>> Geode DOES have a nice feature for key based access though. We
>>>>> actually store the keys in a separate file from the data and we can load
>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>> memory.
>>>>>
>>>>> The Lucene integration work that is going on in Geode might also make
>>>>> it possible to load the indexes first and lazily load the data based on
>>>>> queries against the indexes.
>>>>>
>>>>>
>>>>> --
>>>>> Mike Stolz
>>>>> Principal Engineer, GemFire Product Manager
>>>>> Mobile: 631-835-4771
>>>>>
>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Geode community,
>>>>>>
>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>> persistence as well.
>>>>>>
>>>>>> My use case is the following. During the cluster startup I don't want
>>>>>> to wait while all the data has been pre-loaded from the persistence to RAM
>>>>>> and want to execute OQL queries right away. Is it feasible to implement
>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>> this.
>>>>>>
>>>>>> Regards,
>>>>>> Denis
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Удачи,
>>>> Денис Магда
>>>>
>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>

Re: Persistence and OQL over cold data

Posted by Alan Kash <cr...@gmail.com>.

Will creating another region for Indexes make them persistent ?


We should capture this information in the documentation.

1. Local / Distributed Read
2. Local / Distributed Write
3. Local / Distributed Indexing.

Thanks


On Fri, Aug 19, 2016 at 7:47 PM, John Blum <jb...@pivotal.io> wrote:

> My apologies for confusing Index storage with Geode; thought I heard this
> somewhere in the context of GemFire/Geode before.  No doubt confused this
> with other data stores I work with.  (So) much to learn yet.
>
> On Fri, Aug 19, 2016 at 4:16 PM, Michael Stolz <ms...@pivotal.io> wrote:
>
>> Unfortunately the indexes are not stored. They need to be rebuilt on
>> restart. For that reason, on start up, the whole diskstore needs to be read.
>>
>> --
>> Mike Stolz
>> Principal Engineer, GemFire Product Manager
>> Mobile: 631-835-4771
>>
>> On Fri, Aug 19, 2016 at 5:30 PM, John Blum <jb...@pivotal.io> wrote:
>>
>>> *Jason, Mike*: first, thank you.
>>>
>>> > *In order to target the nodes that would supposedly hold the data of
>>> interest you need to know the keys you are looking for. If you know the
>>> keys why are you querying in the first place? Just do getAll(keys).*
>>>
>>> Two reasons...
>>>
>>> 1. I want to apply some "additional filtering" that can only be handled
>>> elegantly by a OQL query predicate after a subset of the data has been
>>> identified/targeted (using keys).  I have example of this somewhere (doh)
>>> after working with a customer on this exact UC
>>>
>>> 2. I don't want the entire object (i.e. row); I only need a specific
>>> "projection" of the (object) data.  This is particularly important if I
>>> have very large and complex object graph and I am streaming data across the
>>> wire (client/server).
>>>
>>>
>>> > *The trouble happens when you are NOT hitting your indices. *
>>>
>>> Yes, good point.
>>>
>>> > *If you do a query that requires a full table scan, then every row in
>>> the database table needs to be examined, and to examine it, it has to be in
>>> memory at least briefly.*
>>>
>>> Of course.
>>>
>>> *Denis*-
>>>
>>> > *The disk entries that are mentioned by John were located in memory
>>> before and were overflowed on disk at some point of time. It means that if
>>> you start your cluster from scratch and want to run OQL queries over the
>>> indexed data then you have to preload all the data from the persistence.*
>>>
>>> I don't specifically recall how much persistent data Geode reloads on
>>> restart (Geode is a shared-nothing architecture though so each data node
>>> has it's own persistence; additionally primaries must come online before
>>> secondaries are accessible).  The question is how much data gets reloaded
>>> on restart.  It would seem silly if the disk store contained more data then
>>> would fit in memory and reload everything knowing some of the data would be
>>> OVERFLOW on preload when it would not all fit.  Geode will reload the Index
>>> though, which is stored as well.
>>>
>>> I let the experts answer this one.
>>>
>>>
>>> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi John, Jason,
>>>>
>>>> If to expand more on this
>>>>
>>>>
>>>> *If an index can be used, the index look up is executed and entries
>>>> added to the result set.  If any of the entries that match the predicates
>>>> is actually on disk, those values will need to be loaded to memory before
>>>> being returned as a result.*
>>>>
>>>> The disk entries that are mentioned by John were located in memory
>>>> before and were overflowed on disk at some point of time. It means that if
>>>> you start your cluster from scratch and want to run OQL queries over the
>>>> indexed data then you have to preload all the data from the persistence.
>>>> Yes, some of the data may be overflowed back to disk during the preloading
>>>> but you'll have your indexes in a valid state.
>>>>
>>>> Correct me if I'm still missing something.
>>>>
>>>> --
>>>> Denis
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jh...@pivotal.io> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> I think you were referring to Mike's explanation of:
>>>>> "If, however, you ever resort to hitting the disk-based data for a
>>>>> query it is going to have to read every record that isn't in memory from
>>>>> disk which is going to be extremely slow. I personally would never use
>>>>> Geode that way."
>>>>>
>>>>> When stating:
>>>>> "Additionally, assuming the Indexes were defined properly based on the
>>>>> predicates in the queries (most often) used, that it would target the data
>>>>> on disk matching the predicate and load only the data required (no data
>>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>>> predicate; that's absurd, OOMEs galore)."
>>>>>
>>>>> Let me try to clear things up slightly...hopefully not causing more
>>>>> confusion...
>>>>> If an index can be used, the index look up is executed and entries
>>>>> added to the result set.  If any of the entries that match the predicates
>>>>> is actually on disk, those values will need to be loaded to memory before
>>>>> being returned as a result.
>>>>> I think what Mike was saying was that if an index is not used, then
>>>>> the query itself would execute across the entire region, which means
>>>>> loading every entry into memory.  We would need to inspect each entry to
>>>>> see if fulfill the criteria.
>>>>>
>>>>> -Jason
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>>>>
>>>>>> Hi All-
>>>>>>
>>>>>> DISCLAIMER: I am no expert in querying and index
>>>>>> architecture/implementation; mostly a consumer.
>>>>>>
>>>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but
>>>>>> for my own understanding/sanity, it would seem we could do better than
>>>>>> this, meaning...
>>>>>>
>>>>>> I would think any UC partially depends on the organization of your
>>>>>> data in the grid as well.  If you used a PARTITION data management
>>>>>> policy [1], for instance, then, of course, your data would be distributed
>>>>>> and partitioned across all the data nodes in the grid (cluster) holding the
>>>>>> data (i.e. data nodes that have declared the same PARTITION
>>>>>> Region).  It should then be possible to make this more optimal by have a
>>>>>> redundancy level of 1 or more (depending on the frequency of transactions
>>>>>> and data changes) to parallelize the data access.
>>>>>>
>>>>>> Not only does having more nodes mean better (or more optimal)
>>>>>> organization, but more memory.  Still, given a very large data set, clearly
>>>>>> some of the data will need to OVERFLOW (to disk).
>>>>>>
>>>>>> But, by combining the Function Execution service with querying (on
>>>>>> PARTITIONED data) [2], you could target the nodes that would
>>>>>> supposedly hold the data of interests, and execute the queries there.
>>>>>>
>>>>>> Additionally, assuming the Indexes were defined properly based on the
>>>>>> predicates in the queries (most often) used, that it would target the data
>>>>>> on disk matching the predicate and load only the data required (no data
>>>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>>>> predicate; that's absurd, OOMEs galore).
>>>>>>
>>>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>>>>> updates them (either sync/async depending on your configuration) as the
>>>>>> data changes.  You would assume the data would not be changing in the
>>>>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>>>>> also assume that that data would then have to be in-memory (I think so).
>>>>>>
>>>>>> Please let me know if I am way of basis here, but I would think Geode
>>>>>> gives you enough options that particular UCs could be made, with nominal
>>>>>> effort, more optimal.
>>>>>>
>>>>>> Additional references...
>>>>>>
>>>>>> * Query Partitioned Regions [3]
>>>>>> * Working with Indexes [4], and then...
>>>>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>>>>> * Using Indexes with Overflow Regions [6]
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Cheers!
>>>>>> -John
>>>>>>
>>>>>>
>>>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>>>>> ons/region_types.html
>>>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>>>> sics/performance_considerations.html
>>>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>>>> sics/querying_partitioned_regions.html
>>>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>>> /query_index.html
>>>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>>> /indexing_guidelines.html
>>>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>>> /indexes_with_overflow_regions.html
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, now I see.
>>>>>>>
>>>>>>> This works the same way as in Ignite then. If you set up an eviction
>>>>>>> policy in Ignite the data may be evicted to swap at some point of time and
>>>>>>> if a query is executed right after that the it may swap in the data back to
>>>>>>> memory. However the indexes must always be in memory.
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> There is a notion of data aging out in Geode. We call it overflow
>>>>>>>> to disk.
>>>>>>>>
>>>>>>>> The idea is that as data gets old you can have the records in
>>>>>>>> memory expire, and that expiry can be to disk. That's the cold data.
>>>>>>>>
>>>>>>>> You may have built an index while you were initially loading the
>>>>>>>> data, and if your predicates only hit the indexes you will still get really
>>>>>>>> fast queries if the result sets aren't large.
>>>>>>>>
>>>>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>>>>> query it is going to have to read every record that isn't in memory from
>>>>>>>> disk which is going to be extremely slow. I personally would never use
>>>>>>>> Geode that way.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Stolz
>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>> Mobile: 631-835-4771
>>>>>>>>
>>>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Mike,
>>>>>>>>>
>>>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>>>>
>>>>>>>>> I just thought that you were able to do something with indexes in
>>>>>>>>> a such way that there is no need to preload everything from disk into
>>>>>>>>> memory when a query is executed over cold data.
>>>>>>>>>
>>>>>>>>> Then what does "execution over cold data" mean? I'm referring to
>>>>>>>>> the following sentence from the main page:
>>>>>>>>>
>>>>>>>>> *Object Query Language allows distributed query execution on hot
>>>>>>>>> and cold data, with SQL-like capabilities, including joins.*
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <mstolz@pivotal.io
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Here's the thing...
>>>>>>>>>>
>>>>>>>>>> On any In-memory data grid, if you run a query before the data
>>>>>>>>>> has been loaded into memory, it is going to cause the exact same amount of
>>>>>>>>>> disk i/o to do the query as it will take to load everything into memory.
>>>>>>>>>>
>>>>>>>>>> And the system will still have to go ahead and load everything
>>>>>>>>>> into memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>>>>>
>>>>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>>>>> actually store the keys in a separate file from the data and we can load
>>>>>>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>>>>>>> memory.
>>>>>>>>>>
>>>>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>>>>> make it possible to load the indexes first and lazily load the data based
>>>>>>>>>> on queries against the indexes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Mike Stolz
>>>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>>>> Mobile: 631-835-4771
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <magda7817@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Geode community,
>>>>>>>>>>>
>>>>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>>>>> persistence as well.
>>>>>>>>>>>
>>>>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>>>>> want to wait while all the data has been pre-loaded from the persistence to
>>>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to implement
>>>>>>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Удачи,
>>>>>>>>> Денис Магда
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Удачи,
>>>>>>> Денис Магда
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -John
>>>>>> 503-504-8657
>>>>>> john.blum10101 (skype)
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Удачи,
>>>> Денис Магда
>>>>
>>>
>>>
>>>
>>> --
>>> -John
>>> 503-504-8657
>>> john.blum10101 (skype)
>>>
>>
>>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>

Re: Persistence and OQL over cold data

Posted by John Blum <jb...@pivotal.io>.

My apologies for confusing Index storage with Geode; thought I heard this
somewhere in the context of GemFire/Geode before.  No doubt confused this
with other data stores I work with.  (So) much to learn yet.

On Fri, Aug 19, 2016 at 4:16 PM, Michael Stolz <ms...@pivotal.io> wrote:

> Unfortunately the indexes are not stored. They need to be rebuilt on
> restart. For that reason, on start up, the whole diskstore needs to be read.
>
> --
> Mike Stolz
> Principal Engineer, GemFire Product Manager
> Mobile: 631-835-4771
>
> On Fri, Aug 19, 2016 at 5:30 PM, John Blum <jb...@pivotal.io> wrote:
>
>> *Jason, Mike*: first, thank you.
>>
>> > *In order to target the nodes that would supposedly hold the data of
>> interest you need to know the keys you are looking for. If you know the
>> keys why are you querying in the first place? Just do getAll(keys).*
>>
>> Two reasons...
>>
>> 1. I want to apply some "additional filtering" that can only be handled
>> elegantly by a OQL query predicate after a subset of the data has been
>> identified/targeted (using keys).  I have example of this somewhere (doh)
>> after working with a customer on this exact UC
>>
>> 2. I don't want the entire object (i.e. row); I only need a specific
>> "projection" of the (object) data.  This is particularly important if I
>> have very large and complex object graph and I am streaming data across the
>> wire (client/server).
>>
>>
>> > *The trouble happens when you are NOT hitting your indices. *
>>
>> Yes, good point.
>>
>> > *If you do a query that requires a full table scan, then every row in
>> the database table needs to be examined, and to examine it, it has to be in
>> memory at least briefly.*
>>
>> Of course.
>>
>> *Denis*-
>>
>> > *The disk entries that are mentioned by John were located in memory
>> before and were overflowed on disk at some point of time. It means that if
>> you start your cluster from scratch and want to run OQL queries over the
>> indexed data then you have to preload all the data from the persistence.*
>>
>> I don't specifically recall how much persistent data Geode reloads on
>> restart (Geode is a shared-nothing architecture though so each data node
>> has it's own persistence; additionally primaries must come online before
>> secondaries are accessible).  The question is how much data gets reloaded
>> on restart.  It would seem silly if the disk store contained more data then
>> would fit in memory and reload everything knowing some of the data would be
>> OVERFLOW on preload when it would not all fit.  Geode will reload the Index
>> though, which is stored as well.
>>
>> I let the experts answer this one.
>>
>>
>> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <ma...@gmail.com> wrote:
>>
>>> Hi John, Jason,
>>>
>>> If to expand more on this
>>>
>>>
>>> *If an index can be used, the index look up is executed and entries
>>> added to the result set.  If any of the entries that match the predicates
>>> is actually on disk, those values will need to be loaded to memory before
>>> being returned as a result.*
>>>
>>> The disk entries that are mentioned by John were located in memory
>>> before and were overflowed on disk at some point of time. It means that if
>>> you start your cluster from scratch and want to run OQL queries over the
>>> indexed data then you have to preload all the data from the persistence.
>>> Yes, some of the data may be overflowed back to disk during the preloading
>>> but you'll have your indexes in a valid state.
>>>
>>> Correct me if I'm still missing something.
>>>
>>> --
>>> Denis
>>>
>>>
>>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jh...@pivotal.io> wrote:
>>>
>>>> Hi John,
>>>>
>>>> I think you were referring to Mike's explanation of:
>>>> "If, however, you ever resort to hitting the disk-based data for a
>>>> query it is going to have to read every record that isn't in memory from
>>>> disk which is going to be extremely slow. I personally would never use
>>>> Geode that way."
>>>>
>>>> When stating:
>>>> "Additionally, assuming the Indexes were defined properly based on the
>>>> predicates in the queries (most often) used, that it would target the data
>>>> on disk matching the predicate and load only the data required (no data
>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>> predicate; that's absurd, OOMEs galore)."
>>>>
>>>> Let me try to clear things up slightly...hopefully not causing more
>>>> confusion...
>>>> If an index can be used, the index look up is executed and entries
>>>> added to the result set.  If any of the entries that match the predicates
>>>> is actually on disk, those values will need to be loaded to memory before
>>>> being returned as a result.
>>>> I think what Mike was saying was that if an index is not used, then the
>>>> query itself would execute across the entire region, which means loading
>>>> every entry into memory.  We would need to inspect each entry to see if
>>>> fulfill the criteria.
>>>>
>>>> -Jason
>>>>
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>>>
>>>>> Hi All-
>>>>>
>>>>> DISCLAIMER: I am no expert in querying and index
>>>>> architecture/implementation; mostly a consumer.
>>>>>
>>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for
>>>>> my own understanding/sanity, it would seem we could do better than this,
>>>>> meaning...
>>>>>
>>>>> I would think any UC partially depends on the organization of your
>>>>> data in the grid as well.  If you used a PARTITION data management
>>>>> policy [1], for instance, then, of course, your data would be distributed
>>>>> and partitioned across all the data nodes in the grid (cluster) holding the
>>>>> data (i.e. data nodes that have declared the same PARTITION Region).
>>>>> It should then be possible to make this more optimal by have a redundancy
>>>>> level of 1 or more (depending on the frequency of transactions and data
>>>>> changes) to parallelize the data access.
>>>>>
>>>>> Not only does having more nodes mean better (or more optimal)
>>>>> organization, but more memory.  Still, given a very large data set, clearly
>>>>> some of the data will need to OVERFLOW (to disk).
>>>>>
>>>>> But, by combining the Function Execution service with querying (on
>>>>> PARTITIONED data) [2], you could target the nodes that would
>>>>> supposedly hold the data of interests, and execute the queries there.
>>>>>
>>>>> Additionally, assuming the Indexes were defined properly based on the
>>>>> predicates in the queries (most often) used, that it would target the data
>>>>> on disk matching the predicate and load only the data required (no data
>>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>>> predicate; that's absurd, OOMEs galore).
>>>>>
>>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>>>> updates them (either sync/async depending on your configuration) as the
>>>>> data changes.  You would assume the data would not be changing in the
>>>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>>>> also assume that that data would then have to be in-memory (I think so).
>>>>>
>>>>> Please let me know if I am way of basis here, but I would think Geode
>>>>> gives you enough options that particular UCs could be made, with nominal
>>>>> effort, more optimal.
>>>>>
>>>>> Additional references...
>>>>>
>>>>> * Query Partitioned Regions [3]
>>>>> * Working with Indexes [4], and then...
>>>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>>>> * Using Indexes with Overflow Regions [6]
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Cheers!
>>>>> -John
>>>>>
>>>>>
>>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>>>> ons/region_types.html
>>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>>> sics/performance_considerations.html
>>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>>> sics/querying_partitioned_regions.html
>>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>> /query_index.html
>>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>> /indexing_guidelines.html
>>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>> /indexes_with_overflow_regions.html
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks, now I see.
>>>>>>
>>>>>> This works the same way as in Ignite then. If you set up an eviction
>>>>>> policy in Ignite the data may be evicted to swap at some point of time and
>>>>>> if a query is executed right after that the it may swap in the data back to
>>>>>> memory. However the indexes must always be in memory.
>>>>>>
>>>>>> --
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>>>>>> wrote:
>>>>>>
>>>>>>> There is a notion of data aging out in Geode. We call it overflow to
>>>>>>> disk.
>>>>>>>
>>>>>>> The idea is that as data gets old you can have the records in memory
>>>>>>> expire, and that expiry can be to disk. That's the cold data.
>>>>>>>
>>>>>>> You may have built an index while you were initially loading the
>>>>>>> data, and if your predicates only hit the indexes you will still get really
>>>>>>> fast queries if the result sets aren't large.
>>>>>>>
>>>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>>>> query it is going to have to read every record that isn't in memory from
>>>>>>> disk which is going to be extremely slow. I personally would never use
>>>>>>> Geode that way.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mike Stolz
>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>> Mobile: 631-835-4771
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Mike,
>>>>>>>>
>>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>>>
>>>>>>>> I just thought that you were able to do something with indexes in a
>>>>>>>> such way that there is no need to preload everything from disk into memory
>>>>>>>> when a query is executed over cold data.
>>>>>>>>
>>>>>>>> Then what does "execution over cold data" mean? I'm referring to
>>>>>>>> the following sentence from the main page:
>>>>>>>>
>>>>>>>> *Object Query Language allows distributed query execution on hot
>>>>>>>> and cold data, with SQL-like capabilities, including joins.*
>>>>>>>>
>>>>>>>> --
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Here's the thing...
>>>>>>>>>
>>>>>>>>> On any In-memory data grid, if you run a query before the data has
>>>>>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>>>>>
>>>>>>>>> And the system will still have to go ahead and load everything
>>>>>>>>> into memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>>>>
>>>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>>>> actually store the keys in a separate file from the data and we can load
>>>>>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>>>>>> memory.
>>>>>>>>>
>>>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>>>> make it possible to load the indexes first and lazily load the data based
>>>>>>>>> on queries against the indexes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mike Stolz
>>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>>> Mobile: 631-835-4771
>>>>>>>>>
>>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Geode community,
>>>>>>>>>>
>>>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>>>> persistence as well.
>>>>>>>>>>
>>>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>>>> want to wait while all the data has been pre-loaded from the persistence to
>>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to implement
>>>>>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Denis
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Удачи,
>>>>>>>> Денис Магда
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Удачи,
>>>>>> Денис Магда
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -John
>>>>> 503-504-8657
>>>>> john.blum10101 (skype)
>>>>>
>>>>
>>>
>>>
>>> --
>>> Удачи,
>>> Денис Магда
>>>
>>
>>
>>
>> --
>> -John
>> 503-504-8657
>> john.blum10101 (skype)
>>
>
>


-- 
-John
503-504-8657
john.blum10101 (skype)

Re: Persistence and OQL over cold data

Posted by Michael Stolz <ms...@pivotal.io>.

Unfortunately the indexes are not stored. They need to be rebuilt on
restart. For that reason, on start up, the whole diskstore needs to be read.

--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Fri, Aug 19, 2016 at 5:30 PM, John Blum <jb...@pivotal.io> wrote:

> *Jason, Mike*: first, thank you.
>
> > *In order to target the nodes that would supposedly hold the data of
> interest you need to know the keys you are looking for. If you know the
> keys why are you querying in the first place? Just do getAll(keys).*
>
> Two reasons...
>
> 1. I want to apply some "additional filtering" that can only be handled
> elegantly by a OQL query predicate after a subset of the data has been
> identified/targeted (using keys).  I have example of this somewhere (doh)
> after working with a customer on this exact UC
>
> 2. I don't want the entire object (i.e. row); I only need a specific
> "projection" of the (object) data.  This is particularly important if I
> have very large and complex object graph and I am streaming data across the
> wire (client/server).
>
>
> > *The trouble happens when you are NOT hitting your indices. *
>
> Yes, good point.
>
> > *If you do a query that requires a full table scan, then every row in
> the database table needs to be examined, and to examine it, it has to be in
> memory at least briefly.*
>
> Of course.
>
> *Denis*-
>
> > *The disk entries that are mentioned by John were located in memory
> before and were overflowed on disk at some point of time. It means that if
> you start your cluster from scratch and want to run OQL queries over the
> indexed data then you have to preload all the data from the persistence.*
>
> I don't specifically recall how much persistent data Geode reloads on
> restart (Geode is a shared-nothing architecture though so each data node
> has it's own persistence; additionally primaries must come online before
> secondaries are accessible).  The question is how much data gets reloaded
> on restart.  It would seem silly if the disk store contained more data then
> would fit in memory and reload everything knowing some of the data would be
> OVERFLOW on preload when it would not all fit.  Geode will reload the Index
> though, which is stored as well.
>
> I let the experts answer this one.
>
>
> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <ma...@gmail.com> wrote:
>
>> Hi John, Jason,
>>
>> If to expand more on this
>>
>>
>> *If an index can be used, the index look up is executed and entries added
>> to the result set.  If any of the entries that match the predicates is
>> actually on disk, those values will need to be loaded to memory before
>> being returned as a result.*
>>
>> The disk entries that are mentioned by John were located in memory before
>> and were overflowed on disk at some point of time. It means that if you
>> start your cluster from scratch and want to run OQL queries over the
>> indexed data then you have to preload all the data from the persistence.
>> Yes, some of the data may be overflowed back to disk during the preloading
>> but you'll have your indexes in a valid state.
>>
>> Correct me if I'm still missing something.
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jh...@pivotal.io> wrote:
>>
>>> Hi John,
>>>
>>> I think you were referring to Mike's explanation of:
>>> "If, however, you ever resort to hitting the disk-based data for a query
>>> it is going to have to read every record that isn't in memory from disk
>>> which is going to be extremely slow. I personally would never use Geode
>>> that way."
>>>
>>> When stating:
>>> "Additionally, assuming the Indexes were defined properly based on the
>>> predicates in the queries (most often) used, that it would target the data
>>> on disk matching the predicate and load only the data required (no data
>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>> load the entire table/Region/Map/whatever to access the data matching the
>>> predicate; that's absurd, OOMEs galore)."
>>>
>>> Let me try to clear things up slightly...hopefully not causing more
>>> confusion...
>>> If an index can be used, the index look up is executed and entries added
>>> to the result set.  If any of the entries that match the predicates is
>>> actually on disk, those values will need to be loaded to memory before
>>> being returned as a result.
>>> I think what Mike was saying was that if an index is not used, then the
>>> query itself would execute across the entire region, which means loading
>>> every entry into memory.  We would need to inspect each entry to see if
>>> fulfill the criteria.
>>>
>>> -Jason
>>>
>>>
>>>
>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>>
>>>> Hi All-
>>>>
>>>> DISCLAIMER: I am no expert in querying and index
>>>> architecture/implementation; mostly a consumer.
>>>>
>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for
>>>> my own understanding/sanity, it would seem we could do better than this,
>>>> meaning...
>>>>
>>>> I would think any UC partially depends on the organization of your data
>>>> in the grid as well.  If you used a PARTITION data management policy
>>>> [1], for instance, then, of course, your data would be distributed and
>>>> partitioned across all the data nodes in the grid (cluster) holding the
>>>> data (i.e. data nodes that have declared the same PARTITION Region).
>>>> It should then be possible to make this more optimal by have a redundancy
>>>> level of 1 or more (depending on the frequency of transactions and data
>>>> changes) to parallelize the data access.
>>>>
>>>> Not only does having more nodes mean better (or more optimal)
>>>> organization, but more memory.  Still, given a very large data set, clearly
>>>> some of the data will need to OVERFLOW (to disk).
>>>>
>>>> But, by combining the Function Execution service with querying (on
>>>> PARTITIONED data) [2], you could target the nodes that would
>>>> supposedly hold the data of interests, and execute the queries there.
>>>>
>>>> Additionally, assuming the Indexes were defined properly based on the
>>>> predicates in the queries (most often) used, that it would target the data
>>>> on disk matching the predicate and load only the data required (no data
>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>> predicate; that's absurd, OOMEs galore).
>>>>
>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>>> updates them (either sync/async depending on your configuration) as the
>>>> data changes.  You would assume the data would not be changing in the
>>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>>> also assume that that data would then have to be in-memory (I think so).
>>>>
>>>> Please let me know if I am way of basis here, but I would think Geode
>>>> gives you enough options that particular UCs could be made, with nominal
>>>> effort, more optimal.
>>>>
>>>> Additional references...
>>>>
>>>> * Query Partitioned Regions [3]
>>>> * Working with Indexes [4], and then...
>>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>>> * Using Indexes with Overflow Regions [6]
>>>>
>>>> Hope this helps.
>>>>
>>>> Cheers!
>>>> -John
>>>>
>>>>
>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>>> ons/region_types.html
>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/performance_considerations.html
>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/querying_partitioned_regions.html
>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /query_index.html
>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexing_guidelines.html
>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexes_with_overflow_regions.html
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks, now I see.
>>>>>
>>>>> This works the same way as in Ignite then. If you set up an eviction
>>>>> policy in Ignite the data may be evicted to swap at some point of time and
>>>>> if a query is executed right after that the it may swap in the data back to
>>>>> memory. However the indexes must always be in memory.
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>>>>> wrote:
>>>>>
>>>>>> There is a notion of data aging out in Geode. We call it overflow to
>>>>>> disk.
>>>>>>
>>>>>> The idea is that as data gets old you can have the records in memory
>>>>>> expire, and that expiry can be to disk. That's the cold data.
>>>>>>
>>>>>> You may have built an index while you were initially loading the
>>>>>> data, and if your predicates only hit the indexes you will still get really
>>>>>> fast queries if the result sets aren't large.
>>>>>>
>>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>>> query it is going to have to read every record that isn't in memory from
>>>>>> disk which is going to be extremely slow. I personally would never use
>>>>>> Geode that way.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mike Stolz
>>>>>> Principal Engineer, GemFire Product Manager
>>>>>> Mobile: 631-835-4771
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>>
>>>>>>> I just thought that you were able to do something with indexes in a
>>>>>>> such way that there is no need to preload everything from disk into memory
>>>>>>> when a query is executed over cold data.
>>>>>>>
>>>>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>>>>> following sentence from the main page:
>>>>>>>
>>>>>>> *Object Query Language allows distributed query execution on hot and
>>>>>>> cold data, with SQL-like capabilities, including joins.*
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here's the thing...
>>>>>>>>
>>>>>>>> On any In-memory data grid, if you run a query before the data has
>>>>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>>>>
>>>>>>>> And the system will still have to go ahead and load everything into
>>>>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>>>
>>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>>> actually store the keys in a separate file from the data and we can load
>>>>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>>>>> memory.
>>>>>>>>
>>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>>> make it possible to load the indexes first and lazily load the data based
>>>>>>>> on queries against the indexes.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Stolz
>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>> Mobile: 631-835-4771
>>>>>>>>
>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Geode community,
>>>>>>>>>
>>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>>> persistence as well.
>>>>>>>>>
>>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>>> want to wait while all the data has been pre-loaded from the persistence to
>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to implement
>>>>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Удачи,
>>>>>>> Денис Магда
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Удачи,
>>>>> Денис Магда
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -John
>>>> 503-504-8657
>>>> john.blum10101 (skype)
>>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>

Re: Persistence and OQL over cold data

Posted by John Blum <jb...@pivotal.io>.

Tangent... & an FYI regarding #2...

> 2. *I don't want the entire object (i.e. row); I only need a specific
"projection" of the (object) data.*

This is something I plan to expand on more in *Spring Data Geode
(/GemFire) *near term.

Currently, SD has very robust infrastructure for handling "projection"
types in SD *Repositories*.  Generally, a *Repository* is defined around
the application domain type, which is the actual object stored in the
corresponding Geode cache *Region*.  For instance, say I have...

@Region("Contacts")
class Contact {

  @Id
  Long id;

  Address address;

  Person name;

  PhoneNumber phoneNumber;

  String email;

 ...
}

I would then define a *Repository* like so...

interface ContactRepository extends CrudRepository<Contact, Long> {

  List<Contact> findAllCustomersWithContacts();

}

The finder, query method "findAllCustomersWithContacts()" currently returns
a List of Contacts, with all the data in the Contact for all Contacts
matching my predicate (all customers with contact information).  However,
if I only wanted partial Contact details, then I could define the
following...

class ContactProjection {

  String email;
  String personsName;
  String phoneNumber;

  ...

}

And then redefine my finder, query method like so...

  List<ContactProjection> findAllCustomersWithContacts();

And SD (Geode/GemFire) would do the right thing (i.e. specifically craft a
(OQL) query with only the required/necessary information; NOTE: internally
Geode creates a special *struct *for the projected data returned).

The beauty of this is that the abstraction (& interface) is the same across
all data stores; thus is handled generically.

More here [1].

Cheers,
John

[1]
http://docs.spring.io/spring-data/commons/docs/current/reference/html/#repositories


On Fri, Aug 19, 2016 at 2:30 PM, John Blum <jb...@pivotal.io> wrote:

> *Jason, Mike*: first, thank you.
>
> > *In order to target the nodes that would supposedly hold the data of
> interest you need to know the keys you are looking for. If you know the
> keys why are you querying in the first place? Just do getAll(keys).*
>
> Two reasons...
>
> 1. I want to apply some "additional filtering" that can only be handled
> elegantly by a OQL query predicate after a subset of the data has been
> identified/targeted (using keys).  I have example of this somewhere (doh)
> after working with a customer on this exact UC
>
> 2. I don't want the entire object (i.e. row); I only need a specific
> "projection" of the (object) data.  This is particularly important if I
> have very large and complex object graph and I am streaming data across the
> wire (client/server).
>
>
> > *The trouble happens when you are NOT hitting your indices. *
>
> Yes, good point.
>
> > *If you do a query that requires a full table scan, then every row in
> the database table needs to be examined, and to examine it, it has to be in
> memory at least briefly.*
>
> Of course.
>
> *Denis*-
>
> > *The disk entries that are mentioned by John were located in memory
> before and were overflowed on disk at some point of time. It means that if
> you start your cluster from scratch and want to run OQL queries over the
> indexed data then you have to preload all the data from the persistence.*
>
> I don't specifically recall how much persistent data Geode reloads on
> restart (Geode is a shared-nothing architecture though so each data node
> has it's own persistence; additionally primaries must come online before
> secondaries are accessible).  The question is how much data gets reloaded
> on restart.  It would seem silly if the disk store contained more data then
> would fit in memory and reload everything knowing some of the data would be
> OVERFLOW on preload when it would not all fit.  Geode will reload the Index
> though, which is stored as well.
>
> I let the experts answer this one.
>
>
> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <ma...@gmail.com> wrote:
>
>> Hi John, Jason,
>>
>> If to expand more on this
>>
>>
>> *If an index can be used, the index look up is executed and entries added
>> to the result set.  If any of the entries that match the predicates is
>> actually on disk, those values will need to be loaded to memory before
>> being returned as a result.*
>>
>> The disk entries that are mentioned by John were located in memory before
>> and were overflowed on disk at some point of time. It means that if you
>> start your cluster from scratch and want to run OQL queries over the
>> indexed data then you have to preload all the data from the persistence.
>> Yes, some of the data may be overflowed back to disk during the preloading
>> but you'll have your indexes in a valid state.
>>
>> Correct me if I'm still missing something.
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jh...@pivotal.io> wrote:
>>
>>> Hi John,
>>>
>>> I think you were referring to Mike's explanation of:
>>> "If, however, you ever resort to hitting the disk-based data for a query
>>> it is going to have to read every record that isn't in memory from disk
>>> which is going to be extremely slow. I personally would never use Geode
>>> that way."
>>>
>>> When stating:
>>> "Additionally, assuming the Indexes were defined properly based on the
>>> predicates in the queries (most often) used, that it would target the data
>>> on disk matching the predicate and load only the data required (no data
>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>> load the entire table/Region/Map/whatever to access the data matching the
>>> predicate; that's absurd, OOMEs galore)."
>>>
>>> Let me try to clear things up slightly...hopefully not causing more
>>> confusion...
>>> If an index can be used, the index look up is executed and entries added
>>> to the result set.  If any of the entries that match the predicates is
>>> actually on disk, those values will need to be loaded to memory before
>>> being returned as a result.
>>> I think what Mike was saying was that if an index is not used, then the
>>> query itself would execute across the entire region, which means loading
>>> every entry into memory.  We would need to inspect each entry to see if
>>> fulfill the criteria.
>>>
>>> -Jason
>>>
>>>
>>>
>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>>
>>>> Hi All-
>>>>
>>>> DISCLAIMER: I am no expert in querying and index
>>>> architecture/implementation; mostly a consumer.
>>>>
>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for
>>>> my own understanding/sanity, it would seem we could do better than this,
>>>> meaning...
>>>>
>>>> I would think any UC partially depends on the organization of your data
>>>> in the grid as well.  If you used a PARTITION data management policy
>>>> [1], for instance, then, of course, your data would be distributed and
>>>> partitioned across all the data nodes in the grid (cluster) holding the
>>>> data (i.e. data nodes that have declared the same PARTITION Region).
>>>> It should then be possible to make this more optimal by have a redundancy
>>>> level of 1 or more (depending on the frequency of transactions and data
>>>> changes) to parallelize the data access.
>>>>
>>>> Not only does having more nodes mean better (or more optimal)
>>>> organization, but more memory.  Still, given a very large data set, clearly
>>>> some of the data will need to OVERFLOW (to disk).
>>>>
>>>> But, by combining the Function Execution service with querying (on
>>>> PARTITIONED data) [2], you could target the nodes that would
>>>> supposedly hold the data of interests, and execute the queries there.
>>>>
>>>> Additionally, assuming the Indexes were defined properly based on the
>>>> predicates in the queries (most often) used, that it would target the data
>>>> on disk matching the predicate and load only the data required (no data
>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>> predicate; that's absurd, OOMEs galore).
>>>>
>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>>> updates them (either sync/async depending on your configuration) as the
>>>> data changes.  You would assume the data would not be changing in the
>>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>>> also assume that that data would then have to be in-memory (I think so).
>>>>
>>>> Please let me know if I am way of basis here, but I would think Geode
>>>> gives you enough options that particular UCs could be made, with nominal
>>>> effort, more optimal.
>>>>
>>>> Additional references...
>>>>
>>>> * Query Partitioned Regions [3]
>>>> * Working with Indexes [4], and then...
>>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>>> * Using Indexes with Overflow Regions [6]
>>>>
>>>> Hope this helps.
>>>>
>>>> Cheers!
>>>> -John
>>>>
>>>>
>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>>> ons/region_types.html
>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/performance_considerations.html
>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/querying_partitioned_regions.html
>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /query_index.html
>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexing_guidelines.html
>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexes_with_overflow_regions.html
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks, now I see.
>>>>>
>>>>> This works the same way as in Ignite then. If you set up an eviction
>>>>> policy in Ignite the data may be evicted to swap at some point of time and
>>>>> if a query is executed right after that the it may swap in the data back to
>>>>> memory. However the indexes must always be in memory.
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>>>>> wrote:
>>>>>
>>>>>> There is a notion of data aging out in Geode. We call it overflow to
>>>>>> disk.
>>>>>>
>>>>>> The idea is that as data gets old you can have the records in memory
>>>>>> expire, and that expiry can be to disk. That's the cold data.
>>>>>>
>>>>>> You may have built an index while you were initially loading the
>>>>>> data, and if your predicates only hit the indexes you will still get really
>>>>>> fast queries if the result sets aren't large.
>>>>>>
>>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>>> query it is going to have to read every record that isn't in memory from
>>>>>> disk which is going to be extremely slow. I personally would never use
>>>>>> Geode that way.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mike Stolz
>>>>>> Principal Engineer, GemFire Product Manager
>>>>>> Mobile: 631-835-4771
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>>
>>>>>>> I just thought that you were able to do something with indexes in a
>>>>>>> such way that there is no need to preload everything from disk into memory
>>>>>>> when a query is executed over cold data.
>>>>>>>
>>>>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>>>>> following sentence from the main page:
>>>>>>>
>>>>>>> *Object Query Language allows distributed query execution on hot and
>>>>>>> cold data, with SQL-like capabilities, including joins.*
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here's the thing...
>>>>>>>>
>>>>>>>> On any In-memory data grid, if you run a query before the data has
>>>>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>>>>
>>>>>>>> And the system will still have to go ahead and load everything into
>>>>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>>>
>>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>>> actually store the keys in a separate file from the data and we can load
>>>>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>>>>> memory.
>>>>>>>>
>>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>>> make it possible to load the indexes first and lazily load the data based
>>>>>>>> on queries against the indexes.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Stolz
>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>> Mobile: 631-835-4771
>>>>>>>>
>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Geode community,
>>>>>>>>>
>>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>>> persistence as well.
>>>>>>>>>
>>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>>> want to wait while all the data has been pre-loaded from the persistence to
>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to implement
>>>>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Удачи,
>>>>>>> Денис Магда
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Удачи,
>>>>> Денис Магда
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -John
>>>> 503-504-8657
>>>> john.blum10101 (skype)
>>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>



-- 
-John
503-504-8657
john.blum10101 (skype)

Re: Persistence and OQL over cold data

Posted by Denis Magda <ma...@gmail.com>.

*JB*-

> *The disk entries that are mentioned by John were located in memory
before and were overflowed on disk at some point of time. It means that if
you start your cluster from scratch and want to run OQL queries over the
indexed data then you have to preload all the data from the persistence.*

*I don't specifically recall how much persistent data Geode reloads on
restart (Geode is a shared-nothing architecture though so each data node
has it's own persistence; additionally primaries must come online before
secondaries are accessible).  The question is how much data gets reloaded
on restart.  It would seem silly if the disk store contained more data then
would fit in memory and reload everything knowing some of the data would be
OVERFLOW on preload when it would not all fit.  Geode will reload the Index
though, which is stored as well.*

Geode documentation says that indexes are not stored on disk. Look for "Geode
does not write indexes to disk." on this [1] page. Moreover, indexes in
memory point out to entries stored at particular addresses in RAM. These
addresses will be different to disk based addresses where entries can be
persisted, so just simple storing of indexes won't work.

Actually my picture of the world is the following.

1. There is a difference between data that is persisted (just a copy of
data we have in memory) and that overflowed to disk.
2. Indexed OQL queries don't work over persisted data but
3. Indexed OQL queries work perfectly well over overflowed data.

Geocode gurus please clarify if I'm wrong.

[1]
http://geode.docs.pivotal.io/docs/managing/disk_storage/how_disk_stores_work.html

--
Denis

On Fri, Aug 19, 2016 at 2:30 PM, John Blum <jb...@pivotal.io> wrote:

> *Jason, Mike*: first, thank you.
>
> > *In order to target the nodes that would supposedly hold the data of
> interest you need to know the keys you are looking for. If you know the
> keys why are you querying in the first place? Just do getAll(keys).*
>
> Two reasons...
>
> 1. I want to apply some "additional filtering" that can only be handled
> elegantly by a OQL query predicate after a subset of the data has been
> identified/targeted (using keys).  I have example of this somewhere (doh)
> after working with a customer on this exact UC
>
> 2. I don't want the entire object (i.e. row); I only need a specific
> "projection" of the (object) data.  This is particularly important if I
> have very large and complex object graph and I am streaming data across the
> wire (client/server).
>
>
> > *The trouble happens when you are NOT hitting your indices. *
>
> Yes, good point.
>
> > *If you do a query that requires a full table scan, then every row in
> the database table needs to be examined, and to examine it, it has to be in
> memory at least briefly.*
>
> Of course.
>
> *Denis*-
>
> > *The disk entries that are mentioned by John were located in memory
> before and were overflowed on disk at some point of time. It means that if
> you start your cluster from scratch and want to run OQL queries over the
> indexed data then you have to preload all the data from the persistence.*
>
> I don't specifically recall how much persistent data Geode reloads on
> restart (Geode is a shared-nothing architecture though so each data node
> has it's own persistence; additionally primaries must come online before
> secondaries are accessible).  The question is how much data gets reloaded
> on restart.  It would seem silly if the disk store contained more data then
> would fit in memory and reload everything knowing some of the data would be
> OVERFLOW on preload when it would not all fit.  Geode will reload the Index
> though, which is stored as well.
>
> I let the experts answer this one.
>
>
> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <ma...@gmail.com> wrote:
>
>> Hi John, Jason,
>>
>> If to expand more on this
>>
>>
>> *If an index can be used, the index look up is executed and entries added
>> to the result set.  If any of the entries that match the predicates is
>> actually on disk, those values will need to be loaded to memory before
>> being returned as a result.*
>>
>> The disk entries that are mentioned by John were located in memory before
>> and were overflowed on disk at some point of time. It means that if you
>> start your cluster from scratch and want to run OQL queries over the
>> indexed data then you have to preload all the data from the persistence.
>> Yes, some of the data may be overflowed back to disk during the preloading
>> but you'll have your indexes in a valid state.
>>
>> Correct me if I'm still missing something.
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jh...@pivotal.io> wrote:
>>
>>> Hi John,
>>>
>>> I think you were referring to Mike's explanation of:
>>> "If, however, you ever resort to hitting the disk-based data for a query
>>> it is going to have to read every record that isn't in memory from disk
>>> which is going to be extremely slow. I personally would never use Geode
>>> that way."
>>>
>>> When stating:
>>> "Additionally, assuming the Indexes were defined properly based on the
>>> predicates in the queries (most often) used, that it would target the data
>>> on disk matching the predicate and load only the data required (no data
>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>> load the entire table/Region/Map/whatever to access the data matching the
>>> predicate; that's absurd, OOMEs galore)."
>>>
>>> Let me try to clear things up slightly...hopefully not causing more
>>> confusion...
>>> If an index can be used, the index look up is executed and entries added
>>> to the result set.  If any of the entries that match the predicates is
>>> actually on disk, those values will need to be loaded to memory before
>>> being returned as a result.
>>> I think what Mike was saying was that if an index is not used, then the
>>> query itself would execute across the entire region, which means loading
>>> every entry into memory.  We would need to inspect each entry to see if
>>> fulfill the criteria.
>>>
>>> -Jason
>>>
>>>
>>>
>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>>
>>>> Hi All-
>>>>
>>>> DISCLAIMER: I am no expert in querying and index
>>>> architecture/implementation; mostly a consumer.
>>>>
>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for
>>>> my own understanding/sanity, it would seem we could do better than this,
>>>> meaning...
>>>>
>>>> I would think any UC partially depends on the organization of your data
>>>> in the grid as well.  If you used a PARTITION data management policy
>>>> [1], for instance, then, of course, your data would be distributed and
>>>> partitioned across all the data nodes in the grid (cluster) holding the
>>>> data (i.e. data nodes that have declared the same PARTITION Region).
>>>> It should then be possible to make this more optimal by have a redundancy
>>>> level of 1 or more (depending on the frequency of transactions and data
>>>> changes) to parallelize the data access.
>>>>
>>>> Not only does having more nodes mean better (or more optimal)
>>>> organization, but more memory.  Still, given a very large data set, clearly
>>>> some of the data will need to OVERFLOW (to disk).
>>>>
>>>> But, by combining the Function Execution service with querying (on
>>>> PARTITIONED data) [2], you could target the nodes that would
>>>> supposedly hold the data of interests, and execute the queries there.
>>>>
>>>> Additionally, assuming the Indexes were defined properly based on the
>>>> predicates in the queries (most often) used, that it would target the data
>>>> on disk matching the predicate and load only the data required (no data
>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>> predicate; that's absurd, OOMEs galore).
>>>>
>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>>> updates them (either sync/async depending on your configuration) as the
>>>> data changes.  You would assume the data would not be changing in the
>>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>>> also assume that that data would then have to be in-memory (I think so).
>>>>
>>>> Please let me know if I am way of basis here, but I would think Geode
>>>> gives you enough options that particular UCs could be made, with nominal
>>>> effort, more optimal.
>>>>
>>>> Additional references...
>>>>
>>>> * Query Partitioned Regions [3]
>>>> * Working with Indexes [4], and then...
>>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>>> * Using Indexes with Overflow Regions [6]
>>>>
>>>> Hope this helps.
>>>>
>>>> Cheers!
>>>> -John
>>>>
>>>>
>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>>> ons/region_types.html
>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/performance_considerations.html
>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/querying_partitioned_regions.html
>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /query_index.html
>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexing_guidelines.html
>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexes_with_overflow_regions.html
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks, now I see.
>>>>>
>>>>> This works the same way as in Ignite then. If you set up an eviction
>>>>> policy in Ignite the data may be evicted to swap at some point of time and
>>>>> if a query is executed right after that the it may swap in the data back to
>>>>> memory. However the indexes must always be in memory.
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>>>>> wrote:
>>>>>
>>>>>> There is a notion of data aging out in Geode. We call it overflow to
>>>>>> disk.
>>>>>>
>>>>>> The idea is that as data gets old you can have the records in memory
>>>>>> expire, and that expiry can be to disk. That's the cold data.
>>>>>>
>>>>>> You may have built an index while you were initially loading the
>>>>>> data, and if your predicates only hit the indexes you will still get really
>>>>>> fast queries if the result sets aren't large.
>>>>>>
>>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>>> query it is going to have to read every record that isn't in memory from
>>>>>> disk which is going to be extremely slow. I personally would never use
>>>>>> Geode that way.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mike Stolz
>>>>>> Principal Engineer, GemFire Product Manager
>>>>>> Mobile: 631-835-4771
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>>
>>>>>>> I just thought that you were able to do something with indexes in a
>>>>>>> such way that there is no need to preload everything from disk into memory
>>>>>>> when a query is executed over cold data.
>>>>>>>
>>>>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>>>>> following sentence from the main page:
>>>>>>>
>>>>>>> *Object Query Language allows distributed query execution on hot and
>>>>>>> cold data, with SQL-like capabilities, including joins.*
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here's the thing...
>>>>>>>>
>>>>>>>> On any In-memory data grid, if you run a query before the data has
>>>>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>>>>
>>>>>>>> And the system will still have to go ahead and load everything into
>>>>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>>>
>>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>>> actually store the keys in a separate file from the data and we can load
>>>>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>>>>> memory.
>>>>>>>>
>>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>>> make it possible to load the indexes first and lazily load the data based
>>>>>>>> on queries against the indexes.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Stolz
>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>> Mobile: 631-835-4771
>>>>>>>>
>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Geode community,
>>>>>>>>>
>>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>>> persistence as well.
>>>>>>>>>
>>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>>> want to wait while all the data has been pre-loaded from the persistence to
>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to implement
>>>>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Удачи,
>>>>>>> Денис Магда
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Удачи,
>>>>> Денис Магда
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -John
>>>> 503-504-8657
>>>> john.blum10101 (skype)
>>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>



-- 
Удачи,
Денис Магда

Re: Persistence and OQL over cold data

Posted by John Blum <jb...@pivotal.io>.

*Jason, Mike*: first, thank you.

> *In order to target the nodes that would supposedly hold the data of
interest you need to know the keys you are looking for. If you know the
keys why are you querying in the first place? Just do getAll(keys).*

Two reasons...

1. I want to apply some "additional filtering" that can only be handled
elegantly by a OQL query predicate after a subset of the data has been
identified/targeted (using keys).  I have example of this somewhere (doh)
after working with a customer on this exact UC

2. I don't want the entire object (i.e. row); I only need a specific
"projection" of the (object) data.  This is particularly important if I
have very large and complex object graph and I am streaming data across the
wire (client/server).


> *The trouble happens when you are NOT hitting your indices. *

Yes, good point.

> *If you do a query that requires a full table scan, then every row in the
database table needs to be examined, and to examine it, it has to be in
memory at least briefly.*

Of course.

*Denis*-

> *The disk entries that are mentioned by John were located in memory
before and were overflowed on disk at some point of time. It means that if
you start your cluster from scratch and want to run OQL queries over the
indexed data then you have to preload all the data from the persistence.*

I don't specifically recall how much persistent data Geode reloads on
restart (Geode is a shared-nothing architecture though so each data node
has it's own persistence; additionally primaries must come online before
secondaries are accessible).  The question is how much data gets reloaded
on restart.  It would seem silly if the disk store contained more data then
would fit in memory and reload everything knowing some of the data would be
OVERFLOW on preload when it would not all fit.  Geode will reload the Index
though, which is stored as well.

I let the experts answer this one.


On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <ma...@gmail.com> wrote:

> Hi John, Jason,
>
> If to expand more on this
>
>
> *If an index can be used, the index look up is executed and entries added
> to the result set.  If any of the entries that match the predicates is
> actually on disk, those values will need to be loaded to memory before
> being returned as a result.*
>
> The disk entries that are mentioned by John were located in memory before
> and were overflowed on disk at some point of time. It means that if you
> start your cluster from scratch and want to run OQL queries over the
> indexed data then you have to preload all the data from the persistence.
> Yes, some of the data may be overflowed back to disk during the preloading
> but you'll have your indexes in a valid state.
>
> Correct me if I'm still missing something.
>
> --
> Denis
>
>
> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jh...@pivotal.io> wrote:
>
>> Hi John,
>>
>> I think you were referring to Mike's explanation of:
>> "If, however, you ever resort to hitting the disk-based data for a query
>> it is going to have to read every record that isn't in memory from disk
>> which is going to be extremely slow. I personally would never use Geode
>> that way."
>>
>> When stating:
>> "Additionally, assuming the Indexes were defined properly based on the
>> predicates in the queries (most often) used, that it would target the data
>> on disk matching the predicate and load only the data required (no data
>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>> load the entire table/Region/Map/whatever to access the data matching the
>> predicate; that's absurd, OOMEs galore)."
>>
>> Let me try to clear things up slightly...hopefully not causing more
>> confusion...
>> If an index can be used, the index look up is executed and entries added
>> to the result set.  If any of the entries that match the predicates is
>> actually on disk, those values will need to be loaded to memory before
>> being returned as a result.
>> I think what Mike was saying was that if an index is not used, then the
>> query itself would execute across the entire region, which means loading
>> every entry into memory.  We would need to inspect each entry to see if
>> fulfill the criteria.
>>
>> -Jason
>>
>>
>>
>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>
>>> Hi All-
>>>
>>> DISCLAIMER: I am no expert in querying and index
>>> architecture/implementation; mostly a consumer.
>>>
>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for
>>> my own understanding/sanity, it would seem we could do better than this,
>>> meaning...
>>>
>>> I would think any UC partially depends on the organization of your data
>>> in the grid as well.  If you used a PARTITION data management policy
>>> [1], for instance, then, of course, your data would be distributed and
>>> partitioned across all the data nodes in the grid (cluster) holding the
>>> data (i.e. data nodes that have declared the same PARTITION Region).
>>> It should then be possible to make this more optimal by have a redundancy
>>> level of 1 or more (depending on the frequency of transactions and data
>>> changes) to parallelize the data access.
>>>
>>> Not only does having more nodes mean better (or more optimal)
>>> organization, but more memory.  Still, given a very large data set, clearly
>>> some of the data will need to OVERFLOW (to disk).
>>>
>>> But, by combining the Function Execution service with querying (on
>>> PARTITIONED data) [2], you could target the nodes that would supposedly
>>> hold the data of interests, and execute the queries there.
>>>
>>> Additionally, assuming the Indexes were defined properly based on the
>>> predicates in the queries (most often) used, that it would target the data
>>> on disk matching the predicate and load only the data required (no data
>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>> load the entire table/Region/Map/whatever to access the data matching the
>>> predicate; that's absurd, OOMEs galore).
>>>
>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>> updates them (either sync/async depending on your configuration) as the
>>> data changes.  You would assume the data would not be changing in the
>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>> also assume that that data would then have to be in-memory (I think so).
>>>
>>> Please let me know if I am way of basis here, but I would think Geode
>>> gives you enough options that particular UCs could be made, with nominal
>>> effort, more optimal.
>>>
>>> Additional references...
>>>
>>> * Query Partitioned Regions [3]
>>> * Working with Indexes [4], and then...
>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>> * Using Indexes with Overflow Regions [6]
>>>
>>> Hope this helps.
>>>
>>> Cheers!
>>> -John
>>>
>>>
>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>> ons/region_types.html
>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>> sics/performance_considerations.html
>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>> sics/querying_partitioned_regions.html
>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>> /query_index.html
>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>> /indexing_guidelines.html
>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>> /indexes_with_overflow_regions.html
>>>
>>>
>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com>
>>> wrote:
>>>
>>>> Thanks, now I see.
>>>>
>>>> This works the same way as in Ignite then. If you set up an eviction
>>>> policy in Ignite the data may be evicted to swap at some point of time and
>>>> if a query is executed right after that the it may swap in the data back to
>>>> memory. However the indexes must always be in memory.
>>>>
>>>> --
>>>> Denis
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>>>> wrote:
>>>>
>>>>> There is a notion of data aging out in Geode. We call it overflow to
>>>>> disk.
>>>>>
>>>>> The idea is that as data gets old you can have the records in memory
>>>>> expire, and that expiry can be to disk. That's the cold data.
>>>>>
>>>>> You may have built an index while you were initially loading the data,
>>>>> and if your predicates only hit the indexes you will still get really fast
>>>>> queries if the result sets aren't large.
>>>>>
>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>> query it is going to have to read every record that isn't in memory from
>>>>> disk which is going to be extremely slow. I personally would never use
>>>>> Geode that way.
>>>>>
>>>>>
>>>>> --
>>>>> Mike Stolz
>>>>> Principal Engineer, GemFire Product Manager
>>>>> Mobile: 631-835-4771
>>>>>
>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>
>>>>>> I just thought that you were able to do something with indexes in a
>>>>>> such way that there is no need to preload everything from disk into memory
>>>>>> when a query is executed over cold data.
>>>>>>
>>>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>>>> following sentence from the main page:
>>>>>>
>>>>>> *Object Query Language allows distributed query execution on hot and
>>>>>> cold data, with SQL-like capabilities, including joins.*
>>>>>>
>>>>>> --
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Here's the thing...
>>>>>>>
>>>>>>> On any In-memory data grid, if you run a query before the data has
>>>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>>>
>>>>>>> And the system will still have to go ahead and load everything into
>>>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>>
>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>> actually store the keys in a separate file from the data and we can load
>>>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>>>> memory.
>>>>>>>
>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>> make it possible to load the indexes first and lazily load the data based
>>>>>>> on queries against the indexes.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mike Stolz
>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>> Mobile: 631-835-4771
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Geode community,
>>>>>>>>
>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>> persistence as well.
>>>>>>>>
>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>> want to wait while all the data has been pre-loaded from the persistence to
>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to implement
>>>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Denis
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Удачи,
>>>>>> Денис Магда
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Удачи,
>>>> Денис Магда
>>>>
>>>
>>>
>>>
>>> --
>>> -John
>>> 503-504-8657
>>> john.blum10101 (skype)
>>>
>>
>
>
> --
> Удачи,
> Денис Магда
>



-- 
-John
503-504-8657
john.blum10101 (skype)

Re: Persistence and OQL over cold data

Posted by Denis Magda <ma...@gmail.com>.

Hi John, Jason,

If to expand more on this


*If an index can be used, the index look up is executed and entries added
to the result set.  If any of the entries that match the predicates is
actually on disk, those values will need to be loaded to memory before
being returned as a result.*

The disk entries that are mentioned by John were located in memory before
and were overflowed on disk at some point of time. It means that if you
start your cluster from scratch and want to run OQL queries over the
indexed data then you have to preload all the data from the persistence.
Yes, some of the data may be overflowed back to disk during the preloading
but you'll have your indexes in a valid state.

Correct me if I'm still missing something.

--
Denis


On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jh...@pivotal.io> wrote:

> Hi John,
>
> I think you were referring to Mike's explanation of:
> "If, however, you ever resort to hitting the disk-based data for a query
> it is going to have to read every record that isn't in memory from disk
> which is going to be extremely slow. I personally would never use Geode
> that way."
>
> When stating:
> "Additionally, assuming the Indexes were defined properly based on the
> predicates in the queries (most often) used, that it would target the data
> on disk matching the predicate and load only the data required (no data
> store, RDBMS or otherwise, especially disk-bound stores, should have to
> load the entire table/Region/Map/whatever to access the data matching the
> predicate; that's absurd, OOMEs galore)."
>
> Let me try to clear things up slightly...hopefully not causing more
> confusion...
> If an index can be used, the index look up is executed and entries added
> to the result set.  If any of the entries that match the predicates is
> actually on disk, those values will need to be loaded to memory before
> being returned as a result.
> I think what Mike was saying was that if an index is not used, then the
> query itself would execute across the entire region, which means loading
> every entry into memory.  We would need to inspect each entry to see if
> fulfill the criteria.
>
> -Jason
>
>
>
> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>
>> Hi All-
>>
>> DISCLAIMER: I am no expert in querying and index
>> architecture/implementation; mostly a consumer.
>>
>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for my
>> own understanding/sanity, it would seem we could do better than this,
>> meaning...
>>
>> I would think any UC partially depends on the organization of your data
>> in the grid as well.  If you used a PARTITION data management policy
>> [1], for instance, then, of course, your data would be distributed and
>> partitioned across all the data nodes in the grid (cluster) holding the
>> data (i.e. data nodes that have declared the same PARTITION Region).  It
>> should then be possible to make this more optimal by have a redundancy
>> level of 1 or more (depending on the frequency of transactions and data
>> changes) to parallelize the data access.
>>
>> Not only does having more nodes mean better (or more optimal)
>> organization, but more memory.  Still, given a very large data set, clearly
>> some of the data will need to OVERFLOW (to disk).
>>
>> But, by combining the Function Execution service with querying (on
>> PARTITIONED data) [2], you could target the nodes that would supposedly
>> hold the data of interests, and execute the queries there.
>>
>> Additionally, assuming the Indexes were defined properly based on the
>> predicates in the queries (most often) used, that it would target the data
>> on disk matching the predicate and load only the data required (no data
>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>> load the entire table/Region/Map/whatever to access the data matching the
>> predicate; that's absurd, OOMEs galore).
>>
>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>> updates them (either sync/async depending on your configuration) as the
>> data changes.  You would assume the data would not be changing in the
>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>> also assume that that data would then have to be in-memory (I think so).
>>
>> Please let me know if I am way of basis here, but I would think Geode
>> gives you enough options that particular UCs could be made, with nominal
>> effort, more optimal.
>>
>> Additional references...
>>
>> * Query Partitioned Regions [3]
>> * Working with Indexes [4], and then...
>> * Tips and Guidelines on Using Indexes [5], but also important...
>> * Using Indexes with Overflow Regions [6]
>>
>> Hope this helps.
>>
>> Cheers!
>> -John
>>
>>
>> [1] http://geode.docs.pivotal.io/docs/developing/region_
>> options/region_types.html
>> [2] http://geode.docs.pivotal.io/docs/developing/querying_
>> basics/performance_considerations.html
>> [3] http://geode.docs.pivotal.io/docs/developing/querying_
>> basics/querying_partitioned_regions.html
>> [4] http://geode.docs.pivotal.io/docs/developing/query_
>> index/query_index.html
>> [5] http://geode.docs.pivotal.io/docs/developing/query_
>> index/indexing_guidelines.html
>> [6] http://geode.docs.pivotal.io/docs/developing/query_
>> index/indexes_with_overflow_regions.html
>>
>>
>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com>
>> wrote:
>>
>>> Thanks, now I see.
>>>
>>> This works the same way as in Ignite then. If you set up an eviction
>>> policy in Ignite the data may be evicted to swap at some point of time and
>>> if a query is executed right after that the it may swap in the data back to
>>> memory. However the indexes must always be in memory.
>>>
>>> --
>>> Denis
>>>
>>>
>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>>> wrote:
>>>
>>>> There is a notion of data aging out in Geode. We call it overflow to
>>>> disk.
>>>>
>>>> The idea is that as data gets old you can have the records in memory
>>>> expire, and that expiry can be to disk. That's the cold data.
>>>>
>>>> You may have built an index while you were initially loading the data,
>>>> and if your predicates only hit the indexes you will still get really fast
>>>> queries if the result sets aren't large.
>>>>
>>>> If, however, you ever resort to hitting the disk-based data for a query
>>>> it is going to have to read every record that isn't in memory from disk
>>>> which is going to be extremely slow. I personally would never use Geode
>>>> that way.
>>>>
>>>>
>>>> --
>>>> Mike Stolz
>>>> Principal Engineer, GemFire Product Manager
>>>> Mobile: 631-835-4771
>>>>
>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>
>>>>> I just thought that you were able to do something with indexes in a
>>>>> such way that there is no need to preload everything from disk into memory
>>>>> when a query is executed over cold data.
>>>>>
>>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>>> following sentence from the main page:
>>>>>
>>>>> *Object Query Language allows distributed query execution on hot and
>>>>> cold data, with SQL-like capabilities, including joins.*
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>>> wrote:
>>>>>
>>>>>> Here's the thing...
>>>>>>
>>>>>> On any In-memory data grid, if you run a query before the data has
>>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>>
>>>>>> And the system will still have to go ahead and load everything into
>>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>
>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>> actually store the keys in a separate file from the data and we can load
>>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>>> memory.
>>>>>>
>>>>>> The Lucene integration work that is going on in Geode might also make
>>>>>> it possible to load the indexes first and lazily load the data based on
>>>>>> queries against the indexes.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mike Stolz
>>>>>> Principal Engineer, GemFire Product Manager
>>>>>> Mobile: 631-835-4771
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Geode community,
>>>>>>>
>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>> persistence as well.
>>>>>>>
>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>> want to wait while all the data has been pre-loaded from the persistence to
>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to implement
>>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>>> this.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Denis
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Удачи,
>>>>> Денис Магда
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Удачи,
>>> Денис Магда
>>>
>>
>>
>>
>> --
>> -John
>> 503-504-8657
>> john.blum10101 (skype)
>>
>


-- 
Удачи,
Денис Магда

Re: Persistence and OQL over cold data

Posted by Jason Huynh <jh...@pivotal.io>.

Hi John,

I think you were referring to Mike's explanation of:
"If, however, you ever resort to hitting the disk-based data for a query it
is going to have to read every record that isn't in memory from disk which
is going to be extremely slow. I personally would never use Geode that way."

When stating:
"Additionally, assuming the Indexes were defined properly based on the
predicates in the queries (most often) used, that it would target the data
on disk matching the predicate and load only the data required (no data
store, RDBMS or otherwise, especially disk-bound stores, should have to
load the entire table/Region/Map/whatever to access the data matching the
predicate; that's absurd, OOMEs galore)."

Let me try to clear things up slightly...hopefully not causing more
confusion...
If an index can be used, the index look up is executed and entries added to
the result set.  If any of the entries that match the predicates is
actually on disk, those values will need to be loaded to memory before
being returned as a result.
I think what Mike was saying was that if an index is not used, then the
query itself would execute across the entire region, which means loading
every entry into memory.  We would need to inspect each entry to see if
fulfill the criteria.

-Jason



On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:

> Hi All-
>
> DISCLAIMER: I am no expert in querying and index
> architecture/implementation; mostly a consumer.
>
> Perhaps *Anil* or *Jason* can shed more light on the subject, but for my
> own understanding/sanity, it would seem we could do better than this,
> meaning...
>
> I would think any UC partially depends on the organization of your data in
> the grid as well.  If you used a PARTITION data management policy [1],
> for instance, then, of course, your data would be distributed and
> partitioned across all the data nodes in the grid (cluster) holding the
> data (i.e. data nodes that have declared the same PARTITION Region).  It
> should then be possible to make this more optimal by have a redundancy
> level of 1 or more (depending on the frequency of transactions and data
> changes) to parallelize the data access.
>
> Not only does having more nodes mean better (or more optimal)
> organization, but more memory.  Still, given a very large data set, clearly
> some of the data will need to OVERFLOW (to disk).
>
> But, by combining the Function Execution service with querying (on
> PARTITIONED data) [2], you could target the nodes that would supposedly
> hold the data of interests, and execute the queries there.
>
> Additionally, assuming the Indexes were defined properly based on the
> predicates in the queries (most often) used, that it would target the data
> on disk matching the predicate and load only the data required (no data
> store, RDBMS or otherwise, especially disk-bound stores, should have to
> load the entire table/Region/Map/whatever to access the data matching the
> predicate; that's absurd, OOMEs galore).
>
> TMK, Geode keeps Indexes in memory (even loads them on startup) and
> updates them (either sync/async depending on your configuration) as the
> data changes.  You would assume the data would not be changing in the
> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
> also assume that that data would then have to be in-memory (I think so).
>
> Please let me know if I am way of basis here, but I would think Geode
> gives you enough options that particular UCs could be made, with nominal
> effort, more optimal.
>
> Additional references...
>
> * Query Partitioned Regions [3]
> * Working with Indexes [4], and then...
> * Tips and Guidelines on Using Indexes [5], but also important...
> * Using Indexes with Overflow Regions [6]
>
> Hope this helps.
>
> Cheers!
> -John
>
>
> [1]
> http://geode.docs.pivotal.io/docs/developing/region_options/region_types.html
> [2]
> http://geode.docs.pivotal.io/docs/developing/querying_basics/performance_considerations.html
> [3]
> http://geode.docs.pivotal.io/docs/developing/querying_basics/querying_partitioned_regions.html
> [4]
> http://geode.docs.pivotal.io/docs/developing/query_index/query_index.html
> [5]
> http://geode.docs.pivotal.io/docs/developing/query_index/indexing_guidelines.html
> [6]
> http://geode.docs.pivotal.io/docs/developing/query_index/indexes_with_overflow_regions.html
>
>
> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com> wrote:
>
>> Thanks, now I see.
>>
>> This works the same way as in Ignite then. If you set up an eviction
>> policy in Ignite the data may be evicted to swap at some point of time and
>> if a query is executed right after that the it may swap in the data back to
>> memory. However the indexes must always be in memory.
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io>
>> wrote:
>>
>>> There is a notion of data aging out in Geode. We call it overflow to
>>> disk.
>>>
>>> The idea is that as data gets old you can have the records in memory
>>> expire, and that expiry can be to disk. That's the cold data.
>>>
>>> You may have built an index while you were initially loading the data,
>>> and if your predicates only hit the indexes you will still get really fast
>>> queries if the result sets aren't large.
>>>
>>> If, however, you ever resort to hitting the disk-based data for a query
>>> it is going to have to read every record that isn't in memory from disk
>>> which is going to be extremely slow. I personally would never use Geode
>>> that way.
>>>
>>>
>>> --
>>> Mike Stolz
>>> Principal Engineer, GemFire Product Manager
>>> Mobile: 631-835-4771
>>>
>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>
>>>> I just thought that you were able to do something with indexes in a
>>>> such way that there is no need to preload everything from disk into memory
>>>> when a query is executed over cold data.
>>>>
>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>> following sentence from the main page:
>>>>
>>>> *Object Query Language allows distributed query execution on hot and
>>>> cold data, with SQL-like capabilities, including joins.*
>>>>
>>>> --
>>>> Denis
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>>> wrote:
>>>>
>>>>> Here's the thing...
>>>>>
>>>>> On any In-memory data grid, if you run a query before the data has
>>>>> been loaded into memory, it is going to cause the exact same amount of disk
>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>
>>>>> And the system will still have to go ahead and load everything into
>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>
>>>>> Geode DOES have a nice feature for key based access though. We
>>>>> actually store the keys in a separate file from the data and we can load
>>>>> that file very quickly. Then if you go after the data for one of those keys
>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded into
>>>>> memory.
>>>>>
>>>>> The Lucene integration work that is going on in Geode might also make
>>>>> it possible to load the indexes first and lazily load the data based on
>>>>> queries against the indexes.
>>>>>
>>>>>
>>>>> --
>>>>> Mike Stolz
>>>>> Principal Engineer, GemFire Product Manager
>>>>> Mobile: 631-835-4771
>>>>>
>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Geode community,
>>>>>>
>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>> persistence as well.
>>>>>>
>>>>>> My use case is the following. During the cluster startup I don't want
>>>>>> to wait while all the data has been pre-loaded from the persistence to RAM
>>>>>> and want to execute OQL queries right away. Is it feasible to implement
>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>> this.
>>>>>>
>>>>>> Regards,
>>>>>> Denis
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Удачи,
>>>> Денис Магда
>>>>
>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>

Re: Persistence and OQL over cold data

Posted by John Blum <jb...@pivotal.io>.

Hi All-

DISCLAIMER: I am no expert in querying and index
architecture/implementation; mostly a consumer.

Perhaps *Anil* or *Jason* can shed more light on the subject, but for my
own understanding/sanity, it would seem we could do better than this,
meaning...

I would think any UC partially depends on the organization of your data in
the grid as well.  If you used a PARTITION data management policy [1], for
instance, then, of course, your data would be distributed and partitioned
across all the data nodes in the grid (cluster) holding the data (i.e. data
nodes that have declared the same PARTITION Region).  It should then be
possible to make this more optimal by have a redundancy level of 1 or more
(depending on the frequency of transactions and data changes) to
parallelize the data access.

Not only does having more nodes mean better (or more optimal) organization,
but more memory.  Still, given a very large data set, clearly some of the
data will need to OVERFLOW (to disk).

But, by combining the Function Execution service with querying (on
PARTITIONED data) [2], you could target the nodes that would supposedly
hold the data of interests, and execute the queries there.

Additionally, assuming the Indexes were defined properly based on the
predicates in the queries (most often) used, that it would target the data
on disk matching the predicate and load only the data required (no data
store, RDBMS or otherwise, especially disk-bound stores, should have to
load the entire table/Region/Map/whatever to access the data matching the
predicate; that's absurd, OOMEs galore).

TMK, Geode keeps Indexes in memory (even loads them on startup) and updates
them (either sync/async depending on your configuration) as the data
changes.  You would assume the data would not be changing in the OVERFLOW,
disk-based data set.  If the data did change, then wouldn't you also assume
that that data would then have to be in-memory (I think so).

Please let me know if I am way of basis here, but I would think Geode gives
you enough options that particular UCs could be made, with nominal effort,
more optimal.

Additional references...

* Query Partitioned Regions [3]
* Working with Indexes [4], and then...
* Tips and Guidelines on Using Indexes [5], but also important...
* Using Indexes with Overflow Regions [6]

Hope this helps.

Cheers!
-John

[1]
http://geode.docs.pivotal.io/docs/developing/region_options/region_types.html
[2]
http://geode.docs.pivotal.io/docs/developing/querying_basics/performance_considerations.html
[3]
http://geode.docs.pivotal.io/docs/developing/querying_basics/querying_partitioned_regions.html
[4]
http://geode.docs.pivotal.io/docs/developing/query_index/query_index.html
[5]
http://geode.docs.pivotal.io/docs/developing/query_index/indexing_guidelines.html
[6]
http://geode.docs.pivotal.io/docs/developing/query_index/indexes_with_overflow_regions.html

On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <ma...@gmail.com> wrote:

> Thanks, now I see.
>
> This works the same way as in Ignite then. If you set up an eviction
> policy in Ignite the data may be evicted to swap at some point of time and
> if a query is executed right after that the it may swap in the data back to
> memory. However the indexes must always be in memory.
>
> --
> Denis
>
>
> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io> wrote:
>
>> There is a notion of data aging out in Geode. We call it overflow to
>> disk.
>>
>> The idea is that as data gets old you can have the records in memory
>> expire, and that expiry can be to disk. That's the cold data.
>>
>> You may have built an index while you were initially loading the data,
>> and if your predicates only hit the indexes you will still get really fast
>> queries if the result sets aren't large.
>>
>> If, however, you ever resort to hitting the disk-based data for a query
>> it is going to have to read every record that isn't in memory from disk
>> which is going to be extremely slow. I personally would never use Geode
>> that way.
>>
>>
>> --
>> Mike Stolz
>> Principal Engineer, GemFire Product Manager
>> Mobile: 631-835-4771
>>
>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com> wrote:
>>
>>> Hi Mike,
>>>
>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>
>>> I just thought that you were able to do something with indexes in a such
>>> way that there is no need to preload everything from disk into memory when
>>> a query is executed over cold data.
>>>
>>> Then what does "execution over cold data" mean? I'm referring to the
>>> following sentence from the main page:
>>>
>>> *Object Query Language allows distributed query execution on hot and
>>> cold data, with SQL-like capabilities, including joins.*
>>>
>>> --
>>> Denis
>>>
>>>
>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>>> wrote:
>>>
>>>> Here's the thing...
>>>>
>>>> On any In-memory data grid, if you run a query before the data has been
>>>> loaded into memory, it is going to cause the exact same amount of disk i/o
>>>> to do the query as it will take to load everything into memory.
>>>>
>>>> And the system will still have to go ahead and load everything into
>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>
>>>> Geode DOES have a nice feature for key based access though. We actually
>>>> store the keys in a separate file from the data and we can load that file
>>>> very quickly. Then if you go after the data for one of those keys we can
>>>> lazily load it from disk on demand if it hasn't yet been loaded into memory.
>>>>
>>>> The Lucene integration work that is going on in Geode might also make
>>>> it possible to load the indexes first and lazily load the data based on
>>>> queries against the indexes.
>>>>
>>>>
>>>> --
>>>> Mike Stolz
>>>> Principal Engineer, GemFire Product Manager
>>>> Mobile: 631-835-4771
>>>>
>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Geode community,
>>>>>
>>>>> I've been investigating possibilities of Geode Persistence for a while
>>>>> and still can't get it clear whether I need to have all my data in memory
>>>>> if I want to execute OQL queries or OQL engine works over the persistence
>>>>> as well.
>>>>>
>>>>> My use case is the following. During the cluster startup I don't want
>>>>> to wait while all the data has been pre-loaded from the persistence to RAM
>>>>> and want to execute OQL queries right away. Is it feasible to implement
>>>>> with Geode? Please provide me with the links where I can read more about
>>>>> this.
>>>>>
>>>>> Regards,
>>>>> Denis
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Удачи,
>>> Денис Магда
>>>
>>
>>
>
>
> --
> Удачи,
> Денис Магда
>

-- 
-John
503-504-8657
john.blum10101 (skype)

Re: Persistence and OQL over cold data

Posted by Denis Magda <ma...@gmail.com>.

Thanks, now I see.

This works the same way as in Ignite then. If you set up an eviction policy
in Ignite the data may be evicted to swap at some point of time and if a
query is executed right after that the it may swap in the data back to
memory. However the indexes must always be in memory.

--
Denis


On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <ms...@pivotal.io> wrote:

> There is a notion of data aging out in Geode. We call it overflow to disk.
>
> The idea is that as data gets old you can have the records in memory
> expire, and that expiry can be to disk. That's the cold data.
>
> You may have built an index while you were initially loading the data, and
> if your predicates only hit the indexes you will still get really fast
> queries if the result sets aren't large.
>
> If, however, you ever resort to hitting the disk-based data for a query it
> is going to have to read every record that isn't in memory from disk which
> is going to be extremely slow. I personally would never use Geode that way.
>
>
> --
> Mike Stolz
> Principal Engineer, GemFire Product Manager
> Mobile: 631-835-4771
>
> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com> wrote:
>
>> Hi Mike,
>>
>> Thanks a lot for the explanation! It makes perfect sense to me.
>>
>> I just thought that you were able to do something with indexes in a such
>> way that there is no need to preload everything from disk into memory when
>> a query is executed over cold data.
>>
>> Then what does "execution over cold data" mean? I'm referring to the
>> following sentence from the main page:
>>
>> *Object Query Language allows distributed query execution on hot and cold
>> data, with SQL-like capabilities, including joins.*
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io>
>> wrote:
>>
>>> Here's the thing...
>>>
>>> On any In-memory data grid, if you run a query before the data has been
>>> loaded into memory, it is going to cause the exact same amount of disk i/o
>>> to do the query as it will take to load everything into memory.
>>>
>>> And the system will still have to go ahead and load everything into
>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>
>>> Geode DOES have a nice feature for key based access though. We actually
>>> store the keys in a separate file from the data and we can load that file
>>> very quickly. Then if you go after the data for one of those keys we can
>>> lazily load it from disk on demand if it hasn't yet been loaded into memory.
>>>
>>> The Lucene integration work that is going on in Geode might also make it
>>> possible to load the indexes first and lazily load the data based on
>>> queries against the indexes.
>>>
>>>
>>> --
>>> Mike Stolz
>>> Principal Engineer, GemFire Product Manager
>>> Mobile: 631-835-4771
>>>
>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hello Geode community,
>>>>
>>>> I've been investigating possibilities of Geode Persistence for a while
>>>> and still can't get it clear whether I need to have all my data in memory
>>>> if I want to execute OQL queries or OQL engine works over the persistence
>>>> as well.
>>>>
>>>> My use case is the following. During the cluster startup I don't want
>>>> to wait while all the data has been pre-loaded from the persistence to RAM
>>>> and want to execute OQL queries right away. Is it feasible to implement
>>>> with Geode? Please provide me with the links where I can read more about
>>>> this.
>>>>
>>>> Regards,
>>>> Denis
>>>>
>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>


-- 
Удачи,
Денис Магда

Re: Persistence and OQL over cold data

Posted by Michael Stolz <ms...@pivotal.io>.

There is a notion of data aging out in Geode. We call it overflow to disk.

The idea is that as data gets old you can have the records in memory
expire, and that expiry can be to disk. That's the cold data.

You may have built an index while you were initially loading the data, and
if your predicates only hit the indexes you will still get really fast
queries if the result sets aren't large.

If, however, you ever resort to hitting the disk-based data for a query it
is going to have to read every record that isn't in memory from disk which
is going to be extremely slow. I personally would never use Geode that way.


--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <ma...@gmail.com> wrote:

> Hi Mike,
>
> Thanks a lot for the explanation! It makes perfect sense to me.
>
> I just thought that you were able to do something with indexes in a such
> way that there is no need to preload everything from disk into memory when
> a query is executed over cold data.
>
> Then what does "execution over cold data" mean? I'm referring to the
> following sentence from the main page:
>
> *Object Query Language allows distributed query execution on hot and cold
> data, with SQL-like capabilities, including joins.*
>
> --
> Denis
>
>
> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io> wrote:
>
>> Here's the thing...
>>
>> On any In-memory data grid, if you run a query before the data has been
>> loaded into memory, it is going to cause the exact same amount of disk i/o
>> to do the query as it will take to load everything into memory.
>>
>> And the system will still have to go ahead and load everything into
>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>
>> Geode DOES have a nice feature for key based access though. We actually
>> store the keys in a separate file from the data and we can load that file
>> very quickly. Then if you go after the data for one of those keys we can
>> lazily load it from disk on demand if it hasn't yet been loaded into memory.
>>
>> The Lucene integration work that is going on in Geode might also make it
>> possible to load the indexes first and lazily load the data based on
>> queries against the indexes.
>>
>>
>> --
>> Mike Stolz
>> Principal Engineer, GemFire Product Manager
>> Mobile: 631-835-4771
>>
>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com> wrote:
>>
>>> Hello Geode community,
>>>
>>> I've been investigating possibilities of Geode Persistence for a while
>>> and still can't get it clear whether I need to have all my data in memory
>>> if I want to execute OQL queries or OQL engine works over the persistence
>>> as well.
>>>
>>> My use case is the following. During the cluster startup I don't want to
>>> wait while all the data has been pre-loaded from the persistence to RAM and
>>> want to execute OQL queries right away. Is it feasible to implement with
>>> Geode? Please provide me with the links where I can read more about this.
>>>
>>> Regards,
>>> Denis
>>>
>>
>>
>
>
> --
> Удачи,
> Денис Магда
>

Re: Persistence and OQL over cold data

Posted by Denis Magda <ma...@gmail.com>.

Hi Mike,

Thanks a lot for the explanation! It makes perfect sense to me.

I just thought that you were able to do something with indexes in a such
way that there is no need to preload everything from disk into memory when
a query is executed over cold data.

Then what does "execution over cold data" mean? I'm referring to the
following sentence from the main page:

*Object Query Language allows distributed query execution on hot and cold
data, with SQL-like capabilities, including joins.*

--
Denis


On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <ms...@pivotal.io> wrote:

> Here's the thing...
>
> On any In-memory data grid, if you run a query before the data has been
> loaded into memory, it is going to cause the exact same amount of disk i/o
> to do the query as it will take to load everything into memory.
>
> And the system will still have to go ahead and load everything into memory
> anyway so you're going to end up doing all that disk i/o TWICE.
>
> Geode DOES have a nice feature for key based access though. We actually
> store the keys in a separate file from the data and we can load that file
> very quickly. Then if you go after the data for one of those keys we can
> lazily load it from disk on demand if it hasn't yet been loaded into memory.
>
> The Lucene integration work that is going on in Geode might also make it
> possible to load the indexes first and lazily load the data based on
> queries against the indexes.
>
>
> --
> Mike Stolz
> Principal Engineer, GemFire Product Manager
> Mobile: 631-835-4771
>
> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com> wrote:
>
>> Hello Geode community,
>>
>> I've been investigating possibilities of Geode Persistence for a while
>> and still can't get it clear whether I need to have all my data in memory
>> if I want to execute OQL queries or OQL engine works over the persistence
>> as well.
>>
>> My use case is the following. During the cluster startup I don't want to
>> wait while all the data has been pre-loaded from the persistence to RAM and
>> want to execute OQL queries right away. Is it feasible to implement with
>> Geode? Please provide me with the links where I can read more about this.
>>
>> Regards,
>> Denis
>>
>
>


-- 
Удачи,
Денис Магда

Re: Persistence and OQL over cold data

Posted by Michael Stolz <ms...@pivotal.io>.

Here's the thing...

On any In-memory data grid, if you run a query before the data has been
loaded into memory, it is going to cause the exact same amount of disk i/o
to do the query as it will take to load everything into memory.

And the system will still have to go ahead and load everything into memory
anyway so you're going to end up doing all that disk i/o TWICE.

Geode DOES have a nice feature for key based access though. We actually
store the keys in a separate file from the data and we can load that file
very quickly. Then if you go after the data for one of those keys we can
lazily load it from disk on demand if it hasn't yet been loaded into memory.

The Lucene integration work that is going on in Geode might also make it
possible to load the indexes first and lazily load the data based on
queries against the indexes.

--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <ma...@gmail.com> wrote:

> Hello Geode community,
>
> I've been investigating possibilities of Geode Persistence for a while and
> still can't get it clear whether I need to have all my data in memory if I
> want to execute OQL queries or OQL engine works over the persistence as
> well.
>
> My use case is the following. During the cluster startup I don't want to
> wait while all the data has been pre-loaded from the persistence to RAM and
> want to execute OQL queries right away. Is it feasible to implement with
> Geode? Please provide me with the links where I can read more about this.
>
> Regards,
> Denis
>