You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Randy Harmon <rj...@gmail.com> on 2016/12/16 23:16:50 UTC

Indexing, operational performance and data eviction

Hi all,

I'm seeking a fuller understanding of how Apache ignite manages datasets,
both for indexes and for the underlying data.

In particular, I'm looking at what practical constraints exist for overall
data size (beyond the obvious 'how much memory do you have?'), and
functional characteristics when working near the constraint boundaries.

My assumption (corrections welcome) include:

   - The underlying objects (Value part of cache) do not need to be
   in-memory on any cache nodes (performance is naturally affected if they
   were evicted from the cache) to execute an indexed query.

   - The indexed keys need to be in-memory for all indexed lookups.  If the
   referenced Value is not in-memory, it will be loaded by call to backing
   store when that value is needed: load(key)

   - Indexed keys do not need to be in-memory for any table-scan queries to
   work, but loadCache() (?) is called to bring these data into memory.  This
   may result in eviction of other values. Once the queries on these data are
   complete, the keys (at least) will tend to remain in-memory (how to
   forcibly remove?)

In this latter case, can large datasets be queried, with earlier records in
the dataset progressively evicted to make room for later records in the
dataset (e.g. SUM(x) GROUP BY y)?

A sample use case might include a set of metadata objects (megabytes to
gigabytes, in various Ignite caches) and a much larger set of operational
metrics with fine-grained slicing, or even fully-granular facts
(GB/TB/PB).  In this use-case, the metadata might well have "hot" subsets
that (we hope) are not evicted by an LFU cache, as well as some
less-frequently-used data; meanwhile, the operational metrics may also have
tiers, even to the extent where the least frequently-used metrics should be
evicted after a rather short idle time, recovering both Value memory as
well as Key memory.

In this case ^, can "small" data and "big" data co-exist within an Ignite
cluster, and are there any particular techniques needed to assure
operational performance, particularly for keeping hot data hot, when total
data-size exceeds total-available-memory?

   - a) Can "indexed" queries be executed across datasets that need to be
   loaded with loadCache() or would they execute as table-scans?

   - b) Would such a query run incrementally with progressive eviction of
   data, in the case of big data?

I guess I'm unclear on the sequence of data-loading vs data-scanning - are
they parallel operations, or would we expect the data-loading phase to
block the data-scanning phase?

Hopefully these questions and sample scenario are clear enough to get
experienced perspective & input from y'all... thanks in advance.

R

Re: Indexing, operational performance and data eviction

Posted by vkulichenko <va...@gmail.com>.

Hi Randy,

There is no "must" for having indexes, but indexes in many cases can speed
up the execution. If you remove index from 'salary' field in the example, it
will still work, but will imply full scan. However, sometimes there can be
queries that do not benefit from any indexes. Like calculating some
aggregation value for the whole table, for example.

Scan queries always do the scan and they do not support indexes at all. If
you need indexed search, use SQL.

As for loading from the store, you're not required to finish it before query
execution (i.e. it's safe to execute these two processes concurrently).
However, keep in mind that for any query (SQL or scan) you need to make sure
that all the data required by this query is in memory before the query is
executed. So in case you're loading the data for this query, you have to
wait.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Indexing-operational-performance-and-data-eviction-tp9617p9629.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Indexing, operational performance and data eviction

Posted by Randy Harmon <rj...@gmail.com>.

Thanks, Alex.  I understand that any SQL query relying on indexed data
would need to block until the data is loaded, else it could miss index
entries for important rows.

Do I understand that all Ignite SQL queries must use some index?  In
context of a classic example scenario (
https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/datagrid/CacheQueryExample.java):
if a SQL query for Person.salary > 1000 can use a predefined index on the
salary column, we can expect it to seek to 1000 in that index, then scan
through the remainder of the index (all this within each distributed
partition).  Without an index on salary, would that same SQL execute by
scanning through the index on Long Person.id to find candidate rows
(evaluating the salary > 1000 expression on each candidate row)?

Is that also true for ScanQuery case?  IOW: Does loadCache() from backing
store (-> distributed localLoadCache) have to be *completed* prior to
ScanQuery's predicate getting any records?  Or will IgniteBiPredicate
<https://ignite.apache.org/releases/mobile/org/apache/ignite/lang/IgniteBiPredicate.html>
for
that ScanQuery get to process some batches of records in one thread, while
a separate data-loading thread continues to load more records?

I found a description along that line in javadocs for
CacheLoadOnlyStoreAdapter, but it's not clear which typical client-facing
use-cases, if any, can take advantage of the batched/parallel-processing
behavior described there.

Thanks again,

R




On Mon, Dec 19, 2016 at 7:10 AM, Alexander Paschenko <
alexander.a.paschenko@gmail.com> wrote:

> Hi Randy,
>
> Currently, indexes are built only based on what is in cache - i.e.
> contents of the backing store not present in cache are not presented
> in index in any way, and hence yes, indexing blocks scanning.
> Moreover, even non indexed columns in Ignite tables contain only data
> actually loaded to cache.
>
> Significant changes in this aspect should be expected with arrival of
> Ignite 2.0, but that is not yet to happen until some time in 2017.
>
> Regards,
> Alex
>
> 2016-12-17 2:16 GMT+03:00 Randy Harmon <rj...@gmail.com>:
> > Hi all,
> >
> > I'm seeking a fuller understanding of how Apache ignite manages datasets,
> > both for indexes and for the underlying data.
> >
> > In particular, I'm looking at what practical constraints exist for
> overall
> > data size (beyond the obvious 'how much memory do you have?'), and
> > functional characteristics when working near the constraint boundaries.
> >
> > My assumption (corrections welcome) include:
> >
> > The underlying objects (Value part of cache) do not need to be in-memory
> on
> > any cache nodes (performance is naturally affected if they were evicted
> from
> > the cache) to execute an indexed query.
> >
> > The indexed keys need to be in-memory for all indexed lookups.  If the
> > referenced Value is not in-memory, it will be loaded by call to backing
> > store when that value is needed: load(key)
> >
> > Indexed keys do not need to be in-memory for any table-scan queries to
> work,
> > but loadCache() (?) is called to bring these data into memory.  This may
> > result in eviction of other values. Once the queries on these data are
> > complete, the keys (at least) will tend to remain in-memory (how to
> forcibly
> > remove?)
> >
> > In this latter case, can large datasets be queried, with earlier records
> in
> > the dataset progressively evicted to make room for later records in the
> > dataset (e.g. SUM(x) GROUP BY y)?
> >
> > A sample use case might include a set of metadata objects (megabytes to
> > gigabytes, in various Ignite caches) and a much larger set of operational
> > metrics with fine-grained slicing, or even fully-granular facts
> (GB/TB/PB).
> > In this use-case, the metadata might well have "hot" subsets that (we
> hope)
> > are not evicted by an LFU cache, as well as some less-frequently-used
> data;
> > meanwhile, the operational metrics may also have tiers, even to the
> extent
> > where the least frequently-used metrics should be evicted after a rather
> > short idle time, recovering both Value memory as well as Key memory.
> >
> > In this case ^, can "small" data and "big" data co-exist within an Ignite
> > cluster, and are there any particular techniques needed to assure
> > operational performance, particularly for keeping hot data hot, when
> total
> > data-size exceeds total-available-memory?
> >
> > a) Can "indexed" queries be executed across datasets that need to be
> loaded
> > with loadCache() or would they execute as table-scans?
> >
> > b) Would such a query run incrementally with progressive eviction of
> data,
> > in the case of big data?
> >
> > I guess I'm unclear on the sequence of data-loading vs data-scanning -
> are
> > they parallel operations, or would we expect the data-loading phase to
> block
> > the data-scanning phase?
> >
> > Hopefully these questions and sample scenario are clear enough to get
> > experienced perspective & input from y'all... thanks in advance.
> >
> > R
> >
> >
>

Re: Indexing, operational performance and data eviction

Posted by Alexander Paschenko <al...@gmail.com>.

Hi Randy,

Currently, indexes are built only based on what is in cache - i.e.
contents of the backing store not present in cache are not presented
in index in any way, and hence yes, indexing blocks scanning.
Moreover, even non indexed columns in Ignite tables contain only data
actually loaded to cache.

Significant changes in this aspect should be expected with arrival of
Ignite 2.0, but that is not yet to happen until some time in 2017.

Regards,
Alex

2016-12-17 2:16 GMT+03:00 Randy Harmon <rj...@gmail.com>:
> Hi all,
>
> I'm seeking a fuller understanding of how Apache ignite manages datasets,
> both for indexes and for the underlying data.
>
> In particular, I'm looking at what practical constraints exist for overall
> data size (beyond the obvious 'how much memory do you have?'), and
> functional characteristics when working near the constraint boundaries.
>
> My assumption (corrections welcome) include:
>
> The underlying objects (Value part of cache) do not need to be in-memory on
> any cache nodes (performance is naturally affected if they were evicted from
> the cache) to execute an indexed query.
>
> The indexed keys need to be in-memory for all indexed lookups.  If the
> referenced Value is not in-memory, it will be loaded by call to backing
> store when that value is needed: load(key)
>
> Indexed keys do not need to be in-memory for any table-scan queries to work,
> but loadCache() (?) is called to bring these data into memory.  This may
> result in eviction of other values. Once the queries on these data are
> complete, the keys (at least) will tend to remain in-memory (how to forcibly
> remove?)
>
> In this latter case, can large datasets be queried, with earlier records in
> the dataset progressively evicted to make room for later records in the
> dataset (e.g. SUM(x) GROUP BY y)?
>
> A sample use case might include a set of metadata objects (megabytes to
> gigabytes, in various Ignite caches) and a much larger set of operational
> metrics with fine-grained slicing, or even fully-granular facts (GB/TB/PB).
> In this use-case, the metadata might well have "hot" subsets that (we hope)
> are not evicted by an LFU cache, as well as some less-frequently-used data;
> meanwhile, the operational metrics may also have tiers, even to the extent
> where the least frequently-used metrics should be evicted after a rather
> short idle time, recovering both Value memory as well as Key memory.
>
> In this case ^, can "small" data and "big" data co-exist within an Ignite
> cluster, and are there any particular techniques needed to assure
> operational performance, particularly for keeping hot data hot, when total
> data-size exceeds total-available-memory?
>
> a) Can "indexed" queries be executed across datasets that need to be loaded
> with loadCache() or would they execute as table-scans?
>
> b) Would such a query run incrementally with progressive eviction of data,
> in the case of big data?
>
> I guess I'm unclear on the sequence of data-loading vs data-scanning - are
> they parallel operations, or would we expect the data-loading phase to block
> the data-scanning phase?
>
> Hopefully these questions and sample scenario are clear enough to get
> experienced perspective & input from y'all... thanks in advance.
>
> R
>
>