You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Garren Smith <ga...@apache.org> on 2019/03/28 11:01:18 UTC

[DISCUSS] Implementing Mango Indexes for FoundationDB

Hi everyone,


I want to start a discussion, with the aim of an RFC, around implementing
Mango JSON indexes for FoundationDB. Currently Mango indexes are a layer
above CouchDB map/reduce indexes, but with FoundationDB we can make them
separate indexes in FoundationDB. This gives us the possibility of being
able to update the indexes in the same transaction that a document is being
saved in. Later we can look at adding specific mango like covering indexes.


Lets dive into the data model. Currently a user defines an index like this:


{

  name: ‘view-name’ - optional will be auto-generated

  index: {

    fields: [‘fieldA’, ‘fieldB’]

  },

  partial_filter_selector {} - optional

}


For query planning we need to be able to access the list of available
indexes. So we would have a index_definitions subspace with the following
content:


(<fieldname1>, …<rest of fields>) = (<index_name>,
<partial_filter_selector>)


Otherwise we could just store the index definitions as:

(index_name) = ((fields), partial_filter_selector).


At this stage, I can’t think of a fancy way of storing the index
definitions so that when we need to select an index for a query there would
be a fast way to only fetch a subset of the indexes. I think the best is to
rather fetch them all like we currently do and process them. However, we
can look at caching these index definitions in the application layer, and
using FoundationDB watches[0] to notify us when a definition has changed so
we can update the cached definitions.


Then each index definition will have its own dedicated subspace for the
actual built index key/values. Keys in this subspace would be the fields
defined in the index with the doc id at the end of the tuple, e.g for an
index with fields name and age, it would be:


(“john”, 40, “doc-id-1) = null

(“mary”, 30, “doc-id-2) = null


This follows the same key format that document layer[1] does for its
indexes. One point to make here is that the doc id is kept in the key part
so that we can avoid duplicate keys.


Then in terms of sorting the keys, current CouchDB uses ICU to sort all
secondary indexes. We would need to use ICU to sort the indexes for FDB but
we would have to do it differently. We will not be able to use ICU
collation operations directly, instead, we are going to have to look at
using ICU’s sort key[1] to generate a sort key ahead of time. At the same
time we need to look at creating binary encoding to capture the way that
CouchDB currently sorts object, array and numbers. This would most likely
be a sort of key prefix that we add to each key field along with the sort
key generated from ICU.


In terms of keeping mango indexes up to date, we should be able to update
all existing indexes in the same transaction as a document is
updated/created, this means we shouldn’t have to have any background
process keeping mango indexes updated. Though I imagine we going to have to
look at a background process that does update and build new indexes on an
existing index. We will have to do some decent performance testing around
this to determine the best solution, but looking at document layer they
seem to recommend updating the indexes in the transaction rather than in a
background process.


In the future, we could look at using the value space to store covering
indexed or materialized views. That way we would not need to always read
from the by_id when quering with Mango. Which would be a nice performance
improvement.



Please let me know any thoughts, improvements, suggestions or questions
around this.



[0] https://apple.github.io/foundationdb/features.html#watches

[1] https://github.com/FoundationDB/fdb-document-layer

[2] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Jan Lehnardt <ja...@apache.org>.


> On 2. Apr 2019, at 15:10, Adam Kocoloski <ko...@apache.org> wrote:
> 
> 
>> On Apr 2, 2019, at 8:10 AM, Jan Lehnardt <ja...@apache.org> wrote:
>> 
>>> On 28. Mar 2019, at 12:01, Garren Smith <ga...@apache.org> wrote:
>>> 
>>> In terms of keeping mango indexes up to date, we should be able to update
>>> all existing indexes in the same transaction as a document is
>>> updated/created, this means we shouldn’t have to have any background
>>> process keeping mango indexes updated.
>> 
>> 
>> Is this a specific design goal and if yes why? I don’t mind to have this
>> as an option, but I feel it might be easier to go with the lazy/batch-update
>> process in a first pass, as that should both keep the implementation simpler
>> and keep room in doc update transactions.
>> 
>> It’s rather a nice point when discussing CouchDB that you can have as many
>> indexes as you like and incur no write penalties, whereas in other dbs you
>> usually are encouraged to have exactly as many indexes as needed to avoid
>> a write penalty.
>> 
>> Best
>> Jan
> 
> Fair point. There are two drivers for me:
> 
> - Space efficiency
> - Avoiding operational complexities with indexes falling behind
> 
> When I was looking into this design I realized that, for Mango, the async approach produces indexes that are ~2x as large as the “index-on-write” approach. The extra docid -> view_key mapping needed for async indexing is the same size as the view_key ->  docid one that the user actually wants. Indexing on write allows us to reuse the document data space to retrieve the old docid -> view_key mapping just before removing that data from the datastore.
> 
> I also know that we’ve had plenty of instances over the years where we needed to do careful tuning to keep critical query response times low in the face of a write-intensive application. We’d end up pushing the latency back onto the writers indirectly through queue tuning, or the user would have to resort to stale=ok indexes.
> 
> As a bit of an aside, I also expect that the efficiencies we achieved through batching in our current implementation will be less pronounced when working with FoundationDB. Currently the by_seq index contains pointers directly to the document data on disk and we could skip the entire btree overhead while streaming that data to the indexers, whereas in FDB it’ll just have a bunch of docid/rev pairs and the actual document data will be dispersed elsewhere.
> 

Thanks Adam, this all makes a lot of sense.

I don’t feel strongly about whether we should do this right away, but since we need the async behaviour as well anyway, it might be a worthwhile time-trade-off to do this optimisation at a later point.

Best
Jan
-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Adam Kocoloski <ko...@apache.org>.

> On Apr 2, 2019, at 8:10 AM, Jan Lehnardt <ja...@apache.org> wrote:
> 
>> On 28. Mar 2019, at 12:01, Garren Smith <ga...@apache.org> wrote:
>> 
>> In terms of keeping mango indexes up to date, we should be able to update
>> all existing indexes in the same transaction as a document is
>> updated/created, this means we shouldn’t have to have any background
>> process keeping mango indexes updated.
> 
> 
> Is this a specific design goal and if yes why? I don’t mind to have this
> as an option, but I feel it might be easier to go with the lazy/batch-update
> process in a first pass, as that should both keep the implementation simpler
> and keep room in doc update transactions.
> 
> It’s rather a nice point when discussing CouchDB that you can have as many
> indexes as you like and incur no write penalties, whereas in other dbs you
> usually are encouraged to have exactly as many indexes as needed to avoid
> a write penalty.
> 
> Best
> Jan

Fair point. There are two drivers for me:

- Space efficiency
- Avoiding operational complexities with indexes falling behind

When I was looking into this design I realized that, for Mango, the async approach produces indexes that are ~2x as large as the “index-on-write” approach. The extra docid -> view_key mapping needed for async indexing is the same size as the view_key ->  docid one that the user actually wants. Indexing on write allows us to reuse the document data space to retrieve the old docid -> view_key mapping just before removing that data from the datastore.

I also know that we’ve had plenty of instances over the years where we needed to do careful tuning to keep critical query response times low in the face of a write-intensive application. We’d end up pushing the latency back onto the writers indirectly through queue tuning, or the user would have to resort to stale=ok indexes.

As a bit of an aside, I also expect that the efficiencies we achieved through batching in our current implementation will be less pronounced when working with FoundationDB. Currently the by_seq index contains pointers directly to the document data on disk and we could skip the entire btree overhead while streaming that data to the indexers, whereas in FDB it’ll just have a bunch of docid/rev pairs and the actual document data will be dispersed elsewhere.

Adam

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Jan Lehnardt <ja...@apache.org>.


> On 28. Mar 2019, at 12:01, Garren Smith <ga...@apache.org> wrote:
> 
> Hi everyone,
> 
> 
> I want to start a discussion, with the aim of an RFC, around implementing
> Mango JSON indexes for FoundationDB. Currently Mango indexes are a layer
> above CouchDB map/reduce indexes, but with FoundationDB we can make them
> separate indexes in FoundationDB. This gives us the possibility of being
> able to update the indexes in the same transaction that a document is being
> saved in. Later we can look at adding specific mango like covering indexes.
> 
> 
> Lets dive into the data model. Currently a user defines an index like this:
> 
> 
> {
> 
>  name: ‘view-name’ - optional will be auto-generated
> 
>  index: {
> 
>    fields: [‘fieldA’, ‘fieldB’]
> 
>  },
> 
>  partial_filter_selector {} - optional
> 
> }
> 
> 
> For query planning we need to be able to access the list of available
> indexes. So we would have a index_definitions subspace with the following
> content:
> 
> 
> (<fieldname1>, …<rest of fields>) = (<index_name>,
> <partial_filter_selector>)
> 
> 
> Otherwise we could just store the index definitions as:
> 
> (index_name) = ((fields), partial_filter_selector).
> 
> 
> At this stage, I can’t think of a fancy way of storing the index
> definitions so that when we need to select an index for a query there would
> be a fast way to only fetch a subset of the indexes. I think the best is to
> rather fetch them all like we currently do and process them. However, we
> can look at caching these index definitions in the application layer, and
> using FoundationDB watches[0] to notify us when a definition has changed so
> we can update the cached definitions.
> 
> 
> Then each index definition will have its own dedicated subspace for the
> actual built index key/values. Keys in this subspace would be the fields
> defined in the index with the doc id at the end of the tuple, e.g for an
> index with fields name and age, it would be:
> 
> 
> (“john”, 40, “doc-id-1) = null
> 
> (“mary”, 30, “doc-id-2) = null
> 
> 
> This follows the same key format that document layer[1] does for its
> indexes. One point to make here is that the doc id is kept in the key part
> so that we can avoid duplicate keys.
> 
> 
> Then in terms of sorting the keys, current CouchDB uses ICU to sort all
> secondary indexes. We would need to use ICU to sort the indexes for FDB but
> we would have to do it differently. We will not be able to use ICU
> collation operations directly, instead, we are going to have to look at
> using ICU’s sort key[1] to generate a sort key ahead of time. At the same
> time we need to look at creating binary encoding to capture the way that
> CouchDB currently sorts object, array and numbers. This would most likely
> be a sort of key prefix that we add to each key field along with the sort
> key generated from ICU.
> 
> 
> In terms of keeping mango indexes up to date, we should be able to update
> all existing indexes in the same transaction as a document is
> updated/created, this means we shouldn’t have to have any background
> process keeping mango indexes updated.


Is this a specific design goal and if yes why? I don’t mind to have this
as an option, but I feel it might be easier to go with the lazy/batch-update
process in a first pass, as that should both keep the implementation simpler
and keep room in doc update transactions.

It’s rather a nice point when discussing CouchDB that you can have as many
indexes as you like and incur no write penalties, whereas in other dbs you
usually are encouraged to have exactly as many indexes as needed to avoid
a write penalty.

Best
Jan
—


> Though I imagine we going to have to
> look at a background process that does update and build new indexes on an
> existing index. We will have to do some decent performance testing around
> this to determine the best solution, but looking at document layer they
> seem to recommend updating the indexes in the transaction rather than in a
> background process.
> 
> 
> In the future, we could look at using the value space to store covering
> indexed or materialized views. That way we would not need to always read
> from the by_id when quering with Mango. Which would be a nice performance
> improvement.
> 
> 
> 
> Please let me know any thoughts, improvements, suggestions or questions
> around this.
> 
> 
> 
> [0] https://apple.github.io/foundationdb/features.html#watches
> 
> [1] https://github.com/FoundationDB/fdb-document-layer
> 
> [2] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Garren Smith <ga...@apache.org>.

On Tue, Apr 2, 2019 at 1:38 AM Alex Miller <al...@apple.com.invalid>
wrote:

>
> > On Apr 1, 2019, at 2:21 AM, Will Holley <wi...@gmail.com> wrote:
> >
> > 5. I have a few concerns about the performance given we'll no longer be
> > able to push down filtering to the data storage tier. It sounds [1] like
> a
> > key thing will be the extent to which we can concurrently fetch/filter
> > non-overlapping key ranges from FDB - likely something that will come up
> in
> > relation to views as well. This gets a bit more fun when considering
> > multi-tenancy, and is perhaps something to stew on whilst getting
> something
> > functional together.
> >
> > Will
> >
> > [1]
> https://forums.foundationdb.org/t/secondary-indexing-approaches/792/4
>
> There is an open request to folk writing applications or layers that
> desire predicate pushdown to please add their use case and requirements to
> https://forums.foundationdb.org/t/feature-request-predicate-pushdown/954/2
> , to help guide what the eventual implementation of this feature will be.
> If someone wouldn’t mind adding what Mango would need or desire, it’d be
> much appreciated.


Thanks Alex, I've left a summary of our use case. As I mention in the post,
I think the use case that Ryan defines fits our use case pretty nicely as
well.

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Alex Miller <al...@apple.com.INVALID>.

> On Apr 1, 2019, at 2:21 AM, Will Holley <wi...@gmail.com> wrote:
> 
> 5. I have a few concerns about the performance given we'll no longer be
> able to push down filtering to the data storage tier. It sounds [1] like a
> key thing will be the extent to which we can concurrently fetch/filter
> non-overlapping key ranges from FDB - likely something that will come up in
> relation to views as well. This gets a bit more fun when considering
> multi-tenancy, and is perhaps something to stew on whilst getting something
> functional together.
> 
> Will
> 
> [1] https://forums.foundationdb.org/t/secondary-indexing-approaches/792/4

There is an open request to folk writing applications or layers that desire predicate pushdown to please add their use case and requirements to https://forums.foundationdb.org/t/feature-request-predicate-pushdown/954/2 , to help guide what the eventual implementation of this feature will be.  If someone wouldn’t mind adding what Mango would need or desire, it’d be much appreciated.

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Adam Kocoloski <ko...@apache.org>.

Thanks Alex!

I didn’t mention it on the mailing list, but in the RFC for the “exploded KV” document storage I did propose a 1MB document size limit. We’re frankly overdue for a restriction of that nature.

Continuing on that theme, we also do not currently have an enforced limit on the number of indexes that could be created in a single database. I remember seeing a database a few years ago that had over 250 active indexes. That ... wasn’t pretty. I would happily support introducing a limit there, though I don’t yet have a good sense of where to place that limit to minimize the disruption for existing users while staying safely within FDB limits.

Great tip on the use of addWriteConflictRange, that hadn’t occurred to me at all.

Adam

> On Apr 11, 2019, at 9:52 PM, Alex Miller <al...@apple.com.invalid> wrote:
> 
> There was some discussion on the FDB forums previously about how to do index backfill,
> If you hadn’t seen it before, but it’s already reasonably similar to what you’ve proposed:
> https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2
> 
> There's also two related topics that I'd like to provide some context and flavor on:
> 1. The 10MB transaction size limit.
> 2. The performance/latency/throughput implications of writing to multiple
>    disjoint keys versus multiple nearby keys.
> 
> ### Transaction Size Limit
> 
> There was some previous concern surrounding the FDB 10MB transaction limit,
> what it means for CouchDB, and how to communicate the effective maximum
> document size to users.  Index-on-write will impact this as well, as index
> writes will be counted against the 10MB limit.
> 
> As I don't believe I've seen a precise definition yet of the maximum size of a
> document allowed in CouchDB-on-FDB, let us assume it's defined as "whatever FDB
> will accept".  It would then be possible that a document can be inserted that
> just barely fits within the size limits, and then indexes can be defined that
> apply to the document.  One is now left with a document that exists, and is
> valid in the database, but couldn't be read and re-inserted into the database,
> because it would exceed the size limit.  My largest concern would be that this
> would have bad implications on CouchDB replication.
> 
> Although the transaction limit is 10MB, it's generally advisable to aim to stay
> under 1MB for the cluster to perform well.  I've generally seen latency start
> to get strange for all clients when one client starts exceeding ~5MB.  If one
> requires the total size of a document to stay less than 1MB, then this
> comfortably leaves room for indexes in the commit request, and the above case
> would be relatively unlikely.
> 
> Exactly how unlikely involves dissecting exactly how the transaction size limit
> is calculated.
> 
> The 10MB limit applies to the size of the commit request, and not to the sum of
> the key-value pair sizes, as one might expect.  The largest difference between
> the two meanings is that read and write conflict ranges count against the 10MB
> limit.  The rough formula for calculating commit size is:
> 
> $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$
> 
> Each `get()` adds a read conflict range of the read key and the sequentially
> next key.  Each `set()` adds a write conflict range the same way, in addition
> to having to send the actual key and value.  This is the source of the `2*` and
> `3*` multipliers.
> 
> `getRange()` adds a read conflict range that exactly reflects the requested
> range.  However, there is no `setRange()` to automatically optimize the write
> conflict ranges added for inserting sequential values.  If one is inserting
> keys that given the data model represent one collective, sequential unit, such
> as a full document exploded across multiple key-value pairs, it's advisable to
> use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each individual
> write conflict range into one large write conflict range, and thus greatly
> reduce the size of the commit request that will be sent.
> 
> So using this optimization, blindly inserting a document using the scheme
> proposed by the "exploded KV" approach, one would end up with a formula looking
> more like
> 
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) $$
> 
> With index-on-write, we now need to add inserting (prefix/value/key, "") for each
> index entry into the equation:
> 
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) +
>   number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) ) $$
> 
> And there's no similar write conflict range optimization that can be done, as
> index writes will be non-sequential and spread widely across the keyspace.
> 
> Index types that can cause a large number of entries to be written will have an
> even larger affect on total commit size.  I’ve been using the mental example of
> indexing all the elements of an array nested deep within a document.  However,
> if I’ve read CouchDB documentation correctly, all of the index types that can
> cause this explosion are grouped as part of Mango Search indexes, and thus out
> of scope for this proposal.  Mango JSON indexes seem to have a direct correlation
> that one (index,field) pair will turn into one Key-Value write to FDB. Convenient!
> 
> Therefore, in order for a user to run into Mango JSON index-induced transaction
> size limits, they would have to manually add an astounding number of indexes.
> I've failed to quickly find an answer as to if there's an existing maximum
> number of Mango JSON indexes that can be added in a CouchDB instance, but if
> there is, this might not be a concern anyway.
> 
> The worst case would be that each field of a document is indexed.  Assuming the
> "exploded KV" proposal as the representation, each key and value would be
> repeated 3 more times, and thus this would turn into
> 
> $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys written) + 4*len(values written) $$
> 
> So if an maximum exploded KV size of 1MB is chosen, transaction commit requests
> would be expected to stay roughly under 4MB.
> 
> However, if you're thinking into the future, and considering that whatever
> choice is taken for Mango JSON indexes should also be taken for Mango Search
> indexes as well, the above is definitely a concern.
> 
> ### Indexing Write Penalty
> 
> And, as we're now discussing what would happen with thousands of indexes
> defined, let's briefly discuss what impact that would have on actually
> committing that transaction.
> 
> With most distributed databases, each index added equates to an additional
> shard of data involved in the transaction, and thus likely an additional server
> and RPC.  This would lead to an increase in latency and concurrency control
> costs, as more indexes are added.  This adds a strong cost to each additional
> index.  One much larger than the actual cost of the disk IO involved in doing
> the actual index write.
> 
> FoundationDB does all commits against a distributed write-ahead-log, and
> asynchronously to the commit, applies the data to each shard of data involved in
> the transaction.  This means that a commit of 5 adjacent keys and 5 keys
> randomly distributed across the keyspace will be processed in the same way by
> the transaction subsystem.  One would actually see better throughput at
> saturation when writing to randomly distributed keys, as the work of applying
> those mutations gets sharded across multiple storage servers.
> 
> Creating many indexes still means more disk IO, but is comparatively cheaper on
> FDB than other distributed database architectures.
> 
>> On Apr 11, 2019, at 5:42 AM, Garren Smith <ga...@apache.org> wrote:
>> 
>> 
>> I was chatting to Adam yesterday and I want to explore the index-on-write
>> indexing for Mango a bit more. I know there has been a bit of a discussion
>> that we should only use a background process to build mango indexes but I
>> think that building indexes as documents are created/updated along combined
>> with background processing for existing documents will be easier to
>> implement. Especially in the beginning as we build the new fdb layer.
>> 
>> Below is the process for building a new index:
>> 
>> 1. When a user defines a new index on an existing database, save the index
>> definition and also save the sequence that the index was added at. The
>> index should also be marked that it is in a `building`  phase so it won’t
>> be used yet to service queries. (I’ll come back to this later)
>> 2. Any write requests after that must read the new index definition and
>> update the index. When updating the new index, the writers should assume
>> that previous versions of the document have already been indexed.
>> 3. At the same time a background process will start reading sections of the
>> changes feed and building the index, this background process will keep
>> processing the changes read until it reaches the sequence number that the
>> index was saved at. Once it reaches that point, the index is up to date and
>> will be marked as `active` and can be used to service queries.
>> 4. There are some subtle behaviour around step 3 that is worth mentioning.
>> The background process will have the 5 second transaction limit, so it will
>> process smaller parts of the changes feed. Which means that it won’t have
>> one consistent view of the changes feed throughout the index building
>> process. This will lead to a conflict situation. For example when the
>> background process transaction is adding a document to the index while at
>> the same time a write request has a transaction that is updating the same
>> document. There are two possible outcomes to this, if the background
>> process wins, the write request will get a conflict. At that point the
>> write request will try to process the document again, read the old values
>> for that document, remove them from the index and add the new values to the
>> index. If the write request wins, and the background process gets a
>> conflict, then the background process can try again, the document would
>> have been removed from its old position in the changes feed and moved to
>> the later position, so the background process won’t see the document and
>> will then move on to the next one.
>> 5. One other feature to add is to an index progress tracker. We can do this
>> by using doc_count for the database, and then have a counter value that the
>> background workers can increment with the number of documents it updated
>> for each batch update.  We would also have to update this counter on write
>> requests while the index is in building mode.
>> 6. Something we can also explore is splitting the building of the index
>> across multiple workers, we can use the `get_boundary_keys` [1] API call on
>> the changes feed to get the full list of the changes feed keys grouped by
>> partition boundaries and then split that by workers.
>> 
>> Adding a building and active state to the index’s is a bit of a breaking
>> change, but I think its the right way to go. Currently with Mango what can
>> happen is a user creates an index and then immediately does a query that
>> would use that index. Mango would then have to build the whole index before
>> responding to that request. In this new index-on-write process, Mango would
>> ignore the new index until it is active which I think is the better way to
>> go on this.
>> 
>> Finally, a big acknowledgment to Adam who is the major brains behind this
>> design.
>> 
>> What do you think, I would like to hear any thoughts, questions or
>> suggestions on this design.
>> 
>> Cheers
>> Garren
>> 
>> [1]
>> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys
>> 
>>> On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <ga...@apache.org> wrote:
>>> 
>>> 
>>> 
>>>> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <ko...@apache.org> wrote:
>>>> 
>>>> Hi Will, great comments, I have replies to a couple of them.
>>>> 
>>>>> On Apr 1, 2019, at 5:21 AM, Will Holley <wi...@gmail.com> wrote:
>>>>> 
>>>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
>>>>> whether we can guarantee that the generated keys will fit within the
>>>>> maximum FDB key length or if there needs to be some thought as to the
>>>>> failure mode / workaround. As Adam mentioned, it seems fine to store an
>>>>> encoded key given Mango (currently) always fetches the associated
>>>> document
>>>>> / fields from the primary index to filter on anyway. It might even be
>>>>> beneficial to have an additional layer of indirection and allow multiple
>>>>> docs to be associated with each row so that we can maintain compact
>>>> keys.
>>>> 
>>>> Interesting thought on that layer of indirection; it reminds me of an
>>>> optimization applied in the Record Layer’s text indexes. Would have to
>>>> compare whether the extra reads needed to maintain the index that way are
>>>> an acceptable tradeoff.
>>>> 
>>>> Good point on the sort key sizes, I’ve not seen any way to place a
>>>> reliably safe upper bound on the size of one that might be generated. The
>>>> ICU folks have some hand-wavey guidance at
>>>> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size,
>>>> but it seems like we might be able to dig a little deeper.
>>>> 
>>>> I personally haven’t given much thought to a workaround where a
>>>> user-defined index key exceeds 10 KB. We’ll definitely need to handle that
>>>> failure mode safely even without the sort key complication — people try
>>>> crazy things :)
>>>> 
>>> 
>>> For the 10 KB error, I think we should just return an error. As a
>>> comparison, MongoDB has a 1024 Byte limit
>>> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
>>> 
>>> 
>>>>> 3. I don't immediately see how you clear previous values from the index
>>>>> when a doc is updated, but I could easily be missing something obvious
>>>> :)
>>>> 
>>>> Ah yeah, this part wasn’t explicit, was it?
>>>> 
>>>> I think the idea is that these are simple indexes on specific fields of a
>>>> document, and we have a data model where those fields are already stored as
>>>> their own keys in FDB, so there’s no need (in the case of Mango) to
>>>> maintain a separate docid -> {viewid, [keys]} mapping like we do today in
>>>> each view group. Rather, the flow would go something like
>>>> 
>>>> 1) Check which fields are supposed to be indexed
>>>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
>>>> parent revision
>>>> 3) Compare the parent values with the ones supplied in this transaction;
>>>> if any indexed values change, clear the old ones and insert the new ones
>>>> 
>>>> with some additional caveats around checking that the supplied edit is
>>>> actually going to be winning (and therefore indexed) version after the
>>>> commit succeeds.
>>>> 
>>>>> 4. Regarding "Index on write" behaviour, is there something in the
>>>> existing
>>>>> design (Mango overlaying mrview / lucene) that would prevent this? I can
>>>>> see some benefit for certain workloads (and headaches for others) but I
>>>>> don't see that it's necessarily coupled to the Mango design given
>>>>> background indexing of new/changed indexes needs to be supported anyway.
>>>> 
>>>> I’m not sure I understand your question. In my mind the reason “index on
>>>> write" is more applicable for Mango JSON than for generalized views is
>>>> because in the view case batching is currently quite important to achieve
>>>> good throughput to the JS system. You’re of course correct that we need to
>>>> be able to re-generate Mango JSON indexes in the background as well.
>>>> 
>>>> Adam
>>>> 
>>>> 
>>>> 
>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Garren Smith <ga...@apache.org>.

I've created an RFC for this
https://github.com/apache/couchdb-documentation/pull/407

Thanks to everyone who participated in this discussion.

Cheers
Garren

On Mon, Apr 15, 2019 at 11:58 AM Garren Smith <ga...@apache.org> wrote:

> Hi Alex,
>
> Thanks for the really detailed explanation on transaction size
> calculation. That is really helpful. Adding a limit to number of indexes is
> a good idea.
>
> Cheers
> Garren
>
> On Fri, Apr 12, 2019 at 3:58 AM Alex Miller <al...@apple.com.invalid>
> wrote:
>
>> There was some discussion on the FDB forums previously about how to do
>> index backfill,
>> If you hadn’t seen it before, but it’s already reasonably similar to what
>> you’ve proposed:
>>
>> https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2
>>
>> There's also two related topics that I'd like to provide some context and
>> flavor on:
>> 1. The 10MB transaction size limit.
>> 2. The performance/latency/throughput implications of writing to multiple
>>     disjoint keys versus multiple nearby keys.
>>
>> ### Transaction Size Limit
>>
>> There was some previous concern surrounding the FDB 10MB transaction
>> limit,
>> what it means for CouchDB, and how to communicate the effective maximum
>> document size to users.  Index-on-write will impact this as well, as index
>> writes will be counted against the 10MB limit.
>>
>> As I don't believe I've seen a precise definition yet of the maximum size
>> of a
>> document allowed in CouchDB-on-FDB, let us assume it's defined as
>> "whatever FDB
>> will accept".  It would then be possible that a document can be inserted
>> that
>> just barely fits within the size limits, and then indexes can be defined
>> that
>> apply to the document.  One is now left with a document that exists, and
>> is
>> valid in the database, but couldn't be read and re-inserted into the
>> database,
>> because it would exceed the size limit.  My largest concern would be that
>> this
>> would have bad implications on CouchDB replication.
>>
>> Although the transaction limit is 10MB, it's generally advisable to aim
>> to stay
>> under 1MB for the cluster to perform well.  I've generally seen latency
>> start
>> to get strange for all clients when one client starts exceeding ~5MB.  If
>> one
>> requires the total size of a document to stay less than 1MB, then this
>> comfortably leaves room for indexes in the commit request, and the above
>> case
>> would be relatively unlikely.
>>
>> Exactly how unlikely involves dissecting exactly how the transaction size
>> limit
>> is calculated.
>>
>> The 10MB limit applies to the size of the commit request, and not to the
>> sum of
>> the key-value pair sizes, as one might expect.  The largest difference
>> between
>> the two meanings is that read and write conflict ranges count against the
>> 10MB
>> limit.  The rough formula for calculating commit size is:
>>
>> $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$
>>
>> Each `get()` adds a read conflict range of the read key and the
>> sequentially
>> next key.  Each `set()` adds a write conflict range the same way, in
>> addition
>> to having to send the actual key and value.  This is the source of the
>> `2*` and
>> `3*` multipliers.
>>
>> `getRange()` adds a read conflict range that exactly reflects the
>> requested
>> range.  However, there is no `setRange()` to automatically optimize the
>> write
>> conflict ranges added for inserting sequential values.  If one is
>> inserting
>> keys that given the data model represent one collective, sequential unit,
>> such
>> as a full document exploded across multiple key-value pairs, it's
>> advisable to
>> use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each
>> individual
>> write conflict range into one large write conflict range, and thus greatly
>> reduce the size of the commit request that will be sent.
>>
>> So using this optimization, blindly inserting a document using the scheme
>> proposed by the "exploded KV" approach, one would end up with a formula
>> looking
>> more like
>>
>> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
>> written) $$
>>
>> With index-on-write, we now need to add inserting (prefix/value/key, "")
>> for each
>> index entry into the equation:
>>
>> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
>> written) +
>>    number of indexes * 3 * ( len(prefix) + len(one value) + len(one key)
>> ) $$
>>
>> And there's no similar write conflict range optimization that can be
>> done, as
>> index writes will be non-sequential and spread widely across the keyspace.
>>
>> Index types that can cause a large number of entries to be written will
>> have an
>> even larger affect on total commit size.  I’ve been using the mental
>> example of
>> indexing all the elements of an array nested deep within a document.
>> However,
>> if I’ve read CouchDB documentation correctly, all of the index types that
>> can
>> cause this explosion are grouped as part of Mango Search indexes, and
>> thus out
>> of scope for this proposal.  Mango JSON indexes seem to have a direct
>> correlation
>> that one (index,field) pair will turn into one Key-Value write to FDB.
>> Convenient!
>>
>> Therefore, in order for a user to run into Mango JSON index-induced
>> transaction
>> size limits, they would have to manually add an astounding number of
>> indexes.
>> I've failed to quickly find an answer as to if there's an existing maximum
>> number of Mango JSON indexes that can be added in a CouchDB instance, but
>> if
>> there is, this might not be a concern anyway.
>>
>> The worst case would be that each field of a document is indexed.
>> Assuming the
>> "exploded KV" proposal as the representation, each key and value would be
>> repeated 3 more times, and thus this would turn into
>>
>> $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys
>> written) + 4*len(values written) $$
>>
>> So if an maximum exploded KV size of 1MB is chosen, transaction commit
>> requests
>> would be expected to stay roughly under 4MB.
>>
>> However, if you're thinking into the future, and considering that whatever
>> choice is taken for Mango JSON indexes should also be taken for Mango
>> Search
>> indexes as well, the above is definitely a concern.
>>
>> ### Indexing Write Penalty
>>
>> And, as we're now discussing what would happen with thousands of indexes
>> defined, let's briefly discuss what impact that would have on actually
>> committing that transaction.
>>
>> With most distributed databases, each index added equates to an additional
>> shard of data involved in the transaction, and thus likely an additional
>> server
>> and RPC.  This would lead to an increase in latency and concurrency
>> control
>> costs, as more indexes are added.  This adds a strong cost to each
>> additional
>> index.  One much larger than the actual cost of the disk IO involved in
>> doing
>> the actual index write.
>>
>> FoundationDB does all commits against a distributed write-ahead-log, and
>> asynchronously to the commit, applies the data to each shard of data
>> involved in
>> the transaction.  This means that a commit of 5 adjacent keys and 5 keys
>> randomly distributed across the keyspace will be processed in the same
>> way by
>> the transaction subsystem.  One would actually see better throughput at
>> saturation when writing to randomly distributed keys, as the work of
>> applying
>> those mutations gets sharded across multiple storage servers.
>>
>> Creating many indexes still means more disk IO, but is comparatively
>> cheaper on
>> FDB than other distributed database architectures.
>>
>> > On Apr 11, 2019, at 5:42 AM, Garren Smith <ga...@apache.org> wrote:
>> >
>> >
>> > I was chatting to Adam yesterday and I want to explore the
>> index-on-write
>> > indexing for Mango a bit more. I know there has been a bit of a
>> discussion
>> > that we should only use a background process to build mango indexes but
>> I
>> > think that building indexes as documents are created/updated along
>> combined
>> > with background processing for existing documents will be easier to
>> > implement. Especially in the beginning as we build the new fdb layer.
>> >
>> > Below is the process for building a new index:
>> >
>> > 1. When a user defines a new index on an existing database, save the
>> index
>> > definition and also save the sequence that the index was added at. The
>> > index should also be marked that it is in a `building`  phase so it
>> won’t
>> > be used yet to service queries. (I’ll come back to this later)
>> > 2. Any write requests after that must read the new index definition and
>> > update the index. When updating the new index, the writers should assume
>> > that previous versions of the document have already been indexed.
>> > 3. At the same time a background process will start reading sections of
>> the
>> > changes feed and building the index, this background process will keep
>> > processing the changes read until it reaches the sequence number that
>> the
>> > index was saved at. Once it reaches that point, the index is up to date
>> and
>> > will be marked as `active` and can be used to service queries.
>> > 4. There are some subtle behaviour around step 3 that is worth
>> mentioning.
>> > The background process will have the 5 second transaction limit, so it
>> will
>> > process smaller parts of the changes feed. Which means that it won’t
>> have
>> > one consistent view of the changes feed throughout the index building
>> > process. This will lead to a conflict situation. For example when the
>> > background process transaction is adding a document to the index while
>> at
>> > the same time a write request has a transaction that is updating the
>> same
>> > document. There are two possible outcomes to this, if the background
>> > process wins, the write request will get a conflict. At that point the
>> > write request will try to process the document again, read the old
>> values
>> > for that document, remove them from the index and add the new values to
>> the
>> > index. If the write request wins, and the background process gets a
>> > conflict, then the background process can try again, the document would
>> > have been removed from its old position in the changes feed and moved to
>> > the later position, so the background process won’t see the document and
>> > will then move on to the next one.
>> > 5. One other feature to add is to an index progress tracker. We can do
>> this
>> > by using doc_count for the database, and then have a counter value that
>> the
>> > background workers can increment with the number of documents it updated
>> > for each batch update.  We would also have to update this counter on
>> write
>> > requests while the index is in building mode.
>> > 6. Something we can also explore is splitting the building of the index
>> > across multiple workers, we can use the `get_boundary_keys` [1] API
>> call on
>> > the changes feed to get the full list of the changes feed keys grouped
>> by
>> > partition boundaries and then split that by workers.
>> >
>> > Adding a building and active state to the index’s is a bit of a breaking
>> > change, but I think its the right way to go. Currently with Mango what
>> can
>> > happen is a user creates an index and then immediately does a query that
>> > would use that index. Mango would then have to build the whole index
>> before
>> > responding to that request. In this new index-on-write process, Mango
>> would
>> > ignore the new index until it is active which I think is the better way
>> to
>> > go on this.
>> >
>> > Finally, a big acknowledgment to Adam who is the major brains behind
>> this
>> > design.
>> >
>> > What do you think, I would like to hear any thoughts, questions or
>> > suggestions on this design.
>> >
>> > Cheers
>> > Garren
>> >
>> > [1]
>> >
>> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys
>> >
>> > On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <ga...@apache.org> wrote:
>> >
>> >>
>> >>
>> >> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <ko...@apache.org>
>> wrote:
>> >>
>> >>> Hi Will, great comments, I have replies to a couple of them.
>> >>>
>> >>>> On Apr 1, 2019, at 5:21 AM, Will Holley <wi...@gmail.com>
>> wrote:
>> >>>>
>> >>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
>> >>>> whether we can guarantee that the generated keys will fit within the
>> >>>> maximum FDB key length or if there needs to be some thought as to the
>> >>>> failure mode / workaround. As Adam mentioned, it seems fine to store
>> an
>> >>>> encoded key given Mango (currently) always fetches the associated
>> >>> document
>> >>>> / fields from the primary index to filter on anyway. It might even be
>> >>>> beneficial to have an additional layer of indirection and allow
>> multiple
>> >>>> docs to be associated with each row so that we can maintain compact
>> >>> keys.
>> >>>
>> >>> Interesting thought on that layer of indirection; it reminds me of an
>> >>> optimization applied in the Record Layer’s text indexes. Would have to
>> >>> compare whether the extra reads needed to maintain the index that way
>> are
>> >>> an acceptable tradeoff.
>> >>>
>> >>> Good point on the sort key sizes, I’ve not seen any way to place a
>> >>> reliably safe upper bound on the size of one that might be generated.
>> The
>> >>> ICU folks have some hand-wavey guidance at
>> >>>
>> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size
>> ,
>> >>> but it seems like we might be able to dig a little deeper.
>> >>>
>> >>> I personally haven’t given much thought to a workaround where a
>> >>> user-defined index key exceeds 10 KB. We’ll definitely need to handle
>> that
>> >>> failure mode safely even without the sort key complication — people
>> try
>> >>> crazy things :)
>> >>>
>> >>
>> >> For the 10 KB error, I think we should just return an error. As a
>> >> comparison, MongoDB has a 1024 Byte limit
>> >> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
>> >>
>> >>
>> >>>> 3. I don't immediately see how you clear previous values from the
>> index
>> >>>> when a doc is updated, but I could easily be missing something
>> obvious
>> >>> :)
>> >>>
>> >>> Ah yeah, this part wasn’t explicit, was it?
>> >>>
>> >>> I think the idea is that these are simple indexes on specific fields
>> of a
>> >>> document, and we have a data model where those fields are already
>> stored as
>> >>> their own keys in FDB, so there’s no need (in the case of Mango) to
>> >>> maintain a separate docid -> {viewid, [keys]} mapping like we do
>> today in
>> >>> each view group. Rather, the flow would go something like
>> >>>
>> >>> 1) Check which fields are supposed to be indexed
>> >>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
>> >>> parent revision
>> >>> 3) Compare the parent values with the ones supplied in this
>> transaction;
>> >>> if any indexed values change, clear the old ones and insert the new
>> ones
>> >>>
>> >>> with some additional caveats around checking that the supplied edit is
>> >>> actually going to be winning (and therefore indexed) version after the
>> >>> commit succeeds.
>> >>>
>> >>>> 4. Regarding "Index on write" behaviour, is there something in the
>> >>> existing
>> >>>> design (Mango overlaying mrview / lucene) that would prevent this? I
>> can
>> >>>> see some benefit for certain workloads (and headaches for others)
>> but I
>> >>>> don't see that it's necessarily coupled to the Mango design given
>> >>>> background indexing of new/changed indexes needs to be supported
>> anyway.
>> >>>
>> >>> I’m not sure I understand your question. In my mind the reason “index
>> on
>> >>> write" is more applicable for Mango JSON than for generalized views is
>> >>> because in the view case batching is currently quite important to
>> achieve
>> >>> good throughput to the JS system. You’re of course correct that we
>> need to
>> >>> be able to re-generate Mango JSON indexes in the background as well.
>> >>>
>> >>> Adam
>> >>>
>> >>>
>> >>>
>>
>>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Garren Smith <ga...@apache.org>.

Hi Alex,

Thanks for the really detailed explanation on transaction size calculation.
That is really helpful. Adding a limit to number of indexes is a good idea.

Cheers
Garren

On Fri, Apr 12, 2019 at 3:58 AM Alex Miller <al...@apple.com.invalid>
wrote:

> There was some discussion on the FDB forums previously about how to do
> index backfill,
> If you hadn’t seen it before, but it’s already reasonably similar to what
> you’ve proposed:
>
> https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2
>
> There's also two related topics that I'd like to provide some context and
> flavor on:
> 1. The 10MB transaction size limit.
> 2. The performance/latency/throughput implications of writing to multiple
>     disjoint keys versus multiple nearby keys.
>
> ### Transaction Size Limit
>
> There was some previous concern surrounding the FDB 10MB transaction limit,
> what it means for CouchDB, and how to communicate the effective maximum
> document size to users.  Index-on-write will impact this as well, as index
> writes will be counted against the 10MB limit.
>
> As I don't believe I've seen a precise definition yet of the maximum size
> of a
> document allowed in CouchDB-on-FDB, let us assume it's defined as
> "whatever FDB
> will accept".  It would then be possible that a document can be inserted
> that
> just barely fits within the size limits, and then indexes can be defined
> that
> apply to the document.  One is now left with a document that exists, and is
> valid in the database, but couldn't be read and re-inserted into the
> database,
> because it would exceed the size limit.  My largest concern would be that
> this
> would have bad implications on CouchDB replication.
>
> Although the transaction limit is 10MB, it's generally advisable to aim to
> stay
> under 1MB for the cluster to perform well.  I've generally seen latency
> start
> to get strange for all clients when one client starts exceeding ~5MB.  If
> one
> requires the total size of a document to stay less than 1MB, then this
> comfortably leaves room for indexes in the commit request, and the above
> case
> would be relatively unlikely.
>
> Exactly how unlikely involves dissecting exactly how the transaction size
> limit
> is calculated.
>
> The 10MB limit applies to the size of the commit request, and not to the
> sum of
> the key-value pair sizes, as one might expect.  The largest difference
> between
> the two meanings is that read and write conflict ranges count against the
> 10MB
> limit.  The rough formula for calculating commit size is:
>
> $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$
>
> Each `get()` adds a read conflict range of the read key and the
> sequentially
> next key.  Each `set()` adds a write conflict range the same way, in
> addition
> to having to send the actual key and value.  This is the source of the
> `2*` and
> `3*` multipliers.
>
> `getRange()` adds a read conflict range that exactly reflects the requested
> range.  However, there is no `setRange()` to automatically optimize the
> write
> conflict ranges added for inserting sequential values.  If one is inserting
> keys that given the data model represent one collective, sequential unit,
> such
> as a full document exploded across multiple key-value pairs, it's
> advisable to
> use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each
> individual
> write conflict range into one large write conflict range, and thus greatly
> reduce the size of the commit request that will be sent.
>
> So using this optimization, blindly inserting a document using the scheme
> proposed by the "exploded KV" approach, one would end up with a formula
> looking
> more like
>
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
> written) $$
>
> With index-on-write, we now need to add inserting (prefix/value/key, "")
> for each
> index entry into the equation:
>
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
> written) +
>    number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) )
> $$
>
> And there's no similar write conflict range optimization that can be done,
> as
> index writes will be non-sequential and spread widely across the keyspace.
>
> Index types that can cause a large number of entries to be written will
> have an
> even larger affect on total commit size.  I’ve been using the mental
> example of
> indexing all the elements of an array nested deep within a document.
> However,
> if I’ve read CouchDB documentation correctly, all of the index types that
> can
> cause this explosion are grouped as part of Mango Search indexes, and thus
> out
> of scope for this proposal.  Mango JSON indexes seem to have a direct
> correlation
> that one (index,field) pair will turn into one Key-Value write to FDB.
> Convenient!
>
> Therefore, in order for a user to run into Mango JSON index-induced
> transaction
> size limits, they would have to manually add an astounding number of
> indexes.
> I've failed to quickly find an answer as to if there's an existing maximum
> number of Mango JSON indexes that can be added in a CouchDB instance, but
> if
> there is, this might not be a concern anyway.
>
> The worst case would be that each field of a document is indexed.
> Assuming the
> "exploded KV" proposal as the representation, each key and value would be
> repeated 3 more times, and thus this would turn into
>
> $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys written)
> + 4*len(values written) $$
>
> So if an maximum exploded KV size of 1MB is chosen, transaction commit
> requests
> would be expected to stay roughly under 4MB.
>
> However, if you're thinking into the future, and considering that whatever
> choice is taken for Mango JSON indexes should also be taken for Mango
> Search
> indexes as well, the above is definitely a concern.
>
> ### Indexing Write Penalty
>
> And, as we're now discussing what would happen with thousands of indexes
> defined, let's briefly discuss what impact that would have on actually
> committing that transaction.
>
> With most distributed databases, each index added equates to an additional
> shard of data involved in the transaction, and thus likely an additional
> server
> and RPC.  This would lead to an increase in latency and concurrency control
> costs, as more indexes are added.  This adds a strong cost to each
> additional
> index.  One much larger than the actual cost of the disk IO involved in
> doing
> the actual index write.
>
> FoundationDB does all commits against a distributed write-ahead-log, and
> asynchronously to the commit, applies the data to each shard of data
> involved in
> the transaction.  This means that a commit of 5 adjacent keys and 5 keys
> randomly distributed across the keyspace will be processed in the same way
> by
> the transaction subsystem.  One would actually see better throughput at
> saturation when writing to randomly distributed keys, as the work of
> applying
> those mutations gets sharded across multiple storage servers.
>
> Creating many indexes still means more disk IO, but is comparatively
> cheaper on
> FDB than other distributed database architectures.
>
> > On Apr 11, 2019, at 5:42 AM, Garren Smith <ga...@apache.org> wrote:
> >
> >
> > I was chatting to Adam yesterday and I want to explore the index-on-write
> > indexing for Mango a bit more. I know there has been a bit of a
> discussion
> > that we should only use a background process to build mango indexes but I
> > think that building indexes as documents are created/updated along
> combined
> > with background processing for existing documents will be easier to
> > implement. Especially in the beginning as we build the new fdb layer.
> >
> > Below is the process for building a new index:
> >
> > 1. When a user defines a new index on an existing database, save the
> index
> > definition and also save the sequence that the index was added at. The
> > index should also be marked that it is in a `building`  phase so it won’t
> > be used yet to service queries. (I’ll come back to this later)
> > 2. Any write requests after that must read the new index definition and
> > update the index. When updating the new index, the writers should assume
> > that previous versions of the document have already been indexed.
> > 3. At the same time a background process will start reading sections of
> the
> > changes feed and building the index, this background process will keep
> > processing the changes read until it reaches the sequence number that the
> > index was saved at. Once it reaches that point, the index is up to date
> and
> > will be marked as `active` and can be used to service queries.
> > 4. There are some subtle behaviour around step 3 that is worth
> mentioning.
> > The background process will have the 5 second transaction limit, so it
> will
> > process smaller parts of the changes feed. Which means that it won’t have
> > one consistent view of the changes feed throughout the index building
> > process. This will lead to a conflict situation. For example when the
> > background process transaction is adding a document to the index while at
> > the same time a write request has a transaction that is updating the same
> > document. There are two possible outcomes to this, if the background
> > process wins, the write request will get a conflict. At that point the
> > write request will try to process the document again, read the old values
> > for that document, remove them from the index and add the new values to
> the
> > index. If the write request wins, and the background process gets a
> > conflict, then the background process can try again, the document would
> > have been removed from its old position in the changes feed and moved to
> > the later position, so the background process won’t see the document and
> > will then move on to the next one.
> > 5. One other feature to add is to an index progress tracker. We can do
> this
> > by using doc_count for the database, and then have a counter value that
> the
> > background workers can increment with the number of documents it updated
> > for each batch update.  We would also have to update this counter on
> write
> > requests while the index is in building mode.
> > 6. Something we can also explore is splitting the building of the index
> > across multiple workers, we can use the `get_boundary_keys` [1] API call
> on
> > the changes feed to get the full list of the changes feed keys grouped by
> > partition boundaries and then split that by workers.
> >
> > Adding a building and active state to the index’s is a bit of a breaking
> > change, but I think its the right way to go. Currently with Mango what
> can
> > happen is a user creates an index and then immediately does a query that
> > would use that index. Mango would then have to build the whole index
> before
> > responding to that request. In this new index-on-write process, Mango
> would
> > ignore the new index until it is active which I think is the better way
> to
> > go on this.
> >
> > Finally, a big acknowledgment to Adam who is the major brains behind this
> > design.
> >
> > What do you think, I would like to hear any thoughts, questions or
> > suggestions on this design.
> >
> > Cheers
> > Garren
> >
> > [1]
> >
> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys
> >
> > On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <ga...@apache.org> wrote:
> >
> >>
> >>
> >> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <ko...@apache.org>
> wrote:
> >>
> >>> Hi Will, great comments, I have replies to a couple of them.
> >>>
> >>>> On Apr 1, 2019, at 5:21 AM, Will Holley <wi...@gmail.com> wrote:
> >>>>
> >>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
> >>>> whether we can guarantee that the generated keys will fit within the
> >>>> maximum FDB key length or if there needs to be some thought as to the
> >>>> failure mode / workaround. As Adam mentioned, it seems fine to store
> an
> >>>> encoded key given Mango (currently) always fetches the associated
> >>> document
> >>>> / fields from the primary index to filter on anyway. It might even be
> >>>> beneficial to have an additional layer of indirection and allow
> multiple
> >>>> docs to be associated with each row so that we can maintain compact
> >>> keys.
> >>>
> >>> Interesting thought on that layer of indirection; it reminds me of an
> >>> optimization applied in the Record Layer’s text indexes. Would have to
> >>> compare whether the extra reads needed to maintain the index that way
> are
> >>> an acceptable tradeoff.
> >>>
> >>> Good point on the sort key sizes, I’ve not seen any way to place a
> >>> reliably safe upper bound on the size of one that might be generated.
> The
> >>> ICU folks have some hand-wavey guidance at
> >>>
> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size,
> >>> but it seems like we might be able to dig a little deeper.
> >>>
> >>> I personally haven’t given much thought to a workaround where a
> >>> user-defined index key exceeds 10 KB. We’ll definitely need to handle
> that
> >>> failure mode safely even without the sort key complication — people try
> >>> crazy things :)
> >>>
> >>
> >> For the 10 KB error, I think we should just return an error. As a
> >> comparison, MongoDB has a 1024 Byte limit
> >> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
> >>
> >>
> >>>> 3. I don't immediately see how you clear previous values from the
> index
> >>>> when a doc is updated, but I could easily be missing something obvious
> >>> :)
> >>>
> >>> Ah yeah, this part wasn’t explicit, was it?
> >>>
> >>> I think the idea is that these are simple indexes on specific fields
> of a
> >>> document, and we have a data model where those fields are already
> stored as
> >>> their own keys in FDB, so there’s no need (in the case of Mango) to
> >>> maintain a separate docid -> {viewid, [keys]} mapping like we do today
> in
> >>> each view group. Rather, the flow would go something like
> >>>
> >>> 1) Check which fields are supposed to be indexed
> >>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
> >>> parent revision
> >>> 3) Compare the parent values with the ones supplied in this
> transaction;
> >>> if any indexed values change, clear the old ones and insert the new
> ones
> >>>
> >>> with some additional caveats around checking that the supplied edit is
> >>> actually going to be winning (and therefore indexed) version after the
> >>> commit succeeds.
> >>>
> >>>> 4. Regarding "Index on write" behaviour, is there something in the
> >>> existing
> >>>> design (Mango overlaying mrview / lucene) that would prevent this? I
> can
> >>>> see some benefit for certain workloads (and headaches for others) but
> I
> >>>> don't see that it's necessarily coupled to the Mango design given
> >>>> background indexing of new/changed indexes needs to be supported
> anyway.
> >>>
> >>> I’m not sure I understand your question. In my mind the reason “index
> on
> >>> write" is more applicable for Mango JSON than for generalized views is
> >>> because in the view case batching is currently quite important to
> achieve
> >>> good throughput to the JS system. You’re of course correct that we
> need to
> >>> be able to re-generate Mango JSON indexes in the background as well.
> >>>
> >>> Adam
> >>>
> >>>
> >>>
>
>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Alex Miller <al...@apple.com.INVALID>.

There was some discussion on the FDB forums previously about how to do index backfill,
If you hadn’t seen it before, but it’s already reasonably similar to what you’ve proposed:
https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2

There's also two related topics that I'd like to provide some context and flavor on:
1. The 10MB transaction size limit.
2. The performance/latency/throughput implications of writing to multiple
    disjoint keys versus multiple nearby keys.

### Transaction Size Limit

There was some previous concern surrounding the FDB 10MB transaction limit,
what it means for CouchDB, and how to communicate the effective maximum
document size to users.  Index-on-write will impact this as well, as index
writes will be counted against the 10MB limit.

As I don't believe I've seen a precise definition yet of the maximum size of a
document allowed in CouchDB-on-FDB, let us assume it's defined as "whatever FDB
will accept".  It would then be possible that a document can be inserted that
just barely fits within the size limits, and then indexes can be defined that
apply to the document.  One is now left with a document that exists, and is
valid in the database, but couldn't be read and re-inserted into the database,
because it would exceed the size limit.  My largest concern would be that this
would have bad implications on CouchDB replication.

Although the transaction limit is 10MB, it's generally advisable to aim to stay
under 1MB for the cluster to perform well.  I've generally seen latency start
to get strange for all clients when one client starts exceeding ~5MB.  If one
requires the total size of a document to stay less than 1MB, then this
comfortably leaves room for indexes in the commit request, and the above case
would be relatively unlikely.

Exactly how unlikely involves dissecting exactly how the transaction size limit
is calculated.

The 10MB limit applies to the size of the commit request, and not to the sum of
the key-value pair sizes, as one might expect.  The largest difference between
the two meanings is that read and write conflict ranges count against the 10MB
limit.  The rough formula for calculating commit size is:

$$ 2*len(keys read) + 3*len(keys written) + len(values written) $$

Each `get()` adds a read conflict range of the read key and the sequentially
next key.  Each `set()` adds a write conflict range the same way, in addition
to having to send the actual key and value.  This is the source of the `2*` and
`3*` multipliers.

`getRange()` adds a read conflict range that exactly reflects the requested
range.  However, there is no `setRange()` to automatically optimize the write
conflict ranges added for inserting sequential values.  If one is inserting
keys that given the data model represent one collective, sequential unit, such
as a full document exploded across multiple key-value pairs, it's advisable to
use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each individual
write conflict range into one large write conflict range, and thus greatly
reduce the size of the commit request that will be sent.

So using this optimization, blindly inserting a document using the scheme
proposed by the "exploded KV" approach, one would end up with a formula looking
more like

$$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) $$

With index-on-write, we now need to add inserting (prefix/value/key, "") for each
index entry into the equation:

$$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) +
   number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) ) $$

And there's no similar write conflict range optimization that can be done, as
index writes will be non-sequential and spread widely across the keyspace.

Index types that can cause a large number of entries to be written will have an
even larger affect on total commit size.  I’ve been using the mental example of
indexing all the elements of an array nested deep within a document.  However,
if I’ve read CouchDB documentation correctly, all of the index types that can
cause this explosion are grouped as part of Mango Search indexes, and thus out
of scope for this proposal.  Mango JSON indexes seem to have a direct correlation
that one (index,field) pair will turn into one Key-Value write to FDB. Convenient!

Therefore, in order for a user to run into Mango JSON index-induced transaction
size limits, they would have to manually add an astounding number of indexes.
I've failed to quickly find an answer as to if there's an existing maximum
number of Mango JSON indexes that can be added in a CouchDB instance, but if
there is, this might not be a concern anyway.

The worst case would be that each field of a document is indexed.  Assuming the
"exploded KV" proposal as the representation, each key and value would be
repeated 3 more times, and thus this would turn into

$$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys written) + 4*len(values written) $$

So if an maximum exploded KV size of 1MB is chosen, transaction commit requests
would be expected to stay roughly under 4MB.

However, if you're thinking into the future, and considering that whatever
choice is taken for Mango JSON indexes should also be taken for Mango Search
indexes as well, the above is definitely a concern.

### Indexing Write Penalty

And, as we're now discussing what would happen with thousands of indexes
defined, let's briefly discuss what impact that would have on actually
committing that transaction.

With most distributed databases, each index added equates to an additional
shard of data involved in the transaction, and thus likely an additional server
and RPC.  This would lead to an increase in latency and concurrency control
costs, as more indexes are added.  This adds a strong cost to each additional
index.  One much larger than the actual cost of the disk IO involved in doing
the actual index write.

FoundationDB does all commits against a distributed write-ahead-log, and
asynchronously to the commit, applies the data to each shard of data involved in
the transaction.  This means that a commit of 5 adjacent keys and 5 keys
randomly distributed across the keyspace will be processed in the same way by
the transaction subsystem.  One would actually see better throughput at
saturation when writing to randomly distributed keys, as the work of applying
those mutations gets sharded across multiple storage servers.

Creating many indexes still means more disk IO, but is comparatively cheaper on
FDB than other distributed database architectures.

> On Apr 11, 2019, at 5:42 AM, Garren Smith <ga...@apache.org> wrote:
> 
> 
> I was chatting to Adam yesterday and I want to explore the index-on-write
> indexing for Mango a bit more. I know there has been a bit of a discussion
> that we should only use a background process to build mango indexes but I
> think that building indexes as documents are created/updated along combined
> with background processing for existing documents will be easier to
> implement. Especially in the beginning as we build the new fdb layer.
> 
> Below is the process for building a new index:
> 
> 1. When a user defines a new index on an existing database, save the index
> definition and also save the sequence that the index was added at. The
> index should also be marked that it is in a `building`  phase so it won’t
> be used yet to service queries. (I’ll come back to this later)
> 2. Any write requests after that must read the new index definition and
> update the index. When updating the new index, the writers should assume
> that previous versions of the document have already been indexed.
> 3. At the same time a background process will start reading sections of the
> changes feed and building the index, this background process will keep
> processing the changes read until it reaches the sequence number that the
> index was saved at. Once it reaches that point, the index is up to date and
> will be marked as `active` and can be used to service queries.
> 4. There are some subtle behaviour around step 3 that is worth mentioning.
> The background process will have the 5 second transaction limit, so it will
> process smaller parts of the changes feed. Which means that it won’t have
> one consistent view of the changes feed throughout the index building
> process. This will lead to a conflict situation. For example when the
> background process transaction is adding a document to the index while at
> the same time a write request has a transaction that is updating the same
> document. There are two possible outcomes to this, if the background
> process wins, the write request will get a conflict. At that point the
> write request will try to process the document again, read the old values
> for that document, remove them from the index and add the new values to the
> index. If the write request wins, and the background process gets a
> conflict, then the background process can try again, the document would
> have been removed from its old position in the changes feed and moved to
> the later position, so the background process won’t see the document and
> will then move on to the next one.
> 5. One other feature to add is to an index progress tracker. We can do this
> by using doc_count for the database, and then have a counter value that the
> background workers can increment with the number of documents it updated
> for each batch update.  We would also have to update this counter on write
> requests while the index is in building mode.
> 6. Something we can also explore is splitting the building of the index
> across multiple workers, we can use the `get_boundary_keys` [1] API call on
> the changes feed to get the full list of the changes feed keys grouped by
> partition boundaries and then split that by workers.
> 
> Adding a building and active state to the index’s is a bit of a breaking
> change, but I think its the right way to go. Currently with Mango what can
> happen is a user creates an index and then immediately does a query that
> would use that index. Mango would then have to build the whole index before
> responding to that request. In this new index-on-write process, Mango would
> ignore the new index until it is active which I think is the better way to
> go on this.
> 
> Finally, a big acknowledgment to Adam who is the major brains behind this
> design.
> 
> What do you think, I would like to hear any thoughts, questions or
> suggestions on this design.
> 
> Cheers
> Garren
> 
> [1]
> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys
> 
> On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <ga...@apache.org> wrote:
> 
>> 
>> 
>> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <ko...@apache.org> wrote:
>> 
>>> Hi Will, great comments, I have replies to a couple of them.
>>> 
>>>> On Apr 1, 2019, at 5:21 AM, Will Holley <wi...@gmail.com> wrote:
>>>> 
>>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
>>>> whether we can guarantee that the generated keys will fit within the
>>>> maximum FDB key length or if there needs to be some thought as to the
>>>> failure mode / workaround. As Adam mentioned, it seems fine to store an
>>>> encoded key given Mango (currently) always fetches the associated
>>> document
>>>> / fields from the primary index to filter on anyway. It might even be
>>>> beneficial to have an additional layer of indirection and allow multiple
>>>> docs to be associated with each row so that we can maintain compact
>>> keys.
>>> 
>>> Interesting thought on that layer of indirection; it reminds me of an
>>> optimization applied in the Record Layer’s text indexes. Would have to
>>> compare whether the extra reads needed to maintain the index that way are
>>> an acceptable tradeoff.
>>> 
>>> Good point on the sort key sizes, I’ve not seen any way to place a
>>> reliably safe upper bound on the size of one that might be generated. The
>>> ICU folks have some hand-wavey guidance at
>>> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size,
>>> but it seems like we might be able to dig a little deeper.
>>> 
>>> I personally haven’t given much thought to a workaround where a
>>> user-defined index key exceeds 10 KB. We’ll definitely need to handle that
>>> failure mode safely even without the sort key complication — people try
>>> crazy things :)
>>> 
>> 
>> For the 10 KB error, I think we should just return an error. As a
>> comparison, MongoDB has a 1024 Byte limit
>> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
>> 
>> 
>>>> 3. I don't immediately see how you clear previous values from the index
>>>> when a doc is updated, but I could easily be missing something obvious
>>> :)
>>> 
>>> Ah yeah, this part wasn’t explicit, was it?
>>> 
>>> I think the idea is that these are simple indexes on specific fields of a
>>> document, and we have a data model where those fields are already stored as
>>> their own keys in FDB, so there’s no need (in the case of Mango) to
>>> maintain a separate docid -> {viewid, [keys]} mapping like we do today in
>>> each view group. Rather, the flow would go something like
>>> 
>>> 1) Check which fields are supposed to be indexed
>>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
>>> parent revision
>>> 3) Compare the parent values with the ones supplied in this transaction;
>>> if any indexed values change, clear the old ones and insert the new ones
>>> 
>>> with some additional caveats around checking that the supplied edit is
>>> actually going to be winning (and therefore indexed) version after the
>>> commit succeeds.
>>> 
>>>> 4. Regarding "Index on write" behaviour, is there something in the
>>> existing
>>>> design (Mango overlaying mrview / lucene) that would prevent this? I can
>>>> see some benefit for certain workloads (and headaches for others) but I
>>>> don't see that it's necessarily coupled to the Mango design given
>>>> background indexing of new/changed indexes needs to be supported anyway.
>>> 
>>> I’m not sure I understand your question. In my mind the reason “index on
>>> write" is more applicable for Mango JSON than for generalized views is
>>> because in the view case batching is currently quite important to achieve
>>> good throughput to the JS system. You’re of course correct that we need to
>>> be able to re-generate Mango JSON indexes in the background as well.
>>> 
>>> Adam
>>> 
>>> 
>>>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Garren Smith <ga...@apache.org>.

Hi All,

I was chatting to Adam yesterday and I want to explore the index-on-write
indexing for Mango a bit more. I know there has been a bit of a discussion
that we should only use a background process to build mango indexes but I
think that building indexes as documents are created/updated along combined
with background processing for existing documents will be easier to
implement. Especially in the beginning as we build the new fdb layer.

Below is the process for building a new index:

1. When a user defines a new index on an existing database, save the index
definition and also save the sequence that the index was added at. The
index should also be marked that it is in a `building`  phase so it won’t
be used yet to service queries. (I’ll come back to this later)
2. Any write requests after that must read the new index definition and
update the index. When updating the new index, the writers should assume
that previous versions of the document have already been indexed.
3. At the same time a background process will start reading sections of the
changes feed and building the index, this background process will keep
processing the changes read until it reaches the sequence number that the
index was saved at. Once it reaches that point, the index is up to date and
will be marked as `active` and can be used to service queries.
4. There are some subtle behaviour around step 3 that is worth mentioning.
The background process will have the 5 second transaction limit, so it will
process smaller parts of the changes feed. Which means that it won’t have
one consistent view of the changes feed throughout the index building
process. This will lead to a conflict situation. For example when the
background process transaction is adding a document to the index while at
the same time a write request has a transaction that is updating the same
document. There are two possible outcomes to this, if the background
process wins, the write request will get a conflict. At that point the
write request will try to process the document again, read the old values
for that document, remove them from the index and add the new values to the
index. If the write request wins, and the background process gets a
conflict, then the background process can try again, the document would
have been removed from its old position in the changes feed and moved to
the later position, so the background process won’t see the document and
will then move on to the next one.
5. One other feature to add is to an index progress tracker. We can do this
by using doc_count for the database, and then have a counter value that the
background workers can increment with the number of documents it updated
for each batch update.  We would also have to update this counter on write
requests while the index is in building mode.
6. Something we can also explore is splitting the building of the index
across multiple workers, we can use the `get_boundary_keys` [1] API call on
the changes feed to get the full list of the changes feed keys grouped by
partition boundaries and then split that by workers.

Adding a building and active state to the index’s is a bit of a breaking
change, but I think its the right way to go. Currently with Mango what can
happen is a user creates an index and then immediately does a query that
would use that index. Mango would then have to build the whole index before
responding to that request. In this new index-on-write process, Mango would
ignore the new index until it is active which I think is the better way to
go on this.

Finally, a big acknowledgment to Adam who is the major brains behind this
design.

What do you think, I would like to hear any thoughts, questions or
suggestions on this design.

Cheers
Garren

[1]
https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys

On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <ga...@apache.org> wrote:

>
>
> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <ko...@apache.org> wrote:
>
>> Hi Will, great comments, I have replies to a couple of them.
>>
>> > On Apr 1, 2019, at 5:21 AM, Will Holley <wi...@gmail.com> wrote:
>> >
>> > 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
>> > whether we can guarantee that the generated keys will fit within the
>> > maximum FDB key length or if there needs to be some thought as to the
>> > failure mode / workaround. As Adam mentioned, it seems fine to store an
>> > encoded key given Mango (currently) always fetches the associated
>> document
>> > / fields from the primary index to filter on anyway. It might even be
>> > beneficial to have an additional layer of indirection and allow multiple
>> > docs to be associated with each row so that we can maintain compact
>> keys.
>>
>> Interesting thought on that layer of indirection; it reminds me of an
>> optimization applied in the Record Layer’s text indexes. Would have to
>> compare whether the extra reads needed to maintain the index that way are
>> an acceptable tradeoff.
>>
>> Good point on the sort key sizes, I’ve not seen any way to place a
>> reliably safe upper bound on the size of one that might be generated. The
>> ICU folks have some hand-wavey guidance at
>> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size,
>> but it seems like we might be able to dig a little deeper.
>>
>> I personally haven’t given much thought to a workaround where a
>> user-defined index key exceeds 10 KB. We’ll definitely need to handle that
>> failure mode safely even without the sort key complication — people try
>> crazy things :)
>>
>
> For the 10 KB error, I think we should just return an error. As a
> comparison, MongoDB has a 1024 Byte limit
> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
>
>
>> > 3. I don't immediately see how you clear previous values from the index
>> > when a doc is updated, but I could easily be missing something obvious
>> :)
>>
>> Ah yeah, this part wasn’t explicit, was it?
>>
>> I think the idea is that these are simple indexes on specific fields of a
>> document, and we have a data model where those fields are already stored as
>> their own keys in FDB, so there’s no need (in the case of Mango) to
>> maintain a separate docid -> {viewid, [keys]} mapping like we do today in
>> each view group. Rather, the flow would go something like
>>
>> 1) Check which fields are supposed to be indexed
>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
>> parent revision
>> 3) Compare the parent values with the ones supplied in this transaction;
>> if any indexed values change, clear the old ones and insert the new ones
>>
>> with some additional caveats around checking that the supplied edit is
>> actually going to be winning (and therefore indexed) version after the
>> commit succeeds.
>>
>> > 4. Regarding "Index on write" behaviour, is there something in the
>> existing
>> > design (Mango overlaying mrview / lucene) that would prevent this? I can
>> > see some benefit for certain workloads (and headaches for others) but I
>> > don't see that it's necessarily coupled to the Mango design given
>> > background indexing of new/changed indexes needs to be supported anyway.
>>
>> I’m not sure I understand your question. In my mind the reason “index on
>> write" is more applicable for Mango JSON than for generalized views is
>> because in the view case batching is currently quite important to achieve
>> good throughput to the JS system. You’re of course correct that we need to
>> be able to re-generate Mango JSON indexes in the background as well.
>>
>> Adam
>>
>>
>>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Garren Smith <ga...@apache.org>.

On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <ko...@apache.org> wrote:

> Hi Will, great comments, I have replies to a couple of them.
>
> > On Apr 1, 2019, at 5:21 AM, Will Holley <wi...@gmail.com> wrote:
> >
> > 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
> > whether we can guarantee that the generated keys will fit within the
> > maximum FDB key length or if there needs to be some thought as to the
> > failure mode / workaround. As Adam mentioned, it seems fine to store an
> > encoded key given Mango (currently) always fetches the associated
> document
> > / fields from the primary index to filter on anyway. It might even be
> > beneficial to have an additional layer of indirection and allow multiple
> > docs to be associated with each row so that we can maintain compact keys.
>
> Interesting thought on that layer of indirection; it reminds me of an
> optimization applied in the Record Layer’s text indexes. Would have to
> compare whether the extra reads needed to maintain the index that way are
> an acceptable tradeoff.
>
> Good point on the sort key sizes, I’ve not seen any way to place a
> reliably safe upper bound on the size of one that might be generated. The
> ICU folks have some hand-wavey guidance at
> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size,
> but it seems like we might be able to dig a little deeper.
>
> I personally haven’t given much thought to a workaround where a
> user-defined index key exceeds 10 KB. We’ll definitely need to handle that
> failure mode safely even without the sort key complication — people try
> crazy things :)
>

For the 10 KB error, I think we should just return an error. As a
comparison, MongoDB has a 1024 Byte limit
https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit


> > 3. I don't immediately see how you clear previous values from the index
> > when a doc is updated, but I could easily be missing something obvious :)
>
> Ah yeah, this part wasn’t explicit, was it?
>
> I think the idea is that these are simple indexes on specific fields of a
> document, and we have a data model where those fields are already stored as
> their own keys in FDB, so there’s no need (in the case of Mango) to
> maintain a separate docid -> {viewid, [keys]} mapping like we do today in
> each view group. Rather, the flow would go something like
>
> 1) Check which fields are supposed to be indexed
> 2) Retrieve values for those fields in the ?DOCUMENTS space for the parent
> revision
> 3) Compare the parent values with the ones supplied in this transaction;
> if any indexed values change, clear the old ones and insert the new ones
>
> with some additional caveats around checking that the supplied edit is
> actually going to be winning (and therefore indexed) version after the
> commit succeeds.
>
> > 4. Regarding "Index on write" behaviour, is there something in the
> existing
> > design (Mango overlaying mrview / lucene) that would prevent this? I can
> > see some benefit for certain workloads (and headaches for others) but I
> > don't see that it's necessarily coupled to the Mango design given
> > background indexing of new/changed indexes needs to be supported anyway.
>
> I’m not sure I understand your question. In my mind the reason “index on
> write" is more applicable for Mango JSON than for generalized views is
> because in the view case batching is currently quite important to achieve
> good throughput to the JS system. You’re of course correct that we need to
> be able to re-generate Mango JSON indexes in the background as well.
>
> Adam
>
>
>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Adam Kocoloski <ko...@apache.org>.

Hi Will, great comments, I have replies to a couple of them.

> On Apr 1, 2019, at 5:21 AM, Will Holley <wi...@gmail.com> wrote:
> 
> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
> whether we can guarantee that the generated keys will fit within the
> maximum FDB key length or if there needs to be some thought as to the
> failure mode / workaround. As Adam mentioned, it seems fine to store an
> encoded key given Mango (currently) always fetches the associated document
> / fields from the primary index to filter on anyway. It might even be
> beneficial to have an additional layer of indirection and allow multiple
> docs to be associated with each row so that we can maintain compact keys.

Interesting thought on that layer of indirection; it reminds me of an optimization applied in the Record Layer’s text indexes. Would have to compare whether the extra reads needed to maintain the index that way are an acceptable tradeoff.

Good point on the sort key sizes, I’ve not seen any way to place a reliably safe upper bound on the size of one that might be generated. The ICU folks have some hand-wavey guidance at http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size, but it seems like we might be able to dig a little deeper.

I personally haven’t given much thought to a workaround where a user-defined index key exceeds 10 KB. We’ll definitely need to handle that failure mode safely even without the sort key complication — people try crazy things :)

> 3. I don't immediately see how you clear previous values from the index
> when a doc is updated, but I could easily be missing something obvious :)

Ah yeah, this part wasn’t explicit, was it?

I think the idea is that these are simple indexes on specific fields of a document, and we have a data model where those fields are already stored as their own keys in FDB, so there’s no need (in the case of Mango) to maintain a separate docid -> {viewid, [keys]} mapping like we do today in each view group. Rather, the flow would go something like

1) Check which fields are supposed to be indexed
2) Retrieve values for those fields in the ?DOCUMENTS space for the parent revision
3) Compare the parent values with the ones supplied in this transaction; if any indexed values change, clear the old ones and insert the new ones

with some additional caveats around checking that the supplied edit is actually going to be winning (and therefore indexed) version after the commit succeeds.

> 4. Regarding "Index on write" behaviour, is there something in the existing
> design (Mango overlaying mrview / lucene) that would prevent this? I can
> see some benefit for certain workloads (and headaches for others) but I
> don't see that it's necessarily coupled to the Mango design given
> background indexing of new/changed indexes needs to be supported anyway.

I’m not sure I understand your question. In my mind the reason “index on write" is more applicable for Mango JSON than for generalized views is because in the view case batching is currently quite important to achieve good throughput to the JS system. You’re of course correct that we need to be able to re-generate Mango JSON indexes in the background as well.

Adam

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Will Holley <wi...@gmail.com>.

Thanks for kicking this off Garren! I have a few questions / thoughts:

1. Will these indexes continue to have associated design documents? I
assume this aspect wouldn't change, but it would be good to be explicit.
Aside from being required for replication, it's useful to be able to query
the /db/_design/<ddoc>/_info endpoint to get the index size etc. Perhaps
this also addresses the issue around how to get the set of available
indexes?

2. Does the ICU sort key have a bounded length? Mostly I'm wondering
whether we can guarantee that the generated keys will fit within the
maximum FDB key length or if there needs to be some thought as to the
failure mode / workaround. As Adam mentioned, it seems fine to store an
encoded key given Mango (currently) always fetches the associated document
/ fields from the primary index to filter on anyway. It might even be
beneficial to have an additional layer of indirection and allow multiple
docs to be associated with each row so that we can maintain compact keys.

3. I don't immediately see how you clear previous values from the index
when a doc is updated, but I could easily be missing something obvious :)

4. Regarding "Index on write" behaviour, is there something in the existing
design (Mango overlaying mrview / lucene) that would prevent this? I can
see some benefit for certain workloads (and headaches for others) but I
don't see that it's necessarily coupled to the Mango design given
background indexing of new/changed indexes needs to be supported anyway.

5. I have a few concerns about the performance given we'll no longer be
able to push down filtering to the data storage tier. It sounds [1] like a
key thing will be the extent to which we can concurrently fetch/filter
non-overlapping key ranges from FDB - likely something that will come up in
relation to views as well. This gets a bit more fun when considering
multi-tenancy, and is perhaps something to stew on whilst getting something
functional together.

Will

[1] https://forums.foundationdb.org/t/secondary-indexing-approaches/792/4


On Thu, 28 Mar 2019 at 17:48, Adam Kocoloski <ko...@apache.org> wrote:

> Hi Garren, cool, this is a good start.
>
> On the ICU side of things, Russell pointed out that sort keys are a
> one-way trip; i.e., there’s no way to recover the original string from a
> sort key. For the initial pass at Mango I think that’s OK, as we’re reading
> the indexed documents anyway. When we get to views I guess the design will
> need to store the original string in the value so that we can return it as
> the “key” field in the response.
>
> Adam
>
> > On Mar 28, 2019, at 7:01 AM, Garren Smith <ga...@apache.org> wrote:
> >
> > Hi everyone,
> >
> >
> > I want to start a discussion, with the aim of an RFC, around implementing
> > Mango JSON indexes for FoundationDB. Currently Mango indexes are a layer
> > above CouchDB map/reduce indexes, but with FoundationDB we can make them
> > separate indexes in FoundationDB. This gives us the possibility of being
> > able to update the indexes in the same transaction that a document is
> being
> > saved in. Later we can look at adding specific mango like covering
> indexes.
> >
> >
> > Lets dive into the data model. Currently a user defines an index like
> this:
> >
> >
> > {
> >
> >  name: ‘view-name’ - optional will be auto-generated
> >
> >  index: {
> >
> >    fields: [‘fieldA’, ‘fieldB’]
> >
> >  },
> >
> >  partial_filter_selector {} - optional
> >
> > }
> >
> >
> > For query planning we need to be able to access the list of available
> > indexes. So we would have a index_definitions subspace with the following
> > content:
> >
> >
> > (<fieldname1>, …<rest of fields>) = (<index_name>,
> > <partial_filter_selector>)
> >
> >
> > Otherwise we could just store the index definitions as:
> >
> > (index_name) = ((fields), partial_filter_selector).
> >
> >
> > At this stage, I can’t think of a fancy way of storing the index
> > definitions so that when we need to select an index for a query there
> would
> > be a fast way to only fetch a subset of the indexes. I think the best is
> to
> > rather fetch them all like we currently do and process them. However, we
> > can look at caching these index definitions in the application layer, and
> > using FoundationDB watches[0] to notify us when a definition has changed
> so
> > we can update the cached definitions.
> >
> >
> > Then each index definition will have its own dedicated subspace for the
> > actual built index key/values. Keys in this subspace would be the fields
> > defined in the index with the doc id at the end of the tuple, e.g for an
> > index with fields name and age, it would be:
> >
> >
> > (“john”, 40, “doc-id-1) = null
> >
> > (“mary”, 30, “doc-id-2) = null
> >
> >
> > This follows the same key format that document layer[1] does for its
> > indexes. One point to make here is that the doc id is kept in the key
> part
> > so that we can avoid duplicate keys.
> >
> >
> > Then in terms of sorting the keys, current CouchDB uses ICU to sort all
> > secondary indexes. We would need to use ICU to sort the indexes for FDB
> but
> > we would have to do it differently. We will not be able to use ICU
> > collation operations directly, instead, we are going to have to look at
> > using ICU’s sort key[1] to generate a sort key ahead of time. At the same
> > time we need to look at creating binary encoding to capture the way that
> > CouchDB currently sorts object, array and numbers. This would most likely
> > be a sort of key prefix that we add to each key field along with the sort
> > key generated from ICU.
> >
> >
> > In terms of keeping mango indexes up to date, we should be able to update
> > all existing indexes in the same transaction as a document is
> > updated/created, this means we shouldn’t have to have any background
> > process keeping mango indexes updated. Though I imagine we going to have
> to
> > look at a background process that does update and build new indexes on an
> > existing index. We will have to do some decent performance testing around
> > this to determine the best solution, but looking at document layer they
> > seem to recommend updating the indexes in the transaction rather than in
> a
> > background process.
> >
> >
> > In the future, we could look at using the value space to store covering
> > indexed or materialized views. That way we would not need to always read
> > from the by_id when quering with Mango. Which would be a nice performance
> > improvement.
> >
> >
> >
> > Please let me know any thoughts, improvements, suggestions or questions
> > around this.
> >
> >
> >
> > [0] https://apple.github.io/foundationdb/features.html#watches
> >
> > [1] https://github.com/FoundationDB/fdb-document-layer
> >
> > [2] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features
>
>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Posted by Adam Kocoloski <ko...@apache.org>.

Hi Garren, cool, this is a good start.

On the ICU side of things, Russell pointed out that sort keys are a one-way trip; i.e., there’s no way to recover the original string from a sort key. For the initial pass at Mango I think that’s OK, as we’re reading the indexed documents anyway. When we get to views I guess the design will need to store the original string in the value so that we can return it as the “key” field in the response.

Adam

> On Mar 28, 2019, at 7:01 AM, Garren Smith <ga...@apache.org> wrote:
> 
> Hi everyone,
> 
> 
> I want to start a discussion, with the aim of an RFC, around implementing
> Mango JSON indexes for FoundationDB. Currently Mango indexes are a layer
> above CouchDB map/reduce indexes, but with FoundationDB we can make them
> separate indexes in FoundationDB. This gives us the possibility of being
> able to update the indexes in the same transaction that a document is being
> saved in. Later we can look at adding specific mango like covering indexes.
> 
> 
> Lets dive into the data model. Currently a user defines an index like this:
> 
> 
> {
> 
>  name: ‘view-name’ - optional will be auto-generated
> 
>  index: {
> 
>    fields: [‘fieldA’, ‘fieldB’]
> 
>  },
> 
>  partial_filter_selector {} - optional
> 
> }
> 
> 
> For query planning we need to be able to access the list of available
> indexes. So we would have a index_definitions subspace with the following
> content:
> 
> 
> (<fieldname1>, …<rest of fields>) = (<index_name>,
> <partial_filter_selector>)
> 
> 
> Otherwise we could just store the index definitions as:
> 
> (index_name) = ((fields), partial_filter_selector).
> 
> 
> At this stage, I can’t think of a fancy way of storing the index
> definitions so that when we need to select an index for a query there would
> be a fast way to only fetch a subset of the indexes. I think the best is to
> rather fetch them all like we currently do and process them. However, we
> can look at caching these index definitions in the application layer, and
> using FoundationDB watches[0] to notify us when a definition has changed so
> we can update the cached definitions.
> 
> 
> Then each index definition will have its own dedicated subspace for the
> actual built index key/values. Keys in this subspace would be the fields
> defined in the index with the doc id at the end of the tuple, e.g for an
> index with fields name and age, it would be:
> 
> 
> (“john”, 40, “doc-id-1) = null
> 
> (“mary”, 30, “doc-id-2) = null
> 
> 
> This follows the same key format that document layer[1] does for its
> indexes. One point to make here is that the doc id is kept in the key part
> so that we can avoid duplicate keys.
> 
> 
> Then in terms of sorting the keys, current CouchDB uses ICU to sort all
> secondary indexes. We would need to use ICU to sort the indexes for FDB but
> we would have to do it differently. We will not be able to use ICU
> collation operations directly, instead, we are going to have to look at
> using ICU’s sort key[1] to generate a sort key ahead of time. At the same
> time we need to look at creating binary encoding to capture the way that
> CouchDB currently sorts object, array and numbers. This would most likely
> be a sort of key prefix that we add to each key field along with the sort
> key generated from ICU.
> 
> 
> In terms of keeping mango indexes up to date, we should be able to update
> all existing indexes in the same transaction as a document is
> updated/created, this means we shouldn’t have to have any background
> process keeping mango indexes updated. Though I imagine we going to have to
> look at a background process that does update and build new indexes on an
> existing index. We will have to do some decent performance testing around
> this to determine the best solution, but looking at document layer they
> seem to recommend updating the indexes in the transaction rather than in a
> background process.
> 
> 
> In the future, we could look at using the value space to store covering
> indexed or materialized views. That way we would not need to always read
> from the by_id when quering with Mango. Which would be a nice performance
> improvement.
> 
> 
> 
> Please let me know any thoughts, improvements, suggestions or questions
> around this.
> 
> 
> 
> [0] https://apple.github.io/foundationdb/features.html#watches
> 
> [1] https://github.com/FoundationDB/fdb-document-layer
> 
> [2] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features