You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Ian Boston <ie...@tfd.co.uk> on 2016/08/11 07:29:51 UTC

Oak Indexing. Was Re: Property index replacement / evolution

Hi Michael,
It's probably rude of me to reply to this as you addressed it to Davide not
me.
I have changed the subject line as you said...
"This discussion is only about balancing property indexes vs Lucene indexes"

although, the content remains relevant to the original thread.

On 10 August 2016 at 20:44, Michael Marth <mm...@adobe.com> wrote:

> Hi Davide,
>
> My POV:
> Storing the indexes within the repo itself allows for operational
> simplicity. In particular: it allows to create a backup of the persistence
> (including the indexes) in a consistent form - without having to stop
> writes to the repo. In JR2 it is not possible to create a consistent backup
> of nodes and indexes without stopping writes to the repo (to my knowledge
> at least).
> You also extend your question to “what would happen if separate cluster
> nodes would maintain their own indexes (on local/private disc)?”. Two
> things:
> 1. Each cluster node would have to process full text extraction - i.e.
> Computationally expensive
>

Full text extraction should be separated from indexing, as the DS blobs are
immutable, so is the full text. There is code to do this in the Oak
indexer, but it's not used to write to the DS at present. It should be done
in a Job, distributed to all nodes, run only once per item. Full text
extraction is hugely expensive.

> 2. Really bad: if a new node joins the cluster then that node would have
> to re-index the full repo.
>

Building the same index on every node doesn't scale for the reasons you
point out, and eventually hits a brick wall.
http://lucene.apache.org/core/6_1_0/core/org/apache/lucene/codecs/lucene60/package-summary.html#Limitations.
(Int32 on Document ID per index). One of the reasons for the Hybrid
approach was the number of Oak documents in some repositories will exceed
that limit.

>
> IMHO the current design (to store indexes in the repo itself) is totally
> the right approach.
>

I am reticent to disagree with you, but I feel I have no option, based on
research, history and first hand experience over the past 10 years.

Storing indexes in a repo is what Compass did from 2004 onwards, until
after the third version they gave up trying to build a scalable and near
real time search engine. Version 4 was a rerwite that became ElasticSearch
0.4.0. The history is documented here
https://en.wikipedia.org/wiki/Elasticsearch and was presented at Berlin
Buzwords in 2010 with a detailed description of why each approach fails. I
have shared this information before. I am not sharing it to confront. I am
sharing it because it pains me to see Oak repeating history. I don't feel I
can stand by and watch in silence.

If Oak does not want to use ES as a library, then learn from the history as
it addresses your concerns (1,2, + brick wall) and those of Davide, and
satisfies the many of the other issues potentially eliminating property
indexes completely. It will however, only ever be as NRT as the root
document commit period (1s), well above the 100ms data latency a model like
used by ES delivers under production load.

 IMHO, the Hybrid approach being proposed is a step along the same history
that Compass started treading in 2004. It is an innovative solution to a
constrained problem space.

Sorry if I sound like a broken record. I did exactly what Oak has done/is
doing in 2006 onwards but without a vast user base was able to be more
agile.

Apache is about doing, not standing by, about fact not fiction, about
evidence and reasoned argument. If there is any interest, I have an Oak PoC
somewhere that ports the Lucene index plugin to use embedded ES instances,
1 per VM as an embedded ES cluster. It's not complete as I gave up on it
when I realised data latency would be fixed by the Oak root document. My
interest was proper real time indexing over the cluster.

Best Regards
Ian

> This discussion is only about balancing property indexes vs Lucene indexes
>
> Michael
>
>
>
>
> On 10/08/16 15:11, "Davide Giannella" <da...@apache.org> wrote:
>
> >On 09/08/2016 13:18, Ian Boston wrote:
> >> Alternatively, move the indexes so that a sync property index update
> >> doesn't perform a conditional change to the  global root document ? ( A
> new
> >> thread would be required to discuss this if worth talking about.)
> >
> >I'm stubborn and maybe even slow in learning, but again I ask myself:
> >why are we storing the indexes in the repository itself?
> >
> >I was not part of the original discussion around this; but frankly I
> >would have expected to have the indexes stored separately from the
> >repository. Let's say on the file system. Something like JR2 where it
> >was even possible to delete a directory and all the indexes were
> >re-generated from scratch.
> >
> >What do we loose if we would be moving the indexes outside of the
> >repository? Which means each AEM node will have its own index(es).
> >
> >Cheers
> >Davide
> >
> >
>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Ard Schrijvers <a....@onehippo.com>.

Hey,

I've caught up with all mails in this thread, and would like to make
some general remarks. Admittedly, I do not yet work with oak and do
not yet know much about its indexing strategy/implementation,  but I
do know quite some details about the old JR2 index implementation,
about ES, about Lucene and about JCR in general.

I do agree with Ian that in the past every attempt to store the Lucene
index not near the code has failed. I think he forgot to mention
Lucandra :-). About 8 years ago Simon Willnauer was pretty explicit in
a talk with me about it: Bring the computation to the data with
Lucene, every other attempt will fail. I also talked with Jukka (5
years ago?) when he explained me the oak indexing setup. I asked him
about how this would work because bringing the data to the code
(during query execution) doesn't perform. Obviously Jukka was aware.

AFAIU, oak does have the Lucene segments from the storage (mongoDB)
locally. So it doesn't bring the data to the computation during query
execution. The Lucene data is local. In this sense, I think Ian's fear
is not correct wrt having the Lucene index not local (it is confusing,
it is stored externally, but when used, copied locally...that is at
least what I understand)

With respect to using ES (and sharding) embedded or not in oak, I
consider the crux of the requirement being well explained by Chetan:

QR1. Application Query - These mostly involve some property
restrictions and are invoked by code itself to perform some operation.
<snip/>

QR2. User provided query - These queries would consist of both or
either of property restriction and fulltext constraints. <snip/>

With ES (with sharding), the QR1 type queries will never be fast
enough. We (Hippo) have code that can result in hundreds of queries
for a single request (for example for every document in a folder show
the translations of the document). In JR2, simple queries return
within 1 ms (and faster). You'll never be able to deliver this with ES
(clustered with sharding). Only network latency is magnitudes higher.
Obviously I do *not* claim that ES has a worse Lucene implementation
than JR2 has. Quite surely the opposite, but the implementation serves
a very different purpose. It is like comparing a ConcurrentHashMap as
cache with a Terracotta cluster wide cache. Some use cases require the
one, some the other.

Also what I did not see being mentioned in this thread, is
authorization (aka fine-grained ACLs). If you include the ACL
requirements, using an ES index (with sharding) will become even more
problematic: How many query results to fetch if you don't know how
many are allowed to be read? What if you want to return 1.000 hits,
but the JCR user has only read access to about 1%. Fetch 100.000 hits
from ES? And then 100.000 more if you did not find 1.000 authorized?

In JR2, at Hippo we combine with every query also a 'Lucene
authorization query'. This authorization query easily becomes a nested
boolean query with hundreds of boolean queries nested. The only way
this performs is using a caching Lucene filter [1]. I doubt if this is
possible with ES (perhaps with a custom endpoint and some token that
maps to an authorization query). Either way, long story short, I think
ES serves different use cases much much better than JR2 or oak will
ever be able to do. At Hippo we store for example every visitor their
page request including meta data in ES to support trends on data. ES
is perfect for this. I'd never want to store this in a hierarchical
content structure, with versioning, with eventual consistency, with
ACL support, with support for moving of subtrees, etc : But it is
these features that imho make ES in turn unsuited for supporting the
QR1 type of queries for JCR.

AFAIC judge, the hybrid approach suggested by Chetan makes sense to
me. Next to that, support for ES to support QR2 type of queries make
sense (possibly with a delay because they are less application kind of
queries). However, I consider ES support more as an integration
feature, not a core oak requirement.

Some general other remarks:

Some mails were about that text extraction is expensive, and that this
validates having the index in the database. I don't fully agree. Text
extraction is only expensive for (some) binaries, most notably PDFs.
At Hippo we therefor store a sibling for the jcr:data, namely the
binary 'hippo:text'. If hippo:text is present, we do not extract the
jcr:data but use the hippo:text binary (which is the extracted text
and thus only needs to be extracted once). With this kind of approach,
text extraction also only happens once. This does not require an index
to be stored in the repository.

Some mention was made about JR2 that when a cluster node crashes, its
index might be corrupt. Perhaps when the node crashes because disk is
full, but otherwise, the index is in general not corrupt. Namely there
is a redo.log file on FS which contains the jcr nodes which are
indexed in the 'in memory index' which is not yet flushed to disk.

Some remark was made about bringing up a new cluster node requiring to
(re)build the entire index. This is partially true. From a shutdown
cluster node, you can copy the index and make sure when the new
cluster node starts up, its revision number is set equal to the
revision number of the cluster node at the time the index was copied.
For a local POC (because we want to more easily scale out in the
cloud), I've already have it working to create Lucene snapshots of a
running JR2 repository. Not that hard if you use an existing multi
reader in JR (that one contains also the in memory index) and flush
that one to file system. What we then also flush is the revision id of
the cluster node at that time. A new node can then start up with the
exported index and set the revision correct.

Regards Ard

[1] https://code.onehippo.org/cms-community/hippo-repository/blob/master/engine/src/main/java/org/hippoecm/repository/query/lucene/util/CachingMultiReaderQueryFilter.java

On Fri, Aug 12, 2016 at 8:23 AM, Chetan Mehrotra
<ch...@gmail.com> wrote:
> On Thu, Aug 11, 2016 at 7:33 PM, Ian Boston <ie...@tfd.co.uk> wrote:
>> That probably means the queue should only
>> contain pointers to Documents and only index the Document as retrieved. I
>> dont know if that can ever work.
>
> That would not work as what document look like across cluster node
> would wary and what is to be considered valid entries is also not
> defined at that level
>
>> Run a single thread on the master, that indexes into a co-located ES
> cluster.
>
> While keeping things simple that looks like the safe way
>
>> BTW, how does Hybrid manage to parallelise the indexing and maintain
> consistency ?
>
> Hybrid indexes does not affect async indexes. Under this each cluster
> node maintain there local indexes which only contain local changes
> [1]. These indexes are not aware about similar index on other cluster
> node. Further the indexes are supposed to only contain entry from last
> async indexing cycle. Older entries are purged [2]. The query would
> then be consulting both indexes (IndexSearcher backed via MultiReader
> , 1 reader from async index and 1 (or 2) from local index).
>
> Also note that QueryEngine would enforce and reevaluate the property
> restrictions. So even if index has an entry based on old state QE
> would filter it out if it does not match the criteria per current
> repository state. So aim here is to have index provide a super set of
> result set.
>
> In all this async index logic remains same (single threaded) and based
> on diff. So it would remain consistent with repository state
>
> Chetan Mehrotra
> [1] They might also contain entries which are determined based on
> external diff. Read [3] for details
> [2] Purge here is done my maintaining different local index copy for
> each async indexing cycle. At max only 2 indexes are retained and
> older indexes are removed. This keeps index small
> [3] https://issues.apache.org/jira/browse/OAK-4412?focusedCommentId=15405340&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15405340

-- 
Hippo Netherlands, Oosteinde 11, 1017 WT Amsterdam, Netherlands
Hippo USA, Inc. 71 Summer Street, 2nd Floor Boston, MA 02110, United
states of America.

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Thu, Aug 11, 2016 at 7:33 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> That probably means the queue should only
> contain pointers to Documents and only index the Document as retrieved. I
> dont know if that can ever work.

That would not work as what document look like across cluster node
would wary and what is to be considered valid entries is also not
defined at that level

> Run a single thread on the master, that indexes into a co-located ES
cluster.

While keeping things simple that looks like the safe way

> BTW, how does Hybrid manage to parallelise the indexing and maintain
consistency ?

Hybrid indexes does not affect async indexes. Under this each cluster
node maintain there local indexes which only contain local changes
[1]. These indexes are not aware about similar index on other cluster
node. Further the indexes are supposed to only contain entry from last
async indexing cycle. Older entries are purged [2]. The query would
then be consulting both indexes (IndexSearcher backed via MultiReader
, 1 reader from async index and 1 (or 2) from local index).

Also note that QueryEngine would enforce and reevaluate the property
restrictions. So even if index has an entry based on old state QE
would filter it out if it does not match the criteria per current
repository state. So aim here is to have index provide a super set of
result set.

In all this async index logic remains same (single threaded) and based
on diff. So it would remain consistent with repository state

Chetan Mehrotra
[1] They might also contain entries which are determined based on
external diff. Read [3] for details
[2] Purge here is done my maintaining different local index copy for
each async indexing cycle. At max only 2 indexes are retained and
older indexes are removed. This keeps index small
[3] https://issues.apache.org/jira/browse/OAK-4412?focusedCommentId=15405340&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15405340

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi,

On 11 August 2016 at 13:03, Chetan Mehrotra <ch...@gmail.com>
wrote:

> On Thu, Aug 11, 2016 at 5:19 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > correct.
> > Documents are shared by ID so all updates hit the same shard.
> > That may result in network traffic if the shard is not local.
>
> Focusing on ordering part as that is the most critical aspect compared
> to other. (BAckup and Restore with sharded index is a separate problem
> to discuss but later)
>
> So even if there is a single master for a given path how would it
> order the changes. Given local changes only give partial view of end
> state.
>

In theory, the index should be driven by the eventual consistency of the
source repository, eventually reaching the same consistent state, and
updating on each state change. That probably means the queue should only
contain pointers to Documents and only index the Document as retrieved. I
dont know if that can ever work.

>
> Also in such a setup would each query need to consider multiple shards
> for final result or each node would "eventually" sync index changes
> from other nodes (complete replication) and query would only use local
> index
>
> For me ensuring consistency in how index updates are sent to ES wrt
> Oak view of changes was kind of blocking feature to enable
> parallelization of indexing process. It needs to be ensured that for
> concurrent commit end result in index is in sync with repository
> state.
>

agreed, me also on various attempts.

>
> Current single thread async index update avoid all such race condition.
>

Perhaps this is the "root" of the problem. The only way to index Oak
consistently is with a single thread globally, as is done now

That's still possible with ES.
Run a single thread on the master, that indexes into a co-located ES
cluster.
If the full text extraction is distributed, then master only needs to
resource writing the local shard.
Its not as good as parallelising the queue, but given the structure of Oak
might be the only way.

Even so, future revisions will be in the index long before Oak has synced
the root document.

The current implementation doesn't have to think about this as the indexing
is single threaded globally *and* each segment update committed first by a
hard lucene commit and second by a root document sync guaranteeing the
sequential update nature.

BTW, how does Hybrid manage to parallelise the indexing and maintain
consistency ?

Best Regards
Ian

>
> Chetan Mehrotra
>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Thu, Aug 11, 2016 at 5:19 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> correct.
> Documents are shared by ID so all updates hit the same shard.
> That may result in network traffic if the shard is not local.

Focusing on ordering part as that is the most critical aspect compared
to other. (BAckup and Restore with sharded index is a separate problem
to discuss but later)

So even if there is a single master for a given path how would it
order the changes. Given local changes only give partial view of end
state.

Also in such a setup would each query need to consider multiple shards
for final result or each node would "eventually" sync index changes
from other nodes (complete replication) and query would only use local
index

For me ensuring consistency in how index updates are sent to ES wrt
Oak view of changes was kind of blocking feature to enable
parallelization of indexing process. It needs to be ensured that for
concurrent commit end result in index is in sync with repository
state.

Current single thread async index update avoid all such race condition.

Chetan Mehrotra

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Ian Boston <ie...@tfd.co.uk>.

On 11 August 2016 at 11:10, Chetan Mehrotra <ch...@gmail.com>
wrote:

> On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > Both Solr Cloud and ES address this by sharding and
> > replicating the indexes, so that all commits are soft, instant and real
> > time. That introduces problems.
> ...
> > Both Solr Cloud and ES address this by sharding and
> > replicating the indexes, so that all commits are soft, instant and real
> > time.
>
> This would really be useful. However I have couple of aspects to clear
>
> Index Update Gurantee
> --------------------------------
>
> Lets say if commit succeeds and then we update the index and index
> update fails for some reason. Then would that update be missed or
> there can be some mechanism to recover. I am not very sure about WAL
> here that may be the answer here but still confirming.
>

For ES (I don't know about how the Solr Cloud WAL behaves)
The update be accepted until it's written to the WAL so if something fails
before that, then the it's upto how the queue of updates is managed which
is client side.
If its written to the WAL, whatever happens it will be indexed eventually,
provided the WAL is available. Think of the WAL as equivalent to the Oak
Journal, IIUC. The WAL is present on all replicas, so provided 1 replica is
available on shard, no data is lost.

>
> In Oak with the way async index update works based on checkpoint its
> ensured that index would "eventually" contain the right data and no
> update would be lost. if there is a failure in index update then that
> would fail and next cycle would start again from same base state
>

Sound like the same level of guarantee, depending on how the client side is
implemented. Typically I didnt bother with a queue between the application
and the ES client because the ES client was so fast.

>
> Order of index update
> -----------------------------
>
> Lets say I have 2 cluster nodes where same node is being performed
>
> Original state /a {x:1}
>
> Cluster Node N1 - /a {x:1, y:2}
> Cluster Node N2 - /a {x:1, z:3}
>
> End State /a {x:1, y:2, z:3}
>
> At Oak level both the commits would succeed as there is no conflict.
> However N1 and N2 would not be seeing each other updates immediately
> and that would depend on background read. So in this case how would
> index update would look like.
>
> 1. Would index update for specific paths go to some master which would
> order the update
>

correct.
Documents are shared by ID so all updates hit the same shard.
That may result in network traffic if the shard is not local.

> 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}
>
> Here current async index update logic ensures that it sees the
> eventually expected order of changes and hence would be consistent
> with repository state.

> Backup and Restore
> ---------------------------
>
> Would the backup now involve backup of ES index files from each
> cluster node. Or assuming full replication it would involve backup of
> files from any one of the nodes. Would the back be in sync with last
> changes done in repository (assuming sudden shutdown where changes got
> committed to repository but not yet to any index)
>
> Here current approach of storing index files as part of MVCC storage
> ensures that index state is consistent to some "checkpointed" state in
> repository. And post restart it would eventually catch up with the
> current repository state and hence would not require complete rebuild
> of index in case of unclean shutdowns
>

If the revision is present in the document, then I assume it can be
filtered at query time.
However, there may be problems here, as might have to find some way of
indexing the revision history of a document.... like the format in
MongoDB... I did wonder if a better solution was to use ES as the primary
storage then all the property indexes would be present by default with no
need for any Lucene index plugin..... but I stopped thinking about that
with the 1s root document sync as my interest was real time.

Best Regards
Ian

>
>
> Chetan Mehrotra
>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi,

On 11 August 2016 at 11:43, Chetan Mehrotra <ch...@gmail.com>
wrote:

> > https://github.com/ieb/oak-es
>
> btw this looks interesting and something we can build upon. This can
> benefit from a refactoring of LuceneIndexEditor to separate the logic
> of interpreting the Oak indexing config during editor invocation from
> constructing Lucene document. If we decouple that logic then it would
> be possible to plugin in a ES Editor which just converts those
> properties per ES requirement. Hence it gets all benefits of
> aggregation, relative property implementation etc (which is very Oak
> specific stuff). This effort has been discussed but we never got time
> to do that so far. Something on the lines which you are doing at [2]
>
> Another approach - With recent refactoring done in  OAK-4566 my plan
> was to plugin a ES based LuceneIndexWriter (ignore the name for now!)
> and convert the Lucene Document to some ES Document counterpart. And
> then provide just the query implementation. This would also allow to
> reuse most of testcase we have in oak-lucene
>

You know this code far better than I do.
Given this was a PoC I didn't want to try and refactor or extend the Lucene
code but felt it was better to work with the plugin API, hence the
completely separate bundle. If you think it would work better to plugin an
ES based LuceneIndexWriter and provide a query implementation then that
might be less effort.

One point. The ES client manages its update queue well and is async, so
often does not need any plumbing that would be required for a pure Lucene
implementation. You might be able to simplify the impl.

Best Regards
Ian



>
> Chetan Mehrotra
> [2] https://github.com/ieb/oak-es/blob/master/src/main/java/org/
> apache/jackrabbit/oak/plusing/index/es/index/take2/
> ESIndexEditorContext.java
>
> On Thu, Aug 11, 2016 at 3:40 PM, Chetan Mehrotra
> <ch...@gmail.com> wrote:
> > On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> >> Both Solr Cloud and ES address this by sharding and
> >> replicating the indexes, so that all commits are soft, instant and real
> >> time. That introduces problems.
> > ...
> >> Both Solr Cloud and ES address this by sharding and
> >> replicating the indexes, so that all commits are soft, instant and real
> >> time.
> >
> > This would really be useful. However I have couple of aspects to clear
> >
> > Index Update Gurantee
> > --------------------------------
> >
> > Lets say if commit succeeds and then we update the index and index
> > update fails for some reason. Then would that update be missed or
> > there can be some mechanism to recover. I am not very sure about WAL
> > here that may be the answer here but still confirming.
> >
> > In Oak with the way async index update works based on checkpoint its
> > ensured that index would "eventually" contain the right data and no
> > update would be lost. if there is a failure in index update then that
> > would fail and next cycle would start again from same base state
> >
> > Order of index update
> > -----------------------------
> >
> > Lets say I have 2 cluster nodes where same node is being performed
> >
> > Original state /a {x:1}
> >
> > Cluster Node N1 - /a {x:1, y:2}
> > Cluster Node N2 - /a {x:1, z:3}
> >
> > End State /a {x:1, y:2, z:3}
> >
> > At Oak level both the commits would succeed as there is no conflict.
> > However N1 and N2 would not be seeing each other updates immediately
> > and that would depend on background read. So in this case how would
> > index update would look like.
> >
> > 1. Would index update for specific paths go to some master which would
> > order the update
> > 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}
> >
> > Here current async index update logic ensures that it sees the
> > eventually expected order of changes and hence would be consistent
> > with repository state.
> >
> > Backup and Restore
> > ---------------------------
> >
> > Would the backup now involve backup of ES index files from each
> > cluster node. Or assuming full replication it would involve backup of
> > files from any one of the nodes. Would the back be in sync with last
> > changes done in repository (assuming sudden shutdown where changes got
> > committed to repository but not yet to any index)
> >
> > Here current approach of storing index files as part of MVCC storage
> > ensures that index state is consistent to some "checkpointed" state in
> > repository. And post restart it would eventually catch up with the
> > current repository state and hence would not require complete rebuild
> > of index in case of unclean shutdowns
> >
> >
> > Chetan Mehrotra
>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Torgeir Veimo <to...@gmail.com>.

ES 2.3.5 is currently on lucene 5.5, while oak-lucene is at 4.7.1. Maybe
this would inspire upgrading oak-lucene as well to avoid having multiple
different bundeled lucene versions?

https://issues.apache.org/jira/browse/OAK-3150

On 11 August 2016 at 20:43, Chetan Mehrotra <ch...@gmail.com>
wrote:

> > https://github.com/ieb/oak-es
>
> btw this looks interesting and something we can build upon. This can
> benefit from a refactoring of LuceneIndexEditor to separate the logic
> of interpreting the Oak indexing config during editor invocation from
> constructing Lucene document. If we decouple that logic then it would
> be possible to plugin in a ES Editor which just converts those
> properties per ES requirement. Hence it gets all benefits of
> aggregation, relative property implementation etc (which is very Oak
> specific stuff). This effort has been discussed but we never got time
> to do that so far. Something on the lines which you are doing at [2]
>
> Another approach - With recent refactoring done in  OAK-4566 my plan
> was to plugin a ES based LuceneIndexWriter (ignore the name for now!)
> and convert the Lucene Document to some ES Document counterpart. And
> then provide just the query implementation. This would also allow to
> reuse most of testcase we have in oak-lucene
>
> Chetan Mehrotra
> [2] https://github.com/ieb/oak-es/blob/master/src/main/java/org/
> apache/jackrabbit/oak/plusing/index/es/index/take2/
> ESIndexEditorContext.java
>
> On Thu, Aug 11, 2016 at 3:40 PM, Chetan Mehrotra
> <ch...@gmail.com> wrote:
> > On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> >> Both Solr Cloud and ES address this by sharding and
> >> replicating the indexes, so that all commits are soft, instant and real
> >> time. That introduces problems.
> > ...
> >> Both Solr Cloud and ES address this by sharding and
> >> replicating the indexes, so that all commits are soft, instant and real
> >> time.
> >
> > This would really be useful. However I have couple of aspects to clear
> >
> > Index Update Gurantee
> > --------------------------------
> >
> > Lets say if commit succeeds and then we update the index and index
> > update fails for some reason. Then would that update be missed or
> > there can be some mechanism to recover. I am not very sure about WAL
> > here that may be the answer here but still confirming.
> >
> > In Oak with the way async index update works based on checkpoint its
> > ensured that index would "eventually" contain the right data and no
> > update would be lost. if there is a failure in index update then that
> > would fail and next cycle would start again from same base state
> >
> > Order of index update
> > -----------------------------
> >
> > Lets say I have 2 cluster nodes where same node is being performed
> >
> > Original state /a {x:1}
> >
> > Cluster Node N1 - /a {x:1, y:2}
> > Cluster Node N2 - /a {x:1, z:3}
> >
> > End State /a {x:1, y:2, z:3}
> >
> > At Oak level both the commits would succeed as there is no conflict.
> > However N1 and N2 would not be seeing each other updates immediately
> > and that would depend on background read. So in this case how would
> > index update would look like.
> >
> > 1. Would index update for specific paths go to some master which would
> > order the update
> > 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}
> >
> > Here current async index update logic ensures that it sees the
> > eventually expected order of changes and hence would be consistent
> > with repository state.
> >
> > Backup and Restore
> > ---------------------------
> >
> > Would the backup now involve backup of ES index files from each
> > cluster node. Or assuming full replication it would involve backup of
> > files from any one of the nodes. Would the back be in sync with last
> > changes done in repository (assuming sudden shutdown where changes got
> > committed to repository but not yet to any index)
> >
> > Here current approach of storing index files as part of MVCC storage
> > ensures that index state is consistent to some "checkpointed" state in
> > repository. And post restart it would eventually catch up with the
> > current repository state and hence would not require complete rebuild
> > of index in case of unclean shutdowns
> >
> >
> > Chetan Mehrotra
>



-- 
-Tor

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Chetan Mehrotra <ch...@gmail.com>.

> https://github.com/ieb/oak-es

btw this looks interesting and something we can build upon. This can
benefit from a refactoring of LuceneIndexEditor to separate the logic
of interpreting the Oak indexing config during editor invocation from
constructing Lucene document. If we decouple that logic then it would
be possible to plugin in a ES Editor which just converts those
properties per ES requirement. Hence it gets all benefits of
aggregation, relative property implementation etc (which is very Oak
specific stuff). This effort has been discussed but we never got time
to do that so far. Something on the lines which you are doing at [2]

Another approach - With recent refactoring done in  OAK-4566 my plan
was to plugin a ES based LuceneIndexWriter (ignore the name for now!)
and convert the Lucene Document to some ES Document counterpart. And
then provide just the query implementation. This would also allow to
reuse most of testcase we have in oak-lucene

Chetan Mehrotra
[2] https://github.com/ieb/oak-es/blob/master/src/main/java/org/apache/jackrabbit/oak/plusing/index/es/index/take2/ESIndexEditorContext.java

On Thu, Aug 11, 2016 at 3:40 PM, Chetan Mehrotra
<ch...@gmail.com> wrote:
> On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston <ie...@tfd.co.uk> wrote:
>> Both Solr Cloud and ES address this by sharding and
>> replicating the indexes, so that all commits are soft, instant and real
>> time. That introduces problems.
> ...
>> Both Solr Cloud and ES address this by sharding and
>> replicating the indexes, so that all commits are soft, instant and real
>> time.
>
> This would really be useful. However I have couple of aspects to clear
>
> Index Update Gurantee
> --------------------------------
>
> Lets say if commit succeeds and then we update the index and index
> update fails for some reason. Then would that update be missed or
> there can be some mechanism to recover. I am not very sure about WAL
> here that may be the answer here but still confirming.
>
> In Oak with the way async index update works based on checkpoint its
> ensured that index would "eventually" contain the right data and no
> update would be lost. if there is a failure in index update then that
> would fail and next cycle would start again from same base state
>
> Order of index update
> -----------------------------
>
> Lets say I have 2 cluster nodes where same node is being performed
>
> Original state /a {x:1}
>
> Cluster Node N1 - /a {x:1, y:2}
> Cluster Node N2 - /a {x:1, z:3}
>
> End State /a {x:1, y:2, z:3}
>
> At Oak level both the commits would succeed as there is no conflict.
> However N1 and N2 would not be seeing each other updates immediately
> and that would depend on background read. So in this case how would
> index update would look like.
>
> 1. Would index update for specific paths go to some master which would
> order the update
> 2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}
>
> Here current async index update logic ensures that it sees the
> eventually expected order of changes and hence would be consistent
> with repository state.
>
> Backup and Restore
> ---------------------------
>
> Would the backup now involve backup of ES index files from each
> cluster node. Or assuming full replication it would involve backup of
> files from any one of the nodes. Would the back be in sync with last
> changes done in repository (assuming sudden shutdown where changes got
> committed to repository but not yet to any index)
>
> Here current approach of storing index files as part of MVCC storage
> ensures that index state is consistent to some "checkpointed" state in
> repository. And post restart it would eventually catch up with the
> current repository state and hence would not require complete rebuild
> of index in case of unclean shutdowns
>
>
> Chetan Mehrotra

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> Both Solr Cloud and ES address this by sharding and
> replicating the indexes, so that all commits are soft, instant and real
> time. That introduces problems.
...
> Both Solr Cloud and ES address this by sharding and
> replicating the indexes, so that all commits are soft, instant and real
> time.

This would really be useful. However I have couple of aspects to clear

Index Update Gurantee
--------------------------------

Lets say if commit succeeds and then we update the index and index
update fails for some reason. Then would that update be missed or
there can be some mechanism to recover. I am not very sure about WAL
here that may be the answer here but still confirming.

In Oak with the way async index update works based on checkpoint its
ensured that index would "eventually" contain the right data and no
update would be lost. if there is a failure in index update then that
would fail and next cycle would start again from same base state

Order of index update
-----------------------------

Lets say I have 2 cluster nodes where same node is being performed

Original state /a {x:1}

Cluster Node N1 - /a {x:1, y:2}
Cluster Node N2 - /a {x:1, z:3}

End State /a {x:1, y:2, z:3}

At Oak level both the commits would succeed as there is no conflict.
However N1 and N2 would not be seeing each other updates immediately
and that would depend on background read. So in this case how would
index update would look like.

1. Would index update for specific paths go to some master which would
order the update
2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}

Here current async index update logic ensures that it sees the
eventually expected order of changes and hence would be consistent
with repository state.

Backup and Restore
---------------------------

Would the backup now involve backup of ES index files from each
cluster node. Or assuming full replication it would involve backup of
files from any one of the nodes. Would the back be in sync with last
changes done in repository (assuming sudden shutdown where changes got
committed to repository but not yet to any index)

Here current approach of storing index files as part of MVCC storage
ensures that index state is consistent to some "checkpointed" state in
repository. And post restart it would eventually catch up with the
current repository state and hence would not require complete rebuild
of index in case of unclean shutdowns

Chetan Mehrotra

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi,

On 11 August 2016 at 09:14, Michael Marth <mm...@adobe.com> wrote:

> To the second part of your mail:
>
> You touch on a couple of different topics:
> A) a design where indexes are not stored in the repo - but specifically a
> design that uses an external (shared) indexer like ES
> B) indexing latency
> C) embedded Lucene vs embedded ES
>
> I am not entirely sure what you are suggesting, TBH. On one hand you seem
> to suggest to use an external indexer like ES (external to the repo) on the
> other hand you mention embedded ES (which would lead to the repo and the
> embedded ES be colocated in the same JVM IIUC).
>

ES can be run colocated in the same JVM, colocated in a cluster of the same
JVMs or in a dedicated cluster of JVMs. Which mode is determined by the URL
given to the Client. The Client provision a ES sever colocated in the same
JVM if instructed to do so.

>
> In order to understand let me ask you this way:
> A) Oak does support an external Solr(Cloud) instance as a shared indexer
> that is external to the repo. Same could be done with an external ES (in
> fact, Tommaso has written a POC for that). On the very high level question
> whether the index and the repo should be separate: does this address your
> concern? (meaning: if we leave the relative benefits of Solr vs ES aside
> for a second)
>

The External Solr Cloud plugin does not have the same level of
functionality as the Lucene plugin from a indexing and querying pov. If it
did, and leaving the relative benefits of Solr vs ES aside for a second,
yes, it does answer my concern, but see my last point at the end.

If Tommaso has ported the Lucene plugin to ES, the has anyone tried to run
it co-located or has there always been an assumption it required an
external ES cluster ?

> B) indexing latency is a very different concern. In the current design it
> is relevant in deployments when there is high latency between the
> persistence and Oak. OAK-4638 and OAK-4412 are about addressing this. I am
> not sure in how far separating out the indexer from the repo persistence
> would help.
>

Oak commit -> Lucene query latency is currently governed the the following
steps.
1 Index on the Oak  master
2 hard commit the segment
3 write the segment to the repository DS
4 perform an oak commit.
5 wait for the root document to sync (1s).
On each slave when the root document is synced
6 transfer the segment from the DS to a local copy.
7 open the segment and merge into the index reader (may require segment
merge and optimise)

If the index is local a soft commit can be performed with zero latency,
leaving only the root document sync to make the revision visible in the
query results. On TarMK this would not be an issue as there is only 1 JVM.
On DocumentMK, as you point out, every JVM having a complete copy of the
index is also wrong. Both Solr Cloud and ES address this by sharding and
replicating the indexes, so that all commits are soft, instant and real
time. That introduces problems. When a new JVM leaves the cluster for
whatever reason, its local copy must have been replicated somewhere. When a
new JVM joins the cluster it must become part of a replica set. The soft
commit introduces problems, as if the JVM dies everything not hard
committed is lost. This is where Solr Cloud and ES diverge (IIUC). Both
have a write ahead log containing changes since the last hard commit.
Solr's WAL is low level segment information. ES WAL contains the update
documents. This gives ES the edge in terms of data latency, but that's
irrelevant here as the Oak root document sync imposes a 1s data latency.

Making the indexes local and adopting the proven pattern used by both Solr
Cloud and ES would.

> C) you seem to suggest that an embedded ES has advantages over the
> embedded Lucene. What I do not understand in that comparison where you
> would store the index. If locally we would be back to the JR2 design. If
> somewhere remote then why embed ES at all?
>

All the research and history says that a Lucene index must be stored
locally to the Index Update operation and the Index Search operation and
not shipped or shared between instances. For Solr Cloud that history
started with Nutch which did segment shipping from a master indexer. IIRC
it used NFS as its segment repository. Others used RDBM's, some used scp.
Most were in the "nowhere near real time search, trivial scale" domain.

IIUC The JR2 design put the index locally but made each index a complete
copy of everything, with every JR2 instance performing indexing triggered
by events provided by JR2. That is not how ES or Solr Cloud work. JR2
provided no scalability of the index update operation and no resilience in
the event of a cluster change.

If a JR2 instance died, and it was soft committing, IIRC everything not
hard committed was lost. (no WAL, although perhaps the commit log provided
that)
Backing up the index was hard (but possible, repeated rsyncs worked under
light load)
New JVMs had to either be provisioned with a backup index or re-index from
scratch.
Many years ago I ran a JR2 cluster backed by MySQL on 16 machines
supporting 25K users, 5K concurrent so, like many JR2 users experienced
these issues. Others tried to run larger clusters. I think 200K with about
90K concurrent was the largest attempted, but it failed load tests. Most
switched to ES or SolrCloud.

When a new ES or Solr Cloud JVM is added it gets segments from other
replica set members. No full reindex. No downtime.
If the JVM crashed, it replays its WAL to recover to the last update from
the last hard commit.

AFAIK, although co-locating a standard Solr instance in a JVM is trivial,
co-locating Solr Cloud is not trivial as it it requires Zookeeper and is
not fully self contained. ES on the other hand is fully self contained and
trivial to co-locate. There are some requirements for 100% HA. There must
be at least 3 JVMs members to maintain 100% HA and I think ES recommends
more if there is to be no reliance on backups at all.

I have vastly simplified the details of ES and Solr Cloud, partly to keep
the post short and partly because in the case of ES I haven't had to dig
too deeply to make ES work for production, because I have not seen
production problems. In the case of Solr Cloud I have read the docs and
some of the code but not run it in serious production. (TechOps teams
favoured ES over Solr Cloud, having had bad experiences with SolrCloud and
good with ES. I gave both options.)

There is one final point to make. If Oak is "chatty" talking to its lucene
index making 100s of sub queries for each Oak query then running a sharded
index may not be possible, as each query may require a round trip. I know
the ES query language supports complex multistep queries, I don't know
about Solr Cloud but I would expect/hope it does also.

Best Regards
Ian

>
> Thanks for clarifying
> Cheers
> Michael
>
>
>
> >I am reticent to disagree with you, but I feel I have no option, based on
> >research, history and first hand experience over the past 10 years.
> >
> >Storing indexes in a repo is what Compass did from 2004 onwards, until
> >after the third version they gave up trying to build a scalable and near
> >real time search engine. Version 4 was a rerwite that became ElasticSearch
> >0.4.0. The history is documented here
> >https://en.wikipedia.org/wiki/Elasticsearch and was presented at Berlin
> >Buzwords in 2010 with a detailed description of why each approach fails. I
> >have shared this information before. I am not sharing it to confront. I am
> >sharing it because it pains me to see Oak repeating history. I don't feel
> I
> >can stand by and watch in silence.
> >
> >If Oak does not want to use ES as a library, then learn from the history
> as
> >it addresses your concerns (1,2, + brick wall) and those of Davide, and
> >satisfies the many of the other issues potentially eliminating property
> >indexes completely. It will however, only ever be as NRT as the root
> >document commit period (1s), well above the 100ms data latency a model
> like
> >used by ES delivers under production load.
> >
> > IMHO, the Hybrid approach being proposed is a step along the same history
> >that Compass started treading in 2004. It is an innovative solution to a
> >constrained problem space.
> >
> >Sorry if I sound like a broken record. I did exactly what Oak has done/is
> >doing in 2006 onwards but without a vast user base was able to be more
> >agile.
> >
> >Apache is about doing, not standing by, about fact not fiction, about
> >evidence and reasoned argument. If there is any interest, I have an Oak
> PoC
> >somewhere that ports the Lucene index plugin to use embedded ES instances,
> >1 per VM as an embedded ES cluster. It's not complete as I gave up on it
> >when I realised data latency would be fixed by the Oak root document. My
> >interest was proper real time indexing over the cluster.
>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Michael Marth <mm...@adobe.com>.

To the second part of your mail:

You touch on a couple of different topics:
A) a design where indexes are not stored in the repo - but specifically a design that uses an external (shared) indexer like ES
B) indexing latency
C) embedded Lucene vs embedded ES

I am not entirely sure what you are suggesting, TBH. On one hand you seem to suggest to use an external indexer like ES (external to the repo) on the other hand you mention embedded ES (which would lead to the repo and the embedded ES be colocated in the same JVM IIUC).

In order to understand let me ask you this way:
A) Oak does support an external Solr(Cloud) instance as a shared indexer that is external to the repo. Same could be done with an external ES (in fact, Tommaso has written a POC for that). On the very high level question whether the index and the repo should be separate: does this address your concern? (meaning: if we leave the relative benefits of Solr vs ES aside for a second)
B) indexing latency is a very different concern. In the current design it is relevant in deployments when there is high latency between the persistence and Oak. OAK-4638 and OAK-4412 are about addressing this. I am not sure in how far separating out the indexer from the repo persistence would help.
C) you seem to suggest that an embedded ES has advantages over the embedded Lucene. What I do not understand in that comparison where you would store the index. If locally we would be back to the JR2 design. If somewhere remote then why embed ES at all?

Thanks for clarifying
Cheers
Michael



>I am reticent to disagree with you, but I feel I have no option, based on
>research, history and first hand experience over the past 10 years.
>
>Storing indexes in a repo is what Compass did from 2004 onwards, until
>after the third version they gave up trying to build a scalable and near
>real time search engine. Version 4 was a rerwite that became ElasticSearch
>0.4.0. The history is documented here
>https://en.wikipedia.org/wiki/Elasticsearch and was presented at Berlin
>Buzwords in 2010 with a detailed description of why each approach fails. I
>have shared this information before. I am not sharing it to confront. I am
>sharing it because it pains me to see Oak repeating history. I don't feel I
>can stand by and watch in silence.
>
>If Oak does not want to use ES as a library, then learn from the history as
>it addresses your concerns (1,2, + brick wall) and those of Davide, and
>satisfies the many of the other issues potentially eliminating property
>indexes completely. It will however, only ever be as NRT as the root
>document commit period (1s), well above the 100ms data latency a model like
>used by ES delivers under production load.
>
> IMHO, the Hybrid approach being proposed is a step along the same history
>that Compass started treading in 2004. It is an innovative solution to a
>constrained problem space.
>
>Sorry if I sound like a broken record. I did exactly what Oak has done/is
>doing in 2006 onwards but without a vast user base was able to be more
>agile.
>
>Apache is about doing, not standing by, about fact not fiction, about
>evidence and reasoned argument. If there is any interest, I have an Oak PoC
>somewhere that ports the Lucene index plugin to use embedded ES instances,
>1 per VM as an embedded ES cluster. It's not complete as I gave up on it
>when I realised data latency would be fixed by the Oak root document. My
>interest was proper real time indexing over the cluster.

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi,

There is no need to have several different plugins to deal with the
standalone, small scale cluster, large scale cluster deployment. It might
be desirable for some reason, but it's not necessary.

I have pushed the code I was working before I got distracted it to a GitHub
repo. [1] is where the co-located ES cluster starts. If the
property es-server-url is defined, an external ES cluster is used.

The repo is wip, incomplete and to will see 2 attempts to port the Lucene
plugin, take2 is the second. As I said I stopped when it became apparent
there was a 1s latency imposed by Oak. I think you enlightened me to that
behavior on oak-dev.

I don't know how to co-locate a Solr Cloud cluster in the same way given it
needs Zookeeper. (I don't know enough about Solr Cloud TBH).
I Oak can't stomach using ES as a library, it could with, with enough time
and resources, re-implement the pattern or something close.

Best Regards
Ian

1
https://github.com/ieb/oak-es/blob/master/src/main/java/org/apache/jackrabbit/oak/plusing/index/es/index/ESServer.java#L27

On 11 August 2016 at 09:58, Chetan Mehrotra <ch...@gmail.com>
wrote:

> Couple of points around the motivation, target usecase around Hybrid
> Indexing and Oak indexing in general.
>
> Based on my understanding of various deployments. Any application
> based on Oak has 2 type of query requirements
>
> QR1. Application Query - These mostly involve some property
> restrictions and are invoked by code itself to perform some operation.
> The property involved here in most cases would be sparse i.e. present
> in small subset of whole repository content. Such queries need to be
> very fast and they might be invoked very frequently. Such queries
> should also be more accurate and result should not lag repository
> state much.
>
> QR2. User provided query - These queries would consist of both or
> either of property restriction and fulltext constraints. The target
> nodes may form majority part of overall repository content. Such
> queries need to be fast but given user driven need not be very fast.
>
> Note that speed criteria is very subjective and relative here.
>
> Further Oak needs to support deployments
>
> 1. On single setup - For dev, prod on SegmentNodeStore
> 2. Cluster Setup on premise
> 3. Deployment in some DataCenter
>
> So Oak should enable deployments where for smaller setups it does not
> require any thirdparty system while still allow plugging in a dedicate
> system like ES/Solr if need arises. So both usecases need to be
> supported.
>
> And further even if it has access to such third party server it might
> be fine to rely on embedded Lucene for #QR1 and just delegate queries
> under #QR2 to remote. This would ensure that query results are still
> fast for usage falling under #QR1.
>
> Hybrid Index Usecase
> -----------------------------
>
> So far for #QR1 we only had property indexes and to an extent Lucene
> based property index where results lag repository state and lag might
> be significant depending on load.
>
> Hybrid index aim to support queries under  #QR1 and can be seen as
> replacement for existing non unique property indexes. Such indexes
> would have lower storage requirement and would not put much load on
> remote storage for execution. Its not meant as a replacement for
> ES/Solr but then intends to address different type of usage
>
> Very large Indexes
> -------------------------
>
> For deployments having very large repository Solr or ES based indexes
> would be preferable and there oak-solr can be used (some day oak-es!)
>
> So in brief Oak should be self sufficient for smaller deployment and
> still allow plugging in Solr/ES for large deployment and there also
> provide a choice to admin to configure a sub set of index for such
> usage depending on the size.
>
>
>
>
>
>
> Chetan Mehrotra
>
>
> On Thu, Aug 11, 2016 at 1:59 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > Hi,
> >
> > On 11 August 2016 at 09:14, Michael Marth <mm...@adobe.com> wrote:
> >
> >> Hi Ian,
> >>
> >> No worries - good discussion.
> >>
> >> I should point out though that my reply to Davide was based on a
> >> comparison of the current design vs the Jackrabbit 2 design (in which
> >> indexes were stored locally). Maybe I misunderstood Davide’s comment.
> >>
> >> I will split my answer to your mail in 2 parts:
> >>
> >>
> >> >
> >> >Full text extraction should be separated from indexing, as the DS blobs
> >> are
> >> >immutable, so is the full text. There is code to do this in the Oak
> >> >indexer, but it's not used to write to the DS at present. It should be
> >> done
> >> >in a Job, distributed to all nodes, run only once per item. Full text
> >> >extraction is hugely expensive.
> >>
> >> My understanding is that Oak currently:
> >> A) runs full text extraction in a separate thread (separate form the
> >> “other” indexer)
> >> B) runs it only once per cluster
> >> If that is correct then the difference to what you mention above would
> be
> >> that you would like the FT indexing not be pinned to one instance but
> >> rather be distributed, say round-robin.
> >> Right?
> >>
> >
> >
> > Yes.
> >
> >
> >>
> >>
> >> >Building the same index on every node doesn't scale for the reasons you
> >> >point out, and eventually hits a brick wall.
> >> >http://lucene.apache.org/core/6_1_0/core/org/apache/
> >> lucene/codecs/lucene60/package-summary.html#Limitations.
> >> >(Int32 on Document ID per index). One of the reasons for the Hybrid
> >> >approach was the number of Oak documents in some repositories will
> exceed
> >> >that limit.
> >>
> >> I am not sure what you are arguing for with this comment…
> >> It sounds like an argument in favour of the current design - which is
> >> probably not what you mean… Could you explain, please?
> >>
> >
> > I didn't communicate that very well.
> >
> > Currently Lucene (6.1) has a limit of Int32 to the number of documents it
> > can store in an index, IIUC There is a long term desire to increase that
> > but using Int64 but no long term commitment as its probably significant
> > work given arrays in Java are indexed with Int32.
> >
> > The Hybrid approach doesn't help the potential Lucene brick wall, but one
> > motivation for looking at it was the number of Oak Documents including
> > those under /oak:index which is, in some cases, approaching that limit.
> >
> >
> >
> >>
> >>
> >> Thanks!
> >> Michael
> >>
>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Chetan Mehrotra <ch...@gmail.com>.

Couple of points around the motivation, target usecase around Hybrid
Indexing and Oak indexing in general.

Based on my understanding of various deployments. Any application
based on Oak has 2 type of query requirements

QR1. Application Query - These mostly involve some property
restrictions and are invoked by code itself to perform some operation.
The property involved here in most cases would be sparse i.e. present
in small subset of whole repository content. Such queries need to be
very fast and they might be invoked very frequently. Such queries
should also be more accurate and result should not lag repository
state much.

QR2. User provided query - These queries would consist of both or
either of property restriction and fulltext constraints. The target
nodes may form majority part of overall repository content. Such
queries need to be fast but given user driven need not be very fast.

Note that speed criteria is very subjective and relative here.

Further Oak needs to support deployments

1. On single setup - For dev, prod on SegmentNodeStore
2. Cluster Setup on premise
3. Deployment in some DataCenter

So Oak should enable deployments where for smaller setups it does not
require any thirdparty system while still allow plugging in a dedicate
system like ES/Solr if need arises. So both usecases need to be
supported.

And further even if it has access to such third party server it might
be fine to rely on embedded Lucene for #QR1 and just delegate queries
under #QR2 to remote. This would ensure that query results are still
fast for usage falling under #QR1.

Hybrid Index Usecase
-----------------------------

So far for #QR1 we only had property indexes and to an extent Lucene
based property index where results lag repository state and lag might
be significant depending on load.

Hybrid index aim to support queries under  #QR1 and can be seen as
replacement for existing non unique property indexes. Such indexes
would have lower storage requirement and would not put much load on
remote storage for execution. Its not meant as a replacement for
ES/Solr but then intends to address different type of usage

Very large Indexes
-------------------------

For deployments having very large repository Solr or ES based indexes
would be preferable and there oak-solr can be used (some day oak-es!)

So in brief Oak should be self sufficient for smaller deployment and
still allow plugging in Solr/ES for large deployment and there also
provide a choice to admin to configure a sub set of index for such
usage depending on the size.

Chetan Mehrotra

On Thu, Aug 11, 2016 at 1:59 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> Hi,
>
> On 11 August 2016 at 09:14, Michael Marth <mm...@adobe.com> wrote:
>
>> Hi Ian,
>>
>> No worries - good discussion.
>>
>> I should point out though that my reply to Davide was based on a
>> comparison of the current design vs the Jackrabbit 2 design (in which
>> indexes were stored locally). Maybe I misunderstood Davide’s comment.
>>
>> I will split my answer to your mail in 2 parts:
>>
>>
>> >
>> >Full text extraction should be separated from indexing, as the DS blobs
>> are
>> >immutable, so is the full text. There is code to do this in the Oak
>> >indexer, but it's not used to write to the DS at present. It should be
>> done
>> >in a Job, distributed to all nodes, run only once per item. Full text
>> >extraction is hugely expensive.
>>
>> My understanding is that Oak currently:
>> A) runs full text extraction in a separate thread (separate form the
>> “other” indexer)
>> B) runs it only once per cluster
>> If that is correct then the difference to what you mention above would be
>> that you would like the FT indexing not be pinned to one instance but
>> rather be distributed, say round-robin.
>> Right?
>>
>
>
> Yes.
>
>
>>
>>
>> >Building the same index on every node doesn't scale for the reasons you
>> >point out, and eventually hits a brick wall.
>> >http://lucene.apache.org/core/6_1_0/core/org/apache/
>> lucene/codecs/lucene60/package-summary.html#Limitations.
>> >(Int32 on Document ID per index). One of the reasons for the Hybrid
>> >approach was the number of Oak documents in some repositories will exceed
>> >that limit.
>>
>> I am not sure what you are arguing for with this comment…
>> It sounds like an argument in favour of the current design - which is
>> probably not what you mean… Could you explain, please?
>>
>
> I didn't communicate that very well.
>
> Currently Lucene (6.1) has a limit of Int32 to the number of documents it
> can store in an index, IIUC There is a long term desire to increase that
> but using Int64 but no long term commitment as its probably significant
> work given arrays in Java are indexed with Int32.
>
> The Hybrid approach doesn't help the potential Lucene brick wall, but one
> motivation for looking at it was the number of Oak Documents including
> those under /oak:index which is, in some cases, approaching that limit.
>
>
>
>>
>>
>> Thanks!
>> Michael
>>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi,

On 11 August 2016 at 09:14, Michael Marth <mm...@adobe.com> wrote:

> Hi Ian,
>
> No worries - good discussion.
>
> I should point out though that my reply to Davide was based on a
> comparison of the current design vs the Jackrabbit 2 design (in which
> indexes were stored locally). Maybe I misunderstood Davide’s comment.
>
> I will split my answer to your mail in 2 parts:
>
>
> >
> >Full text extraction should be separated from indexing, as the DS blobs
> are
> >immutable, so is the full text. There is code to do this in the Oak
> >indexer, but it's not used to write to the DS at present. It should be
> done
> >in a Job, distributed to all nodes, run only once per item. Full text
> >extraction is hugely expensive.
>
> My understanding is that Oak currently:
> A) runs full text extraction in a separate thread (separate form the
> “other” indexer)
> B) runs it only once per cluster
> If that is correct then the difference to what you mention above would be
> that you would like the FT indexing not be pinned to one instance but
> rather be distributed, say round-robin.
> Right?
>


Yes.


>
>
> >Building the same index on every node doesn't scale for the reasons you
> >point out, and eventually hits a brick wall.
> >http://lucene.apache.org/core/6_1_0/core/org/apache/
> lucene/codecs/lucene60/package-summary.html#Limitations.
> >(Int32 on Document ID per index). One of the reasons for the Hybrid
> >approach was the number of Oak documents in some repositories will exceed
> >that limit.
>
> I am not sure what you are arguing for with this comment…
> It sounds like an argument in favour of the current design - which is
> probably not what you mean… Could you explain, please?
>

I didn't communicate that very well.

Currently Lucene (6.1) has a limit of Int32 to the number of documents it
can store in an index, IIUC There is a long term desire to increase that
but using Int64 but no long term commitment as its probably significant
work given arrays in Java are indexed with Int32.

The Hybrid approach doesn't help the potential Lucene brick wall, but one
motivation for looking at it was the number of Oak Documents including
those under /oak:index which is, in some cases, approaching that limit.



>
>
> Thanks!
> Michael
>

Re: Oak Indexing. Was Re: Property index replacement / evolution

Posted by Michael Marth <mm...@adobe.com>.

Hi Ian,

No worries - good discussion.

I should point out though that my reply to Davide was based on a comparison of the current design vs the Jackrabbit 2 design (in which indexes were stored locally). Maybe I misunderstood Davide’s comment.

I will split my answer to your mail in 2 parts:


>
>Full text extraction should be separated from indexing, as the DS blobs are
>immutable, so is the full text. There is code to do this in the Oak
>indexer, but it's not used to write to the DS at present. It should be done
>in a Job, distributed to all nodes, run only once per item. Full text
>extraction is hugely expensive.

My understanding is that Oak currently:
A) runs full text extraction in a separate thread (separate form the “other” indexer)
B) runs it only once per cluster
If that is correct then the difference to what you mention above would be that you would like the FT indexing not be pinned to one instance but rather be distributed, say round-robin.
Right?


>Building the same index on every node doesn't scale for the reasons you
>point out, and eventually hits a brick wall.
>http://lucene.apache.org/core/6_1_0/core/org/apache/lucene/codecs/lucene60/package-summary.html#Limitations.
>(Int32 on Document ID per index). One of the reasons for the Hybrid
>approach was the number of Oak documents in some repositories will exceed
>that limit.

I am not sure what you are arguing for with this comment…
It sounds like an argument in favour of the current design - which is probably not what you mean… Could you explain, please?


Thanks!
Michael