You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by Norbert Bodnar <bo...@inqool.cz> on 2021/04/15 08:23:45 UTC

Distributed search

Hi.

Hope you are doing well.

I would like to start with introducing the problem we are facing.

Our(let's call us CompanyB working on WebB) primary goal is to create a web
application, which will expand an existing web(WebA developed by CompanyA)
application's searching possibilities.

Let's call this existing web application WebA. WebA is deployed on multiple
clients and every client has its own set of documents. On bigger instances,
the index is updated almost every minute and the size of the index is
sometimes even a few hundred GB. WebA was developed by CompanyA and
provides basic searching capabilities, such as keyword search, faceting on
authors, classic stuff.

Now comes CompanyB with a project to build WebB, which should be a web
service built on existing WebA instances. The main goal is to take the
content of WebA, run some natural language processing on it, obtain
metadata about documents from external sources and index it, so the new
WebB will provide advanced searching capabilities on these added metadata.
WebB must also be able to search on fields from the original index, such as
author, title, but also be able to search on these new fields, effectively
combining the results.

Both WebA and WebB are for the same customer, so a bit of collaboration
between CompanyA and CompanyB is possible, however we should limit the
amount of work needed on WebA by CompanyA to minimum.

The issue is the following:
The first idea was that WebB will replicate the index of WebA and has its
own copy of the index. When new metadata is gathered for a document, it
will update the document in its own index. However, since both projects are
for the same customer, doubling the index was a bit of a problem. Also,
this idea was based on the presumption that the original index is not
updated very often, which we found out is not true.

The second approach was to use the existing index, so duplicating data on
two indexes would not be needed. But since the original index is used by
WebA and worked on by a different company, it would mean a lot of work for
them and it probably wouldn't be possible. On reindex, they would not have
access to new metadata that was posted into the index by CompanyB/WebB,
querying from WebB could affect performance on WebA, WebA would need to
filter the results to not return fields that were put there by WebB and so
on...

The third idea is probably the closest to reality, but it also might not
be. The idea was to create a new shard/node in the existing cluster of
WebA's Solr, where only the new metadata will be indexed. However, since we
need to be able to search and return results from both of these nodes
(queries like "author:ABC && metadata_one:123"), I believe it would also
not be possible. The documents in the new node containing only additional
metadata should be represented in the same document in a response, since we
want the result to contain both fields from original node (author, title,
...) and fields from new node (metadata1, namedEntities, ...). The results
of querying, faceting and sorting should also be somehow combined.

We also considered CDCR, effectively keeping a synchronized copy of the
original index but the use case for CDCR is a bit different, the target
cluster should not be updated without updates on the source cluster, which
is not what we want. We want to have an up-to-date original index enriched
with additional fields.


I hope my explanation is clear enough and I will appreciate your help.

Thank you for your time, and have a nice day :),
Norbert Bodnar

Re: Distributed search

Posted by Susmit Shukla <sh...@gmail.com>.

Solr streaming may be useful in your case. It can also execute "joins"
across different solr cloud instances and also has a SQL facade.

Parallel sql
https://solr.apache.org/guide/8_6/parallel-sql-interface.html

lower level streaming interface
https://solr.apache.org/guide/8_6/streaming-expressions.html#streaming-expressions

sorting would only be possible on "common" fields between solrcloud schemas



On Thu, Apr 15, 2021 at 1:24 AM Norbert Bodnar <bo...@inqool.cz> wrote:

> Hi.
>
> Hope you are doing well.
>
> I would like to start with introducing the problem we are facing.
>
> Our(let's call us CompanyB working on WebB) primary goal is to create a web
> application, which will expand an existing web(WebA developed by CompanyA)
> application's searching possibilities.
>
> Let's call this existing web application WebA. WebA is deployed on multiple
> clients and every client has its own set of documents. On bigger instances,
> the index is updated almost every minute and the size of the index is
> sometimes even a few hundred GB. WebA was developed by CompanyA and
> provides basic searching capabilities, such as keyword search, faceting on
> authors, classic stuff.
>
> Now comes CompanyB with a project to build WebB, which should be a web
> service built on existing WebA instances. The main goal is to take the
> content of WebA, run some natural language processing on it, obtain
> metadata about documents from external sources and index it, so the new
> WebB will provide advanced searching capabilities on these added metadata.
> WebB must also be able to search on fields from the original index, such as
> author, title, but also be able to search on these new fields, effectively
> combining the results.
>
> Both WebA and WebB are for the same customer, so a bit of collaboration
> between CompanyA and CompanyB is possible, however we should limit the
> amount of work needed on WebA by CompanyA to minimum.
>
> The issue is the following:
> The first idea was that WebB will replicate the index of WebA and has its
> own copy of the index. When new metadata is gathered for a document, it
> will update the document in its own index. However, since both projects are
> for the same customer, doubling the index was a bit of a problem. Also,
> this idea was based on the presumption that the original index is not
> updated very often, which we found out is not true.
>
> The second approach was to use the existing index, so duplicating data on
> two indexes would not be needed. But since the original index is used by
> WebA and worked on by a different company, it would mean a lot of work for
> them and it probably wouldn't be possible. On reindex, they would not have
> access to new metadata that was posted into the index by CompanyB/WebB,
> querying from WebB could affect performance on WebA, WebA would need to
> filter the results to not return fields that were put there by WebB and so
> on...
>
> The third idea is probably the closest to reality, but it also might not
> be. The idea was to create a new shard/node in the existing cluster of
> WebA's Solr, where only the new metadata will be indexed. However, since we
> need to be able to search and return results from both of these nodes
> (queries like "author:ABC && metadata_one:123"), I believe it would also
> not be possible. The documents in the new node containing only additional
> metadata should be represented in the same document in a response, since we
> want the result to contain both fields from original node (author, title,
> ...) and fields from new node (metadata1, namedEntities, ...). The results
> of querying, faceting and sorting should also be somehow combined.
>
> We also considered CDCR, effectively keeping a synchronized copy of the
> original index but the use case for CDCR is a bit different, the target
> cluster should not be updated without updates on the source cluster, which
> is not what we want. We want to have an up-to-date original index enriched
> with additional fields.
>
>
> I hope my explanation is clear enough and I will appreciate your help.
>
> Thank you for your time, and have a nice day :),
> Norbert Bodnar
>