You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Charlie Hull <ch...@flax.co.uk> on 2015/01/20 15:13:18 UTC

Using SolrCloud to implement a kind of federated search

Hi all,

We've been discussing a way of implementing a federated search by
leveraging the distributed query parts of SolrCloud. I've written this up
at
http://www.flax.co.uk/blog/2015/01/20/solr-superclusters-for-improved-federated-search/
and would welcome any comments or feedback. So far, two committers have
failed to see any major flaw in our plan, which makes me slightly nervous :)

cheers

Charlie

Re: Using SolrCloud to implement a kind of federated search

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Tue, 2015-01-20 at 15:41 +0100, "Jürgen Wagner (DVT)" wrote:

[Snip: Valid concerns]

> 3. Cardinality: there may be rather large collections and some smaller
> collections in the federation. If you use SolrCloud to obtain results,
> the ones from smaller collections will get more significance in the
> result mixing than the ones from the larger collections, as relevance
> will be relative to each federated source.

The math might be solvable or at least fuzzy solvable: SOLR-1632 takes
care of unifying term stats and site-specific boosts, defined in the
merger, can compensate somewhat for overall score-adjustments from the
different sites.

> 4. Uniqueness: different systems may index the same documents. The
> idea of having a globally unique identifier should take this into
> account, i.e., it won't suffice to simply prefix each (locally unique)
> document id with a source identifier. The federated sources must be
> aware of being federated and possibly having overlaps. Otherwise, you
> will get multiple occurrences of very popular documents.

Different sources might have different meta-data on the same entity.
Some sort of nearly-duplicate-document-merge might be preferable.
> 
> 6. Orchestration: there will be some issues with the orchestration of
> these services. Zookeeper won't scale to the multiple datacenter
> topology, effectively leaving node discovery to some other mechanism
> yet to be defined.

If the nodes are locally run proxies exposed as a Solr shard, the
connection details will be de-coupled from ZooKeeper. That would also
allow for mapping of field names & values and similar site-specific
adjustments of requests & queries.

> In my experience, there is a clear distinction between "technical" 
> federated search (possibly something like the tribe nodes) and 
> "semantic" federated search (requiring special processing of results 
> obtained from different sources, ready to be consolidated).

We have spend a fair amount of time getting semantic federated search
(we call it "integrated search") to work across our sources. The raw
requesting & merging is not too hard: Most of the development time has
been spend mapping values and adjusting how the merger should order the
documents.

- Toke Eskildsen, State and University Library, Denmark

Re: Using SolrCloud to implement a kind of federated search

Posted by "Jürgen Wagner (DVT)" <ju...@devoteam.com>.

Hello Charlie,
  theoretically, things may work as you describe them. A few big
HOWEVERs exist as far as I can see:

1. Attributes: as different organisations may use different schemata
(document attributes), the consolidation of results from multiple
sources may present a problem. This may not arise with common attributes
(for which there may be a standardization of some sort, e.g., like the
Dublin meta-core standard), but especially for very specific attributes
that pertain to the different focal work areas of the institutions
running the individual systems you want to federate.

2. Values: different organisations will work on different topics. There
may be large similarities, but as the staff involved is different, there
will be an inherent difference in the actual semantic domain dealt with.
Consequently, it is very likely that you won't have a homogeneous
ontology for all pieces of information across all federated sources.
This makes it hard to consolidate results in a semantically correct way.

3. Cardinality: there may be rather large collections and some smaller
collections in the federation. If you use SolrCloud to obtain results,
the ones from smaller collections will get more significance in the
result mixing than the ones from the larger collections, as relevance
will be relative to each federated source.

4. Uniqueness: different systems may index the same documents. The idea
of having a globally unique identifier should take this into account,
i.e., it won't suffice to simply prefix each (locally unique) document
id with a source identifier. The federated sources must be aware of
being federated and possibly having overlaps. Otherwise, you will get
multiple occurrences of very popular documents.

5. Security: security in SolrCloud is through filtering. If you simply
use the SolrCould distributed query mechanism, each source would have to
trust each federation instance to properly enforce security filters
through the respective entitlement groups. If one such federation system
won't comply and simply issue wild queries, there won't be any security.

6. Orchestration: there will be some issues with the orchestration of
these services. Zookeeper won't scale to the multiple datacenter
topology, effectively leaving node discovery to some other mechanism yet
to be defined.

These are the issues that quickly come to my mind. There may be more.

Also have a look at tribe nodes in Elasticsearch, although these don't
fully address all issues I listed above.

In my experience, there is a clear distinction between "technical"
federated search (possibly something like the tribe nodes) and
"semantic" federated search (requiring special processing of results
obtained from different sources, ready to be consolidated). FAST Unity
used to have elaborate (but still limited) mechanisms to handle this,
but they disappeared in the course of the Microsoft takeover.

Best regards,
--Jürgen

On 20.01.2015 15:13, Charlie Hull wrote:
> Hi all,
>
> We've been discussing a way of implementing a federated search by
> leveraging the distributed query parts of SolrCloud. I've written this up
> at
> http://www.flax.co.uk/blog/2015/01/20/solr-superclusters-for-improved-federated-search/
> and would welcome any comments or feedback. So far, two committers have
> failed to see any major flaw in our plan, which makes me slightly nervous :)
>
> cheers
>
> Charlie
>

-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wagner@devoteam.com
<ma...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071