You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2006/09/08 16:40:37 UTC
[Solr Wiki] Update of "FederatedSearch" by YonikSeeley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by YonikSeeley:
http://wiki.apache.org/solr/FederatedSearch

------------------------------------------------------------------------------
  = Federated Search Design =
  == Motivation and Goals ==
- There is a need to support homogeneous indicies that are larger than will fit on a single machine and still provide acceptable latency.
+ There is a need to support homogeneous indicies that are larger than will fit on a single machine and still provide acceptable latency.  Aggregate throughput can also increase if a larger percent of the index can be held in memory.
  
  Goals:
   * split an index into multiple pieces and be able to search across those pieces as if it were a single index.
@@ -16, +16 @@

   * distributed global idf calculations (a component of scoring factoring in the rareness of a term)
  
  == Simple Federation ==
+ The normal responses from the standard and dismax query handlers provide almost enough information in order to merge multiple responses from sub-searchers into a single client response.
- === Merge current XML ===
- Create an external service to simply combine the current XML results from handlers.
  
+ Document Lists: need sort criteria added in order to merge.  sort fields may not be stored though... may have to retrieve from FieldCache or by other means.
- ==== Merging documents ====
- If sorting by something other than score, modifications would need to be made to always return the sort criteria with the document to enable merging.
  
- Slight problem: the strings that Solr uses to represent integers and floats in a sortable/rangeable representation are *not* text and XML isn't capable of representing all unicode code points.  Higher level escaping would be needed, or the use of another format like JSON.
+ Highlighting info: trivial to merge
  
- If the merger were solr-schema aware, we could use the "external" form of the sort keys in the XML and still merge correctly by translating to index form before comparing.
+ Faceted Browsing info: relatively easy to merge
  
- ==== Merging other data ====
- The information that could be merged would be from a pre-determined set.
-  * highlighting - easily merged
-  * debugging - might need tweaking of the debugging format to more easily pick out specific documents
-  * faceted browsing - doable for simple faceted browsing that is planned for the standard request handlers.
+ Debug info: currently not meant for machine parsing... we could simply include debug info from all sub-searchers
+ 
+ Advantages:
+  * reuse current handlers with a minimum of changes
+  * everything backward compatible
+  * should be least amount of effort
+ Disadvantages:
+  * index view consistency limited to a single request (the index might change inbetween two requests), hence
+    no use of internal document ids is permitted
+  * global idf either impossible, or possibly inconsistent because it requires two RPCs
+  * can't support custom query handlers
+  * uses more network bandwidth on average... instead of just returning doc ids and sort criteria on the
+    first phase, all requested fields must be returned for every document in the return range.
+  * potentially lower scalability (due to increased network bandwidth)
+  * potentially higher CPU load due to parsing sub-responses (depends on transfer syntax)
+ 
+ === Merge XML responses w/o Schema ===
+ Create an external service to make HTTP requests to sub-searchers and simply combine the current XML responses.
+ All operations operate on the XML alone, w/o reliance on the Solr schema.
+ 
+ Notes:
+  * returning sort criteria with the document would be slightly harder, but would more easily allow for streaming.
+  * slight problem: the strings that Solr uses to represent integers and floats in a sortable/rangeable representation are *not* text and XML isn't capable of representing all unicode code points.  Solution: use the external form of the sort keys and don't do string sorts... sort based on the XML type <int> <float> <str> <bool>, etc
+ 
+ Advantages:
+  * Possible to stream (write response as sub-responses are streamed)
+    * only important if requesting very large data sets
+    * sort criteria would need to be bundled with each document to enable effective streaming
+ Disadvantages:
+  * Wouldn't support custom sorters
+  * Wouldn't support sort-missing-first, sort-missing-last
+    * but perhaps that could be passed back as part of sort criteria
+  * Wouldn't support (easily) different output formats w/o a re-implementation of the other response writers (like JSON)
+ 
+ === Merge responses with Schema awareness  ===
+ Have a query handler that makes HTTP requests to sub-searchers and merges responses.
+ 
+ Advantages:
+  * can more easily support other output formats... sub-response is de-serialized and the correct writer is used
+  * more flexible w.r.t. sub-response formats... if JSON format is smaller and faster to parse, it may be
+    used.
+ Disadvantages:
+  * streaming not really possible (or not w/o great effort)
  
  === Stateless request handlers ===
- Have request handlers and APIs that don't use docids, and don't require query consistency.
+ Optional: Have request handlers and APIs that don't use docids, and don't require query consistency.  There may not be as much value in this option...  If we need custom query handlers, complex federation that allows index consistency and docids would probably be a better choice.
  
  == Complex Federation ==
  The big distinction here is index consistency... a single request handler would be guaranteed that the view of the index does not change during a single request (just like non-federated Solr).
@@ -106, +142 @@

  Disadvantages:
   * scalability not as good... if different index parts are committing frequently at different times, the retry rate goes up as the number of sub indicies increases.
  
- Related Idea: allow the optional specification of a specific index version during querying, and
-  delay closing old indicies for a short amount of time (5 seconds... configurable) to allow requests to finish.
+ === Consistency via Specifying Index Version ===
+ Every request returns an index version that created it.  On any request, one may specify the version of
+ the index they want to serve the request.  An older IndexSearcher may be made available for some time
+ after a new IndexSearcher is in place to service requests that started with it.  Every request to a particular version of a searcher could extend the "lease" for another 5 seconds (configurable, or even per-request).
  
+  * if load balanced, a different server offering the same sub-index may be hit and go backward in time (it may not have switched to a new version yet).
+   * implementing our own load-balancing would solve this... stick to the same server for the duration of a request.
+  * if streaming the document list (retrieving the stored fields for each doc as it is sent), it would be
+    possible (though very unlikely) to hit an error in the middle of a response.  If it did happen it would
+    most likely be due to:
+    * a long gc... longer than the lease period between retrieving documents
+    * an upstream client not reading as we write, causing us to block
  
  === High Availability ===
  How can High Availability be obtained on the query side?
   * sub-searchers could be identified by VIPs (top-level-searcher would go through a load-balancer to access sub-searchers).
   * could do it in code via HASolrMultiSearcher that takes a list of sub-servers for each index slice.
-    * would need to implement failover... including not retrying a failed server for a certain amount of time (after a certain number of failures)
+    * would need to implement failover... including not retrying a failed server for a certain amount of time (after a certain number of failures) * offers advantages for complex federation... one can stick to a particular server for better caching effects
  
  == Master ==
  How should the collection be updated?  It would be complex for the client to partition the data themselves, since they would have to ensure that a particular document always went to the same server.  Although user partitioning should be possible, there should be an easier default.
@@ -122, +167 @@

  === Single Master ===
  A single master could partition the data into multiple local indicies and subsearchers would only pull the local index they are configured to have.
   * hash based on unique key field to get target index
+  * commit should be changed so nothing is done if a particular sub-index hasn't been changed
  
  Directory structure for indicies:
  Current: solr/data/index
  OptionA: solr/data/index0, solr/data/index1, solr/data/index2,
+ OptionB: solr/data/index, solr/data2/index, solr/data3/index,
+ OptionC: solr/data/index, another_solr_home/data/index, yet_another_solr_home/data/index
+   Option (C) mimics having multiple masters
+ 
+ Disadvantages:
+  * scalability limits for index distribution at some point... we are limited by the outbound network
+    bandwidth of a single box.
+   * for better bandwidth usage for incremental updates, we could consider alternate update methods...
+     add all documents to a separate index and distribute separately... then merge into main index.
+     This would also require changes to Lucene to not optimize at the start and end of a segment.
  
  === Multiple Masters ===
  There could be a master for each slice of the index.
  An external module could provide the update interface and forward the request to the correct master based on the unique key field.
  
- The directiry structure for indicies would not have to change since each master would have it's own solr home and hence it's own index directory.
+ The directory structure for indicies would not have to change since each master would have it's own solr home and hence it's own index directory.
+ 
+ Advantages:
+  * better network scalability for index distribution
+ 
+ Disadvantages:
+  * complexity of managing multiple masters
  
  === Commits ===
  How to synchronize commits across subsearchers and top-level-searchers?
+ This is probably only needed if one is trying to present a SolrSearcher java interface with a consistent index view.
  
  == Misc ==
  Any realistic way to use Hadoop?