You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/12/16 06:31:20 UTC

[Solr Wiki] Update of "S14ESSAddendum" by DavidSmiley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "S14ESSAddendum" page has been changed by DavidSmiley.
http://wiki.apache.org/solr/S14ESSAddendum

--------------------------------------------------

New page:
The book [[https://www.packtpub.com/solr-1-4-enterprise-search-server/book|Solr 1.4 Enterprise Search Server]] aimed to cover all the features in Solr 1.4 but some features were overlooked at the time of writing or were implemented after the book was published.  This document is a listing of the missed content organized by the chapter it would most likely have been added to.  There are some other known "features" in Solr that are not in the book and aren't here because they are either internal to Solr or have dubious purpose or value.

== Chapter 2: Schema and Text Analysis ==

=== Trie based field types ===
The schema.xml used in the book examples has a schema version 1.1 instead of 1.2 which is Solr 1.4's new default. The distinction is fairly trivial.  The bigger difference is that Solr 1.4 defines a set of "Trie" based field types which are used in preference to the "Sortable" based ones.  For example, there is now a `TrieIntField` using a field type named `tint` which is to be used in preference to `SortableIntField` with a field type named `sint`.  The trie field types have improved performance characteristics, particularly for range queries, and they are of course sortable.  However, the "Sortable" field variants still do one thing that the trie based fields cannot do which is the ability to specify `sortMissingLast` and `sortMissingFirst`.  There is further documentation about these field types in the [[http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.0/example/solr/conf/schema.xml?revision=834197&view=markup|Solr 1.4 example schema.xml file]].

=== Text Analysis ===
 * !ReverseWildcardFilter
There is support for leading wildcards when using [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/ReversedWildcardFilterFactory.html|ReverseWildcardFilterFactory]].  See that link for some configuration options.  For example, using this filter allows a query `*book` to match the text `cookbook`.  It essentially works by reversing the words as indexed variants.  Be aware that this can more than double the number of indexed terms for the field and thus increase the disk usage proportionally.  For a configuration snippet using this feature, consider this sample field type definition excerpted from the unit tests:
{{{
    <fieldtype name="srev" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
            maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
      </analyzer>

      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>
    </fieldtype>
}}}
An interesting under-the-hood detail is that this filter requires the Solr query parsing code to check for the presence of this filter to change its behavior -- something not true for any other filter.

 * ASCIIFoldingFilter
For mapping of non-ascii characters to reasonable ASCII equivalents use `ASCIIFoldingFilterFactory` which is best documented [[http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/analysis/ASCIIFoldingFilter.html|here]].

 * !WordDelimiterFilter
There are a couple extra options for this filter not covered in the book.  One is an option `stemEnglishPossessive` which is either 1 to enable (the default) or 0. When enabled it strips off trailing `'s` on words. For example "O'Neil's" becomes "O", "Neil".  Another point is that this filter supports the same `protected` attribute that the stemmer filters do so that you can exclude certain input tokens listed in a configuration file from word delimiter processing.

=== Misc ===

 * copyField maxChars
The copyField directive in the schema can contain an optional `maxChars` attribute which puts a cap on the number of characters copied. This is useful for copying potentially large text fields into a catch-all searched field.

 * !ExternalFileField
There is a field type you can use called `ExternalFileField` that only works when referenced in function queries.  As its name suggests, its data is in an external file instead of in the index.  It's suitability is only for manipulating boosts of a document without requiring re-indexing the document.  There is some [[http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html|rudimentary javadocs]] but you'll want to search [[http://www.lucidimagination.com/search/?q=ExternalFileField|solr's mailing list]] for further info.

== Chapter 3: Indexing Data ==

 * Duplicate detection
In some Solr usage situations you may need to prevent duplication where documents that are the same could get added.  This is called ''deduplication''.  This doesn't have to do with your unique key field, it is for when there is some other text field(s) that should be unique, perhaps from a crawled file.  This feature is [[Deduplication|documented on Solr's wiki]].

=== Automatically Committing ===
The book discusses how to explicitly commit added data. Solr can also be configured to automatically commit.  This feature is particularly useful when updating the index with changed data as it occurs externally. 

 * autoCommit

In solrconfig.xml there is an <updateHandler> configuration element. Within it there is the following XML commented in the default configuration:
{{{
    <autoCommit> 
      <maxDocs>10000</maxDocs>
      <maxTime>1000</maxTime> 
    </autoCommit>
}}}
You can specify `maxDocs` and/or `maxTime` depending on your needs.  `maxDocs` simply sets a threshold at which a commit happens if there are this many documents not yet committed.  Most useful is `maxTime` (milliseconds) which essentially sets a count-down timer from the first document added after the previous commit for a commit to occur automatically.  The only problem with using these is that it can't be disabled, which is something you might want to do for bulk index loads.  Instead, consider `commitWithin` described below. 

 * commitWithin

When submitting documents to Solr, you can include a "commitWithin" attribute placed on the `<add/>` XML element.  When >= 0, this tells Solr to perform a commit no later than this number of milliseconds relative to the time Solr finishes processing the data.  Essentially it acts as an override to solrconfig.xml / updateHandler / autoCommit / maxTime.

=== Misc ===

I'd like to simply re-emphasize that the book covered the `DataImportHandler` fairly lightly.  For the latest documentation, [[DataImportHandler|go to Solr's wiki]].

 * !ContentStreamDataSource
One unique way to use the `DataImportHandler` is using the `ContentStreamDataSource`.  It is like the `URLDataSource` except that instead of the DIH going out to fetch the XML, XML can be POST'ed to the DIH from some other system (i.e. pull vs push).  Coupled together with the DIH's XSLT support, this is fairly powerful.  The following is a snippet of `solrconfig.xml` and then an entire DIH configuration file, referencing this `DataSource` type and using XSL.
{{{
  <requestHandler name="/update/musicbrainz" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-musicbrainz-post-config.xml</str>
      <str name="optimize">false</str>
      <str name="clean">false</str>
      <str name="command">full-import</str>
    </lst>
  </requestHandler>
}}}
{{{
 <dataConfig>
  <dataSource type="ContentStreamDataSource" />
  <document>
    <entity name="mentity"
            xsl="xslt/musicbrains2solr.xsl"
            useSolrAddSchema="true"
            processor="XPathEntityProcessor">
    </entity>
  </document>
</dataConfig>
}}}

== Chapter 5: Enhanced Searching ==

 * QParserPlugin and !LocalParams syntax and subqueries
Another modification Solr has to Lucene's query syntax which is the use of {{{{!qparser name=value name2=value2} yourquery}}}. That is, at the very beginning of a query you can use this syntax to indicate a different query parser (optionally) and specify some so-called "local params" name-value pairs too (again, optionally), used for certain advanced cases.  [[SolrQuerySyntax|Solr's wiki]] has a bit more information on this. And in addition, there is a _query_ pseudo field hack in the query syntax to support subqueries which is useful when used with the aforementioned QParserPlugin syntax to change the query type.  Aside from Solr's wiki, you will also find [[http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/|this blog post by Yonik]] enlightening.

=== Function queries ===

The main reference for function queries is [[http://wiki.apache.org/solr/FunctionQuery|here at Solr's wiki]].  The following are the ones not covered in the book:

 * sub(x,y)
Subtracts: x - y

 * query(subquery,default)
This one is a bit tough to understand. It yields the ''score'' for this document as found from the given sub-query, defaulting to the 2nd argument if not found in that query.  There are some interesting examples on the wiki.

 * ms(), ms(x), ms(x,y)
The `ms` function deals with times in milliseconds since the common 1970 epoch.  Arguments either refer to a date field or it is a literal (ex: 2000-01-01T00:00:00Z ).  Without arguments it returns the current time.  One argument will return the time referenced, probably a field reference.  When there are two, it returns the difference `x-y`. This function is useful when boosting more recent documents sooner. There is excellent information on this subject [[SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents|at the wiki]].

 * Function Range Queries
Functions Queries can also be used for filtering searches.  Using the `frange` QParserPlugin, you specify a numeric range applied on the given function query.  This advanced technique is best described at 
[[http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/|Yonik's blog post]] at Lucid Imationation. 

== Chapter 6: Search Components ==

=== Clustering Component ===

This is a Solr "contrib" module and was incorporated in the Solr 1.4 distribution near the end of the book's release.  This component will "cluster" the search results based on statistical similarity of terms.  It uses the [[http://project.carrot2.org|Carrot2]] open-source project as the implementation of the underlying algorithm.  Clustering is useful for large text-heavy indexes, especially when there is little/no structural information for faceting.

More details: [[ClusteringComponent]]

== Chapter 7: Deployment == 

 * XInclude
The `solrconfig.xml` file can be broken up into pieces and then included using the [[http://www.w3.org/TR/xinclude/|XInclude]] spec. An example of this is the following line:
{{{ <xi:include href="solr/conf/solrconfig_master.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/> }}}
This is particularly useful when there are multiple Solr cores that require only slightly different configurations. The common parts could be put into a file that is included into each config.  There is [[SolrConfigXml#XInclude|more information about this]] at Solr's wiki.

== Chapter 8: Integrating Solr ==

 * !VelocityResponseWriter

Solr incorporates a contrib module called [[VelocityResponseWriter]] (AKA Solritas).  By using a special request handler, you can rapidly construct user web front-ends using the [[http://velocity.apache.org/|Apache Velocity]] templating system. It isn't expected that you would build sites with this, just proof-of-concepts.

 * AJAX-Solr forks from SolrJs
[[http://wiki.github.com/evolvingweb/ajax-solr|AJAX Solr]] is another option for browser JavaScript integration with Solr. Unlike SolrJs (from which it derives), AJAX-Solr is not tied to JQuery or any other JavaScript framework for that matter. 

 * Native PHP support
PHP5 now has a [[http://us3.php.net/manual/en/book.solr.php|client API]] for interacting with Solr.

== Chapter 9: Scaling Solr ==

 * partial optimize
If the index is so large that optimizes are taking longer than desired or using more disk space during optimization than you can spare, consider adding the `maxSegments` parameter to the optimize command.  In the XML message, this would be an attribute; the URL form and SolrJ have the corresponding option too.  By default this parameter is 1 since an optimize results in a single Lucene "segment".  By setting it larger than 1 but less than the `mergeFactor`, you permit partial optimization to no more than this many segments.  Of course the index won't be fully optimized and therefore searches will be slower.