You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by "Steve Rowe (Confluence)" <co...@apache.org> on 2013/10/01 21:37:00 UTC

[CONF] Apache Solr Reference Guide > UIMA Integration

Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: UIMA Integration (https://cwiki.apache.org/confluence/display/solr/UIMA+Integration)

Change Comment:
---------------------------------------------------------------------
apache-solr-* -> solr-*

Edited by Steve Rowe:
---------------------------------------------------------------------
You can integrate the Apache Unstructured Information Management Architecture ([UIMA|https://uima.apache.org/]) with Solr. UIMA lets you define custom pipelines of Analysis Engines that incrementally add metadata to your documents as annotations.

For more information about Solr UIMA integration, see [https://wiki.apache.org/solr/SolrUIMA].

h2. Configuring UIMA

The SolrUIMA UpdateRequestProcessor is a custom update request processor that takes documents being indexed, sends them to a UIMA pipeline, and then returns the documents enriched with the specified metadata.  To configure UIMA for Solr, follow these steps:

# Copy {{solr-uima-4.x.y.jar}} (under {{/solr-4.x.y/dist/}}) and its libraries (under {{contrib/uima/lib}}) to a Solr libraries directory, or set {{<lib/>}} tags in {{solrconfig.xml}} appropriately to point to those jar files: \\
\\
{code:xml|borderStyle=solid|borderColor=#666666}
<lib dir="../../contrib/uima/lib" />
<lib dir="../../dist/" regex="solr-uima-\d.*\.jar" />
{code}
# Modify {{schema.xml}}, adding your desired metadata fields specifying proper values for type, indexed, stored, and multiValued options. For example: \\
\\
{code:xml|borderStyle=solid|borderColor=#666666}
<field name="language" type="string" indexed="true" stored="true" required="false"/>
<field name="concept" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
<field name="sentence" type="text" indexed="true" stored="true" multiValued="true" required="false" />
{code} \\
# Add the following snippet to {{solrconfig.xml}}: \\
\\
{code:xml|borderStyle=solid|borderColor=#666666}
  <updateRequestProcessorChain name="uima">
    <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
      <lst name="uimaConfig">
        <lst name="runtimeParameters">
          <str name="keyword_apikey">VALID_ALCHEMYAPI_KEY</str>
          <str name="concept_apikey">VALID_ALCHEMYAPI_KEY</str>
          <str name="lang_apikey">VALID_ALCHEMYAPI_KEY</str>
          <str name="cat_apikey">VALID_ALCHEMYAPI_KEY</str>
          <str name="entities_apikey">VALID_ALCHEMYAPI_KEY</str>
          <str name="oc_licenseID">VALID_OPENCALAIS_KEY</str>
        </lst>
        <str name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str>
        <!-- Set to true if you want to continue indexing even if text processing fails.
             Default is false. That is, Solr throws RuntimeException and
             never indexed documents entirely in your session. -->
        <bool name="ignoreErrors">true</bool>
        <!-- This is optional. It is used for logging when text processing fails.
             If logField is not specified, uniqueKey will be used as logField.
        <str name="logField">id</str>
        -->
        <lst name="analyzeFields">
          <bool name="merge">false</bool>
          <arr name="fields">
            <str>text</str>
          </arr>
        </lst>
        <lst name="fieldMappings">
          <lst name="type">
            <str name="name">org.apache.uima.alchemy.ts.concept.ConceptFS</str>
            <lst name="mapping">
              <str name="feature">text</str>
              <str name="field">concept</str>
            </lst>
          </lst>
          <lst name="type">
            <str name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
            <lst name="mapping">
              <str name="feature">language</str>
              <str name="field">language</str>
            </lst>
          </lst>
          <lst name="type">
            <str name="name">org.apache.uima.SentenceAnnotation</str>
            <lst name="mapping">
              <str name="feature">coveredText</str>
              <str name="field">sentence</str>
            </lst>
          </lst>
        </lst>
      </lst>
    </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>
{code}
\\
{note}
{{VALID_ALCHEMYAPI_KEY}} is your AlchemyAPI Access Key. You need to register an AlchemyAPI Access key to use AlchemyAPI services: [http://www.alchemyapi.com/api/register.html]. \\
\\
{{VALID_OPENCALAIS_KEY}} is your Calais Service Key. You need to register a Calais Service key to use the Calais services: [http://www.opencalais.com/apikey]. \\
\\
{{analysisEngine}} must contain an AE descriptor inside the specified path in the classpath. \\
\\
{{analyzeFields}} must contain the input fields that need to be analyzed by UIMA. If {{merge=true}} then their content will be merged and analyzed only once. \\
\\
Field mapping describes which features of which types should go in a field.
{note}
\\
# In your {{solrconfig.xml}} replace the existing default UpdateRequestHandler or create a new UpdateRequestHandler: \\
{code:xml|borderStyle=solid|borderColor=#666666}
  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
    <lst name="defaults">
      <str name="update.processor">uima</str>
    </lst>
  </requestHandler>
{code}

Once you are done with the configuration your documents will be automatically enriched with the specified fields when you index them.

{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action