You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/02/12 05:40:04 UTC

[Solr Wiki] Update of "AnalysisRequestHandler" by GrantIngersoll

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/AnalysisRequestHandler

New page:
<!> ["Solr1.3"]

The AnalysisRequestHandler is a RequestHandler designed to take in documents as input and return the tokens as output.

It is available via [https://issues.apache.org/jira/browse/SOLR-477]

Input is very similar to UpdateXmlMessages in that a post can be one or more <doc>s, as in

{{{
<docs>
  <doc>
    <field name="employeeId">05991</field>
    <field name="office">Bridgewater</field>
    <field name="skills">Perl</field>
    <field name="skills">Java</field>
  </doc>
  [<doc> ... </doc>[<doc> ... </doc>]]
</docs>
}}}

The docs tag can actually be any value, it need not be docs.  In fact, you could send an <add> to the AnalysisRequestHandler and it should work just fine.

The output will look something like:
{{{
<lst name="VDBDB1A16">
<arr name="id">
 <token start="0" end="9" positionInc="1" type="word" value="VDBDB1A16"/>
</arr>

<arr name="name">
 <token start="0" end="1" positionInc="1" type="word" value="a"/>
 <token start="2" end="6" positionInc="1" type="word" value="data"/>
 <token start="0" end="6" positionInc="0" type="word" value="adata"/>
 <token start="7" end="8" positionInc="1" type="word" value="v"/>
 <token start="9" end="15" positionInc="1" type="word" value="seri"/>
 <token start="7" end="15" positionInc="0" type="word" value="vseri"/>
 <token start="16" end="17" positionInc="1" type="word" value="1"/>
 <token start="17" end="19" positionInc="1" type="word" value="gb"/>
 <token start="20" end="23" positionInc="1" type="word" value="184"/>
 <token start="24" end="27" positionInc="1" type="word" value="pin"/>
 <token start="28" end="31" positionInc="1" type="word" value="ddr"/>
 <token start="32" end="37" positionInc="1" type="word" value="sdram"/>
 <token start="38" end="48" positionInc="1" type="word" value="unbuff"/>
 <token start="49" end="52" positionInc="1" type="word" value="ddr"/>
 <token start="53" end="56" positionInc="1" type="word" value="400"/>
 <token start="58" end="60" positionInc="1" type="word" value="pc"/>
 <token start="61" end="65" positionInc="1" type="word" value="3200"/>
 <token start="67" end="73" positionInc="1" type="word" value="system"/>
 <token start="74" end="80" positionInc="1" type="word" value="memori"/>
 <token start="83" end="86" positionInc="1" type="word" value="oem"/>
</arr>
<arr name="manu">
 <token start="0" end="1" positionInc="1" type="word" value="a"/>
 <token start="2" end="6" positionInc="1" type="word" value="data"/>
 <token start="0" end="6" positionInc="0" type="word" value="adata"/>
 <token start="7" end="17" positionInc="1" type="word" value="technolog"/>
 <token start="18" end="21" positionInc="1" type="word" value="inc"/>
</arr>
<arr name="cat">
 <token start="0" end="11" positionInc="1" type="word" value="electronics"/>
</arr>
<arr name="cat">
 <token start="0" end="6" positionInc="1" type="word" value="memory"/>
</arr>
<arr name="features">
 <token start="0" end="3" positionInc="1" type="word" value="cas"/>
 <token start="4" end="11" positionInc="1" type="word" value="latenc"/>
 <token start="12" end="13" positionInc="1" type="word" value="3"/>
 <token start="16" end="17" positionInc="1" type="word" value="2"/>
 <token start="18" end="19" positionInc="1" type="word" value="7"/>
 <token start="16" end="19" positionInc="0" type="word" value="27"/>
 <token start="19" end="20" positionInc="1" type="word" value="v"/>
</arr>
<arr name="popularity">
 <token start="0" end="1" positionInc="1" type="word" value="€#0;#5;"/>
</arr>
<arr name="inStock">
 <token start="0" end="1" positionInc="1" type="word" value="T"/>
</arr>
</lst>
}}}
which is wrapped in various wrappers of the NamedList.  The key is that the <arr> tag specifies the field that is tokenized.  The top level <lst> tag name attribute contains the value of the field for the unique key of that document, in this case "VDBDB1A16".