You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/02/12 05:40:04 UTC
[Solr Wiki] Update of "AnalysisRequestHandler" by GrantIngersoll
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/AnalysisRequestHandler
New page:
<!> ["Solr1.3"]
The AnalysisRequestHandler is a RequestHandler designed to take in documents as input and return the tokens as output.
It is available via [https://issues.apache.org/jira/browse/SOLR-477]
Input is very similar to UpdateXmlMessages in that a post can be one or more <doc>s, as in
{{{
<docs>
<doc>
<field name="employeeId">05991</field>
<field name="office">Bridgewater</field>
<field name="skills">Perl</field>
<field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</docs>
}}}
The docs tag can actually be any value, it need not be docs. In fact, you could send an <add> to the AnalysisRequestHandler and it should work just fine.
The output will look something like:
{{{
<lst name="VDBDB1A16">
<arr name="id">
<token start="0" end="9" positionInc="1" type="word" value="VDBDB1A16"/>
</arr>
<arr name="name">
<token start="0" end="1" positionInc="1" type="word" value="a"/>
<token start="2" end="6" positionInc="1" type="word" value="data"/>
<token start="0" end="6" positionInc="0" type="word" value="adata"/>
<token start="7" end="8" positionInc="1" type="word" value="v"/>
<token start="9" end="15" positionInc="1" type="word" value="seri"/>
<token start="7" end="15" positionInc="0" type="word" value="vseri"/>
<token start="16" end="17" positionInc="1" type="word" value="1"/>
<token start="17" end="19" positionInc="1" type="word" value="gb"/>
<token start="20" end="23" positionInc="1" type="word" value="184"/>
<token start="24" end="27" positionInc="1" type="word" value="pin"/>
<token start="28" end="31" positionInc="1" type="word" value="ddr"/>
<token start="32" end="37" positionInc="1" type="word" value="sdram"/>
<token start="38" end="48" positionInc="1" type="word" value="unbuff"/>
<token start="49" end="52" positionInc="1" type="word" value="ddr"/>
<token start="53" end="56" positionInc="1" type="word" value="400"/>
<token start="58" end="60" positionInc="1" type="word" value="pc"/>
<token start="61" end="65" positionInc="1" type="word" value="3200"/>
<token start="67" end="73" positionInc="1" type="word" value="system"/>
<token start="74" end="80" positionInc="1" type="word" value="memori"/>
<token start="83" end="86" positionInc="1" type="word" value="oem"/>
</arr>
<arr name="manu">
<token start="0" end="1" positionInc="1" type="word" value="a"/>
<token start="2" end="6" positionInc="1" type="word" value="data"/>
<token start="0" end="6" positionInc="0" type="word" value="adata"/>
<token start="7" end="17" positionInc="1" type="word" value="technolog"/>
<token start="18" end="21" positionInc="1" type="word" value="inc"/>
</arr>
<arr name="cat">
<token start="0" end="11" positionInc="1" type="word" value="electronics"/>
</arr>
<arr name="cat">
<token start="0" end="6" positionInc="1" type="word" value="memory"/>
</arr>
<arr name="features">
<token start="0" end="3" positionInc="1" type="word" value="cas"/>
<token start="4" end="11" positionInc="1" type="word" value="latenc"/>
<token start="12" end="13" positionInc="1" type="word" value="3"/>
<token start="16" end="17" positionInc="1" type="word" value="2"/>
<token start="18" end="19" positionInc="1" type="word" value="7"/>
<token start="16" end="19" positionInc="0" type="word" value="27"/>
<token start="19" end="20" positionInc="1" type="word" value="v"/>
</arr>
<arr name="popularity">
<token start="0" end="1" positionInc="1" type="word" value="#0;#5;"/>
</arr>
<arr name="inStock">
<token start="0" end="1" positionInc="1" type="word" value="T"/>
</arr>
</lst>
}}}
which is wrapped in various wrappers of the NamedList. The key is that the <arr> tag specifies the field that is tokenized. The top level <lst> tag name attribute contains the value of the field for the unique key of that document, in this case "VDBDB1A16".