You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Tricia Williams (JIRA)" <ji...@apache.org> on 2008/04/24 01:45:21 UTC

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591873#action_12591873 ] 

Tricia Williams commented on SOLR-380:
--------------------------------------

After a lengthy absence I've returned to this issue with a bit of a new perspective.  I recognize what we have described really is a customization of Solr (albeit one I have seen in at least two organizations) and as such should be built as a plug-in (http://wiki.apache.org/solr/SolrPlugins) which can reside in your solr.home lib directory.  Now that Solr has lucene 2.3 and payloads my solution is much easier to apply than before.

I'll try to explain it here and then attach the src, deployable jar, and example for your use/reuse.

I assume that your structured document can be represented by xml:

{code:xml}
<book title="One, Two, Three">
   <page label="1">one</page>
   <page label="2">two</page>
   <page label="3">three</page>
</book>
{code}
 
But we don't have a tokenizer that can make sense of xml.  So I wrote a tokenizer which parallels the existing WhitespaceTokenizer called XmlPayloadWhitespaceTokenizer.  XmlPayloadWhitespaceTokenizer extends XmlPayloadCharTokenizer which does the same things as CharTokenizer in Lucene, but expects that the content is wrapped in xml tags.  The tokenizer keeps track of the xpath associated with each token and stores this as a payload.  

To use my Tokenizer in Solr I add the deployable jar I created containing XmlPayloadWhitespaceTokenizer in my solr.home lib director and add a structure text field type "text_st" to my schema.xml:
{code:xml}
<!-- A text field that uses the XmlPayloadWhitespaceTokenizer to store xpath info about the structured document -->
  <fieldType name="text_st" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.XmlPayloadWhitespaceTokenizerFactory"/>
      <!-- in this example, we will only use synonyms at query time
      <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
      -->
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>
{code}

I also add a field "fulltext_st" of type "text_st".

We can visualize what happens to the input text above using the Solr Admin web-app analysis.jsp modified by [SOLR-522|https://issues.apache.org/jira/browse/SOLR-522].

|term position|1|2|3|
|term text|one|two|three|
|term type|word|word|word|
|source start,end|3,6|7,10|11,16|
|payload|/book[title='One, Two, Three']/page[label='1']|/book[title='One, Two, Three']/page[label='2']|/book[title='One, Two, Three']/page[label='3']|

~Note that I've removed the hex representation of the payload for clarity~

The other side of this problem is how to present the results in a meaningful way.  Taking FacetComponent and HighlightComponent as my muse, I created a plugable [SearchComponent|http://wiki.apache.org/solr/SearchComponent] called PayloadComponent.  This component recognizes two parameters: "payload" and "payload.fl".  If payload=true, the component will find the terms from your query in the payload.fl field, retrieve the payload in these tokens, and re-combine this information to display the xpath of a search result in a give document and the number of times that term occurs in the given xpath.  

Again, to use my SearchComponent in Solr I add the deployable jar I created containing PayloadComponent in my solr.home lib director and add a search component "payload" to my solrconfig.xml:

{code:xml}
<searchComponent name="payload" class="org.apache.solr.handler.component.PayloadComponent"/>
 
  <requestHandler name="/search" class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
    </lst>
    <arr name="last-components">
      <str>payload</str>
    </arr>
  </requestHandler>
{code}

Then the result of http://localhost:8983/solr/search?q=came&payload=true&payload.fl=fulltext_st includes something like this:

{code:xml}
<lst name="payload">
 <lst name="payload_context">
  <lst name="Book.IA.0001">
   <lst name="fulltext_st">
    <int name="/book[title='Crooked Man'][url='http://ia310931.us.archive.org//load_djvu_applet.cgi?file=0/items/crookedmanotherr00newyiala/crookedmanotherr00newyiala.djvu'][author='unknown']/page[id='3']">1</int>
   </lst>
  </lst>
  <lst name="Book.IA.37729">
   <lst name="fulltext_st">
    <int name="/book[title='Charles Dicken's A Christmas Carol'][url=''][author='Dickens, Charles']/stave[title='Marley's Ghost'][id='One']/page[id='13']">1</int>
   </lst>
  </lst>
  <lst name="Book.IA.0002">
   <lst name="fulltext_st">
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='2']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='4']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='6']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='7']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='13']">1</int>
   </lst>
  </lst>
 </lst>
</lst>
{code}  

~The documents here are borrowed from the [Internet Archive|http://archive.org] and can be found in the xmlpayload-example.zip attached to this issue~

Then you have everything you need to write an xsl which will take your normal Solr results and supplement them with context from your structured document.

There may be some issues with filters that aren't payload aware.  The only one that concerned me to this point is the WordDelimiterFilter.  You can find a quick and easy patch at [SOLR-532|https://issues.apache.org/jira/browse/SOLR-532].

The other thing that you might run into if you use curl or post.jar is that the XmlUpdateRequestHandler is a bit anal about well formed xml, and throws an exception if it finds anything but the expected <doc> and <field> tags.  To work around either escape your structured document's xml like this:
{code:xml}
<add>
 <doc>
  <field name="id">0001</field>
  <field name="title">One, Two, Three</field>
  <field name="fulltext_st">
   &lt;book title="One, Two, Three"&gt;
    &lt;page label="1"&gt;one&lt;/page&gt;
    &lt;page label="2"&gt;two&lt;/page&gt;
    &lt;page label="3"&gt;three&lt;/page&gt;
   &lt;/book&gt;
  </field>
 </doc>
</add>
{code}
or hack XmlUpdateRequestHandler to accept your "unexpected XML tag doc/".

Cool?

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.