You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Tricia Williams (JIRA)" <ji...@apache.org> on 2007/11/12 08:43:50 UTC

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Attachment: lucene-core-2.3-dev.jar
                SOLR-380-XmlPayload.patch

This is a draft.  Note that Payload and Token classes in particular have changed since lucene-core-2.2.0.jar.  Users of this patch will need to replace lucene-core-2.2.0.jar with lucene-core-2.3-dev.jar.  I have created a test for XmlPayloadCharTokenizer but not attached it here because LuceneTestCase is not in SOLR's classpath in any form and it will break the build.

 The code works in theory and passes tests to that effect.  However, in practice when I deploy the war created from the "dist" ant target several problems result from adding documents (which seems to work using a <![CDATA[...]]> to contain the structured document and post.jar):

 * after adding a XmlPayload tokenized document, q=*:* causes 500 error: HTTP Status 500 - read past EOF java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:153) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:408) at org.apache.lucene.index.MultiSegmentReader.document(MultiSegmentReader.java:129) at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) at ...
 * use of the highlight fields produces the same error.
 * queries that should match a XmlPayload tokenized document do not ( //result[@numFound='0'])-- though queries matching un-XmlPayload tokenized document continue to return the expected results.
 * trying to view the index using Luke (Lucene Index Toolbox, v 0.7.1 (2007-06-20) ) returns: Unknown format version: -4
 * Solr Statistics confirm that all the documents have been added.


I will continue to finish this functionality but any suggestions or other input are welcomed.  You will see how the functionality is intended to be used in src/test/org/apache/solr/highlight/XmlPayloadTest.java

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: lucene-core-2.3-dev.jar, SOLR-380-XmlPayload.patch
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.