You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Tricia Williams (JIRA)" <ji...@apache.org> on 2007/10/18 00:14:50 UTC
[jira] Issue Comment Edited: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748 ] 

pgwillia edited comment on SOLR-380 at 10/17/07 3:13 PM:
----------------------------------------------------------------

The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical.  The number of pages of the monographs we index vary greatly (10s to 1000s of pages).  So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples.  Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field.  If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets.  In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://tinyurl.com/3xdshk
(essentially shows the parameters and their values for this example -- pay attention to the hl.fl parameter)
gives the normal results, with the following at the end:

<lst name="highlighting">
&nbsp;<lst name="News.EFP.186500">
&nbsp;&nbsp;<arr name="fulltext_1">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; was <em>employed</em> on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_4">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; ^-f 6r-Ke.w-¥eaf!fl&apos;: Mr.-BradV whb is <em>employed</em> in Windsor, was also at his borne for jSew Year
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_6">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; <em>employed</em> at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_7">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; . have entered intoan agreement to <em>employ</em> the powerful tug Lntz to keep th&gt;e Detroit river between
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;</lst>
</lst>

You will notice that only the pages with hits on them appear in the highlight section.  From this point it would take a little work to parse the /arr[@name] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue?

      was (Author: pgwillia):
    The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical.  The number of pages of the monographs we index vary greatly (10s to 1000s of pages).  So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples.  Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field.  If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets.  In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://localhost:8080/solr/select?indent=on&version=2.2&q=employ&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9
gives the normal results, with the following at the end:

<lst name="highlighting">
&nbsp;<lst name="News.EFP.186500">
&nbsp;&nbsp;<arr name="fulltext_1">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; was <em>employed</em> on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_4">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; ^-f 6r-Ke.w-¥eaf!fl&apos;: Mr.-BradV whb is <em>employed</em> in Windsor, was also at his borne for jSew Year
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_6">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; <em>employed</em> at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_7">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; . have entered intoan agreement to <em>employ</em> the powerful tug Lntz to keep th&gt;e Detroit river between
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;</lst>
</lst>

You will notice that only the pages with hits on them appear in the highlight section.  From this point it would take a little work to parse the /arr[@name] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue?
  
> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.