You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Tricia Williams (JIRA)" <ji...@apache.org> on 2007/10/16 05:51:50 UTC

[jira] Created: (SOLR-380) The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: SOLR-380
                 URL: https://issues.apache.org/jira/browse/SOLR-380
             Project: Solr
          Issue Type: New Feature
          Components: search
            Reporter: Tricia Williams
            Priority: Minor


"Paged-Text" FieldType for Solr
> 
> A chance to dig into the guts of Solr. The problem: If we index a
> monograph in Solr, there's no way to convert search results into
> page-level hits. The solution: have a "paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports page-level hits in the
> search results.
> 
> The input would contain page milestones: <page id="234"/>. As Solr
> processed the tokens (using its standard tokenizers and filters), it would
> concurrently build a structural map of the item, indicating which term
> position marked the beginning of which page: <page id="234"
> firstterm="14324"/>. This map would be stored in an unindexed field in
> some efficient format.
> 
> At search time, Solr would retrieve term positions for all hits that are
> returned in the current request, and use the stored map to determine page
> ids for each term position. The results would imitate the results for
> highlighting, something like:
> 
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Peter Binkley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Binkley updated SOLR-380:
-------------------------------

    Description: 
"Paged-Text" FieldType for Solr

A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

<lst name="pages">
&nbsp;&nbsp;<lst name="doc1">
&nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
&nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
&nbsp;&nbsp;        </lst>
&nbsp;&nbsp;        <lst name="doc2">
&nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
&nbsp;&nbsp;        </lst>
</lst>
<lst name="hitpos">
&nbsp;&nbsp;        <lst name="doc1">
&nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
&nbsp;&nbsp;&nbsp;&nbsp;                </lst>
&nbsp;&nbsp;        </lst>
&nbsp;&nbsp;        ...
</lst>

  was:
"Paged-Text" FieldType for Solr

A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

<lst name="pages">
        <lst name="doc1">
                <int name="pageid">234</int>
                <int name="pageid">236</int>
        </lst>
        <lst name="doc2">
                <int name="pageid">19</int>
        </lst>
</lst>
<lst name="hitpos">
        <lst name="doc1">
                <lst name="234">
                        <int name="pos">14325</int>
                </lst>
        </lst>
        ...
</lst>


formatted the xml for clarity

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-380:
---------------------------------------

    Fix Version/s: 1.4

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Attachment: xmlpayload-example.zip

xmlpayload-example.zip contains a specialized version of the Solr example to demonstrate the plugins.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748 ] 

pgwillia edited comment on SOLR-380 at 10/17/07 3:13 PM:
----------------------------------------------------------------

The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical.  The number of pages of the monographs we index vary greatly (10s to 1000s of pages).  So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples.  Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field.  If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets.  In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://tinyurl.com/3xdshk
(essentially shows the parameters and their values for this example -- pay attention to the hl.fl parameter)
gives the normal results, with the following at the end:

<lst name="highlighting">
&nbsp;<lst name="News.EFP.186500">
&nbsp;&nbsp;<arr name="fulltext_1">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; was <em>employed</em> on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_4">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; ^-f 6r-Ke.w-¥eaf!fl&apos;: Mr.-BradV whb is <em>employed</em> in Windsor, was also at his borne for jSew Year
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_6">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; <em>employed</em> at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_7">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; . have entered intoan agreement to <em>employ</em> the powerful tug Lntz to keep th&gt;e Detroit river between
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;</lst>
</lst>

You will notice that only the pages with hits on them appear in the highlight section.  From this point it would take a little work to parse the /arr[@name] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue?

      was (Author: pgwillia):
    The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical.  The number of pages of the monographs we index vary greatly (10s to 1000s of pages).  So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples.  Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field.  If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets.  In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://localhost:8080/solr/select?indent=on&version=2.2&q=employ&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9
gives the normal results, with the following at the end:

<lst name="highlighting">
&nbsp;<lst name="News.EFP.186500">
&nbsp;&nbsp;<arr name="fulltext_1">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; was <em>employed</em> on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_4">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; ^-f 6r-Ke.w-¥eaf!fl&apos;: Mr.-BradV whb is <em>employed</em> in Windsor, was also at his borne for jSew Year
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_6">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; <em>employed</em> at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_7">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; . have entered intoan agreement to <em>employ</em> the powerful tug Lntz to keep th&gt;e Detroit river between
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;</lst>
</lst>

You will notice that only the pages with hits on them appear in the highlight section.  From this point it would take a little work to parse the /arr[@name] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue?
  
> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Peter Binkley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535296 ] 

Peter Binkley commented on SOLR-380:
------------------------------------

The problem with the page-as-SorlDocument approach is that you then have to group the pages back under their container documents to present a unified result to the user (like this: http://tinyurl.com/yt2a25 ). I want the primary unit of granularity in search results to be the book, and the pages to be only a secondary layer. I also want to be able to do proximity searches that bridge page boundaries, have relevance ranking consider the whole book text and not just that page, etc.: i.e. treat the text as continuous for searching purposes. So I gain a lot by treating the book as the SolrDocument; I just need that extra bit of work to resolve the page positions to have it all.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535489 ] 

Erik Hatcher commented on SOLR-380:
-----------------------------------

> The idea was to use dynamic fields (e.g. page_1, page_2, page_3... page_N) to store the text of each page in a single document. The problem is that currently Solr does not support "glob" style field expansion in query parameters (e.g.
> qf=page_* ) so you would end up having to specify the entire list of page fields in your query, which is impractical. There is already an open issue related to this particular problem (SOLR-247) but nobody has had time to look into it.

In this case, a copyField from page_* into an unstored "contents" would do the trick, which would also facilitate querying across pages.  A position increment gap could also prohibit phrase queries across "pages", optionally.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Attachment: xmlpayload-src.jar

xmlpayload-src.jar contains the source files and junit test and ant build file for these plugins.
{code}
jar xf xmlpayload-src.jar
{code}
will unpack this.



> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-src.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Attachment: SOLR-380-XmlPayload.patch

Functionality is improved.  Tests are more complete.  I have included an example (much like the example included in solr) which demonstrates the changes needed to solrconfig.xml and schema.xml.  As well as some xml documents to start playing with. 

TODO: 
 * Still have to track down what happens when filters are applied to the Tokenizer.
 * Implement error handling for bad xml input. 

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: lucene-core-2.3-dev.jar, SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Attachment: lucene-core-2.3-dev.jar
                SOLR-380-XmlPayload.patch

This is a draft.  Note that Payload and Token classes in particular have changed since lucene-core-2.2.0.jar.  Users of this patch will need to replace lucene-core-2.2.0.jar with lucene-core-2.3-dev.jar.  I have created a test for XmlPayloadCharTokenizer but not attached it here because LuceneTestCase is not in SOLR's classpath in any form and it will break the build.

 The code works in theory and passes tests to that effect.  However, in practice when I deploy the war created from the "dist" ant target several problems result from adding documents (which seems to work using a <![CDATA[...]]> to contain the structured document and post.jar):

 * after adding a XmlPayload tokenized document, q=*:* causes 500 error: HTTP Status 500 - read past EOF java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:153) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:408) at org.apache.lucene.index.MultiSegmentReader.document(MultiSegmentReader.java:129) at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) at ...
 * use of the highlight fields produces the same error.
 * queries that should match a XmlPayload tokenized document do not ( //result[@numFound='0'])-- though queries matching un-XmlPayload tokenized document continue to return the expected results.
 * trying to view the index using Luke (Lucene Index Toolbox, v 0.7.1 (2007-06-20) ) returns: Unknown format version: -4
 * Solr Statistics confirm that all the documents have been added.


I will continue to finish this functionality but any suggestions or other input are welcomed.  You will see how the functionality is intended to be used in src/test/org/apache/solr/highlight/XmlPayloadTest.java

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: lucene-core-2.3-dev.jar, SOLR-380-XmlPayload.patch
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Peter Binkley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535288 ] 

Peter Binkley commented on SOLR-380:
------------------------------------

I've been wondering about what's required to get this output added to the response. It appears that a response writer isn't the answer: those are for different formats (xml, json, etc.). Is everything we need included in the FieldType methods (write(), etc.)? The highlighting functionality is probably a good model for what we want to do.



> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535290 ] 

Ryan McKinley commented on SOLR-380:
------------------------------------

I don't totally understand how a field type solves your problem (I'm sure it can... i just don't quite follow)

But - If you want your search results to return pages, why not just index each page as a new SolrDocument?

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Pieter Berkel (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535426 ] 

Pieter Berkel commented on SOLR-380:
------------------------------------

There was a recent discussion surrounding a similar problem on solr-user:
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390

The idea was to use dynamic fields (e.g. page_1, page_2, page_3... page_N) to store the text of each page in a single document.  The problem is that currently Solr does not support "glob" style field expansion in query parameters (e.g. qf=page_* ) so you would end up having to specify the entire list of page fields in your query, which is impractical.  There is already an open issue related to this particular problem (SOLR-247) but nobody has had time to look into it.

In terms of returning term position information, this seems somehow (albeit loosely) related to highlighting, is there any way you could use the existing functionality to achieve your goal? (definitely would be a hack though)


> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Attachment:     (was: lucene-core-2.3-dev.jar)

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837035#action_12837035 ] 

Chris Harris commented on SOLR-380:
-----------------------------------

This is an interesting patch. One current limitation seems to be that proximity search queries (PhraseQueries and SpanQueries) may result in false positives. For example, if I query

bq. "audit trail"~10

then I think I'd expect Solr to return only the page #s where audit and trail are near one another. (Yes, what I just said leaves some wiggle room for implementation details.) The current code, in contrast, looks like it will report all the pages where "audit" and "trail" occur, regardless of proximity to the other term.

Has anyone thought about how to add proximity awareness?

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665195#action_12665195 ] 

Tricia Williams commented on SOLR-380:
--------------------------------------

Hi Laurent,

    Thanks for your interest in my Solr PayloadComponent plugin.  I want to address all of the questions you pose in your comment, but won't have time until early February.  I apologize for the inconvenience but my priorities lay elsewhere right now.  Feel free to look at the code and play in the meantime.  The code that's up there is basically proof of concept.  I've been slowly working at improving the robustness of the code and improving performance so hopefully there will be a improved version before the end of March.

    I'm sure there would be many people who would appreciate a Wiki page for this topic.  Why don't you go ahead and set that up?  I'll be happy to add my two cents when I'm available.

All the best,
Tricia

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795944#action_12795944 ] 

Lance Norskog commented on SOLR-380:
------------------------------------

Please ask this on solr-user.  Issues are for discussing implementations.

Lucene payloads are supported by Solr, and a rectangle per term can be stored as a payload. This allows the text to be indexed as a text field, and all queries including phrases will work as normal.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Attachment: xmlpayload.jar

xmlpayload.jar is the deployable jar that can be dropped into your solr.home lib directory (it contains only .class files)

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535768 ] 

Mike Klaas commented on SOLR-380:
---------------------------------

In my opinion the best solution is to create one solr document per page and denormalize the container data across each page.

If I had to implement it the other way, I would probably index the pages as a multivalued field with a large position increment gap (say 1000), store term vectors, and use the position information from the term vectors to determine the page hits (e.g., pos 4668 -> page 5; pos 668 -> page 1; pos 9999 -> page 10).  Assumes < 1000 tokens per page, of course.

Incidentally, this discussion doesn't really belong here.  It would be better to sketch out ideas on solr-user, then move to JIRA to track a resulting patch (if it gets that far).  I actually don't think that there is anything to add to Solr here--it seems more of a question of customization.



> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748 ] 

Tricia Williams commented on SOLR-380:
--------------------------------------

The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical.  The number of pages of the monographs we index vary greatly (10s to 1000s of pages).  So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples.  Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field.  If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets.  In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://localhost:8080/solr/select?indent=on&version=2.2&q=employ&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9
gives the normal results, with the following at the end:

<lst name="highlighting">
&nbsp;<lst name="News.EFP.186500">
&nbsp;&nbsp;<arr name="fulltext_1">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; was <em>employed</em> on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_4">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; ^-f 6r-Ke.w-¥eaf!fl&apos;: Mr.-BradV whb is <em>employed</em> in Windsor, was also at his borne for jSew Year
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_6">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; <em>employed</em> at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_7">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; . have entered intoan agreement to <em>employ</em> the powerful tug Lntz to keep th&gt;e Detroit river between
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;</lst>
</lst>

You will notice that only the pages with hits on them appear in the highlight section.  From this point it would take a little work to parse the /arr[@name] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue?

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591873#action_12591873 ] 

Tricia Williams commented on SOLR-380:
--------------------------------------

After a lengthy absence I've returned to this issue with a bit of a new perspective.  I recognize what we have described really is a customization of Solr (albeit one I have seen in at least two organizations) and as such should be built as a plug-in (http://wiki.apache.org/solr/SolrPlugins) which can reside in your solr.home lib directory.  Now that Solr has lucene 2.3 and payloads my solution is much easier to apply than before.

I'll try to explain it here and then attach the src, deployable jar, and example for your use/reuse.

I assume that your structured document can be represented by xml:

{code:xml}
<book title="One, Two, Three">
   <page label="1">one</page>
   <page label="2">two</page>
   <page label="3">three</page>
</book>
{code}
 
But we don't have a tokenizer that can make sense of xml.  So I wrote a tokenizer which parallels the existing WhitespaceTokenizer called XmlPayloadWhitespaceTokenizer.  XmlPayloadWhitespaceTokenizer extends XmlPayloadCharTokenizer which does the same things as CharTokenizer in Lucene, but expects that the content is wrapped in xml tags.  The tokenizer keeps track of the xpath associated with each token and stores this as a payload.  

To use my Tokenizer in Solr I add the deployable jar I created containing XmlPayloadWhitespaceTokenizer in my solr.home lib director and add a structure text field type "text_st" to my schema.xml:
{code:xml}
<!-- A text field that uses the XmlPayloadWhitespaceTokenizer to store xpath info about the structured document -->
  <fieldType name="text_st" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.XmlPayloadWhitespaceTokenizerFactory"/>
      <!-- in this example, we will only use synonyms at query time
      <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
      -->
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldType>
{code}

I also add a field "fulltext_st" of type "text_st".

We can visualize what happens to the input text above using the Solr Admin web-app analysis.jsp modified by [SOLR-522|https://issues.apache.org/jira/browse/SOLR-522].

|term position|1|2|3|
|term text|one|two|three|
|term type|word|word|word|
|source start,end|3,6|7,10|11,16|
|payload|/book[title='One, Two, Three']/page[label='1']|/book[title='One, Two, Three']/page[label='2']|/book[title='One, Two, Three']/page[label='3']|

~Note that I've removed the hex representation of the payload for clarity~

The other side of this problem is how to present the results in a meaningful way.  Taking FacetComponent and HighlightComponent as my muse, I created a plugable [SearchComponent|http://wiki.apache.org/solr/SearchComponent] called PayloadComponent.  This component recognizes two parameters: "payload" and "payload.fl".  If payload=true, the component will find the terms from your query in the payload.fl field, retrieve the payload in these tokens, and re-combine this information to display the xpath of a search result in a give document and the number of times that term occurs in the given xpath.  

Again, to use my SearchComponent in Solr I add the deployable jar I created containing PayloadComponent in my solr.home lib director and add a search component "payload" to my solrconfig.xml:

{code:xml}
<searchComponent name="payload" class="org.apache.solr.handler.component.PayloadComponent"/>
 
  <requestHandler name="/search" class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
    </lst>
    <arr name="last-components">
      <str>payload</str>
    </arr>
  </requestHandler>
{code}

Then the result of http://localhost:8983/solr/search?q=came&payload=true&payload.fl=fulltext_st includes something like this:

{code:xml}
<lst name="payload">
 <lst name="payload_context">
  <lst name="Book.IA.0001">
   <lst name="fulltext_st">
    <int name="/book[title='Crooked Man'][url='http://ia310931.us.archive.org//load_djvu_applet.cgi?file=0/items/crookedmanotherr00newyiala/crookedmanotherr00newyiala.djvu'][author='unknown']/page[id='3']">1</int>
   </lst>
  </lst>
  <lst name="Book.IA.37729">
   <lst name="fulltext_st">
    <int name="/book[title='Charles Dicken's A Christmas Carol'][url=''][author='Dickens, Charles']/stave[title='Marley's Ghost'][id='One']/page[id='13']">1</int>
   </lst>
  </lst>
  <lst name="Book.IA.0002">
   <lst name="fulltext_st">
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='2']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='4']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='6']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='7']">1</int>
    <int name="/book[title='Jack and Jill and Old Dame Gill']/page[id='13']">1</int>
   </lst>
  </lst>
 </lst>
</lst>
{code}  

~The documents here are borrowed from the [Internet Archive|http://archive.org] and can be found in the xmlpayload-example.zip attached to this issue~

Then you have everything you need to write an xsl which will take your normal Solr results and supplement them with context from your structured document.

There may be some issues with filters that aren't payload aware.  The only one that concerned me to this point is the WordDelimiterFilter.  You can find a quick and easy patch at [SOLR-532|https://issues.apache.org/jira/browse/SOLR-532].

The other thing that you might run into if you use curl or post.jar is that the XmlUpdateRequestHandler is a bit anal about well formed xml, and throws an exception if it finds anything but the expected <doc> and <field> tags.  To work around either escape your structured document's xml like this:
{code:xml}
<add>
 <doc>
  <field name="id">0001</field>
  <field name="title">One, Two, Three</field>
  <field name="fulltext_st">
   &lt;book title="One, Two, Three"&gt;
    &lt;page label="1"&gt;one&lt;/page&gt;
    &lt;page label="2"&gt;two&lt;/page&gt;
    &lt;page label="3"&gt;three&lt;/page&gt;
   &lt;/book&gt;
  </field>
 </doc>
</add>
{code}
or hack XmlUpdateRequestHandler to accept your "unexpected XML tag doc/".

Cool?

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795943#action_12795943 ] 

Lance Norskog commented on SOLR-380:
------------------------------------

Please ask this on solr-user.  Issues are for discussing implementations.

Lucene payloads are supported by Solr, and a rectangle per term can be stored as a payload. This allows the text to be indexed as a text field, and all queries including phrases will work as normal.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Tricia Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tricia Williams updated SOLR-380:
---------------------------------

    Description: 
"Paged-Text" FieldType for Solr

A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.

The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.

At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:

<lst name="pages">
        <lst name="doc1">
                <int name="pageid">234</int>
                <int name="pageid">236</int>
        </lst>
        <lst name="doc2">
                <int name="pageid">19</int>
        </lst>
</lst>
<lst name="hitpos">
        <lst name="doc1">
                <lst name="234">
                        <int name="pos">14325</int>
                </lst>
        </lst>
        ...
</lst>

  was:
"Paged-Text" FieldType for Solr
> 
> A chance to dig into the guts of Solr. The problem: If we index a
> monograph in Solr, there's no way to convert search results into
> page-level hits. The solution: have a "paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports page-level hits in the
> search results.
> 
> The input would contain page milestones: <page id="234"/>. As Solr
> processed the tokens (using its standard tokenizers and filters), it would
> concurrently build a structural map of the item, indicating which term
> position marked the beginning of which page: <page id="234"
> firstterm="14324"/>. This map would be stored in an unindexed field in
> some efficient format.
> 
> At search time, Solr would retrieve term positions for all hits that are
> returned in the current request, and use the stored map to determine page
> ids for each term position. The results would imitate the results for
> highlighting, something like:
> 
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

        Summary: There's no way to convert search results into page-level hits of a "structured document".  (was: The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.)

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Laurent Hoss (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664579#action_12664579 ] 

Laurent Hoss commented on SOLR-380:
-----------------------------------

Hi Tricia
Looks nice, I've been searching for such a feature for years in lucene (and solr)!
But before getting too excited, i better try to ask the correct questions before doing a real test .. as we don't even use solr yet (though I really want to :) 

In fact we currently have our home grown solution for similar problem:
In our case we want to restrain boolean searches to paragraphs or sentences of a document, and implemented this (like many others) indexing extra docs for paragraphs etc. (with duplication of many meta-data fields of the parent document)
Besides multiplying index size, the mapping from the found paragraphs to their base documents involved a lot of custom coding.. and only recently we have at least implemented a fast counting of the base docs for the found paragraph docs, by using a 'baseDocId'-FieldCache  (essentialy a 'group by' In SQL lingo)

This leads to following requirements and questions:
* What is the performance of your PayloadComponent, compared to the standard SearchHandler?
We especially need very fast count(*) functionality, to dynamically compute statistics/charts needing 100's of queries.
For this we just need the hitsCount of documents/paragraphs without the xpath payload info, which would generate a really big XML response for 100K docs resultset!

* We want to find only documents where a (boolean) query matches within one of the paragraph_* fields, and not if the query matches over the combined content of multiple paragraphs, as discussed here:
http://www.nabble.com/Redundant-indexing-*-4-only-solution-(for-par-sen-and-case-sensitivity)-td13684315.html#a13685041
and
http://www.nabble.com/What-is-the-best-way-to-index-xml-data-preserving-the-mark-up--td13641104.html#a13657470
> The problem is that a search for sentence:foo AND sentence:bar is matching if foo matches in any sentence of the paragraph, and bar also matches in any sentence of the paragraph. 


Do you think this is a good option for us?
ps: We should probably put up some Wiki page for this topic, after I've seen at least 10 people asking for the possible solutions.. ok, maybe often with slightly different requirements!

One whole other way solving this would be using the SpanQuery package together with the nicelooking Qsol (http://myhardshadow.com/about.php), allthough I'm not sure about its performance especially with (really) long boolean queries!


> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Shairon Toledo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795386#action_12795386 ] 

Shairon Toledo commented on SOLR-380:
-------------------------------------

I have a project that involves words extracted by OCR, each page has words, each word has its geometry to blink a highlight to end user. 
I've been trying represent this document structure by xml


{code:xml}
<document>
   <page num="1">
    <term top='111' bottom='222' right='333' left='444'>foo</term> 
    <term top='211' bottom='322' right='833' left='944'>bar</term> 
    <term top='311' bottom='422' right='733' left='144'>baz</term> 
    <term top='411' bottom='522' right='633' left='244'>qux</term> 
   </page>
   <page num="2">
	<term .... />
   </page>
   
</document>

{code}

Using the field 'fulltext_st' ,

{code:xml}
<field name="fulltext_st">
	&lt;document &gt;
	&lt;page top='111' bottom='222' right='333' left='444' word='foo' num='1'&gt;foo&lt;/page&gt;
	&lt;page top='211' bottom='322' right='833' left='944' word='bar' num='1'&gt;bar&lt;/page&gt;
	&lt;page top='311' bottom='422' right='733' left='144' word='baz' num='1'&gt;baz&lt;/page&gt;
	&lt;page top='411' bottom='522' right='633' left='244' word='qux' num='1'&gt;qux&lt;/page&gt;
	&lt;/document&gt;
</field>
{code}

I can get all terms in my search result with them payloads.
But if I do search using phrase query I can't fetch any result.

Example:

*search?q=foo* 

{code:xml}
<lst name="fulltext_st">
	<int name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
</lst>
{code}

*search?q=foo+bar*

{code:xml}
<lst name="fulltext_st">
	<int name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
	<int name="/document/page[word='baz'][num='1'][top='211'][bottom='322'][right='833'][left='944']">1</int>
</lst>
{code}

*/search?q="foo bar"*
{code:xml}
*nothing*
{code}

I was wondering if I could get your thoughts if xmlpayload supports sort of the things or how easy is I update the code to provide a solution for do that.  

thank you in advance

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Peter Binkley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535755 ] 

Peter Binkley commented on SOLR-380:
------------------------------------

Thanks for clarifying how the highlighting would let you see the page numbers. On that model, all we would need would be to enhance the highlighting report to make it show the term positions rather than (or as well a) the terms. 

But I'm not ready to give up on the map idea yet. I hadn't dug far enough into FieldTypes, evidently. Could we maybe index the text in the normal way, with a token filter that ignores the milestones, and then copyfield the text to a FieldType whose only job is to build and store the map? Provided that the two were tokenizing and filtering in the same way, the position counts would remain in sync; the mapping FieldType would just require a final filter that counted the incoming tokens and took note of the milestones, and generated the map as a series of tokens in whatever format we decide to store the map in.

(And Tricia, would you mind tinyfying that url, so the page doesn't get stretched?)

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Peter Binkley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535648 ] 

Peter Binkley commented on SOLR-380:
------------------------------------

Both these methods (page_* fields or unstored "contents" field) would make it difficult to discover from the search results which pages matched the query, though, wouldn't they? They would both need extra work to populate a structure like the "pages" and "hitpos" elements in the sample xml above. Would that extra work be more efficient than the document-map approach we've proposed above? 

The highlighting functionality is definitely the model to follow for handling term positions.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Peter Binkley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535941 ] 

Peter Binkley commented on SOLR-380:
------------------------------------

OK, taking the discussion to solr-user until we nail down what we're doing.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591897#action_12591897 ] 

Erik Hatcher commented on SOLR-380:
-----------------------------------

{quote}Cool?{quote}

Very!   Wow Tricia - thanks for documenting that so thoroughly.  This particular feature is sure to be of great interest to many.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a "paged-text" fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: <page id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.