You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by leonardo2 <le...@gmail.com> on 2012/02/16 15:18:52 UTC

Payload and exact search - 2

Hello,
I already posted this question but for some reason it was attached to a
thread with different topic.


Is there the possibility of perform 'exact search' in a payload field? 

I'have to index text with auxiliary info for each word. In particular at
each word is associated the bounding box containing it in the original pdf
page (it is used for highligthing the search terms in the pdf). I used the
payload to store that information. 

In the schema.xml, the fieldType definition is: 

------------------------------- 
<fieldtype name="wppayloads" stored="false" indexed="true"
class="solr.TextField" >
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" 
                         catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="identity"/>
</analyzer>
</fieldtype>
------------------------------- 

while the field definition is: 

------------------------------- 
<field name="words" type="wppayloads" indexed="true" stored="true"
required="true" multiValued="true"/>
------------------------------- 

When indexing, the field 'words' contains a list of word|box as in the
following example: 

------------------------------- 
doc_id=example 
words={Fonte:|307.62,948.16,324.62,954.25 Comune|326.29,948.16,349.07,954.25
di|350.74,948.16,355.62,954.25 Bologna|358.95,948.16,381.28,954.25} 
------------------------------- 

Such solution works well except in the case of an exact search. For example,
assuming the only indexed doc is the 'example' doc (before shown), the query
words:"Comune di Bologna" returns no results. 

Someone know if there is the possibility of perform 'exact search' in a
payload field? 

Thanks in advance, 
Leonardo

--
View this message in context: http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3750355.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Payload and exact search - 2

Posted by leonardo2 <le...@gmail.com>.
Ok, it works!!
Thanks you very much.

Leonardo


--
View this message in context: http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3760477.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Payload and exact search - 2

Posted by Erick Erickson <er...@gmail.com>.
As far as I know, you're on the right track. Note that it isn't important
that the payload filter be the first, just that nothing that splits the tokens
up on your delimit character (pipe symbol) come before it.

Like I said, payloads are a bit of a mystery to me, so don't take my
word for gospel here!

Best
Erick

On Sun, Feb 19, 2012 at 9:54 AM, leonardo2 <le...@gmail.com> wrote:
> Thank's for your reply,
> so, if I apply the <filter class="solr.DelimitedPayloadTokenFilterFactory"
> encoder="identity"/> as first filter in the chain, it shoud works....
> In this new configuration, the first filter in the chain intercept the
> payload. It manages and removes the payload info and then the subsequent
> filters are applied to the clear text: is it right?
>
> Leonardo
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3758152.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Payload and exact search - 2

Posted by leonardo2 <le...@gmail.com>.
Thank's for your reply,
so, if I apply the <filter class="solr.DelimitedPayloadTokenFilterFactory"
encoder="identity"/> as first filter in the chain, it shoud works.... 
In this new configuration, the first filter in the chain intercept the
payload. It manages and removes the payload info and then the subsequent
filters are applied to the clear text: is it right?

Leonardo

--
View this message in context: http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3758152.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Payload and exact search - 2

Posted by Erick Erickson <er...@gmail.com>.
OK, payloads are a bit of a mystery to me, so this may be way off
base.

But...

The ordering of your analysis chain is suspicious, the admin/analysis
page is a life-saver.

WordDelimiterFilterFactory is breaking up your input before it gets to
the payload filter I think, so your payload information is completely
disassociated with from your terms and treated as individual terms
all by themselves. At that point what you get
in your index *probably* has no payloads attached at all!

Use the admin/schema browser link to actually look at the data (or
just go straight to Luke) and I believe you'll see that your position
information is being treated just like any other token in the input stream.

There should be nothing about payloads that prevents normal
text query on the text part, although.

Best
Erick

On Thu, Feb 16, 2012 at 9:18 AM, leonardo2 <le...@gmail.com> wrote:
> Hello,
> I already posted this question but for some reason it was attached to a
> thread with different topic.
>
>
> Is there the possibility of perform 'exact search' in a payload field?
>
> I'have to index text with auxiliary info for each word. In particular at
> each word is associated the bounding box containing it in the original pdf
> page (it is used for highligthing the search terms in the pdf). I used the
> payload to store that information.
>
> In the schema.xml, the fieldType definition is:
>
> -------------------------------
> <fieldtype name="wppayloads" stored="false" indexed="true"
> class="solr.TextField" >
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1"
>                         catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.DelimitedPayloadTokenFilterFactory" encoder="identity"/>
> </analyzer>
> </fieldtype>
> -------------------------------
>
> while the field definition is:
>
> -------------------------------
> <field name="words" type="wppayloads" indexed="true" stored="true"
> required="true" multiValued="true"/>
> -------------------------------
>
> When indexing, the field 'words' contains a list of word|box as in the
> following example:
>
> -------------------------------
> doc_id=example
> words={Fonte:|307.62,948.16,324.62,954.25 Comune|326.29,948.16,349.07,954.25
> di|350.74,948.16,355.62,954.25 Bologna|358.95,948.16,381.28,954.25}
> -------------------------------
>
> Such solution works well except in the case of an exact search. For example,
> assuming the only indexed doc is the 'example' doc (before shown), the query
> words:"Comune di Bologna" returns no results.
>
> Someone know if there is the possibility of perform 'exact search' in a
> payload field?
>
> Thanks in advance,
> Leonardo
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3750355.html
> Sent from the Solr - User mailing list archive at Nabble.com.