You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Niraj Aswani <N....@dcs.shef.ac.uk> on 2010/03/25 16:21:31 UTC

solr highlighting

Hi,

I am using the following two parameters to highlight the hits.

"hl.simple.pre=" + URLEncoder.encode("<b><u>")
"hl.simple.post=" + URLEncoder.encode("</u></b>")

This seems to work.  However, there is a bit of trouble when the text 
itself contains html markup.

For example, I have indexed a document with the following text in it.
=======
something here...
<choice minOccurs="1" maxOccurs="unbounded">xyz</choice>
something here..
=======

When I search for the keyword choice, what it does is, it inserts 
"<b><u>" just before the word choice and "</u></b>" immediately after 
the word choice. It results into something like below:

<<b><u>choice</b></u> minOccurs="1" 
maxOccurs="unbounded">xyz</<b><u>choice</u></b>>


I would like it to be something like:

&lt;<b><u>choice</b></u> minOccurs="1" 
maxOccurs="unbounded"&gt;xyz/<b><u>choice</u></b>&gt;

Is there any way to do it such that the highlight content is encoded as 
HTML but the prefix and suffix are not?

Thanks,
Niraj



When I issue a query, it returns all the corret

Re: solr highlighting

Posted by Lance Norskog <go...@gmail.com>.

No problem: wrapping and unwrapping escaped text can be very confusing.

On Fri, Mar 26, 2010 at 6:31 AM, Niraj Aswani <N....@dcs.shef.ac.uk> wrote:
> Hi Lance,
>
> apologies.. please ignore my previous mail.  I'll have a look at the
> PatternReplaceFilter.
>
> Thanks,
> Niraj
>
> Niraj Aswani wrote:
>>
>> Hi Lance,
>>
>> Yes, that is once solution but wouldn't it stop people searching for
>> something like "<choice" in the first place?  I mean, if I encode such
>> characters at the index time, one would have to write a query like
>> "&lt;choice".  Am I right?
>>
>> Thanks,
>> Niraj
>>
>> Lance Norskog wrote:
>>>
>>> To display html-markup in an html page, it has to be in entity-encoded
>>> form. So, encode the <> as entities in your input application, and
>>> have it indexed and stored in this format. Then, the <b><u> are
>>> inserted as normal. This gives you the html text displayable in an
>>> html page, with all words highlightable. And add gt/lt etc. as
>>> stopwords.
>>>
>>> At this point you have the element names, attribute names and values,
>>> and text parts searchable and highlightable. If you only want the HTML
>>> syntax parts shown, the PatternReplaceFilter is your friend: with
>>> regex patterns you can pull out those values and ignore the text
>>> parts.
>>>
>>> The analysis.jsp page will make it much much easier to debug this.
>>>
>>> Good luck!
>>>
>>> On Thu, Mar 25, 2010 at 8:21 AM, Niraj Aswani <N....@dcs.shef.ac.uk>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I am using the following two parameters to highlight the hits.
>>>>
>>>> "hl.simple.pre=" + URLEncoder.encode("<b><u>")
>>>> "hl.simple.post=" + URLEncoder.encode("</u></b>")
>>>>
>>>> This seems to work.  However, there is a bit of trouble when the text
>>>> itself
>>>> contains html markup.
>>>>
>>>> For example, I have indexed a document with the following text in it.
>>>> =======
>>>> something here...
>>>> <choice minOccurs="1" maxOccurs="unbounded">xyz</choice>
>>>> something here..
>>>> =======
>>>>
>>>> When I search for the keyword choice, what it does is, it inserts
>>>> "<b><u>"
>>>> just before the word choice and "</u></b>" immediately after the word
>>>> choice. It results into something like below:
>>>>
>>>> <<b><u>choice</b></u> minOccurs="1"
>>>> maxOccurs="unbounded">xyz</<b><u>choice</u></b>>
>>>>
>>>>
>>>> I would like it to be something like:
>>>>
>>>> &lt;<b><u>choice</b></u> minOccurs="1"
>>>> maxOccurs="unbounded"&gt;xyz/<b><u>choice</u></b>&gt;
>>>>
>>>> Is there any way to do it such that the highlight content is encoded as
>>>> HTML
>>>> but the prefix and suffix are not?
>>>>
>>>> Thanks,
>>>> Niraj
>>>>
>>>>
>>>>
>>>> When I issue a query, it returns all the corret
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: solr highlighting

Posted by Niraj Aswani <N....@dcs.shef.ac.uk>.

Hi Lance,

apologies.. please ignore my previous mail.  I'll have a look at the 
PatternReplaceFilter.

Thanks,
Niraj

Niraj Aswani wrote:
> Hi Lance,
>
> Yes, that is once solution but wouldn't it stop people searching for 
> something like "<choice" in the first place?  I mean, if I encode such 
> characters at the index time, one would have to write a query like 
> "&lt;choice".  Am I right?
>
> Thanks,
> Niraj
>
> Lance Norskog wrote:
>> To display html-markup in an html page, it has to be in entity-encoded
>> form. So, encode the <> as entities in your input application, and
>> have it indexed and stored in this format. Then, the <b><u> are
>> inserted as normal. This gives you the html text displayable in an
>> html page, with all words highlightable. And add gt/lt etc. as
>> stopwords.
>>
>> At this point you have the element names, attribute names and values,
>> and text parts searchable and highlightable. If you only want the HTML
>> syntax parts shown, the PatternReplaceFilter is your friend: with
>> regex patterns you can pull out those values and ignore the text
>> parts.
>>
>> The analysis.jsp page will make it much much easier to debug this.
>>
>> Good luck!
>>
>> On Thu, Mar 25, 2010 at 8:21 AM, Niraj Aswani 
>> <N....@dcs.shef.ac.uk> wrote:
>>  
>>> Hi,
>>>
>>> I am using the following two parameters to highlight the hits.
>>>
>>> "hl.simple.pre=" + URLEncoder.encode("<b><u>")
>>> "hl.simple.post=" + URLEncoder.encode("</u></b>")
>>>
>>> This seems to work.  However, there is a bit of trouble when the 
>>> text itself
>>> contains html markup.
>>>
>>> For example, I have indexed a document with the following text in it.
>>> =======
>>> something here...
>>> <choice minOccurs="1" maxOccurs="unbounded">xyz</choice>
>>> something here..
>>> =======
>>>
>>> When I search for the keyword choice, what it does is, it inserts 
>>> "<b><u>"
>>> just before the word choice and "</u></b>" immediately after the word
>>> choice. It results into something like below:
>>>
>>> <<b><u>choice</b></u> minOccurs="1"
>>> maxOccurs="unbounded">xyz</<b><u>choice</u></b>>
>>>
>>>
>>> I would like it to be something like:
>>>
>>> &lt;<b><u>choice</b></u> minOccurs="1"
>>> maxOccurs="unbounded"&gt;xyz/<b><u>choice</u></b>&gt;
>>>
>>> Is there any way to do it such that the highlight content is encoded 
>>> as HTML
>>> but the prefix and suffix are not?
>>>
>>> Thanks,
>>> Niraj
>>>
>>>
>>>
>>> When I issue a query, it returns all the corret
>>>
>>>     
>>
>>
>>
>>   
>

Re: solr highlighting

Posted by Niraj Aswani <N....@dcs.shef.ac.uk>.

Hi Lance,

Yes, that is once solution but wouldn't it stop people searching for 
something like "<choice" in the first place?  I mean, if I encode such 
characters at the index time, one would have to write a query like 
"&lt;choice".  Am I right?

Thanks,
Niraj

Lance Norskog wrote:
> To display html-markup in an html page, it has to be in entity-encoded
> form. So, encode the <> as entities in your input application, and
> have it indexed and stored in this format. Then, the <b><u> are
> inserted as normal. This gives you the html text displayable in an
> html page, with all words highlightable. And add gt/lt etc. as
> stopwords.
>
> At this point you have the element names, attribute names and values,
> and text parts searchable and highlightable. If you only want the HTML
> syntax parts shown, the PatternReplaceFilter is your friend: with
> regex patterns you can pull out those values and ignore the text
> parts.
>
> The analysis.jsp page will make it much much easier to debug this.
>
> Good luck!
>
> On Thu, Mar 25, 2010 at 8:21 AM, Niraj Aswani <N....@dcs.shef.ac.uk> wrote:
>   
>> Hi,
>>
>> I am using the following two parameters to highlight the hits.
>>
>> "hl.simple.pre=" + URLEncoder.encode("<b><u>")
>> "hl.simple.post=" + URLEncoder.encode("</u></b>")
>>
>> This seems to work.  However, there is a bit of trouble when the text itself
>> contains html markup.
>>
>> For example, I have indexed a document with the following text in it.
>> =======
>> something here...
>> <choice minOccurs="1" maxOccurs="unbounded">xyz</choice>
>> something here..
>> =======
>>
>> When I search for the keyword choice, what it does is, it inserts "<b><u>"
>> just before the word choice and "</u></b>" immediately after the word
>> choice. It results into something like below:
>>
>> <<b><u>choice</b></u> minOccurs="1"
>> maxOccurs="unbounded">xyz</<b><u>choice</u></b>>
>>
>>
>> I would like it to be something like:
>>
>> &lt;<b><u>choice</b></u> minOccurs="1"
>> maxOccurs="unbounded"&gt;xyz/<b><u>choice</u></b>&gt;
>>
>> Is there any way to do it such that the highlight content is encoded as HTML
>> but the prefix and suffix are not?
>>
>> Thanks,
>> Niraj
>>
>>
>>
>> When I issue a query, it returns all the corret
>>
>>     
>
>
>
>

Re: solr highlighting

Posted by Lance Norskog <go...@gmail.com>.

To display html-markup in an html page, it has to be in entity-encoded
form. So, encode the <> as entities in your input application, and
have it indexed and stored in this format. Then, the <b><u> are
inserted as normal. This gives you the html text displayable in an
html page, with all words highlightable. And add gt/lt etc. as
stopwords.

At this point you have the element names, attribute names and values,
and text parts searchable and highlightable. If you only want the HTML
syntax parts shown, the PatternReplaceFilter is your friend: with
regex patterns you can pull out those values and ignore the text
parts.

The analysis.jsp page will make it much much easier to debug this.

Good luck!

On Thu, Mar 25, 2010 at 8:21 AM, Niraj Aswani <N....@dcs.shef.ac.uk> wrote:
> Hi,
>
> I am using the following two parameters to highlight the hits.
>
> "hl.simple.pre=" + URLEncoder.encode("<b><u>")
> "hl.simple.post=" + URLEncoder.encode("</u></b>")
>
> This seems to work.  However, there is a bit of trouble when the text itself
> contains html markup.
>
> For example, I have indexed a document with the following text in it.
> =======
> something here...
> <choice minOccurs="1" maxOccurs="unbounded">xyz</choice>
> something here..
> =======
>
> When I search for the keyword choice, what it does is, it inserts "<b><u>"
> just before the word choice and "</u></b>" immediately after the word
> choice. It results into something like below:
>
> <<b><u>choice</b></u> minOccurs="1"
> maxOccurs="unbounded">xyz</<b><u>choice</u></b>>
>
>
> I would like it to be something like:
>
> &lt;<b><u>choice</b></u> minOccurs="1"
> maxOccurs="unbounded"&gt;xyz/<b><u>choice</u></b>&gt;
>
> Is there any way to do it such that the highlight content is encoded as HTML
> but the prefix and suffix are not?
>
> Thanks,
> Niraj
>
>
>
> When I issue a query, it returns all the corret
>

-- 
Lance Norskog
goksron@gmail.com