You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thorsten Scherler <th...@juntadeandalucia.es> on 2007/01/02 18:16:31 UTC

How to tell the highlighter not to escape?

Hi all,

I am playing around with the highlighter and found that all highlight
terms get escaped.

I mean solr will return 
 &lt;em&gt;TERM&lt;/em&gt; and not
<em> TERM </em>

I am not sure where this escaping is happening but I would need the
highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
since it is horror to work with cdata sections in xsl.

I had a look in the lucene highlighter and it seem that it does not
escape the tags.

Can somebody point me to code which is responsible for escaping and
maybe give me a tip how I can patch to make it configurable. 

TIA

salu2


Re: How to tell the highlighter not to escape?

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 3, 2007, at 7:39 AM, Thorsten Scherler wrote:
> However I still think the highlighter should return unescaped tags for
> highlighting. There is IMO no benefit for the current behavior.

That really isn't practical.  Suppose the prefix were ">>" and the  
suffix were "<<"?   It would return invalid XML.  Escaping is the  
only sensible solution, it seems, in order to have the results  
returned in XML.  Personally, I have the results returned as ruby  
(wt=ruby) :)

	Erik



Re: How to tell the highlighter not to escape?

Posted by Chris Hostetter <ho...@fucit.org>.
: it sure seems to me that if SOLR is returning XML, it might as well return
: XML with real markup through and through instead of exploiting
: pseudo-markup. if there is concern about introducing validation errors, then
: perhaps you could use namespaces in the XML and put the highlighting markup
: in a non-SOLR namespace???

the problem is that XML is only one of many formats Solr can return, and
the "psuedo-markup" can be choosen by the client completely independent
of the output format -- it might be <em>...</em> or it might me
[HiGhLiGhT"START}...<EnD] -- the choice is entirely up to the user, and
the XMLResposneWriter must ensure that it is properly escaped so it
produces a valid XML document, just as the JSONResponseWriter must ensure
that it's properly escaped to produce a valid JSON document.




-Hoss


Re: How to tell the highlighter not to escape?

Posted by Edward Garrett <he...@gmail.com>.
just to add a note on this, the whole idea of inserting "pseudo-markup" into
XML text elements seems to be pretty much in disrepute, and certainly caused
many complaints about RSS 1.0, see e.g.

http://www.biglist.com/lists/xsl-list/archives/200505/msg00316.html

in xsl, you **can** use disable-output-escaping="yes" to convert
pseudo-markup to markup, but xslt processors are not required to support
this, and so some do not.

it sure seems to me that if SOLR is returning XML, it might as well return
XML with real markup through and through instead of exploiting
pseudo-markup. if there is concern about introducing validation errors, then
perhaps you could use namespaces in the XML and put the highlighting markup
in a non-SOLR namespace???

Re: How to tell the highlighter not to escape?

Posted by Chris Hostetter <ho...@fucit.org>.
: > However I still think the highlighter should return unescaped tags for
: > highlighting. There is IMO no benefit for the current behavior.

the advantage is that the XmlResponseWriter has a duty to ensure that it
produces wellformed XML regardless of configuration, data, or input.

: The problems all stem from the simple highlighter formatter mixing
: highlighting info directly into the string to be highlighted.  It's
: not a 100% bulletproof solution because now you can't tell original
: field value from markup.

right ... the best you can do is pick markup that you are confident
doesn't appear in *your* data (which is why the hl.simple.pre and
hl.simple.post args exist)

Once upon a time, i speculated on a more "advanced" format for returning
highlighter info that was never implimented, but could still be usefull --
not just for situations like this, but in general for extracting more info
about the orrigin of the snippets...

http://www.nabble.com/highlighting-tf1393198.html#a3954083





-Hoss


Re: How to tell the highlighter not to escape?

Posted by Yonik Seeley <yo...@apache.org>.
On 1/3/07, Thorsten Scherler <th...@apache.org> wrote:
> However I still think the highlighter should return unescaped tags for
> highlighting. There is IMO no benefit for the current behavior.

The problems all stem from the simple highlighter formatter mixing
highlighting info directly into the string to be highlighted.  It's
not a 100% bulletproof solution because now you can't tell original
field value from markup.

This simple format was meant to be easy for people to use when they
had data that they knew didn't have special HTML characters.  This is
very straightforward to do programmatically (via JSPs or whatever), so
I'm surprised it can't be done with XSLT.

-Yonik

Re: How to tell the highlighter not to escape?

Posted by Thorsten Scherler <th...@apache.org>.
On Wed, 2007-01-03 at 12:06 +0000, Edward Garrett wrote:
> for what it's worth, i wrote a recursive template in xsl that replaces the
> escaped characters with actual elements. here, the variable $val would be
> the tag, e.g. "em". this has been working okay for me so far.

Yeah, many thanks for posting this template. This is actually
"imitating" a parser. 

However I still think the highlighter should return unescaped tags for
highlighting. There is IMO no benefit for the current behavior.

Thanks again Edward.

salu2

> 
> <xsl:template name="unescapeEm">
>     <xsl:param name="val" select="''"/>
>     <xsl:variable name="preEm" select="substring-before($val, '&lt;')"/>
>     <xsl:choose>
>         <xsl:when test="$preEm or starts-with($val, '&lt;')">
>             <xsl:variable name="insideEm" select="substring-before($val,
> '&lt;/')"/>
>             <xsl:value-of select="$preEm"/><em><xsl:value-of
> select="substring($insideEm, string-length($preEm)+5)"/></em>
>             <xsl:variable name="leftover" select="substring($val,
> string-length($insideEm) + 6)"/>
>             <xsl:if test="$leftover">
>                 <xsl:call-template name="unescapeEm">
>                     <xsl:with-param name="val" select="$leftover"/>
>                 </xsl:call-template>
>             </xsl:if>
>         </xsl:when>
>         <xsl:otherwise>
>             <xsl:value-of select="$val"/>
>         </xsl:otherwise>
>     </xsl:choose>
> </xsl:template>
> 
> On 1/3/07, Thorsten Scherler <th...@apache.org> wrote:
> >
> > On Wed, 2007-01-03 at 02:16 +0000, Edward Garrett wrote:
> > > thorsten,
> > >
> > > see the following for discussion. your case is indeed an annoyance--the
> > > thread below discusses motivations for it and ways of working around it.
> > (i
> > > too confess that i wish it were not so.)
> > >
> > > http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html
> >
> > Thanks Edward, the problem is with the suggestion in the above thread is
> > that:
> > "just create an XSL that
> > generates XML and unescapes the fields you know will contain wellformed
> > XML data -- then apply your second transform client side"
> >
> > Is not possible with xsl. See e.g.
> > http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
> > "> How can I match the Cdata Section?!?
> > >
> > You can't, the XPath data model regards CDATA as merely an input shortcut,
> > not as an information-bearing part of the XML content. In other words,
> > "<![CDATA[x]]>" and "x" look exactly the same to the XSLT processor.
> >
> > Mike Kay"
> >
> > Michael Kay is the xsl guru and I can say as well from my own experience
> > one would need to write a custom parser since <![CDATA[<em>TERM</em>]]>
> > is equal to &lt;em&gt;TERM&lt;/em&gt; and this in xsl is a string (XPath
> > would match text()).
> >
> > IMO the highlighter should really return pure xml and not escape it.
> > I will have a look in the XmlResponseWriter maybe I find a way to change
> > this.
> >
> > salu2
> >
> >
> > >
> > > -edward
> > >
> > > On 1/2/07, Mike Klaas <mi...@gmail.com> wrote:
> > > >
> > > > Hi Thorsten,
> > > >
> > > > The highlighter does not escape anything itself: you are seeing the
> > > > results of solr's automatic escaping of xml data within its xml
> > > > response.  This should be transparent (your xml decoder should
> > > > un-escape the values on the way out).  I'm not really familiar with
> > > > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > > > html-escaping the values after un-xml-escaping them?)
> > > >
> > > > Be careful of documents containing html fragments natively.
> > > >
> > > > cheers,
> > > > -MIke
> > > >
> > > > On 1/2/07, Thorsten Scherler <
> > thorsten.scherler.ext@juntadeandalucia.es>
> > > > wrote:
> > > > > Hi all,
> > > > >
> > > > > I am playing around with the highlighter and found that all
> > highlight
> > > > > terms get escaped.
> > > > >
> > > > > I mean solr will return
> > > > >  &lt;em&gt;TERM&lt;/em&gt; and not
> > > > > <em> TERM </em>
> > > > >
> > > > > I am not sure where this escaping is happening but I would need the
> > > > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > > > since it is horror to work with cdata sections in xsl.
> > > > >
> > > > > I had a look in the lucene highlighter and it seem that it does not
> > > > > escape the tags.
> > > > >
> > > > > Can somebody point me to code which is responsible for escaping and
> > > > > maybe give me a tip how I can patch to make it configurable.
> > > > >
> > > > > TIA
> > > > >
> > > > > salu2
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > --
> > thorsten
> >
> > "Together we stand, divided we fall!"
> > Hey you (Pink Floyd)
> >
> >
> >
> 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)



Re: How to tell the highlighter not to escape?

Posted by Edward Garrett <he...@gmail.com>.
for what it's worth, i wrote a recursive template in xsl that replaces the
escaped characters with actual elements. here, the variable $val would be
the tag, e.g. "em". this has been working okay for me so far.

<xsl:template name="unescapeEm">
    <xsl:param name="val" select="''"/>
    <xsl:variable name="preEm" select="substring-before($val, '&lt;')"/>
    <xsl:choose>
        <xsl:when test="$preEm or starts-with($val, '&lt;')">
            <xsl:variable name="insideEm" select="substring-before($val,
'&lt;/')"/>
            <xsl:value-of select="$preEm"/><em><xsl:value-of
select="substring($insideEm, string-length($preEm)+5)"/></em>
            <xsl:variable name="leftover" select="substring($val,
string-length($insideEm) + 6)"/>
            <xsl:if test="$leftover">
                <xsl:call-template name="unescapeEm">
                    <xsl:with-param name="val" select="$leftover"/>
                </xsl:call-template>
            </xsl:if>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$val"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

On 1/3/07, Thorsten Scherler <th...@apache.org> wrote:
>
> On Wed, 2007-01-03 at 02:16 +0000, Edward Garrett wrote:
> > thorsten,
> >
> > see the following for discussion. your case is indeed an annoyance--the
> > thread below discusses motivations for it and ways of working around it.
> (i
> > too confess that i wish it were not so.)
> >
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html
>
> Thanks Edward, the problem is with the suggestion in the above thread is
> that:
> "just create an XSL that
> generates XML and unescapes the fields you know will contain wellformed
> XML data -- then apply your second transform client side"
>
> Is not possible with xsl. See e.g.
> http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
> "> How can I match the Cdata Section?!?
> >
> You can't, the XPath data model regards CDATA as merely an input shortcut,
> not as an information-bearing part of the XML content. In other words,
> "<![CDATA[x]]>" and "x" look exactly the same to the XSLT processor.
>
> Mike Kay"
>
> Michael Kay is the xsl guru and I can say as well from my own experience
> one would need to write a custom parser since <![CDATA[<em>TERM</em>]]>
> is equal to &lt;em&gt;TERM&lt;/em&gt; and this in xsl is a string (XPath
> would match text()).
>
> IMO the highlighter should really return pure xml and not escape it.
> I will have a look in the XmlResponseWriter maybe I find a way to change
> this.
>
> salu2
>
>
> >
> > -edward
> >
> > On 1/2/07, Mike Klaas <mi...@gmail.com> wrote:
> > >
> > > Hi Thorsten,
> > >
> > > The highlighter does not escape anything itself: you are seeing the
> > > results of solr's automatic escaping of xml data within its xml
> > > response.  This should be transparent (your xml decoder should
> > > un-escape the values on the way out).  I'm not really familiar with
> > > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > > html-escaping the values after un-xml-escaping them?)
> > >
> > > Be careful of documents containing html fragments natively.
> > >
> > > cheers,
> > > -MIke
> > >
> > > On 1/2/07, Thorsten Scherler <
> thorsten.scherler.ext@juntadeandalucia.es>
> > > wrote:
> > > > Hi all,
> > > >
> > > > I am playing around with the highlighter and found that all
> highlight
> > > > terms get escaped.
> > > >
> > > > I mean solr will return
> > > >  &lt;em&gt;TERM&lt;/em&gt; and not
> > > > <em> TERM </em>
> > > >
> > > > I am not sure where this escaping is happening but I would need the
> > > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > > since it is horror to work with cdata sections in xsl.
> > > >
> > > > I had a look in the lucene highlighter and it seem that it does not
> > > > escape the tags.
> > > >
> > > > Can somebody point me to code which is responsible for escaping and
> > > > maybe give me a tip how I can patch to make it configurable.
> > > >
> > > > TIA
> > > >
> > > > salu2
> > > >
> > > >
> > >
> >
> >
> >
> --
> thorsten
>
> "Together we stand, divided we fall!"
> Hey you (Pink Floyd)
>
>
>


-- 
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA

Re: How to tell the highlighter not to escape?

Posted by Thorsten Scherler <th...@apache.org>.
On Wed, 2007-01-03 at 02:16 +0000, Edward Garrett wrote:
> thorsten,
> 
> see the following for discussion. your case is indeed an annoyance--the
> thread below discusses motivations for it and ways of working around it. (i
> too confess that i wish it were not so.)
> 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

Thanks Edward, the problem is with the suggestion in the above thread is
that:
"just create an XSL that
generates XML and unescapes the fields you know will contain wellformed
XML data -- then apply your second transform client side"

Is not possible with xsl. See e.g. http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
"> How can I match the Cdata Section?!?
>
You can't, the XPath data model regards CDATA as merely an input shortcut,
not as an information-bearing part of the XML content. In other words,
"<![CDATA[x]]>" and "x" look exactly the same to the XSLT processor.

Mike Kay"

Michael Kay is the xsl guru and I can say as well from my own experience
one would need to write a custom parser since <![CDATA[<em>TERM</em>]]>
is equal to &lt;em&gt;TERM&lt;/em&gt; and this in xsl is a string (XPath
would match text()). 

IMO the highlighter should really return pure xml and not escape it. 
I will have a look in the XmlResponseWriter maybe I find a way to change this.

salu2


> 
> -edward
> 
> On 1/2/07, Mike Klaas <mi...@gmail.com> wrote:
> >
> > Hi Thorsten,
> >
> > The highlighter does not escape anything itself: you are seeing the
> > results of solr's automatic escaping of xml data within its xml
> > response.  This should be transparent (your xml decoder should
> > un-escape the values on the way out).  I'm not really familiar with
> > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > html-escaping the values after un-xml-escaping them?)
> >
> > Be careful of documents containing html fragments natively.
> >
> > cheers,
> > -MIke
> >
> > On 1/2/07, Thorsten Scherler <th...@juntadeandalucia.es>
> > wrote:
> > > Hi all,
> > >
> > > I am playing around with the highlighter and found that all highlight
> > > terms get escaped.
> > >
> > > I mean solr will return
> > >  &lt;em&gt;TERM&lt;/em&gt; and not
> > > <em> TERM </em>
> > >
> > > I am not sure where this escaping is happening but I would need the
> > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > since it is horror to work with cdata sections in xsl.
> > >
> > > I had a look in the lucene highlighter and it seem that it does not
> > > escape the tags.
> > >
> > > Can somebody point me to code which is responsible for escaping and
> > > maybe give me a tip how I can patch to make it configurable.
> > >
> > > TIA
> > >
> > > salu2
> > >
> > >
> >
> 
> 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)



Re: How to tell the highlighter not to escape?

Posted by Edward Garrett <he...@gmail.com>.
thorsten,

see the following for discussion. your case is indeed an annoyance--the
thread below discusses motivations for it and ways of working around it. (i
too confess that i wish it were not so.)

http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

-edward

On 1/2/07, Mike Klaas <mi...@gmail.com> wrote:
>
> Hi Thorsten,
>
> The highlighter does not escape anything itself: you are seeing the
> results of solr's automatic escaping of xml data within its xml
> response.  This should be transparent (your xml decoder should
> un-escape the values on the way out).  I'm not really familiar with
> xslt so I'm unsure why that isn't so (perhaps it is automatically
> html-escaping the values after un-xml-escaping them?)
>
> Be careful of documents containing html fragments natively.
>
> cheers,
> -MIke
>
> On 1/2/07, Thorsten Scherler <th...@juntadeandalucia.es>
> wrote:
> > Hi all,
> >
> > I am playing around with the highlighter and found that all highlight
> > terms get escaped.
> >
> > I mean solr will return
> >  &lt;em&gt;TERM&lt;/em&gt; and not
> > <em> TERM </em>
> >
> > I am not sure where this escaping is happening but I would need the
> > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > since it is horror to work with cdata sections in xsl.
> >
> > I had a look in the lucene highlighter and it seem that it does not
> > escape the tags.
> >
> > Can somebody point me to code which is responsible for escaping and
> > maybe give me a tip how I can patch to make it configurable.
> >
> > TIA
> >
> > salu2
> >
> >
>



-- 
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA

Re: How to tell the highlighter not to escape?

Posted by Mike Klaas <mi...@gmail.com>.
Hi Thorsten,

The highlighter does not escape anything itself: you are seeing the
results of solr's automatic escaping of xml data within its xml
response.  This should be transparent (your xml decoder should
un-escape the values on the way out).  I'm not really familiar with
xslt so I'm unsure why that isn't so (perhaps it is automatically
html-escaping the values after un-xml-escaping them?)

Be careful of documents containing html fragments natively.

cheers,
-MIke

On 1/2/07, Thorsten Scherler <th...@juntadeandalucia.es> wrote:
> Hi all,
>
> I am playing around with the highlighter and found that all highlight
> terms get escaped.
>
> I mean solr will return
>  &lt;em&gt;TERM&lt;/em&gt; and not
> <em> TERM </em>
>
> I am not sure where this escaping is happening but I would need the
> highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> since it is horror to work with cdata sections in xsl.
>
> I had a look in the lucene highlighter and it seem that it does not
> escape the tags.
>
> Can somebody point me to code which is responsible for escaping and
> maybe give me a tip how I can patch to make it configurable.
>
> TIA
>
> salu2
>
>