You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lyndon Maydwell <ma...@gmail.com> on 2007/08/10 09:25:22 UTC

Snippet contents.

I've noticed that the snippets returned in nutch's search seem to have
the formatting added to them, and are then escaped into xml strings.
How would I go about changing the process so that the content was
escaped, then formatting added, then the snippet escaped?

the reason I want this is so that I can return valid xml with the
formatting as xml entities, but the actual snippet text escaped.

example of how nutch does it:
origional text:
"red fox & lazy dog"
formatting applied:
"red <span class="highlight">fox</span> & lazy dog"
escaped:
"red &lt;span class="highlight"&gt;fox&lt;/span&gt; &amp; dog"

example of what I'm after:
origional text:
"red fox & lazy dog"
escaped text"
"red fox &amp; lazy dog"
formatting applied:
"red <span class="highlight">fox</span> &amp; lazy dog"
escaped:
"red &lt;span class="highlight"&gt;fox&lt;/span&gt; &amp;amp; lazy dog"