You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Mark E. Haase" <me...@gmail.com> on 2010/08/21 02:04:51 UTC
Confused about highlighting
I have highlighting working on my project (indexing content for a web app),
but the idea of highlighting with <em> tags doesn't make sense to me. It
seems that it opens up the system to XSS attacks if you echo search result
data (with highlights) into a web page.
Example: Index the following string:
example of malicious script: <script>alert(1)</script>
Now when I fetch this document from Solr, I will escape it before
outputting, it, giving me:
example of malicious script: <script&rt;alert(1)</script>
But if I turn highlighting on and the highlight is the <em> tag, then when I
search for the word "example" I would get:
<em&rt;example</em&rt; of malicious script:
<script&rt;alert(1)</script>
When a browser displays this, it will literally print <em> tags around the
word "example" instead of actually visually emphasizing the word.
Now then, I could escape the text before indexing, but then Solr's index
would include words like "lt", "rt", and "amp". I can't put these words on
the stopword list because "amp" is a real word that a user might want to
search for.
Any errors in my logic? The only thing I can think to do is to change the
highlight "pre" and "post" to some non-HTML string and then parse the
response to replace those with correct HTML tags. But that's definitely
hacky.
Thanks,
Mark
Re: Confused about highlighting
Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(10/08/21 9:04), Mark E. Haase wrote:
> I have highlighting working on my project (indexing content for a web app),
> but the idea of highlighting with<em> tags doesn't make sense to me. It
> seems that it opens up the system to XSS attacks if you echo search result
> data (with highlights) into a web page.
>
> Example: Index the following string:
>
> example of malicious script:<script>alert(1)</script>
>
>
> Now when I fetch this document from Solr, I will escape it before
> outputting, it, giving me:
>
> example of malicious script:<script&rt;alert(1)</script>
>
>
> But if I turn highlighting on and the highlight is the<em> tag, then when I
> search for the word "example" I would get:
>
> <em&rt;example</em&rt; of malicious script:
> <script&rt;alert(1)</script>
>
> When a browser displays this, it will literally print<em> tags around the
> word "example" instead of actually visually emphasizing the word.
>
> Now then, I could escape the text before indexing, but then Solr's index
> would include words like "lt", "rt", and "amp". I can't put these words on
> the stopword list because "amp" is a real word that a user might want to
> search for.
>
> Any errors in my logic? The only thing I can think to do is to change the
> highlight "pre" and "post" to some non-HTML string and then parse the
> response to replace those with correct HTML tags. But that's definitely
> hacky.
>
> Thanks,
> Mark
>
Mark,
You are right. Luckily, the latest Solr in branch_3x and trunk can accept
Lucene's Encoder. Check out SOLR-2021 and example solrconfig.xml:
https://issues.apache.org/jira/browse/SOLR-2021
HtmlEncoder is set as default in the example solrconfig.xml,
you'll get the following snippet:
<em>example</em> of malicious script: <script>alert(1)</script>
Koji
--
http://www.rondhuit.com/en/