You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matt Mitchell <go...@gmail.com> on 2009/04/23 21:13:09 UTC

storing xml - how to highlight hits in response?

Hi,

I'm storing some raw xml in solr (stored and non-tokenized). I'd like to
highlight hits in the response, obviously this is problematic as the
highlighting elements are also xml. So if I match an attribute value or tag
name, the xml response is messed up. Is there a way to highlight only text,
that is not part of an xml element? As in, only the text content?

Matt

RE: storing xml - how to highlight hits in response?

Posted by Ensdorf Ken <En...@zoominfo.com>.
> Yeah great idea, thanks. Does anyone know if there is code out there
> that
> will do this sort of thing?
>

Perhaps a much simpler option would be to use this:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceFilterFactory.html

with a regex of "<[^>]*>" or something like that - I'm no regex expert.  Of course it could get tricky to handle escaped characters and the like, but it may be a good enough poor man's solution.

-Ken


Re: storing xml - how to highlight hits in response?

Posted by Matt Mitchell <go...@gmail.com>.
Yeah great idea, thanks. Does anyone know if there is code out there that
will do this sort of thing?

Matt


On Thu, Apr 23, 2009 at 3:23 PM, Ensdorf Ken <En...@zoominfo.com> wrote:

> > Hi,
> >
> > I'm storing some raw xml in solr (stored and non-tokenized). I'd like
> > to
> > highlight hits in the response, obviously this is problematic as the
> > highlighting elements are also xml. So if I match an attribute value or
> > tag
> > name, the xml response is messed up. Is there a way to highlight only
> > text,
> > that is not part of an xml element? As in, only the text content?
>
> You could create a custom Analyzer or Tokenizer that strips everything but
> the text content.
>
> -Ken
>
>

RE: storing xml - how to highlight hits in response?

Posted by Ensdorf Ken <En...@zoominfo.com>.
> Hi,
>
> I'm storing some raw xml in solr (stored and non-tokenized). I'd like
> to
> highlight hits in the response, obviously this is problematic as the
> highlighting elements are also xml. So if I match an attribute value or
> tag
> name, the xml response is messed up. Is there a way to highlight only
> text,
> that is not part of an xml element? As in, only the text content?

You could create a custom Analyzer or Tokenizer that strips everything but the text content.

-Ken