You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by okayndc <bo...@gmail.com> on 2011/09/25 15:00:22 UTC

escaping HTML tags within XML file

Hello,

Was wondering if it is necessary to escape HTML tags within an XML file for
indexing?  If so, seems like a large XML files with tons of HTML tags could
get really messy (using CDATA).
Has this been your experience?  Do you escape the HTML tags? If so, what
technique do you use? Or do you leave the HTML tags in place without
escaping them?

Thanks!

Re: escaping HTML tags within XML file

Posted by pu...@gmail.com.

Yes sir!

Sent from my iPhone

On Sep 25, 2011, at 4:06 PM, okayndc <bo...@gmail.com> wrote:

> Here is a representation of the XML file...
> 
> <root>
> <commenter>
> <comment><p>Text here</p><img src="image.gif" /><p>More text
> here....</p></comment>
> </commenter>
> </root>
> 
> I want to keep the HTML tags because it keeps the formatting (paragraph
> tags, etc) intact for the output.  Seems like you're saying that the HTML
> can be kept intact with the use of a HTML field type without having to
> escape the HTML tags?
> 
> On Sun, Sep 25, 2011 at 2:52 PM, <pu...@gmail.com> wrote:
> 
>> Assuming that the XML has the HTML as values inside fully formed tags like
>> so:
>> <node><HTML></HTML></node> then I think that using the "HTML" field type in
>> schema.xml for indexing/storing will allow you to do meaningful searches on
>> the content of the HTML without getting confused by the HTML syntax itself.
>> 
>> If you have absolutely no need for the entire stored HTML when presenting
>> results to the user then stripping out the syntax at index time makes sense.
>> This will adversely affect highlighting of  that document field as well so
>> just know your requirements.
>> 
>> If you don't want to present anything at all then don't store, just index
>> and use the right field type (HTML) such that search results find the right
>> document. Just because a field is helpful in finding the doc, doesn't mean
>> folks always want to present it or store it.
>> 
>> With Data Import Handler a HTML stripping transformer is present so that it
>> is removed before the indexer gets it's hands on things. I can't be sure if
>> that is how you get your data into Solr.
>> 
>> - Pulkit
>> 
>> Sent from my iPhone
>> 
>> On Sep 25, 2011, at 8:00 AM, okayndc <bo...@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> Was wondering if it is necessary to escape HTML tags within an XML file
>> for
>>> indexing?  If so, seems like a large XML files with tons of HTML tags
>> could
>>> get really messy (using CDATA).
>>> Has this been your experience?  Do you escape the HTML tags? If so, what
>>> technique do you use? Or do you leave the HTML tags in place without
>>> escaping them?
>>> 
>>> Thanks!
>>

Re: escaping HTML tags within XML file

Posted by Michael Sokolov <so...@ifactory.com>.

Yes - you can index HTML text only while keeping the tags in place in 
the stored field using HTMLCharFilter (or possibly XMLCharFilter).  But 
you will find that embedding HTML inside XML can be problematic since 
HTML tags don't have to follow the well-formed constraints that XML 
requires.  For example, old-style paragraph tags in HTML were often not 
closed, just <p> with no </p>.  If you have stuff like that, you won't 
be able to embed in XML without quoting the < character.  You never said 
why you are embedding HTML in XML though.

-Mike

On 9/25/2011 5:06 PM, okayndc wrote:
> Here is a representation of the XML file...
>
> <root>
> <commenter>
> <comment><p>Text here</p><img src="image.gif" /><p>More text
> here....</p></comment>
> </commenter>
> </root>
>
> I want to keep the HTML tags because it keeps the formatting (paragraph
> tags, etc) intact for the output.  Seems like you're saying that the HTML
> can be kept intact with the use of a HTML field type without having to
> escape the HTML tags?
>

Re: escaping HTML tags within XML file

Posted by okayndc <bo...@gmail.com>.

Here is a representation of the XML file...

<root>
<commenter>
<comment><p>Text here</p><img src="image.gif" /><p>More text
here....</p></comment>
</commenter>
</root>

I want to keep the HTML tags because it keeps the formatting (paragraph
tags, etc) intact for the output.  Seems like you're saying that the HTML
can be kept intact with the use of a HTML field type without having to
escape the HTML tags?

On Sun, Sep 25, 2011 at 2:52 PM, <pu...@gmail.com> wrote:

> Assuming that the XML has the HTML as values inside fully formed tags like
> so:
> <node><HTML></HTML></node> then I think that using the "HTML" field type in
> schema.xml for indexing/storing will allow you to do meaningful searches on
> the content of the HTML without getting confused by the HTML syntax itself.
>
> If you have absolutely no need for the entire stored HTML when presenting
> results to the user then stripping out the syntax at index time makes sense.
> This will adversely affect highlighting of  that document field as well so
> just know your requirements.
>
> If you don't want to present anything at all then don't store, just index
> and use the right field type (HTML) such that search results find the right
> document. Just because a field is helpful in finding the doc, doesn't mean
> folks always want to present it or store it.
>
> With Data Import Handler a HTML stripping transformer is present so that it
> is removed before the indexer gets it's hands on things. I can't be sure if
> that is how you get your data into Solr.
>
> - Pulkit
>
> Sent from my iPhone
>
> On Sep 25, 2011, at 8:00 AM, okayndc <bo...@gmail.com> wrote:
>
> > Hello,
> >
> > Was wondering if it is necessary to escape HTML tags within an XML file
> for
> > indexing?  If so, seems like a large XML files with tons of HTML tags
> could
> > get really messy (using CDATA).
> > Has this been your experience?  Do you escape the HTML tags? If so, what
> > technique do you use? Or do you leave the HTML tags in place without
> > escaping them?
> >
> > Thanks!
>

Re: escaping HTML tags within XML file

Posted by pu...@gmail.com.

Assuming that the XML has the HTML as values inside fully formed tags like so:
<node><HTML></HTML></node> then I think that using the "HTML" field type in schema.xml for indexing/storing will allow you to do meaningful searches on the content of the HTML without getting confused by the HTML syntax itself.

If you have absolutely no need for the entire stored HTML when presenting results to the user then stripping out the syntax at index time makes sense. This will adversely affect highlighting of  that document field as well so just know your requirements.

If you don't want to present anything at all then don't store, just index and use the right field type (HTML) such that search results find the right document. Just because a field is helpful in finding the doc, doesn't mean folks always want to present it or store it.

With Data Import Handler a HTML stripping transformer is present so that it is removed before the indexer gets it's hands on things. I can't be sure if that is how you get your data into Solr.

- Pulkit

Sent from my iPhone

On Sep 25, 2011, at 8:00 AM, okayndc <bo...@gmail.com> wrote:

> Hello,
> 
> Was wondering if it is necessary to escape HTML tags within an XML file for
> indexing?  If so, seems like a large XML files with tons of HTML tags could
> get really messy (using CDATA).
> Has this been your experience?  Do you escape the HTML tags? If so, what
> technique do you use? Or do you leave the HTML tags in place without
> escaping them?
> 
> Thanks!