You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "steve.christin@gmail.com" <st...@gmail.com> on 2007/09/25 12:53:07 UTC

Problem with html code inside xml

Hello,

I've got some problem with html code who is embedded in xml file:

Sample source .

<content>
	<stories>
		<div class="storyTitle">
			 Les débats
		</div>
		<div class="storyIntroductionText">
			Le premier tour des élections fédérales se déroulera le 21  
octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez- 
vous, dont plusieurs grands débats à l'enseigne de Forums.
		</div>
		<div class="paragraph">
			<div class="paragraphTitle"/>
			<div class="paragraphText">
				my para textehere
				<br/>
				<br/>
				Vous trouverez sur cette page toutes les dates et les heures de  
ces différents rendez-vous ainsi que le nom et les partis des  
débatteurs. De plus, vous pourrez également écouter ou réécouter  
l'ensemble de ces émissions.
			</div>
		</div>
....
---------
When a make a query on solr I've got something like that in the  
source code of the xml result:

<td xmlns="http://www.w3.org/1999/xhtml">
<span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraph"</span>
<span class="markup">&gt;</span><div class="expander-content">
<div class="indent"><span class="markup">&lt;</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraphTitle"</span>
<span class="markup">/&gt;</span></div><table><tr>
<td class="expander">−<div class="spacer"/>
</td><td><span class="markup">&lt;</span>
...

It is not exactly what I want. I want to keep the html tags, that all  
without formatting.

So the br tags and a tags are well formed in xml and json result, but  
the div tags are not kept.
---------
In the schema.xml I've got this for the html content

<fieldType name="html" class="solr.TextField" />

  <field name="storyFullText" type="html" indexed="true"  
stored="true" multiValued="true"/>

---------

Any help would be appreciate.

Thanks in advance.

S. Christin






Re: Problem with html code inside xml

Posted by Yonik Seeley <yo...@apache.org>.
On Fri, Mar 7, 2008 at 5:11 PM, Latj <jt...@gmail.com> wrote:
>  When I use HTML::Entities to encode my text, I get this error:
>
>  SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
>  named 'para'
>
>  Its complaining about finding:   &para;   in my text. Anyone know why this
>  is a problem?

&para; is an HTML entity, not standard in XML.

-Yonik

Re: Problem with html code inside xml

Posted by Reece <li...@gmail.com>.
Just use cdata to have the parser ignore the html characters.

http://www.w3schools.com/xml/xml_cdata.asp

-Reece



On Fri, Mar 7, 2008 at 5:11 PM, Latj <jt...@gmail.com> wrote:
>
>
>  When I use HTML::Entities to encode my text, I get this error:
>
>  SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
>  named 'para'
>
>  Its complaining about finding:   &para;   in my text. Anyone know why this
>  is a problem?
>
>
>
>
>
>  Jérôme Etévé-2 wrote:
>  >
>  > If I understand, you want to keep the raw html code in solr like that
>  > (in your posting xml file):
>  >
>  > <field name="storyFullText">
>  >   <html></html>
>  > </field>
>  >
>  > I think you should encode your content to protect these xml entities:
>  > <  ->  &lt;
>  >> -> &gt;
>  > " -> &quot;
>  > & -> &amp;
>  >
>  > If you use perl, have a look at HTML::Entities.
>  >
>  >
>  > On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
>  >> Hello,
>  >>
>  >> I've got some problem with html code who is embedded in xml file:
>  >>
>  >> Sample source .
>  >>
>  >> <content>
>  >>         <stories>
>  >>                 <div class="storyTitle">
>  >>                          Les débats
>  >>                 </div>
>  >>                 <div class="storyIntroductionText">
>  >>                         Le premier tour des élections fédérales se
>  >> déroulera le 21
>  >> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>  >> vous, dont plusieurs grands débats à l'enseigne de Forums.
>  >>                 </div>
>  >>                 <div class="paragraph">
>  >>                         <div class="paragraphTitle"/>
>  >>                         <div class="paragraphText">
>  >>                                 my para textehere
>  >>                                 <br/>
>  >>                                 <br/>
>  >>                                 Vous trouverez sur cette page toutes les
>  >> dates et les heures de
>  >> ces différents rendez-vous ainsi que le nom et les partis des
>  >> débatteurs. De plus, vous pourrez également écouter ou réécouter
>  >> l'ensemble de ces émissions.
>  >>                         </div>
>  >>                 </div>
>  >> ....
>  >> ---------
>  >> When a make a query on solr I've got something like that in the
>  >> source code of the xml result:
>  >>
>  >> <td xmlns="http://www.w3.org/1999/xhtml">
>  >> &lt;
>  >> div
>  >> class
>  >> =
>  >> "paragraph"
>  >> &gt;<div class="expander-content">
>  >> <div class="indent">&lt;
>  >> div
>  >> class
>  >> =
>  >> "paragraphTitle"
>  >> /&gt;</div><table><tr>
>  >> <td class="expander">−<div class="spacer"/>
>  >> </td><td>&lt;
>  >> ...
>  >>
>  >> It is not exactly what I want. I want to keep the html tags, that all
>  >> without formatting.
>  >>
>  >> So the br tags and a tags are well formed in xml and json result, but
>  >> the div tags are not kept.
>  >> ---------
>  >> In the schema.xml I've got this for the html content
>  >>
>  >> <fieldType name="html" class="solr.TextField" />
>  >>
>  >>   <field name="storyFullText" type="html" indexed="true"
>  >> stored="true" multiValued="true"/>
>  >>
>  >> ---------
>  >>
>  >> Any help would be appreciate.
>  >>
>  >> Thanks in advance.
>  >>
>  >> S. Christin
>  >>
>  >>
>  >>
>  >>
>  >>
>  >>
>  >
>  >
>  > --
>  > Jerome Eteve.
>  > jerome@eteve.net
>  > http://jerome.eteve.free.fr/
>  >
>  >
>
>  --
>  View this message in context: http://www.nabble.com/Problem-with-html-code-inside-xml-tp12877194p15907551.html
>  Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Problem with html code inside xml

Posted by Latj <jt...@gmail.com>.

When I use HTML::Entities to encode my text, I get this error:

SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
named 'para'

Its complaining about finding:   &para;   in my text. Anyone know why this
is a problem?





Jérôme Etévé-2 wrote:
> 
> If I understand, you want to keep the raw html code in solr like that
> (in your posting xml file):
> 
> <field name="storyFullText">
>   <html></html>
> </field>
> 
> I think you should encode your content to protect these xml entities:
> <  ->  &lt;
>> -> &gt;
> " -> &quot;
> & -> &amp;
> 
> If you use perl, have a look at HTML::Entities.
> 
> 
> On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
>> Hello,
>>
>> I've got some problem with html code who is embedded in xml file:
>>
>> Sample source .
>>
>> <content>
>>         <stories>
>>                 <div class="storyTitle">
>>                          Les débats
>>                 </div>
>>                 <div class="storyIntroductionText">
>>                         Le premier tour des élections fédérales se
>> déroulera le 21
>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>> vous, dont plusieurs grands débats à l'enseigne de Forums.
>>                 </div>
>>                 <div class="paragraph">
>>                         <div class="paragraphTitle"/>
>>                         <div class="paragraphText">
>>                                 my para textehere
>>                                 <br/>
>>                                 <br/>
>>                                 Vous trouverez sur cette page toutes les
>> dates et les heures de
>> ces différents rendez-vous ainsi que le nom et les partis des
>> débatteurs. De plus, vous pourrez également écouter ou réécouter
>> l'ensemble de ces émissions.
>>                         </div>
>>                 </div>
>> ....
>> ---------
>> When a make a query on solr I've got something like that in the
>> source code of the xml result:
>>
>> <td xmlns="http://www.w3.org/1999/xhtml">
>> &lt;
>> div
>> class
>> =
>> "paragraph"
>> &gt;<div class="expander-content">
>> <div class="indent">&lt;
>> div
>> class
>> =
>> "paragraphTitle"
>> /&gt;</div><table><tr>
>> <td class="expander">−<div class="spacer"/>
>> </td><td>&lt;
>> ...
>>
>> It is not exactly what I want. I want to keep the html tags, that all
>> without formatting.
>>
>> So the br tags and a tags are well formed in xml and json result, but
>> the div tags are not kept.
>> ---------
>> In the schema.xml I've got this for the html content
>>
>> <fieldType name="html" class="solr.TextField" />
>>
>>   <field name="storyFullText" type="html" indexed="true"
>> stored="true" multiValued="true"/>
>>
>> ---------
>>
>> Any help would be appreciate.
>>
>> Thanks in advance.
>>
>> S. Christin
>>
>>
>>
>>
>>
>>
> 
> 
> -- 
> Jerome Eteve.
> jerome@eteve.net
> http://jerome.eteve.free.fr/
> 
> 

-- 
View this message in context: http://www.nabble.com/Problem-with-html-code-inside-xml-tp12877194p15907551.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with html code inside xml

Posted by "steve.christin@gmail.com" <st...@gmail.com>.
well... the xml output has changed and I receive  
&lt;strong&gt;hhhhhh&lt;strong&gt;   sic!

So the problem is not a problem...

Thanks

Steve

Le 3 oct. 07 à 01:09, Chris Hostetter a écrit :

> : I created a field type:
> :
> : <fieldType name="htmlTxt" class="solr.TextField"  
> positionIncrementGap="100">
>
> 	...
>
> : Everything works (the div tags, p tags are removed) but some
> : <strong>nnn</strong>   or <br/> tags are style in the text after  
> indexing.
>
> i cut/paste that fieldtype into the example schema.xml, and  
> experimented
> with the analysis tool (http://localhost:8983/solr/admin/ 
> analysis.jsp) and
> both of those examples were correctly striped.
>
> do you have a more specific example of something that doesn't work?
>
> Hmm... it seems like maybe the problem is examples like this...
> 	blahblah<string>nnn</strong>
> ...if the tag is direclty adjacent to other text, it may not get  
> striped
> off ... i'm not sure if that's specific to the  
> HtmlWhitespaceTokenizer.
>
>
>
>
> -Hoss


Re: Problem with html code inside xml

Posted by Chris Hostetter <ho...@fucit.org>.
: I created a field type:
: 
: <fieldType name="htmlTxt" class="solr.TextField" positionIncrementGap="100">

	...

: Everything works (the div tags, p tags are removed) but some
: <strong>nnn</strong>   or <br/> tags are style in the text after indexing.

i cut/paste that fieldtype into the example schema.xml, and experimented 
with the analysis tool (http://localhost:8983/solr/admin/analysis.jsp) and 
both of those examples were correctly striped.

do you have a more specific example of something that doesn't work?

Hmm... it seems like maybe the problem is examples like this...
	blahblah<string>nnn</strong>
...if the tag is direclty adjacent to other text, it may not get striped 
off ... i'm not sure if that's specific to the HtmlWhitespaceTokenizer.




-Hoss

Re: Problem with html code inside xml

Posted by "steve.christin@gmail.com" <st...@gmail.com>.
Thanks

I use this solution:

put  <![CDATA[  Here my hml code   ]]> in the xml to be indexed and  
it works, nothing to change in the xsl.

In the schema I use this fieldType

<fieldType name="html" class="solr.TextField"  
positionIncrementGap="100">
     	<analyzer>
         	<tokenizer class="solr.WhitespaceTokenizerFactory"/>
          	<filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="1" generateNumberParts="1" catenateWords="1"  
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
          	<filter class="solr.LowerCaseFilterFactory"/>
          	<filter class="solr.StopFilterFactory" ignoreCase="true"  
words="stopwords.txt"/>
          	<filter class="solr.ISOLatin1AccentFilterFactory"/>
          	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      	</analyzer>
      </fieldType>

----------
Now question:
I created a field to index only the text for this html code.

I created a field type:

<fieldType name="htmlTxt" class="solr.TextField"  
positionIncrementGap="100">
     	<analyzer>
         	<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
          	<filter class="solr.WordDelimiterFilterFactory"  
generateWordParts="1" generateNumberParts="1" catenateWords="1"  
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
          	<filter class="solr.LowerCaseFilterFactory"/>
          	<filter class="solr.StopFilterFactory" ignoreCase="true"  
words="stopwords.txt"/>
          	<filter class="solr.ISOLatin1AccentFilterFactory"/>
          	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      	</analyzer>
      </fieldType>

Everything works (the div tags, p tags are removed) but some  
<strong>nnn</strong>   or <br/> tags are style in the text after  
indexing.

If you've got any idea to solve this problem it we'll be great.

Thanks

S. Christin



-------------


Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :

> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
>> If I understand, you want to keep the raw html code in solr like that
>> (in your posting xml file):
>>
>> <field name="storyFullText">
>>   <html></html>
>> </field>
>>
>> I think you should encode your content to protect these xml entities:
>> <  ->  &lt;
>>> -> &gt;
>> " -> &quot;
>> & -> &amp;
>>
>> If you use perl, have a look at HTML::Entities.
>
> AFAIR you cannot use tags, they always are getting transformed to
> entities. The solution is to have a xsl transformation after the
> response that transforms the entities back to tags.
>
> Have a look at the thread
> http://marc.info/?t=116775837900001&r=1&w=2
> and especially at
> http://marc.info/?l=solr-user&m=116782664828926&w=2
>
> HTH
>
> salu2
>
>>
>>
>> On 9/25/07, steve.christin@gmail.com <st...@gmail.com>  
>> wrote:
>>> Hello,
>>>
>>> I've got some problem with html code who is embedded in xml file:
>>>
>>> Sample source .
>>>
>>> <content>
>>>         <stories>
>>>                 <div class="storyTitle">
>>>                          Les débats
>>>                 </div>
>>>                 <div class="storyIntroductionText">
>>>                         Le premier tour des élections fédérales  
>>> se déroulera le 21
>>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>>> vous, dont plusieurs grands débats à l'enseigne de Forums.
>>>                 </div>
>>>                 <div class="paragraph">
>>>                         <div class="paragraphTitle"/>
>>>                         <div class="paragraphText">
>>>                                 my para textehere
>>>                                 <br/>
>>>                                 <br/>
>>>                                 Vous trouverez sur cette page  
>>> toutes les dates et les heures de
>>> ces différents rendez-vous ainsi que le nom et les partis des
>>> débatteurs. De plus, vous pourrez également écouter ou  
>>> réécouter
>>> l'ensemble de ces émissions.
>>>                         </div>
>>>                 </div>
>>> ....
>>> ---------
>>> When a make a query on solr I've got something like that in the
>>> source code of the xml result:
>>>
>>> <td xmlns="http://www.w3.org/1999/xhtml">
>>> <span class="markup">&lt;</span>
>>> <span class="start-tag">div</span>
>>> <span class="attribute-name">class</span>
>>> <span class="markup">=</span>
>>> <span class="attribute-value">"paragraph"</span>
>>> <span class="markup">&gt;</span><div class="expander-content">
>>> <div class="indent"><span class="markup">&lt;</span>
>>> <span class="start-tag">div</span>
>>> <span class="attribute-name">class</span>
>>> <span class="markup">=</span>
>>> <span class="attribute-value">"paragraphTitle"</span>
>>> <span class="markup">/&gt;</span></div><table><tr>
>>> <td class="expander">−<div class="spacer"/>
>>> </td><td><span class="markup">&lt;</span>
>>> ...
>>>
>>> It is not exactly what I want. I want to keep the html tags, that  
>>> all
>>> without formatting.
>>>
>>> So the br tags and a tags are well formed in xml and json result,  
>>> but
>>> the div tags are not kept.
>>> ---------
>>> In the schema.xml I've got this for the html content
>>>
>>> <fieldType name="html" class="solr.TextField" />
>>>
>>>   <field name="storyFullText" type="html" indexed="true"
>>> stored="true" multiValued="true"/>
>>>
>>> ---------
>>>
>>> Any help would be appreciate.
>>>
>>> Thanks in advance.
>>>
>>> S. Christin
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> -- 
> Thorsten Scherler                                  
> thorsten.at.apache.org
> Open Source Java                      consulting, training and  
> solutions
>


Re: Problem with html code inside xml

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
> If I understand, you want to keep the raw html code in solr like that
> (in your posting xml file):
> 
> <field name="storyFullText">
>   <html></html>
> </field>
> 
> I think you should encode your content to protect these xml entities:
> <  ->  &lt;
> > -> &gt;
> " -> &quot;
> & -> &amp;
> 
> If you use perl, have a look at HTML::Entities.

AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.

Have a look at the thread 
http://marc.info/?t=116775837900001&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2

HTH

salu2

> 
> 
> On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
> > Hello,
> >
> > I've got some problem with html code who is embedded in xml file:
> >
> > Sample source .
> >
> > <content>
> >         <stories>
> >                 <div class="storyTitle">
> >                          Les débats
> >                 </div>
> >                 <div class="storyIntroductionText">
> >                         Le premier tour des élections fédérales se déroulera le 21
> > octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> > vous, dont plusieurs grands débats à l'enseigne de Forums.
> >                 </div>
> >                 <div class="paragraph">
> >                         <div class="paragraphTitle"/>
> >                         <div class="paragraphText">
> >                                 my para textehere
> >                                 <br/>
> >                                 <br/>
> >                                 Vous trouverez sur cette page toutes les dates et les heures de
> > ces différents rendez-vous ainsi que le nom et les partis des
> > débatteurs. De plus, vous pourrez également écouter ou réécouter
> > l'ensemble de ces émissions.
> >                         </div>
> >                 </div>
> > ....
> > ---------
> > When a make a query on solr I've got something like that in the
> > source code of the xml result:
> >
> > <td xmlns="http://www.w3.org/1999/xhtml">
> > <span class="markup">&lt;</span>
> > <span class="start-tag">div</span>
> > <span class="attribute-name">class</span>
> > <span class="markup">=</span>
> > <span class="attribute-value">"paragraph"</span>
> > <span class="markup">&gt;</span><div class="expander-content">
> > <div class="indent"><span class="markup">&lt;</span>
> > <span class="start-tag">div</span>
> > <span class="attribute-name">class</span>
> > <span class="markup">=</span>
> > <span class="attribute-value">"paragraphTitle"</span>
> > <span class="markup">/&gt;</span></div><table><tr>
> > <td class="expander">−<div class="spacer"/>
> > </td><td><span class="markup">&lt;</span>
> > ...
> >
> > It is not exactly what I want. I want to keep the html tags, that all
> > without formatting.
> >
> > So the br tags and a tags are well formed in xml and json result, but
> > the div tags are not kept.
> > ---------
> > In the schema.xml I've got this for the html content
> >
> > <fieldType name="html" class="solr.TextField" />
> >
> >   <field name="storyFullText" type="html" indexed="true"
> > stored="true" multiValued="true"/>
> >
> > ---------
> >
> > Any help would be appreciate.
> >
> > Thanks in advance.
> >
> > S. Christin
> >
> >
> >
> >
> >
> >
> 
> 
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions


Re: Problem with html code inside xml

Posted by Jérôme Etévé <je...@eteve.net>.
If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):

<field name="storyFullText">
  <html></html>
</field>

I think you should encode your content to protect these xml entities:
<  ->  &lt;
> -> &gt;
" -> &quot;
& -> &amp;

If you use perl, have a look at HTML::Entities.


On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
> Hello,
>
> I've got some problem with html code who is embedded in xml file:
>
> Sample source .
>
> <content>
>         <stories>
>                 <div class="storyTitle">
>                          Les débats
>                 </div>
>                 <div class="storyIntroductionText">
>                         Le premier tour des élections fédérales se déroulera le 21
> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> vous, dont plusieurs grands débats à l'enseigne de Forums.
>                 </div>
>                 <div class="paragraph">
>                         <div class="paragraphTitle"/>
>                         <div class="paragraphText">
>                                 my para textehere
>                                 <br/>
>                                 <br/>
>                                 Vous trouverez sur cette page toutes les dates et les heures de
> ces différents rendez-vous ainsi que le nom et les partis des
> débatteurs. De plus, vous pourrez également écouter ou réécouter
> l'ensemble de ces émissions.
>                         </div>
>                 </div>
> ....
> ---------
> When a make a query on solr I've got something like that in the
> source code of the xml result:
>
> <td xmlns="http://www.w3.org/1999/xhtml">
> <span class="markup">&lt;</span>
> <span class="start-tag">div</span>
> <span class="attribute-name">class</span>
> <span class="markup">=</span>
> <span class="attribute-value">"paragraph"</span>
> <span class="markup">&gt;</span><div class="expander-content">
> <div class="indent"><span class="markup">&lt;</span>
> <span class="start-tag">div</span>
> <span class="attribute-name">class</span>
> <span class="markup">=</span>
> <span class="attribute-value">"paragraphTitle"</span>
> <span class="markup">/&gt;</span></div><table><tr>
> <td class="expander">−<div class="spacer"/>
> </td><td><span class="markup">&lt;</span>
> ...
>
> It is not exactly what I want. I want to keep the html tags, that all
> without formatting.
>
> So the br tags and a tags are well formed in xml and json result, but
> the div tags are not kept.
> ---------
> In the schema.xml I've got this for the html content
>
> <fieldType name="html" class="solr.TextField" />
>
>   <field name="storyFullText" type="html" indexed="true"
> stored="true" multiValued="true"/>
>
> ---------
>
> Any help would be appreciate.
>
> Thanks in advance.
>
> S. Christin
>
>
>
>
>
>


-- 
Jerome Eteve.
jerome@eteve.net
http://jerome.eteve.free.fr/