You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "steve.christin@gmail.com" <st...@gmail.com> on 2007/09/25 12:53:07 UTC
Problem with html code inside xml
Hello,
I've got some problem with html code who is embedded in xml file:
Sample source .
<content>
<stories>
<div class="storyTitle">
Les débats
</div>
<div class="storyIntroductionText">
Le premier tour des élections fédérales se déroulera le 21
octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
vous, dont plusieurs grands débats à l'enseigne de Forums.
</div>
<div class="paragraph">
<div class="paragraphTitle"/>
<div class="paragraphText">
my para textehere
<br/>
<br/>
Vous trouverez sur cette page toutes les dates et les heures de
ces différents rendez-vous ainsi que le nom et les partis des
débatteurs. De plus, vous pourrez également écouter ou réécouter
l'ensemble de ces émissions.
</div>
</div>
....
---------
When a make a query on solr I've got something like that in the
source code of the xml result:
<td xmlns="http://www.w3.org/1999/xhtml">
<span class="markup"><</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraph"</span>
<span class="markup">></span><div class="expander-content">
<div class="indent"><span class="markup"><</span>
<span class="start-tag">div</span>
<span class="attribute-name">class</span>
<span class="markup">=</span>
<span class="attribute-value">"paragraphTitle"</span>
<span class="markup">/></span></div><table><tr>
<td class="expander">−<div class="spacer"/>
</td><td><span class="markup"><</span>
...
It is not exactly what I want. I want to keep the html tags, that all
without formatting.
So the br tags and a tags are well formed in xml and json result, but
the div tags are not kept.
---------
In the schema.xml I've got this for the html content
<fieldType name="html" class="solr.TextField" />
<field name="storyFullText" type="html" indexed="true"
stored="true" multiValued="true"/>
---------
Any help would be appreciate.
Thanks in advance.
S. Christin
Re: Problem with html code inside xml
Posted by Yonik Seeley <yo...@apache.org>.
On Fri, Mar 7, 2008 at 5:11 PM, Latj <jt...@gmail.com> wrote:
> When I use HTML::Entities to encode my text, I get this error:
>
> SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
> named 'para'
>
> Its complaining about finding: ¶ in my text. Anyone know why this
> is a problem?
¶ is an HTML entity, not standard in XML.
-Yonik
Re: Problem with html code inside xml
Posted by Reece <li...@gmail.com>.
Just use cdata to have the parser ignore the html characters.
http://www.w3schools.com/xml/xml_cdata.asp
-Reece
On Fri, Mar 7, 2008 at 5:11 PM, Latj <jt...@gmail.com> wrote:
>
>
> When I use HTML::Entities to encode my text, I get this error:
>
> SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
> named 'para'
>
> Its complaining about finding: ¶ in my text. Anyone know why this
> is a problem?
>
>
>
>
>
> Jérôme Etévé-2 wrote:
> >
> > If I understand, you want to keep the raw html code in solr like that
> > (in your posting xml file):
> >
> > <field name="storyFullText">
> > <html></html>
> > </field>
> >
> > I think you should encode your content to protect these xml entities:
> > < -> <
> >> -> >
> > " -> "
> > & -> &
> >
> > If you use perl, have a look at HTML::Entities.
> >
> >
> > On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
> >> Hello,
> >>
> >> I've got some problem with html code who is embedded in xml file:
> >>
> >> Sample source .
> >>
> >> <content>
> >> <stories>
> >> <div class="storyTitle">
> >> Les débats
> >> </div>
> >> <div class="storyIntroductionText">
> >> Le premier tour des élections fédérales se
> >> déroulera le 21
> >> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> >> vous, dont plusieurs grands débats à l'enseigne de Forums.
> >> </div>
> >> <div class="paragraph">
> >> <div class="paragraphTitle"/>
> >> <div class="paragraphText">
> >> my para textehere
> >> <br/>
> >> <br/>
> >> Vous trouverez sur cette page toutes les
> >> dates et les heures de
> >> ces différents rendez-vous ainsi que le nom et les partis des
> >> débatteurs. De plus, vous pourrez également écouter ou réécouter
> >> l'ensemble de ces émissions.
> >> </div>
> >> </div>
> >> ....
> >> ---------
> >> When a make a query on solr I've got something like that in the
> >> source code of the xml result:
> >>
> >> <td xmlns="http://www.w3.org/1999/xhtml">
> >> <
> >> div
> >> class
> >> =
> >> "paragraph"
> >> ><div class="expander-content">
> >> <div class="indent"><
> >> div
> >> class
> >> =
> >> "paragraphTitle"
> >> /></div><table><tr>
> >> <td class="expander">−<div class="spacer"/>
> >> </td><td><
> >> ...
> >>
> >> It is not exactly what I want. I want to keep the html tags, that all
> >> without formatting.
> >>
> >> So the br tags and a tags are well formed in xml and json result, but
> >> the div tags are not kept.
> >> ---------
> >> In the schema.xml I've got this for the html content
> >>
> >> <fieldType name="html" class="solr.TextField" />
> >>
> >> <field name="storyFullText" type="html" indexed="true"
> >> stored="true" multiValued="true"/>
> >>
> >> ---------
> >>
> >> Any help would be appreciate.
> >>
> >> Thanks in advance.
> >>
> >> S. Christin
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Jerome Eteve.
> > jerome@eteve.net
> > http://jerome.eteve.free.fr/
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Problem-with-html-code-inside-xml-tp12877194p15907551.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Re: Problem with html code inside xml
Posted by Latj <jt...@gmail.com>.
When I use HTML::Entities to encode my text, I get this error:
SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve entity
named 'para'
Its complaining about finding: ¶ in my text. Anyone know why this
is a problem?
Jérôme Etévé-2 wrote:
>
> If I understand, you want to keep the raw html code in solr like that
> (in your posting xml file):
>
> <field name="storyFullText">
> <html></html>
> </field>
>
> I think you should encode your content to protect these xml entities:
> < -> <
>> -> >
> " -> "
> & -> &
>
> If you use perl, have a look at HTML::Entities.
>
>
> On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
>> Hello,
>>
>> I've got some problem with html code who is embedded in xml file:
>>
>> Sample source .
>>
>> <content>
>> <stories>
>> <div class="storyTitle">
>> Les débats
>> </div>
>> <div class="storyIntroductionText">
>> Le premier tour des élections fédérales se
>> déroulera le 21
>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>> vous, dont plusieurs grands débats à l'enseigne de Forums.
>> </div>
>> <div class="paragraph">
>> <div class="paragraphTitle"/>
>> <div class="paragraphText">
>> my para textehere
>> <br/>
>> <br/>
>> Vous trouverez sur cette page toutes les
>> dates et les heures de
>> ces différents rendez-vous ainsi que le nom et les partis des
>> débatteurs. De plus, vous pourrez également écouter ou réécouter
>> l'ensemble de ces émissions.
>> </div>
>> </div>
>> ....
>> ---------
>> When a make a query on solr I've got something like that in the
>> source code of the xml result:
>>
>> <td xmlns="http://www.w3.org/1999/xhtml">
>> <
>> div
>> class
>> =
>> "paragraph"
>> ><div class="expander-content">
>> <div class="indent"><
>> div
>> class
>> =
>> "paragraphTitle"
>> /></div><table><tr>
>> <td class="expander">−<div class="spacer"/>
>> </td><td><
>> ...
>>
>> It is not exactly what I want. I want to keep the html tags, that all
>> without formatting.
>>
>> So the br tags and a tags are well formed in xml and json result, but
>> the div tags are not kept.
>> ---------
>> In the schema.xml I've got this for the html content
>>
>> <fieldType name="html" class="solr.TextField" />
>>
>> <field name="storyFullText" type="html" indexed="true"
>> stored="true" multiValued="true"/>
>>
>> ---------
>>
>> Any help would be appreciate.
>>
>> Thanks in advance.
>>
>> S. Christin
>>
>>
>>
>>
>>
>>
>
>
> --
> Jerome Eteve.
> jerome@eteve.net
> http://jerome.eteve.free.fr/
>
>
--
View this message in context: http://www.nabble.com/Problem-with-html-code-inside-xml-tp12877194p15907551.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem with html code inside xml
Posted by "steve.christin@gmail.com" <st...@gmail.com>.
well... the xml output has changed and I receive
<strong>hhhhhh<strong> sic!
So the problem is not a problem...
Thanks
Steve
Le 3 oct. 07 à 01:09, Chris Hostetter a écrit :
> : I created a field type:
> :
> : <fieldType name="htmlTxt" class="solr.TextField"
> positionIncrementGap="100">
>
> ...
>
> : Everything works (the div tags, p tags are removed) but some
> : <strong>nnn</strong> or <br/> tags are style in the text after
> indexing.
>
> i cut/paste that fieldtype into the example schema.xml, and
> experimented
> with the analysis tool (http://localhost:8983/solr/admin/
> analysis.jsp) and
> both of those examples were correctly striped.
>
> do you have a more specific example of something that doesn't work?
>
> Hmm... it seems like maybe the problem is examples like this...
> blahblah<string>nnn</strong>
> ...if the tag is direclty adjacent to other text, it may not get
> striped
> off ... i'm not sure if that's specific to the
> HtmlWhitespaceTokenizer.
>
>
>
>
> -Hoss
Re: Problem with html code inside xml
Posted by Chris Hostetter <ho...@fucit.org>.
: I created a field type:
:
: <fieldType name="htmlTxt" class="solr.TextField" positionIncrementGap="100">
...
: Everything works (the div tags, p tags are removed) but some
: <strong>nnn</strong> or <br/> tags are style in the text after indexing.
i cut/paste that fieldtype into the example schema.xml, and experimented
with the analysis tool (http://localhost:8983/solr/admin/analysis.jsp) and
both of those examples were correctly striped.
do you have a more specific example of something that doesn't work?
Hmm... it seems like maybe the problem is examples like this...
blahblah<string>nnn</strong>
...if the tag is direclty adjacent to other text, it may not get striped
off ... i'm not sure if that's specific to the HtmlWhitespaceTokenizer.
-Hoss
Re: Problem with html code inside xml
Posted by "steve.christin@gmail.com" <st...@gmail.com>.
Thanks
I use this solution:
put <![CDATA[ Here my hml code ]]> in the xml to be indexed and
it works, nothing to change in the xsl.
In the schema I use this fieldType
<fieldType name="html" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
----------
Now question:
I created a field to index only the text for this html code.
I created a field type:
<fieldType name="htmlTxt" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Everything works (the div tags, p tags are removed) but some
<strong>nnn</strong> or <br/> tags are style in the text after
indexing.
If you've got any idea to solve this problem it we'll be great.
Thanks
S. Christin
-------------
Le 25 sept. 07 à 13:14, Thorsten Scherler a écrit :
> On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
>> If I understand, you want to keep the raw html code in solr like that
>> (in your posting xml file):
>>
>> <field name="storyFullText">
>> <html></html>
>> </field>
>>
>> I think you should encode your content to protect these xml entities:
>> < -> <
>>> -> >
>> " -> "
>> & -> &
>>
>> If you use perl, have a look at HTML::Entities.
>
> AFAIR you cannot use tags, they always are getting transformed to
> entities. The solution is to have a xsl transformation after the
> response that transforms the entities back to tags.
>
> Have a look at the thread
> http://marc.info/?t=116775837900001&r=1&w=2
> and especially at
> http://marc.info/?l=solr-user&m=116782664828926&w=2
>
> HTH
>
> salu2
>
>>
>>
>> On 9/25/07, steve.christin@gmail.com <st...@gmail.com>
>> wrote:
>>> Hello,
>>>
>>> I've got some problem with html code who is embedded in xml file:
>>>
>>> Sample source .
>>>
>>> <content>
>>> <stories>
>>> <div class="storyTitle">
>>> Les débats
>>> </div>
>>> <div class="storyIntroductionText">
>>> Le premier tour des élections fédérales
>>> se déroulera le 21
>>> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
>>> vous, dont plusieurs grands débats à l'enseigne de Forums.
>>> </div>
>>> <div class="paragraph">
>>> <div class="paragraphTitle"/>
>>> <div class="paragraphText">
>>> my para textehere
>>> <br/>
>>> <br/>
>>> Vous trouverez sur cette page
>>> toutes les dates et les heures de
>>> ces différents rendez-vous ainsi que le nom et les partis des
>>> débatteurs. De plus, vous pourrez également écouter ou
>>> réécouter
>>> l'ensemble de ces émissions.
>>> </div>
>>> </div>
>>> ....
>>> ---------
>>> When a make a query on solr I've got something like that in the
>>> source code of the xml result:
>>>
>>> <td xmlns="http://www.w3.org/1999/xhtml">
>>> <span class="markup"><</span>
>>> <span class="start-tag">div</span>
>>> <span class="attribute-name">class</span>
>>> <span class="markup">=</span>
>>> <span class="attribute-value">"paragraph"</span>
>>> <span class="markup">></span><div class="expander-content">
>>> <div class="indent"><span class="markup"><</span>
>>> <span class="start-tag">div</span>
>>> <span class="attribute-name">class</span>
>>> <span class="markup">=</span>
>>> <span class="attribute-value">"paragraphTitle"</span>
>>> <span class="markup">/></span></div><table><tr>
>>> <td class="expander">−<div class="spacer"/>
>>> </td><td><span class="markup"><</span>
>>> ...
>>>
>>> It is not exactly what I want. I want to keep the html tags, that
>>> all
>>> without formatting.
>>>
>>> So the br tags and a tags are well formed in xml and json result,
>>> but
>>> the div tags are not kept.
>>> ---------
>>> In the schema.xml I've got this for the html content
>>>
>>> <fieldType name="html" class="solr.TextField" />
>>>
>>> <field name="storyFullText" type="html" indexed="true"
>>> stored="true" multiValued="true"/>
>>>
>>> ---------
>>>
>>> Any help would be appreciate.
>>>
>>> Thanks in advance.
>>>
>>> S. Christin
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> --
> Thorsten Scherler
> thorsten.at.apache.org
> Open Source Java consulting, training and
> solutions
>
Re: Problem with html code inside xml
Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
> If I understand, you want to keep the raw html code in solr like that
> (in your posting xml file):
>
> <field name="storyFullText">
> <html></html>
> </field>
>
> I think you should encode your content to protect these xml entities:
> < -> <
> > -> >
> " -> "
> & -> &
>
> If you use perl, have a look at HTML::Entities.
AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.
Have a look at the thread
http://marc.info/?t=116775837900001&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2
HTH
salu2
>
>
> On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
> > Hello,
> >
> > I've got some problem with html code who is embedded in xml file:
> >
> > Sample source .
> >
> > <content>
> > <stories>
> > <div class="storyTitle">
> > Les débats
> > </div>
> > <div class="storyIntroductionText">
> > Le premier tour des élections fédérales se déroulera le 21
> > octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> > vous, dont plusieurs grands débats à l'enseigne de Forums.
> > </div>
> > <div class="paragraph">
> > <div class="paragraphTitle"/>
> > <div class="paragraphText">
> > my para textehere
> > <br/>
> > <br/>
> > Vous trouverez sur cette page toutes les dates et les heures de
> > ces différents rendez-vous ainsi que le nom et les partis des
> > débatteurs. De plus, vous pourrez également écouter ou réécouter
> > l'ensemble de ces émissions.
> > </div>
> > </div>
> > ....
> > ---------
> > When a make a query on solr I've got something like that in the
> > source code of the xml result:
> >
> > <td xmlns="http://www.w3.org/1999/xhtml">
> > <span class="markup"><</span>
> > <span class="start-tag">div</span>
> > <span class="attribute-name">class</span>
> > <span class="markup">=</span>
> > <span class="attribute-value">"paragraph"</span>
> > <span class="markup">></span><div class="expander-content">
> > <div class="indent"><span class="markup"><</span>
> > <span class="start-tag">div</span>
> > <span class="attribute-name">class</span>
> > <span class="markup">=</span>
> > <span class="attribute-value">"paragraphTitle"</span>
> > <span class="markup">/></span></div><table><tr>
> > <td class="expander">−<div class="spacer"/>
> > </td><td><span class="markup"><</span>
> > ...
> >
> > It is not exactly what I want. I want to keep the html tags, that all
> > without formatting.
> >
> > So the br tags and a tags are well formed in xml and json result, but
> > the div tags are not kept.
> > ---------
> > In the schema.xml I've got this for the html content
> >
> > <fieldType name="html" class="solr.TextField" />
> >
> > <field name="storyFullText" type="html" indexed="true"
> > stored="true" multiValued="true"/>
> >
> > ---------
> >
> > Any help would be appreciate.
> >
> > Thanks in advance.
> >
> > S. Christin
> >
> >
> >
> >
> >
> >
>
>
--
Thorsten Scherler thorsten.at.apache.org
Open Source Java consulting, training and solutions
Re: Problem with html code inside xml
Posted by Jérôme Etévé <je...@eteve.net>.
If I understand, you want to keep the raw html code in solr like that
(in your posting xml file):
<field name="storyFullText">
<html></html>
</field>
I think you should encode your content to protect these xml entities:
< -> <
> -> >
" -> "
& -> &
If you use perl, have a look at HTML::Entities.
On 9/25/07, steve.christin@gmail.com <st...@gmail.com> wrote:
> Hello,
>
> I've got some problem with html code who is embedded in xml file:
>
> Sample source .
>
> <content>
> <stories>
> <div class="storyTitle">
> Les débats
> </div>
> <div class="storyIntroductionText">
> Le premier tour des élections fédérales se déroulera le 21
> octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> vous, dont plusieurs grands débats à l'enseigne de Forums.
> </div>
> <div class="paragraph">
> <div class="paragraphTitle"/>
> <div class="paragraphText">
> my para textehere
> <br/>
> <br/>
> Vous trouverez sur cette page toutes les dates et les heures de
> ces différents rendez-vous ainsi que le nom et les partis des
> débatteurs. De plus, vous pourrez également écouter ou réécouter
> l'ensemble de ces émissions.
> </div>
> </div>
> ....
> ---------
> When a make a query on solr I've got something like that in the
> source code of the xml result:
>
> <td xmlns="http://www.w3.org/1999/xhtml">
> <span class="markup"><</span>
> <span class="start-tag">div</span>
> <span class="attribute-name">class</span>
> <span class="markup">=</span>
> <span class="attribute-value">"paragraph"</span>
> <span class="markup">></span><div class="expander-content">
> <div class="indent"><span class="markup"><</span>
> <span class="start-tag">div</span>
> <span class="attribute-name">class</span>
> <span class="markup">=</span>
> <span class="attribute-value">"paragraphTitle"</span>
> <span class="markup">/></span></div><table><tr>
> <td class="expander">−<div class="spacer"/>
> </td><td><span class="markup"><</span>
> ...
>
> It is not exactly what I want. I want to keep the html tags, that all
> without formatting.
>
> So the br tags and a tags are well formed in xml and json result, but
> the div tags are not kept.
> ---------
> In the schema.xml I've got this for the html content
>
> <fieldType name="html" class="solr.TextField" />
>
> <field name="storyFullText" type="html" indexed="true"
> stored="true" multiValued="true"/>
>
> ---------
>
> Any help would be appreciate.
>
> Thanks in advance.
>
> S. Christin
>
>
>
>
>
>
--
Jerome Eteve.
jerome@eteve.net
http://jerome.eteve.free.fr/