You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bruno Mannina <bm...@free.fr> on 2016/07/29 21:50:33 UTC

How to index text field with html entities ?

Dear Solr User,

Solr 5.0.1

I have several xml files that contains html entities in some fields.

I have a author field (english text) with this kind of text:

Brown &amp; Gammon

If I set my field like this:

<field name="au">Brown &amp; Gammon</field>

Solr generates error "Undeclared general entity"

if I add CDATA like this:

<field name="au"><![CDATA[Brown &amp; Gammon]]></field>

it seems that I can't search with the &

au:"brown & gammon"

Could you help me to find the right syntax ?

Thanks a lot,

Bruno




---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus


Re: How to index text field with html entities ?

Posted by Bruno Mannina <bm...@free.fr>.
Thanks Shawn for these precisions

Le 30/07/2016 � 00:43, Shawn Heisey a �crit :
> On 7/29/2016 4:05 PM, Bruno Mannina wrote:
>> after checking my log it seems that it concerns only some html entities.
>> No problem with &amp; but I have problem with:
>>
>> &uuml;
>> &ldquo;
>> etc...
> Those are valid *HTML* entities, but they are not valid *XML* entities.
> The list of entities that are valid in XML is quite short -- there are
> only five of them.
>
> https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML
>
> When Solr processes XML, it is only going to convert entities that are
> valid for XML -- the five already mentioned.  It will fail on the other
> 247 entities that are only valid for HTML.
>
> If you are seeing the problem with &amp; (which is one of the five valid
> XML entities) then we'll need the Solr version and the full error
> message/stacktrace from the solr logfile.
>
> Thanks,
> Shawn
>
>


---
L'absence de virus dans ce courrier �lectronique a �t� v�rifi�e par le logiciel antivirus Avast.
https://www.avast.com/antivirus


Re: How to index text field with html entities ?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 7/29/2016 4:05 PM, Bruno Mannina wrote:
> after checking my log it seems that it concerns only some html entities.
> No problem with &amp; but I have problem with:
>
> &uuml;
> &ldquo;
> etc...

Those are valid *HTML* entities, but they are not valid *XML* entities. 
The list of entities that are valid in XML is quite short -- there are
only five of them.

https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Predefined_entities_in_XML

When Solr processes XML, it is only going to convert entities that are
valid for XML -- the five already mentioned.  It will fail on the other
247 entities that are only valid for HTML.

If you are seeing the problem with &amp; (which is one of the five valid
XML entities) then we'll need the Solr version and the full error
message/stacktrace from the solr logfile.

Thanks,
Shawn


Re: How to index text field with html entities ?

Posted by Bruno Mannina <bm...@free.fr>.
Hi Chris,

Thanks for your answer, and I add a little thing,

after checking my log it seems that it concerns only some html entities.
No problem with &amp; but I have problem with:

&uuml;
&ldquo;
etc...

I will check your answer to find a solution,

Thanks !

Le 29/07/2016  23:58, Chris Hostetter a crit :
> : I have several xml files that contains html entities in some fields.
>
> 	...
>
> : If I set my field like this:
> :
> : <field name="au">Brown &amp; Gammon</field>
> :
> : Solr generates error "Undeclared general entity"
>
> ...because that's not valid XML...
>
> : if I add CDATA like this:
> :
> : <field name="au"><![CDATA[Brown &amp; Gammon]]></field>
> :
> : it seems that I can't search with the &
>
> ...because that is valid xml, and tells solr you want the literal string
> "Brown &amp; Gammon" to be indexed -- given a typical analyzer you are
> probably getting either "&amp;" or "amp" as a term in your index.
>
> : Could you help me to find the right syntax ?
>
> the client code you are using for indexing can either "parse" these HTML
> snippets using an HTML parser, and then send solr the *real* string you
> want to index, or you can configure solr with something like
> HTMLStripFieldUpdateProcessorFactory (if you want both the indexed form
> and the stored form to be plain text) or HTMLStripCharFilterFactory (if
> you wnat to preserve the html markup in the stored value, but strip it as
> part of the analysis chain for indexing.
>
>
> http://lucene.apache.org/solr/6_1_0/solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
> http://lucene.apache.org/core/6_1_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html
>
>
> -Hoss
> http://www.lucidworks.com/
>


---
L'absence de virus dans ce courrier lectronique a t vrifie par le logiciel antivirus Avast.
https://www.avast.com/antivirus


Re: How to index text field with html entities ?

Posted by Chris Hostetter <ho...@fucit.org>.
: I have several xml files that contains html entities in some fields.

	...

: If I set my field like this:
: 
: <field name="au">Brown &amp; Gammon</field>
: 
: Solr generates error "Undeclared general entity"

...because that's not valid XML...

: if I add CDATA like this:
: 
: <field name="au"><![CDATA[Brown &amp; Gammon]]></field>
: 
: it seems that I can't search with the &

...because that is valid xml, and tells solr you want the literal string 
"Brown &amp; Gammon" to be indexed -- given a typical analyzer you are 
probably getting either "&amp;" or "amp" as a term in your index.

: Could you help me to find the right syntax ?

the client code you are using for indexing can either "parse" these HTML 
snippets using an HTML parser, and then send solr the *real* string you 
want to index, or you can configure solr with something like 
HTMLStripFieldUpdateProcessorFactory (if you want both the indexed form 
and the stored form to be plain text) or HTMLStripCharFilterFactory (if 
you wnat to preserve the html markup in the stored value, but strip it as 
part of the analysis chain for indexing.


http://lucene.apache.org/solr/6_1_0/solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
http://lucene.apache.org/core/6_1_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html


-Hoss
http://www.lucidworks.com/