You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chambeda <ch...@gmail.com> on 2012/04/17 21:11:07 UTC

HTML Indexing error

Hi All,

I am trying to parse some text that contains embedded HTML elements and am
getting the following error:

FATAL: Solr returned an error #400 Unexpected close tag </PROD_DESC>;
expected </br>.

My set up is as follows:

schema.xml

<fieldType name="html_text" class="solr.TextField" indexed="true">
      <analyzer>
       <charFilter class="solr.HTMLStripCharFilterFactory"/>
       <tokenizer class="solr.StandardTokenizerFactory"/>
      </analyzer>
</fieldType>

<field name="PROD_DESC" type="html_text" indexed="false" stored="true"/>
<field name="DOCUMENTID" type="string" indexed="true" stored="true"
required="true"/>

XML snippet:
<PRODUCT><DOCUMENTID>1</DOCUMENTID><PROD_DESC>Bose's best bookshelf speakers
are updated to provide an even more spacious, natural listening experience.
They're great for stereo, or as a front- or rear-channel solution for home
theater.<br><br>Learn more about Bose products and proprietary technologies
in our Bose Store.</PROD_DESC></PRODUCT>

According to the documentation the <br> should be removed correctly.

Anything I am missing?


--
View this message in context: http://lucene.472066.n3.nabble.com/HTML-Indexing-error-tp3918174p3918174.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: HTML Indexing error

Posted by Gora Mohanty <go...@mimirtech.com>.
On 18 April 2012 00:41, Chambeda <ch...@gmail.com> wrote:
> Hi All,
>
> I am trying to parse some text that contains embedded HTML elements and am
> getting the following error:
[...]
> According to the documentation the <br> should be removed correctly.
>
> Anything I am missing?

How are you indexing the XML documents? Using DIH? If so, please
show us the DIH configuration file.

Regards,
Gora