You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matt Galvin <ma...@gmail.com> on 2011/04/22 19:57:56 UTC

DIH Transform XML?

Hello,

First post here... I spent some time researching this but can't seem
to find the answer I am looking for...

I have a MySQL DB that I have Solr indexing and all is well.

However, one field I need to index is a text field that contains XML
stored in the DB. I read up on DIH Transformers a bit and I am
wondering... is there a way to have solr DIH either transform the XML
data or strip the XML out of the field as it indexes it leaving only
the textual data in solr's index?

This XML field is the body content of web site articles (don't ask
why, not my choice :-/) and it also has a lot of CDATA's wrapping HTML
in the XML. I want solr to index this data, minus all the markup.

Should I be using a RegexTransformer to strip tags (this feels like
the wrong approach) or would HTMLStripTransformer work? Is there an
XMLTransformer I don't know about?

I have been reading this:

http://wiki.apache.org/solr/DataImportHandler

but I feel like I am missing something that would make this work.

My dataConfig is barebones ATM.

Any help is greatly appreciated.

Thanks,

Matt

Re: DIH Transform XML?

Posted by Ahmet Arslan <io...@yahoo.com>.

I have a MySQL DB that I have Solr indexing and all is well.

However, one field I need to index is a text field that contains XML
stored in the DB. I read up on DIH Transformers a bit and I am
wondering... is there a way to have solr DIH either transform the XML
data or strip the XML out of the field as it indexes it leaving only
the textual data in solr's index?

This XML field is the body content of web site articles (don't ask
why, not my choice :-/) and it also has a lot of CDATA's wrapping HTML
in the XML. I want solr to index this data, minus all the markup.

Should I be using a RegexTransformer to strip tags (this feels like
the wrong approach) or would HTMLStripTransformer work? Is there an
XMLTransformer I don't know about?



Not sure about the cdata thing, but HTMLStripTranformer behaves like,
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
, so it can be used to strip xml tags as well.