You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by keeblerh <ke...@yahoo.com> on 2014/09/22 21:02:01 UTC

Configure TikaEntityProcessor so as NOT to strip xml tags

I had posted this on the solr-user forum but have received no replies so I
thought I would try here next.  thanks.

 I'm processing a zip file with an xml file.   The TikaEntityProcessor opens
the zip, reads the file but is stripping the xml tags even though I have
supplied the htmlMapper="identity" attribute.  It maintains any html that is
contained in a CDATA section but seems to strip the other xml tags.   Is
this due to the recursive nature of opening the zip file?  Somehow that
identity value is lost?  My understanding is that this should work in this
version 4.8.  Thanks.  Below is my config info.

<dataConfig><dataSource type="BinFileDataSource" /><document>
<entity
name="kmlfiles" dataSource=null" rootEntity="false" baseDir="mydirectory"
fileName=".*\.kmz$" onError="skip" processor="FileListEntityProcessor"
recursive="false" >
<field defs........................
/>
<entity name="kmlImport" processor="TikaEntityProcessor"
datasource="kmlfiles" htmlMapper="identity" format="xml"
transformer="TemplateTransformer" url="${kmlfiles.fileAbsolutePath}"
recursive="true">
<more field defs....
/>
  <entity name="xml" processor="XPathEntityProcessor" ForEach="/kml"
dataSource="fds"
  dataField="kmlImport.text">
  <field xpath=//name" column="name" />
...more field defs
  </entity>
</entity>
</entity>
</document></dataConfig>

Note that it does wrap my data in html but it is after it strips all my xml
tags out.  So the data I am interested in parsing which would be
<name>something</name>
<description>something</description>
<coordinates>12345,12345,0</coordinates> 

end up like <p>/n something /t/n something /n 12345,12345,0 ....etc. 



--
View this message in context: http://lucene.472066.n3.nabble.com/Configure-TikaEntityProcessor-so-as-NOT-to-strip-xml-tags-tp4160524.html
Sent from the Lucene - General mailing list archive at Nabble.com.