You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Anton Shokhrin <an...@me.com> on 2015/01/20 04:36:24 UTC

Can't get the DIH to recurse to index messages in Outlook PST file

Hi List,

My SOLR instance is setup to index PST files with DIH, TikaEntityProcessor and OutlookPSTParser. After running import, I can see that the index contains the top level information of the PST file (e.g. unique id of each message, header, PST file size) but the messages themselves are missing. I suspect that I need to instruct SOLR to recurse to the next level during indexing inside DIH config file but I don’t know how. My DIH config file looks like so:

<dataSource name="bin" type="BinFileDataSource" />
<document>
	<entity name="files" dataSource="bin" rootEntity="false" processor="FileListEntityProcessor" baseDir=“/PST_Path" fileName=".*" onError="abort” recursive=“true”>
		<entity pk="uri" name="file" dataSource="bin" processor="TikaEntityProcessor" url="${files.fileAbsolutePath}" format="xml" rootEntity="true" onError="skip" recursive="true" parser="org.apache.tika.parser.mbox.OutlookPSTParser”>
			<!—- I think I need to insert another entity here to parse/index the actual messages but I don’t know how to craft one —>
		</entity>
	</entity>
</document>

Any ideas?

Thank you,
Anton