You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dyer, James" <Ja...@ingrambook.com> on 2010/10/25 18:54:36 UTC

Need some DIH Entity Processor development advice...

We have a situation where we have data coming from several long-running queries hitting multiple relational databases.  Other data comes in fixed-width text file feeds, etc.  All of this has to be joined and denormalized and made into nice SOLR documents.  I've been wanting to use DIH as it seems to already provide 90% of what we need.  The rest can some in the form of custom transformers & Entity Processors that I can write...

One big need is to have disk-backed caches.  For instance, a child entity that pulls back millions of rows will beat up the db using a regular SQLEntityProcessor whereas the CachedSQLEntityProcessor puts everything in memory in a HashMap so it will only scale to a point.  For fixed-width text files, there doesn't seem to be any Cached implementations at all.

So I've written a custom Entity Processor that creates a temporary Lucene index to use as a disk cache.  Initial tests are promising but with one little problem.  I need a place to close the Lucene index reader and then delete the temporary index.  It seemed easy enough to override the "destroy()" method from EntityProcessorBase.  But to my surprise, it seems that both destroy() and init() get called every time a new Primary Key is called up from the cache.  (see DocBuilder.buildDocument()).  Just to be sure I wasn't crazy, I added a "destroy()" method to CachedSqlEntityProcessor and found it indeed gets called every time a new Primary Key is called from the cache.  In fact, the first couple of lines in cacheInit() in EntityProcessorBase seem to be there to cope with the fact that both destroy() and init() get called over and over again during the lifecycle of the object.

I've also noticed that destroy() isn't actually implemented anywhere in the prepacked Entity Processors.  This makes me wonder if it is a mistake.  Should DocBuilder be changed to call destroy() only once per lifecycle for each EntityProcessor object?  If so I think I can have a patch in JIRA in short order.

Otherwise...How do I best accomplish my clean-up tasks?  Advice is greatly appreciated.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311