You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Eric Hauser <ew...@gmail.com> on 2010/04/10 02:49:50 UTC

Indexing documents generated from a template

Hi,

I'm doing research on indexing some documents that are generated from
templates.  I don't have the exact statistics yet, but I'm estimating that
in the standard case 90% of the document is the same across all instances of
the document and the other 10% is dynamic (although it is certainly possible
for it to be 10/90).  Because these documents can be rather large in size
and there could potentially be millions of instances of a single document, I
don't want to put every instance of the document into the index.

What (I think) I would like to do is index the template and the dynamic
content separately, and then merge the search results afterwards.  This does
not seem too difficult; except for when a query spans both the template and
dynamic content.  Also, things like proximity queries would be difficult.
 In theory, it would seem plausible to split the terms of the original
search, search both indexes in parallel, rebuild the document, build an in
memory index of merged document, and run the original search against it.
 Most of the searches are going to be "online" meaning that a Hadoop job
probably isn't appropriate.

I am working on compiler better statistics on the standard deviation for
static to dynamic content, but in the meantime I was just curious if anyone
else has dealt with a similar scenario or had something to point me at for
additional research.  Thanks.