You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mike O'Leary <tm...@uw.edu> on 2012/04/02 22:07:00 UTC

SolrJ updating indexed documents?

I am working on a component for indexing documents from a database that contains medical records. The information is organized across several tables and I am supposed to index records for varying sizes of sets of patients for others to do IR experiments with. Each patient record has one or more main documents associated with it, and each main document has zero or more addenda associated with it. (The main documents and addenda are treated alike for the most part, except for a parent record field that is null for main documents and has the number of a main document for addenda. Addenda cannot have addenda.) Also, each main document has one or more diagnosis records. I am trying to figure out the best performing way to select all of the records for each patient, including the main documents, addenda and diagnoses.

I tried indexing sets of these records using DataImportHandler and nested Entity blocks in a way similar to the Full Import example on the http://wiki.apache.org/solr/DataImportHandler page, with a select for all patients and main records in a data set, and nested selects that get all of the addenda and all of the diagnoses for each patient, but it didn't run very fast and a database resource person who looked into it with me said that issuing a million SQL queries for addenda and a million queries for diagnoses, one each for the million patient documents in a typical set of 10,000 patients, was very inefficient, and I should look for a different way of getting the data.

I switched to using SolrJ, and I am trying to figure out which of two ways to use to index this data. One would be to use one large SQL statement to get all of the data for a patient set. The results would contain duplication due to the way tables are joined together that I would need to sort out in the Java code, but that is doable.

The other way would be to

1. Get all of the main document data with one SQL query, create index documents with the data that they contain and store them in the index,

2. Issue another SQL query that gets all of the addenda for all of the patients in the data set and an id number for each one that tells which main document an addendum belongs with, retrieve the main documents from the index, add the addenda fields to the document and put them back in the index

3. Do the same with diagnosis data.
It would be great to be able to keep the main document data that is retrieved from the database in a hash table, update each of those objects with addenda and diagnoses, and write completely filled out documents to the index once, but I don't have enough memory available to do this for the patient sets I am working with now, and they want this indexing process to scale up to patient sets that are ten times as large and eventually much larger than that.

Essentially for the second approach I am wondering if a Lucene index can be made to serve as a hash table for storing intermediate results, and whether SolrJ has an API for retrieving individual index documents so they can be updated. Basically it would be shifting from iterating over SQL queries to iterating over Lucene index updates. If this way of doing things is also likely to be slow, or the SolrJ API doesn't provide a way to do this, or there are other problems with it, I can go with selecting all of the data in one large query and dealing with the duplication.
Thanks,
Mike