You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by zbindigonzales <sa...@gmail.com> on 2012/11/23 17:03:18 UTC

Solr Near Realtime with denormalized Data

Hi there. 

In our company we use Apache Solr 4 to index data from the database via the
databaseimport handler.
The data we are indexing is a denormalzied table (Patient, Visit, Study,
Image).

One requirement is to be near realtime. For that we use softcomits every
second. 
The index size is about 3M documents.

The problem now is that if a useres modify a entry on a the patient table
there may be 1000 rows affected (due to denormalization)  wich can't be
indexed in near realtime.

Did anyone had a similar problem and got a workarround to speed up delta
imports ?

Best Regards



  






--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Near Realtime with denormalized Data

Posted by zbindigonzales <sa...@gmail.com>.

Hi Jack 

You are totally right about the "Phantom SQL" thing. This is why i am asking
for help here :)

You said that: 

"In short, I still don't see any problem here - provided that you 
de-normalized only the patient ID and not all the patient metadata."

What we are doing is denormalizing all connected tables and we index 
all the fields (idfields and metadata fields). 

So the meta data fields for patient will be duplicated a lot (wich as far as
I know is no problem, correct me if i am wrong).

We currently have some issues that a 3rd party interface is updating these
patient and study datas.
They are sending updates every 2 seconds with max 5 changed patients. The
delta update currently takes about 20 seconds to 1 minute ! So what happens
is that we can never really stop delta updating in solr. 

I guess we will be a lot faster when i just reindex the patient meta fields
that have changed in all the documents. But I somehow thought probably there
is another better solution. 








--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072p4022376.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Near Realtime with denormalized Data

Posted by Jack Krupansky <ja...@basetechnology.com>.

At least from what you have said, it doesn't sound as if a lot of updating 
(as opposed to adding new documents) would be needed. I suspect that a lot 
of users would rather enter a simple patient ID anyway (with a separate name 
lookup capability.)

I mean, once a "visit" is complete, is it really updated that much? Ditto 
for a study, and an image.

In short, I still don't see any problem here - provided that you 
de-normalized only the patient ID and not all the patient metadata. Or maybe 
this is simply a "Phantom SQL" or "grieving" problem - wishing you could do 
things exactly as they were done in SQL. It may simply take time for you to 
finish "grieving" about Solr not being SQL. Focus on exploiting Solr's 
strengths rather than obsessing over it's differences from SQL.

I mean, you were going to have more than 4,000 total rows in your visit, 
study, and image tables anyway.

-- Jack Krupansky

-----Original Message----- 
From: zbindigonzales
Sent: Monday, November 26, 2012 9:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Near Realtime with denormalized Data

Hello again

The problem is that the software is used in different fields. The table
schema for hospital software is not the same as in the industrie sectors.

Customers usually create their own schema. The worst case scenary that we
know is that there are four tables connected.

Table patient --> visit --> study --> image

So if a patient has 10 visits each visit has 10 studies and each study has
40 images then we would need to update 4000 documents just because some
values changed in the patient row.

Example queries are quite difficult because the change from customer to
customer. But a normal query would look like.

query:q=(patient___idpatient__:"55" AND ((image_user__:*admin*) AND
((image_double_:{* TO 0.1} OR image_double_:{99.1 TO *})) AND
(image_text:*fiji*)))

But basicly our customers can define their searchfields on their own.

I already tried out  the JOIN capability but I couldn't find out how to join
over more then 1 table. I think denormalizing is the better solution then
try to join the tables durning the query.

What I had in my mind was some kind of reference fields or somethig.
So that in a image document you could refer to the connected patient fields.
But I don't know if something exists.

What i am now trying is to reduce the update fields. This will speed up the
delta import time but i am not sure if this is the "best practice"

Regards Sandro






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072p4022351.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Near Realtime with denormalized Data

Posted by zbindigonzales <sa...@gmail.com>.

Hello again

The problem is that the software is used in different fields. The table
schema for hospital software is not the same as in the industrie sectors.

Customers usually create their own schema. The worst case scenary that we
know is that there are four tables connected.

Table patient --> visit --> study --> image

So if a patient has 10 visits each visit has 10 studies and each study has
40 images then we would need to update 4000 documents just because some
values changed in the patient row.

Example queries are quite difficult because the change from customer to
customer. But a normal query would look like.

query:q=(patient___idpatient__:"55" AND ((image_user__:*admin*) AND
((image_double_:{* TO 0.1} OR image_double_:{99.1 TO *})) AND
(image_text:*fiji*)))

But basicly our customers can define their searchfields on their own.

I already tried out the JOIN capability but I couldn't find out how to join
over more then 1 table. I think denormalizing is the better solution then
try to join the tables durning the query.

What I had in my mind was some kind of reference fields or somethig.
So that in a image document you could refer to the connected patient fields.
But I don't know if something exists.

What i am now trying is to reduce the update fields. This will speed up the
delta import time but i am not sure if this is the "best practice"

Regards Sandro

--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072p4022351.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Near Realtime with denormalized Data

Posted by Jack Krupansky <ja...@basetechnology.com>.

Denormalization IS the best practice, gnerally.

You still haven't demonstrated an "exponential" increase. If there is any 
"exponential" increase, that is your own doing and you should simply NOT do 
that!

The total number of documents would be the number of rows in your second 
table. Denormalization simply means that each of those documents would also 
hold some patient data - denormalized.

If you want, you can store some separate, sparse documents with patient data 
only, and then store only the patient ID in the data documents. Or, you can 
have a second Solr collection that parallels your SQL tables. It is a simple 
efficiency question of what operations you perform most commonly.

Do patients really change their names with any great frequency?

The bottom line is that each Solr document needs to have the fields that 
users will be searching on. There is some limited JOIN capability, but it is 
limited and may be more trouble than it is worth.

You might be better doing a two-stage lookup - query one Solr table by 
patient name to get the patient ID and then query the second table using the 
ID and other fields.

Maybe you simply need to give us some example queries - then we can suggest 
structures to make those queries efficient.

NRT doesn't mean that occasional bulk updates will happen in non-NRT, simply 
that common queries will usually be NRT.

Maybe you need an app layer above Solr that manages patient names and does 
the two-part query so that data documents can be queried with patient ID.

-- Jack Krupansky

-----Original Message----- 
From: zbindigonzales
Sent: Monday, November 26, 2012 6:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Near Realtime with denormalized Data

Hello Erick

Thanks for your response.

The main problem we have, is that the data is denormalized. And this
increases the document to index exponentially.

Let's say we have the following two tables.

----------------------      ----------------------
Table: Patient                   Table:Image
----------------------      ----------------------
Field: IDPatient                 Field: IDImage
Field: Name                      Field: IDPatient
                                     Field: Name


Then we create a view to denormalize the data.

create or replace view solrview as select * from patient p left join image i
on p.idpatient = i.idpatient

If we select the view we will get the following rows

Patient_IDPatient | Patient_Name | Image_IDImage | Image_IDPatient |
Image_Name
           1                    Peter                   1
1                   Dog
           2                    Peter                   2
1                   Cat
           3                    Peter                   3
1                   Ant
           4                    Josh                    4
1                   Deer
           5                    Kent                    5
2                   Dog
           6                    Martin                  6
2                   Dog
           7                    David
           8                    Eric


This is the configuration we use in the data-config.xml

<entity name="solrEntiry" pk="solrview_PK"
    query=" SELECT * FROM solrview"
    deltaQuery = "SELECT PATIENT_IDPATIENT IMAGE_IDIMAGE FROM solrview
                       WHERE PATIENT_IAMODIFICATIONDATE >
'${dataimporter.last_index_time}
                       UNION
                       SELECT PATIENT_IDPATIENT, IMAGE_IDIMAGE FROM solrview
                       WHERE IMAGE_IAMODIFICATIONDATE >
'${dataimporter.last_index_time}
    deltaImportQuery=" SELECT * FROM solrview
                               WHERE PATIENT_IDPATIENT =
'${dataimporter.delta.PATIENT_IDPATIENT}'
                               AND IMAGE_IDIMAGE =
'${dataimporter.delta.IMAGE_IDIMAGE}'"
                               transformer="ClobTransformer">


If a user is now updating the Patient_Name "Peter" to "Frank" then there are
already 3 rows affected.

So even if we arrive to index 6k / second there is still a problem that 80k
can be affected with only a small change like changing the patient_name
field.


What I now try to do now is reducing the fields that will be updated.
For example if there are only changes on the patient table I will only need
to reindex the fields from the table patient.

Writing our own SolrJ program isn't an option because changes to the
database can come from different services. It would be to error prone if we
would integrate this SolrJ programm everywhere where data gets written.

The reason to post this issue is to see if somebody already got this
situation and have a better "best practise" :-)


Best Regards Sandro











--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072p4022323.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Near Realtime with denormalized Data

Posted by zbindigonzales <sa...@gmail.com>.

Hello Erick

Thanks for your response.

The main problem we have, is that the data is denormalized. And this
increases the document to index exponentially.

Let's say we have the following two tables.

----------------------      ----------------------
Table: Patient                   Table:Image
----------------------      ----------------------
Field: IDPatient                 Field: IDImage
Field: Name                      Field: IDPatient
                                     Field: Name  


Then we create a view to denormalize the data.

create or replace view solrview as select * from patient p left join image i
on p.idpatient = i.idpatient

If we select the view we will get the following rows

Patient_IDPatient | Patient_Name | Image_IDImage | Image_IDPatient |
Image_Name
           1                    Peter                   1                    
1                   Dog
           2                    Peter                   2                    
1                   Cat
           3                    Peter                   3                    
1                   Ant
           4                    Josh                    4                    
1                   Deer
           5                    Kent                    5                    
2                   Dog
           6                    Martin                  6                    
2                   Dog
           7                    David                   
           8                    Eric                     


This is the configuration we use in the data-config.xml

<entity name="solrEntiry" pk="solrview_PK" 
    query=" SELECT * FROM solrview" 
    deltaQuery = "SELECT PATIENT_IDPATIENT IMAGE_IDIMAGE FROM solrview 
                       WHERE PATIENT_IAMODIFICATIONDATE >
'${dataimporter.last_index_time}
                       UNION 
                       SELECT PATIENT_IDPATIENT, IMAGE_IDIMAGE FROM solrview 
                       WHERE IMAGE_IAMODIFICATIONDATE >
'${dataimporter.last_index_time}
    deltaImportQuery=" SELECT * FROM solrview 
                               WHERE PATIENT_IDPATIENT =
'${dataimporter.delta.PATIENT_IDPATIENT}'  
                               AND IMAGE_IDIMAGE =
'${dataimporter.delta.IMAGE_IDIMAGE}'"
                               transformer="ClobTransformer">


If a user is now updating the Patient_Name "Peter" to "Frank" then there are
already 3 rows affected.

So even if we arrive to index 6k / second there is still a problem that 80k
can be affected with only a small change like changing the patient_name
field.


What I now try to do now is reducing the fields that will be updated. 
For example if there are only changes on the patient table I will only need
to reindex the fields from the table patient. 

Writing our own SolrJ program isn't an option because changes to the
database can come from different services. It would be to error prone if we
would integrate this SolrJ programm everywhere where data gets written.

The reason to post this issue is to see if somebody already got this
situation and have a better "best practise" :-)


Best Regards Sandro











--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072p4022323.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Near Realtime with denormalized Data

Posted by Erick Erickson <er...@gmail.com>.

Why can't they be? I've seen indexing documents on the order of 6k/second.
Is the problem with getting the data out of the DB (i.e. how long does the
SQL take to execute) or on the Solr indexing side? If the former, you could
consider writing a SolrJ program that accessed the database and sent the
docs to Solr. If you can write the program such that you could run N copies
(each updating only a portion of the data) you could run them in parallel.

But I suspect that the issue is probably your SQL, and before looking at
Solr solutions I'd try to verify that.

Best,
Erick

On Fri, Nov 23, 2012 at 11:03 AM, zbindigonzales <sa...@gmail.com>wrote:

> Hi there.
>
> In our company we use Apache Solr 4 to index data from the database via the
> databaseimport handler.
> The data we are indexing is a denormalzied table (Patient, Visit, Study,
> Image).
>
> One requirement is to be near realtime. For that we use softcomits every
> second.
> The index size is about 3M documents.
>
> The problem now is that if a useres modify a entry on a the patient table
> there may be 1000 rows affected (due to denormalization)  wich can't be
> indexed in near realtime.
>
> Did anyone had a similar problem and got a workarround to speed up delta
> imports ?
>
> Best Regards
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Near-Realtime-with-denormalized-Data-tp4022072.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>