You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stefan Burkard <sb...@gmail.com> on 2012/08/15 14:12:57 UTC

How to design index for related versioned database records

Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one "real" record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
(=offered services) <-> Service

That means I have data across 5 relations, each of them with versioned records.

***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan

Re: How to design index for related versioned database records

Posted by Stefan Burkard <sb...@gmail.com>.

@Erik: the cross product is about 300k records. Since we use Solr
1.4.0 I try other options first.

@Lance: sounds interesting, I have to check this out

In the meanwhile I tried to use sub-entities to solve my problem. This
means I started with the "employer" as entity and defined "services"
and "employees" as subentities with their validity daterange (the
validity of the link-table).
In the schema I defined all fields of the sub-entities as multivalued
(since they are 1:n). So in the index I have employer "records" with
multi-value fields "validFrom", "validUntil" and others.

But now, when I query for a date range in the sub-entities
(validFrom:[* TO refDate] AND validUntil:[refDate TO *]) I get every
employer record with ANY of the date entries matching the query.
This means: if an employer has three services where one service
fulfills the validFrom-condition and another service matches the
validUntil-condition, the employer is a match and returned despite the
fact that none of the services fulfill both date conditions.

A similar problem was asked (but not answered) here by somebody else:
http://stackoverflow.com/questions/6889009/solr-range-queries-on-multivalued-fields-in-sub-entities

So I gues that I cannot query sub-entities (that are scattered across
multiple multi-value-fields) independently?

Thanks
Stefan




On Wed, Aug 22, 2012 at 8:47 AM, Lance Norskog <go...@gmail.com> wrote:
> Another option is to take the minimum time interval and record every
> active interval during an employee record. Make a compound key of the
> employee and the time range. (Look at the SignatureUpdateProcessor for
> how to do this.) Add one multi-valued field that contains all of the
> time intervals for which this record is active.
>
> If you make this multi-valued field indexed and not stored, the index
> will store one copy of each interval for all of the documents, and
> with each interval a list of all documents containing it. This takes a
> surprisingly small amount of memory. You do not have to do range
> searches or joins, you can just do an OR of all of the intervals you
> are looking for.
>
> On Tue, Aug 21, 2012 at 5:20 AM, Erick Erickson <er...@gmail.com> wrote:
>> Hmmm, how many employees/services/dates are we talking about
>> here? Is the cross product 1M? 1B? 1G records?
>>
>> You could try the Solr join stuff (Solr 4x), be aware that it performs
>> best on join fields with a limited number of unique values.
>>
>> Best
>> Erick
>>
>> On Tue, Aug 21, 2012 at 4:05 AM, Stefan Burkard <sb...@gmail.com> wrote:
>>> Hi Jack
>>>
>>> Thanks for your answer. Do I understand that correctly that I must
>>> create a "merge-entity" that contains all the different
>>> validFrom/validUntil dates as fields (and of course the other
>>> search-related fields).
>>>
>>> This would mean that the number of index entries is equal to the
>>> number of all possible combinations of from/until date-ranges in a
>>> "record-chain" (all records with all their individual versions
>>> connected by foreign keys) since every combination creates a new
>>> record in a query across all tables. That also means that I will have
>>> a lot of entries with the same values in the other search-related
>>> fields - the only difference will be most of the time one of the
>>> from/until-ranges.
>>> Perhaps the query can be optimized so that irrelevant combinations can
>>> be avoided (for example if two date-ranges do not overlap).
>>>
>>> Then, when I have built that index I can query it with the reference
>>> date as argument to compare it with every from/until range in the
>>> chain. And so I get only the relevant entries where the reference date
>>> is between all from/until ranges.
>>>
>>> Is this correct?
>>>
>>> Thanks and regards
>>> Stephan
>>>
>>>
>>>
>>> On Wed, Aug 15, 2012 at 2:32 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>>>> The date checking can be implemented using range query as a filter query,
>>>> such as
>>>>
>>>> &fq=startDate:[* TO NOW] AND endDate:[NOW TO *]
>>>>
>>>> (You can also use an "frange" query.)
>>>>
>>>> Then you will have to flatten the database tables. Your Solr schema would
>>>> have a single "merged" record type. You will have to decide whether the
>>>> different record types (tables) will have common fields versus static
>>>> qualification by adding a prefix or suffix, e.g., "name" vs. "employee_name"
>>>> and "employer_name". The latter has the advantage that you do not have to
>>>> separately specify a table "type" field since the fields would be empty for
>>>> records of other types.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Stefan Burkard
>>>> Sent: Wednesday, August 15, 2012 8:12 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: How to design index for related versioned database records
>>>>
>>>>
>>>> Hi solr-users
>>>>
>>>> I have a case where I need to build an index from a database.
>>>>
>>>> ***Data structure***
>>>> The data is spread across multiple tables and in each table the
>>>> records are versioned - this means that one "real" record can exist
>>>> multiple times in a table, each with different validFrom/validUntil
>>>> dates. Therefore it is possible to query the valid version of a record
>>>> for a given point in time.
>>>>
>>>> The relations of the data are something like this:
>>>> Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
>>>> (=offered services) <-> Service
>>>>
>>>> That means I have data across 5 relations, each of them with versioned
>>>> records.
>>>>
>>>> ***Search needs***
>>>> Now I need to be able to search for employees and employers based on
>>>> the services they offer for a given point in time.
>>>>
>>>> Therefore I have built an index of all employees and employers with
>>>> their services as subentity. So I have one index entry for every
>>>> version of every employee/employer and each version collects the
>>>> offered services for the given timeframe of the employee/employer
>>>> version.
>>>>
>>>> Problem: The offered services of an employee/employer can change
>>>> during its validity period. That means I do not only need to take the
>>>> version timespan of the employee/employer into account but also the
>>>> version timespans of services and the link-tables.
>>>>
>>>> ***Question***
>>>> I think I could continue with my strategy to have an index entry of an
>>>> employee/employer with its services for any given point in time. But
>>>> there are much more entries than now since every involved
>>>> validfrom/validuntil period (if they overlap) produces more entries.
>>>> But I am not sure if this is a good strategy, or if it would be better
>>>> to try to index the whole datastructure in an other way.
>>>>
>>>> Are there any recommendations how to handle such a case?
>>>>
>>>> Thanks for any help
>>>> Stephan
>
>
>
> --
> Lance Norskog
> goksron@gmail.com

Re: How to design index for related versioned database records

Posted by Lance Norskog <go...@gmail.com>.

Another option is to take the minimum time interval and record every
active interval during an employee record. Make a compound key of the
employee and the time range. (Look at the SignatureUpdateProcessor for
how to do this.) Add one multi-valued field that contains all of the
time intervals for which this record is active.

If you make this multi-valued field indexed and not stored, the index
will store one copy of each interval for all of the documents, and
with each interval a list of all documents containing it. This takes a
surprisingly small amount of memory. You do not have to do range
searches or joins, you can just do an OR of all of the intervals you
are looking for.

On Tue, Aug 21, 2012 at 5:20 AM, Erick Erickson <er...@gmail.com> wrote:
> Hmmm, how many employees/services/dates are we talking about
> here? Is the cross product 1M? 1B? 1G records?
>
> You could try the Solr join stuff (Solr 4x), be aware that it performs
> best on join fields with a limited number of unique values.
>
> Best
> Erick
>
> On Tue, Aug 21, 2012 at 4:05 AM, Stefan Burkard <sb...@gmail.com> wrote:
>> Hi Jack
>>
>> Thanks for your answer. Do I understand that correctly that I must
>> create a "merge-entity" that contains all the different
>> validFrom/validUntil dates as fields (and of course the other
>> search-related fields).
>>
>> This would mean that the number of index entries is equal to the
>> number of all possible combinations of from/until date-ranges in a
>> "record-chain" (all records with all their individual versions
>> connected by foreign keys) since every combination creates a new
>> record in a query across all tables. That also means that I will have
>> a lot of entries with the same values in the other search-related
>> fields - the only difference will be most of the time one of the
>> from/until-ranges.
>> Perhaps the query can be optimized so that irrelevant combinations can
>> be avoided (for example if two date-ranges do not overlap).
>>
>> Then, when I have built that index I can query it with the reference
>> date as argument to compare it with every from/until range in the
>> chain. And so I get only the relevant entries where the reference date
>> is between all from/until ranges.
>>
>> Is this correct?
>>
>> Thanks and regards
>> Stephan
>>
>>
>>
>> On Wed, Aug 15, 2012 at 2:32 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>>> The date checking can be implemented using range query as a filter query,
>>> such as
>>>
>>> &fq=startDate:[* TO NOW] AND endDate:[NOW TO *]
>>>
>>> (You can also use an "frange" query.)
>>>
>>> Then you will have to flatten the database tables. Your Solr schema would
>>> have a single "merged" record type. You will have to decide whether the
>>> different record types (tables) will have common fields versus static
>>> qualification by adding a prefix or suffix, e.g., "name" vs. "employee_name"
>>> and "employer_name". The latter has the advantage that you do not have to
>>> separately specify a table "type" field since the fields would be empty for
>>> records of other types.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Stefan Burkard
>>> Sent: Wednesday, August 15, 2012 8:12 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: How to design index for related versioned database records
>>>
>>>
>>> Hi solr-users
>>>
>>> I have a case where I need to build an index from a database.
>>>
>>> ***Data structure***
>>> The data is spread across multiple tables and in each table the
>>> records are versioned - this means that one "real" record can exist
>>> multiple times in a table, each with different validFrom/validUntil
>>> dates. Therefore it is possible to query the valid version of a record
>>> for a given point in time.
>>>
>>> The relations of the data are something like this:
>>> Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
>>> (=offered services) <-> Service
>>>
>>> That means I have data across 5 relations, each of them with versioned
>>> records.
>>>
>>> ***Search needs***
>>> Now I need to be able to search for employees and employers based on
>>> the services they offer for a given point in time.
>>>
>>> Therefore I have built an index of all employees and employers with
>>> their services as subentity. So I have one index entry for every
>>> version of every employee/employer and each version collects the
>>> offered services for the given timeframe of the employee/employer
>>> version.
>>>
>>> Problem: The offered services of an employee/employer can change
>>> during its validity period. That means I do not only need to take the
>>> version timespan of the employee/employer into account but also the
>>> version timespans of services and the link-tables.
>>>
>>> ***Question***
>>> I think I could continue with my strategy to have an index entry of an
>>> employee/employer with its services for any given point in time. But
>>> there are much more entries than now since every involved
>>> validfrom/validuntil period (if they overlap) produces more entries.
>>> But I am not sure if this is a good strategy, or if it would be better
>>> to try to index the whole datastructure in an other way.
>>>
>>> Are there any recommendations how to handle such a case?
>>>
>>> Thanks for any help
>>> Stephan



-- 
Lance Norskog
goksron@gmail.com

Re: How to design index for related versioned database records

Posted by Erick Erickson <er...@gmail.com>.

Hmmm, how many employees/services/dates are we talking about
here? Is the cross product 1M? 1B? 1G records?

You could try the Solr join stuff (Solr 4x), be aware that it performs
best on join fields with a limited number of unique values.

Best
Erick

On Tue, Aug 21, 2012 at 4:05 AM, Stefan Burkard <sb...@gmail.com> wrote:
> Hi Jack
>
> Thanks for your answer. Do I understand that correctly that I must
> create a "merge-entity" that contains all the different
> validFrom/validUntil dates as fields (and of course the other
> search-related fields).
>
> This would mean that the number of index entries is equal to the
> number of all possible combinations of from/until date-ranges in a
> "record-chain" (all records with all their individual versions
> connected by foreign keys) since every combination creates a new
> record in a query across all tables. That also means that I will have
> a lot of entries with the same values in the other search-related
> fields - the only difference will be most of the time one of the
> from/until-ranges.
> Perhaps the query can be optimized so that irrelevant combinations can
> be avoided (for example if two date-ranges do not overlap).
>
> Then, when I have built that index I can query it with the reference
> date as argument to compare it with every from/until range in the
> chain. And so I get only the relevant entries where the reference date
> is between all from/until ranges.
>
> Is this correct?
>
> Thanks and regards
> Stephan
>
>
>
> On Wed, Aug 15, 2012 at 2:32 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
>> The date checking can be implemented using range query as a filter query,
>> such as
>>
>> &fq=startDate:[* TO NOW] AND endDate:[NOW TO *]
>>
>> (You can also use an "frange" query.)
>>
>> Then you will have to flatten the database tables. Your Solr schema would
>> have a single "merged" record type. You will have to decide whether the
>> different record types (tables) will have common fields versus static
>> qualification by adding a prefix or suffix, e.g., "name" vs. "employee_name"
>> and "employer_name". The latter has the advantage that you do not have to
>> separately specify a table "type" field since the fields would be empty for
>> records of other types.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Stefan Burkard
>> Sent: Wednesday, August 15, 2012 8:12 AM
>> To: solr-user@lucene.apache.org
>> Subject: How to design index for related versioned database records
>>
>>
>> Hi solr-users
>>
>> I have a case where I need to build an index from a database.
>>
>> ***Data structure***
>> The data is spread across multiple tables and in each table the
>> records are versioned - this means that one "real" record can exist
>> multiple times in a table, each with different validFrom/validUntil
>> dates. Therefore it is possible to query the valid version of a record
>> for a given point in time.
>>
>> The relations of the data are something like this:
>> Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
>> (=offered services) <-> Service
>>
>> That means I have data across 5 relations, each of them with versioned
>> records.
>>
>> ***Search needs***
>> Now I need to be able to search for employees and employers based on
>> the services they offer for a given point in time.
>>
>> Therefore I have built an index of all employees and employers with
>> their services as subentity. So I have one index entry for every
>> version of every employee/employer and each version collects the
>> offered services for the given timeframe of the employee/employer
>> version.
>>
>> Problem: The offered services of an employee/employer can change
>> during its validity period. That means I do not only need to take the
>> version timespan of the employee/employer into account but also the
>> version timespans of services and the link-tables.
>>
>> ***Question***
>> I think I could continue with my strategy to have an index entry of an
>> employee/employer with its services for any given point in time. But
>> there are much more entries than now since every involved
>> validfrom/validuntil period (if they overlap) produces more entries.
>> But I am not sure if this is a good strategy, or if it would be better
>> to try to index the whole datastructure in an other way.
>>
>> Are there any recommendations how to handle such a case?
>>
>> Thanks for any help
>> Stephan

Re: How to design index for related versioned database records

Posted by Stefan Burkard <sb...@gmail.com>.

Hi Jack

Thanks for your answer. Do I understand that correctly that I must
create a "merge-entity" that contains all the different
validFrom/validUntil dates as fields (and of course the other
search-related fields).

This would mean that the number of index entries is equal to the
number of all possible combinations of from/until date-ranges in a
"record-chain" (all records with all their individual versions
connected by foreign keys) since every combination creates a new
record in a query across all tables. That also means that I will have
a lot of entries with the same values in the other search-related
fields - the only difference will be most of the time one of the
from/until-ranges.
Perhaps the query can be optimized so that irrelevant combinations can
be avoided (for example if two date-ranges do not overlap).

Then, when I have built that index I can query it with the reference
date as argument to compare it with every from/until range in the
chain. And so I get only the relevant entries where the reference date
is between all from/until ranges.

Is this correct?

Thanks and regards
Stephan



On Wed, Aug 15, 2012 at 2:32 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> The date checking can be implemented using range query as a filter query,
> such as
>
> &fq=startDate:[* TO NOW] AND endDate:[NOW TO *]
>
> (You can also use an "frange" query.)
>
> Then you will have to flatten the database tables. Your Solr schema would
> have a single "merged" record type. You will have to decide whether the
> different record types (tables) will have common fields versus static
> qualification by adding a prefix or suffix, e.g., "name" vs. "employee_name"
> and "employer_name". The latter has the advantage that you do not have to
> separately specify a table "type" field since the fields would be empty for
> records of other types.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Stefan Burkard
> Sent: Wednesday, August 15, 2012 8:12 AM
> To: solr-user@lucene.apache.org
> Subject: How to design index for related versioned database records
>
>
> Hi solr-users
>
> I have a case where I need to build an index from a database.
>
> ***Data structure***
> The data is spread across multiple tables and in each table the
> records are versioned - this means that one "real" record can exist
> multiple times in a table, each with different validFrom/validUntil
> dates. Therefore it is possible to query the valid version of a record
> for a given point in time.
>
> The relations of the data are something like this:
> Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
> (=offered services) <-> Service
>
> That means I have data across 5 relations, each of them with versioned
> records.
>
> ***Search needs***
> Now I need to be able to search for employees and employers based on
> the services they offer for a given point in time.
>
> Therefore I have built an index of all employees and employers with
> their services as subentity. So I have one index entry for every
> version of every employee/employer and each version collects the
> offered services for the given timeframe of the employee/employer
> version.
>
> Problem: The offered services of an employee/employer can change
> during its validity period. That means I do not only need to take the
> version timespan of the employee/employer into account but also the
> version timespans of services and the link-tables.
>
> ***Question***
> I think I could continue with my strategy to have an index entry of an
> employee/employer with its services for any given point in time. But
> there are much more entries than now since every involved
> validfrom/validuntil period (if they overlap) produces more entries.
> But I am not sure if this is a good strategy, or if it would be better
> to try to index the whole datastructure in an other way.
>
> Are there any recommendations how to handle such a case?
>
> Thanks for any help
> Stephan

Re: How to design index for related versioned database records

Posted by Jack Krupansky <ja...@basetechnology.com>.

The date checking can be implemented using range query as a filter query, 
such as

&fq=startDate:[* TO NOW] AND endDate:[NOW TO *]

(You can also use an "frange" query.)

Then you will have to flatten the database tables. Your Solr schema would 
have a single "merged" record type. You will have to decide whether the 
different record types (tables) will have common fields versus static 
qualification by adding a prefix or suffix, e.g., "name" vs. "employee_name" 
and "employer_name". The latter has the advantage that you do not have to 
separately specify a table "type" field since the fields would be empty for 
records of other types.

-- Jack Krupansky

-----Original Message----- 
From: Stefan Burkard
Sent: Wednesday, August 15, 2012 8:12 AM
To: solr-user@lucene.apache.org
Subject: How to design index for related versioned database records

Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one "real" record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
(=offered services) <-> Service

That means I have data across 5 relations, each of them with versioned 
records.

***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan