You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by gary tam <gu...@gmail.com> on 2011/09/15 20:55:15 UTC

query for point in time

Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for project,
started date, end date.

looks something like


John Smith             web site bug fix       2010-01-01       2010-01-03
                             unit testing              2010-01-04
2010-01-06
                             QA support             2010-01-07
2010-01-12
                             implementation       2010-01-13
 2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks

Re: query for point in time

Posted by Jonathan Rochkind <ro...@jhu.edu>.
I think there's something wrong with your database then, but okay.

You still haven't said what your Solr schema looks like -- that list of 
values doesn't say what the solr field names or types are. I think this 
is maybe because you don't actually have a Solr database and have no 
idea how Solr works, you're just asking in theory? On the other hand, 
you just said you have better performance with solr -- I'm not sure how 
you were able to tell the performance of solr in answering these queries 
if you don't even know how to make them!

But, again, assuming your data is set up like i'm guessing it is, it's 
quite similar to what you'd do with an rdbms.

What does 'most current' mean? Can jobs be overlapping? To find the 
project with the latest start date for a given person, just limit to 
documents with that current person in a 'q' or 'fq', and then sort by 
"start_date desc". Perhaps limit to 1 if you really only want one hit.  
Same principle as you would in an rdbms.

Again, this requires setting up your solr index in such a way to answer 
these sorts of questions. Each "document" in Solr will represent a 
person-project pair.  It'll have fields for person (or multiple fields, 
personID, personFirst, personLast, etc), project name, project start 
date, project end date.  This will make it easy/possible to answer 
questions like your examples with Solr, but will make it hard to answer 
many other sorts of questions -- unlike an rdbms, it is difficult to set 
up a Solr index that can flexibly answer just about any question you 
through at it, particularly when you have hieararchical or otherwise 
multi-entity data.

If you are interested, the standard Solr tutorial is pretty good: 
http://lucene.apache.org/solr/tutorial.html




On 9/15/2011 6:39 PM, gary tam wrote:
> Thanks for the reply.  We had the search within the database initially, but
> it proven to be too slow.  With solr we have much better performance.
>
> One more question, how could I find the most current job for each employee
>
> My data looks like
>
>
> John Smith  department A       web site bug fix       2010-01-01
> 2010-01-03
>                                               unit testing
>   2010-01-04       2010-01-06
>                                               QA support
> 2010-01-07       2010-01-12
>                                               implementation       2010-01-13
>         2010-01-22
>
> Jane Doe  department A          QA support             2010-01-01
> 2010-05-01
>                                               implementation       2010-05-02
>         2010-09-28
>
> Joe Doe  department A            PHP development  2011-01-01
> 2011-08-31
>                                               Java Development  2011-09-01
>      2011-09-15
>
> I would like to return this as my search result
>
> John Smith       department A            implementation          2010-01-13
>        2010-01-22
> Jane Doe          department A            implementation          2010-05-02
>        2010-09-28
> Joe Doe            department A            Java Development      2011-09-01
>        2011-09-15
>
>
> Thanks in advance
> Gary
>
>
>
> On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkind<ro...@jhu.edu>  wrote:
>
>> You didn't tell us what your schema looks like, what fields with what types
>> are involved.
>>
>> But similar to how you'd do it in your database, you need to find
>> 'documents' that have a start date before your date in question, and an end
>> date after your date in question, to find the ones whose range includes your
>> date in question.
>>
>> Something like this:
>>
>> q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]
>>
>> Of course, you need to add on your restriction to just documents about
>> 'John Smith', through another AND clause or an 'fq'.
>>
>> But in general, if you've got a db with this info already, and this is all
>> you need, why not just use the db?  Multi-hieararchy data like this is going
>> to give you trouble in Solr eventually, you've got to arrange the solr
>> indexes/schema to answer your questions, and eventually you're going to have
>> two questions which require mutually incompatible schema to answer.
>>
>> An rdbms is a great general purpose question answering tool for structured
>> data.  lucene/Solr is a great indexing tool for text matching.
>>
>>
>> On 9/15/2011 2:55 PM, gary tam wrote:
>>
>>> Hi
>>>
>>> I have a scenario that I am not sure how to write the query for.
>>>
>>> Here is the scenario - have an employee record with multi value for
>>> project,
>>> started date, end date.
>>>
>>> looks something like
>>>
>>>
>>> John Smith             web site bug fix       2010-01-01       2010-01-03
>>>                               unit testing              2010-01-04
>>> 2010-01-06
>>>                               QA support             2010-01-07
>>> 2010-01-12
>>>                               implementation       2010-01-13
>>>   2010-01-22
>>>
>>> I want to find what project John Smith was working on 2010-01-05
>>>
>>> Is this possible or I have to back to my database ?
>>>
>>>
>>> Thanks
>>>
>>>

Re: query for point in time

Posted by gary tam <gu...@gmail.com>.
Thanks for the reply.  We had the search within the database initially, but
it proven to be too slow.  With solr we have much better performance.

One more question, how could I find the most current job for each employee

My data looks like


John Smith  department A       web site bug fix       2010-01-01
2010-01-03
                                             unit testing
 2010-01-04       2010-01-06
                                             QA support
2010-01-07       2010-01-12
                                             implementation       2010-01-13
       2010-01-22

Jane Doe  department A          QA support             2010-01-01
2010-05-01
                                             implementation       2010-05-02
       2010-09-28

Joe Doe  department A            PHP development  2011-01-01
2011-08-31
                                             Java Development  2011-09-01
    2011-09-15

I would like to return this as my search result

John Smith       department A            implementation          2010-01-13
      2010-01-22
Jane Doe          department A            implementation          2010-05-02
      2010-09-28
Joe Doe            department A            Java Development      2011-09-01
      2011-09-15


Thanks in advance
Gary



On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:

> You didn't tell us what your schema looks like, what fields with what types
> are involved.
>
> But similar to how you'd do it in your database, you need to find
> 'documents' that have a start date before your date in question, and an end
> date after your date in question, to find the ones whose range includes your
> date in question.
>
> Something like this:
>
> q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]
>
> Of course, you need to add on your restriction to just documents about
> 'John Smith', through another AND clause or an 'fq'.
>
> But in general, if you've got a db with this info already, and this is all
> you need, why not just use the db?  Multi-hieararchy data like this is going
> to give you trouble in Solr eventually, you've got to arrange the solr
> indexes/schema to answer your questions, and eventually you're going to have
> two questions which require mutually incompatible schema to answer.
>
> An rdbms is a great general purpose question answering tool for structured
> data.  lucene/Solr is a great indexing tool for text matching.
>
>
> On 9/15/2011 2:55 PM, gary tam wrote:
>
>> Hi
>>
>> I have a scenario that I am not sure how to write the query for.
>>
>> Here is the scenario - have an employee record with multi value for
>> project,
>> started date, end date.
>>
>> looks something like
>>
>>
>> John Smith             web site bug fix       2010-01-01       2010-01-03
>>                              unit testing              2010-01-04
>> 2010-01-06
>>                              QA support             2010-01-07
>> 2010-01-12
>>                              implementation       2010-01-13
>>  2010-01-22
>>
>> I want to find what project John Smith was working on 2010-01-05
>>
>> Is this possible or I have to back to my database ?
>>
>>
>> Thanks
>>
>>

Re: query for point in time

Posted by Jonathan Rochkind <ro...@jhu.edu>.
You didn't tell us what your schema looks like, what fields with what 
types are involved.

But similar to how you'd do it in your database, you need to find 
'documents' that have a start date before your date in question, and an 
end date after your date in question, to find the ones whose range 
includes your date in question.

Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about 
'John Smith', through another AND clause or an 'fq'.

But in general, if you've got a db with this info already, and this is 
all you need, why not just use the db?  Multi-hieararchy data like this 
is going to give you trouble in Solr eventually, you've got to arrange 
the solr indexes/schema to answer your questions, and eventually you're 
going to have two questions which require mutually incompatible schema 
to answer.

An rdbms is a great general purpose question answering tool for 
structured data.  lucene/Solr is a great indexing tool for text matching.

On 9/15/2011 2:55 PM, gary tam wrote:
> Hi
>
> I have a scenario that I am not sure how to write the query for.
>
> Here is the scenario - have an employee record with multi value for project,
> started date, end date.
>
> looks something like
>
>
> John Smith             web site bug fix       2010-01-01       2010-01-03
>                               unit testing              2010-01-04
> 2010-01-06
>                               QA support             2010-01-07
> 2010-01-12
>                               implementation       2010-01-13
>   2010-01-22
>
> I want to find what project John Smith was working on 2010-01-05
>
> Is this possible or I have to back to my database ?
>
>
> Thanks
>