You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark Allan <ma...@ed.ac.uk> on 2010/06/22 15:53:06 UTC

Searching across multiple repeating fields

Hi all,

Firstly, I apologise for the length of this email but I need to  
describe properly what I'm doing before I get to the problem!

I'm working on a project just now which requires the ability to store  
and search on temporal coverage data - ie. a field which specifies a  
date range during which a certain event took place.

I hunted around for a few days and couldn't find anything which seemed  
to fit, so I had a go at writing my own field type based on  
solr.PointType.  It's used as follows:
   schema.xml
	<fieldType name="temporal" class="solr.TemporalCoverage"  
dimension="2" subFieldSuffix="_i"/>
	<field name="daterange" type="temporal" indexed="true" stored="true"  
multiValued="true"/>
   data.xml
	<add>
	<doc>
	...
	<field name="daterange">1940,1945</field>
	</doc>
	</add>

Internally, this gets stored as:
     <arr name="daterange"><str>1940,1945</str></arr>
     <int name="daterange_0_i">19400000</int>
     <int name="daterange_1_i">19450000</int>

In due course, I'll declare the subfields as a proper date type, but  
in the meantime, this works absolutely fine.  I can search for an  
individual date and Solr will check (queryDate > daterange_0 AND  
queryDate < daterange_1 ) and the correct documents are returned.  My  
code also allows the user to input a date range in the query but I  
won't complicate matters with that just now!

The problem arises when a document has more than one "daterange" field  
(imagine a news broadcast which covers a variety of topics and hence  
time periods).

A document with two daterange fields
	<doc>
	...
	<field name="daterange">19820402,19820614</field>
	<field name="daterange">1990,2000</field>
	</doc>
gets stored internally as
     <arr name="daterange"><str>19820402,19820614</str><str>1990,2000</ 
str></arr>
     <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></ 
arr>
     <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></ 
arr>

In this situation, searching for 1985 should yield zero results as it  
is contained within neither daterange, however, the above document is  
returned in the result set.  What Solr is doing is checking that the  
queryDate (1985) is greater than *any* of the values in daterange_0  
AND queryDate is less than *any* of the values in daterange_1.

How can I get Solr to respect the positions of each item in the  
daterange_0 and _1 arrays?  Ideally I'd like the search to use the  
following logic, thus preventing the above document from being  
returned in a search for 1985:
	(queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR  
(queryDate > daterange_0[1] AND queryDate < daterange_1[1])

Someone else had a very similar problem recently on the mailing list  
with a multiValued PointType field but the thread went cold without a  
final solution.

While I could filter the results when they get back to my application  
layer, it seems like it's not really the right place to do it.

Any help getting Solr to respect the positions of items in arrays  
would be very gratefully received.

Many thanks,
Mark


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: remove from list

Posted by Susan Rust <su...@achieveinternet.com>.
Will do -- but wasn't selling -- trying to donate!

Susan Rust
VP of Client Services

If you wish to travel quickly, go alone
If you wish to travel far, go together
------------------------------------------------
Achieve Internet
1767 Grand Avenue, Suite 2
San Diego, CA 92109

800-618-8777 x106
858-453-5760 x106

Susan-Rust (skype)
@Susan_Rust (twitter)
@Achieveinternet (twitter)
@drupalsandiego (San Diego Drupal Users' Group Twitter)



This message contains confidential information and is intended only  
for the individual named. If you are not the named addressee you  
should not disseminate, distribute or copy this e-mail. Please notify  
the sender immediately by e-mail if you have received this e-mail by  
mistake and delete this e-mail from your system. E-mail transmission  
cannot be guaranteed to be secure or error-free as information could  
be intercepted, corrupted, lost, destroyed, arrive late or incomplete,  
or contain viruses. The sender therefore does not accept liability for  
any errors or omissions in the contents of this message, which arise  
as a result of e-mail transmission. If verification is required please  
request a hard-copy version.













On Jun 23, 2010, at 9:30 AM, Markus Jelsma wrote:

> If you want to unsubscribe, then you can do so [1] without trying to  
> sell something ;)
>
>
>
> [1]: http://lucene.apache.org/solr/mailing_lists.html
>
>
>
> Cheers!
>
> -----Original message-----
> From: Susan Rust <su...@achieveinternet.com>
> Sent: Wed 23-06-2010 18:23
> To: solr-user@lucene.apache.org; Erik Hatcher  
> <er...@gmail.com>;
> Subject: remove from list
>
> Hey SOLR folks -- There's too much info for me to digest, so please
> remove me from the email threads.
>
> However, if we can build you a forum, bulletin board or other web-
> based tool, please let us know. For that matter, we would be happy to
> build you a new website.
>
> Bill O'Connor is our CTO and the Drupal.org SOLR Redesign Lead. So we
> love SOLR! Let us know how we can support your efforts.
>
> Susan Rust
> VP of Client Services
>
> If you wish to travel quickly, go alone
> If you wish to travel far, go together
> ------------------------------------------------
> Achieve Internet
> 1767 Grand Avenue, Suite 2
> San Diego, CA 92109
>
> 800-618-8777 x106
> 858-453-5760 x106
>
> Susan-Rust (skype)
> @Susan_Rust (twitter)
> @Achieveinternet (twitter)
> @drupalsandiego (San Diego Drupal Users' Group Twitter)
>
>
>
> This message contains confidential information and is intended only
> for the individual named. If you are not the named addressee you
> should not disseminate, distribute or copy this e-mail. Please notify
> the sender immediately by e-mail if you have received this e-mail by
> mistake and delete this e-mail from your system. E-mail transmission
> cannot be guaranteed to be secure or error-free as information could
> be intercepted, corrupted, lost, destroyed, arrive late or incomplete,
> or contain viruses. The sender therefore does not accept liability for
> any errors or omissions in the contents of this message, which arise
> as a result of e-mail transmission. If verification is required please
> request a hard-copy version.
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Jun 23, 2010, at 1:52 AM, Mark Allan wrote:
>
>> Cheers, Geert-Jan, that's very helpful.
>>
>> We won't always be searching with dates and we wouldn't want
>> duplicates to show up in the results, so your second suggestion
>> looks like a good workaround if I can't solve the actual problem.  I
>> didn't know about FieldCollapsing, so I'll definitely keep it in  
>> mind.
>>
>> Thanks
>> Mark
>>
>> On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:
>>
>>> Perhaps my answer is useless, bc I don't have an answer to your
>>> direct
>>> question, but:
>>> You *might* want to consider if your concept of a solr-document is
>>> on the
>>> correct granular level, i.e:
>>>
>>> your problem posted could be tackled (afaik) by defining a
>>> document being a
>>> 'sub-event' with only 1 daterange.
>>> So for each event-doc you have now, this is replaced by several sub-
>>> event
>>> docs in this proposed situation.
>>>
>>> Additionally each sub-event doc gets an additional field 'parent-
>>> eventid'
>>> which maps to something like an event-id (which you're probably
>>> using) .
>>> So several sub-event docs can point to the same event-id.
>>>
>>> Lastly, all sub-event docs belonging to a particular event
>>> implement all the
>>> other fields that you may have stored in that particular event-doc.
>>>
>>> Now you can query for events based on data-rages like you
>>> envisioned, but
>>> instead of returning events you return sub-event-docs. However
>>> since all
>>> data of the original event (except the multiple dateranges) is
>>> available in
>>> the subevent-doc this shouldn't really bother the client. If you
>>> need to
>>> display all dates of an event (the only info missing from the
>>> returned
>>> solr-doc) you could easily store it in a RDB and fetch it using the
>>> defined
>>> parent-eventid.
>>>
>>> The only caveat I see, is that possibly multiple sub-events with
>>> the same
>>> 'parent-eventid' might get returned for a particular query.
>>> This however depends on the type of queries you envision. i.e:
>>> 1)  If you always issue queries with date-filters, and *assuming*
>>> that
>>> sub-events of a particular event don't temporally overlap, you will
>>> never
>>> get multiple sub-events returned.
>>> 2)  if 1)  doesn't hold and assuming you *do* mind multiple sub-
>>> events of
>>> the same actual event, you could try to use Field Collapsing on
>>> 'parent-eventid' to only return the first sub-event per parent-
>>> eventid that
>>> matches the rest of your query. (Note however, that Field
>>> Collapsing is a
>>> patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
>>>
>>> Not sure if this helped you at all, but at the very least it was a
>>> nice
>>> conceptual exercise ;-)
>>>
>>> Cheers,
>>> Geert-Jan
>>>
>>>
>>> 2010/6/22 Mark Allan <ma...@ed.ac.uk>
>>>
>>>> Hi all,
>>>>
>>>> Firstly, I apologise for the length of this email but I need to
>>>> describe
>>>> properly what I'm doing before I get to the problem!
>>>>
>>>> I'm working on a project just now which requires the ability to
>>>> store and
>>>> search on temporal coverage data - ie. a field which specifies a
>>>> date range
>>>> during which a certain event took place.
>>>>
>>>> I hunted around for a few days and couldn't find anything which
>>>> seemed to
>>>> fit, so I had a go at writing my own field type based on
>>>> solr.PointType.
>>>> It's used as follows:
>>>> schema.xml
>>>>      <fieldType name="temporal" class="solr.TemporalCoverage"
>>>> dimension="2" subFieldSuffix="_i"/>
>>>>      <field name="daterange" type="temporal" indexed="true"
>>>> stored="true"
>>>> multiValued="true"/>
>>>> data.xml
>>>>      <add>
>>>>      <doc>
>>>>      ...
>>>>      <field name="daterange">1940,1945</field>
>>>>      </doc>
>>>>      </add>
>>>>
>>>> Internally, this gets stored as:
>>>>  <arr name="daterange"><str>1940,1945</str></arr>
>>>>  <int name="daterange_0_i">19400000</int>
>>>>  <int name="daterange_1_i">19450000</int>
>>>>
>>>> In due course, I'll declare the subfields as a proper date type,
>>>> but in the
>>>> meantime, this works absolutely fine.  I can search for an
>>>> individual date
>>>> and Solr will check (queryDate > daterange_0 AND queryDate <
>>>> daterange_1 )
>>>> and the correct documents are returned.  My code also allows the
>>>> user to
>>>> input a date range in the query but I won't complicate matters
>>>> with that
>>>> just now!
>>>>
>>>> The problem arises when a document has more than one "daterange"
>>>> field
>>>> (imagine a news broadcast which covers a variety of topics and
>>>> hence time
>>>> periods).
>>>>
>>>> A document with two daterange fields
>>>>      <doc>
>>>>      ...
>>>>      <field name="daterange">19820402,19820614</field>
>>>>      <field name="daterange">1990,2000</field>
>>>>      </doc>
>>>> gets stored internally as
>>>>  <arr
>>>> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></
>>>> arr>
>>>>  <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></
>>>> arr>
>>>>  <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></
>>>> arr>
>>>>
>>>> In this situation, searching for 1985 should yield zero results as
>>>> it is
>>>> contained within neither daterange, however, the above document is
>>>> returned
>>>> in the result set.  What Solr is doing is checking that the
>>>> queryDate (1985)
>>>> is greater than *any* of the values in daterange_0 AND queryDate
>>>> is less
>>>> than *any* of the values in daterange_1.
>>>>
>>>> How can I get Solr to respect the positions of each item in the
>>>> daterange_0
>>>> and _1 arrays?  Ideally I'd like the search to use the following
>>>> logic, thus
>>>> preventing the above document from being returned in a search for
>>>> 1985:
>>>>      (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
>>>> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>>>>
>>>> Someone else had a very similar problem recently on the mailing
>>>> list with a
>>>> multiValued PointType field but the thread went cold without a  
>>>> final
>>>> solution.
>>>>
>>>> While I could filter the results when they get back to my
>>>> application
>>>> layer, it seems like it's not really the right place to do it.
>>>>
>>>> Any help getting Solr to respect the positions of items in arrays
>>>> would be
>>>> very gratefully received.
>>>>
>>>> Many thanks,
>>>> Mark
>>
>>
>> -- 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>


RE: remove from list

Posted by Markus Jelsma <ma...@buyways.nl>.
If you want to unsubscribe, then you can do so [1] without trying to sell something ;)

 

[1]: http://lucene.apache.org/solr/mailing_lists.html

 

Cheers!
 
-----Original message-----
From: Susan Rust <su...@achieveinternet.com>
Sent: Wed 23-06-2010 18:23
To: solr-user@lucene.apache.org; Erik Hatcher <er...@gmail.com>; 
Subject: remove from list

Hey SOLR folks -- There's too much info for me to digest, so please  
remove me from the email threads.

However, if we can build you a forum, bulletin board or other web- 
based tool, please let us know. For that matter, we would be happy to  
build you a new website.

Bill O'Connor is our CTO and the Drupal.org SOLR Redesign Lead. So we  
love SOLR! Let us know how we can support your efforts.

Susan Rust
VP of Client Services

If you wish to travel quickly, go alone
If you wish to travel far, go together
------------------------------------------------
Achieve Internet
1767 Grand Avenue, Suite 2
San Diego, CA 92109

800-618-8777 x106
858-453-5760 x106

Susan-Rust (skype)
@Susan_Rust (twitter)
@Achieveinternet (twitter)
@drupalsandiego (San Diego Drupal Users' Group Twitter)



This message contains confidential information and is intended only  
for the individual named. If you are not the named addressee you  
should not disseminate, distribute or copy this e-mail. Please notify  
the sender immediately by e-mail if you have received this e-mail by  
mistake and delete this e-mail from your system. E-mail transmission  
cannot be guaranteed to be secure or error-free as information could  
be intercepted, corrupted, lost, destroyed, arrive late or incomplete,  
or contain viruses. The sender therefore does not accept liability for  
any errors or omissions in the contents of this message, which arise  
as a result of e-mail transmission. If verification is required please  
request a hard-copy version.













On Jun 23, 2010, at 1:52 AM, Mark Allan wrote:

> Cheers, Geert-Jan, that's very helpful.
>
> We won't always be searching with dates and we wouldn't want  
> duplicates to show up in the results, so your second suggestion  
> looks like a good workaround if I can't solve the actual problem.  I  
> didn't know about FieldCollapsing, so I'll definitely keep it in mind.
>
> Thanks
> Mark
>
> On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:
>
>> Perhaps my answer is useless, bc I don't have an answer to your  
>> direct
>> question, but:
>> You *might* want to consider if your concept of a solr-document is  
>> on the
>> correct granular level, i.e:
>>
>> your problem posted could be tackled (afaik) by defining a   
>> document being a
>> 'sub-event' with only 1 daterange.
>> So for each event-doc you have now, this is replaced by several sub- 
>> event
>> docs in this proposed situation.
>>
>> Additionally each sub-event doc gets an additional field 'parent- 
>> eventid'
>> which maps to something like an event-id (which you're probably  
>> using) .
>> So several sub-event docs can point to the same event-id.
>>
>> Lastly, all sub-event docs belonging to a particular event  
>> implement all the
>> other fields that you may have stored in that particular event-doc.
>>
>> Now you can query for events based on data-rages like you  
>> envisioned, but
>> instead of returning events you return sub-event-docs. However  
>> since all
>> data of the original event (except the multiple dateranges) is  
>> available in
>> the subevent-doc this shouldn't really bother the client. If you  
>> need to
>> display all dates of an event (the only info missing from the  
>> returned
>> solr-doc) you could easily store it in a RDB and fetch it using the  
>> defined
>> parent-eventid.
>>
>> The only caveat I see, is that possibly multiple sub-events with  
>> the same
>> 'parent-eventid' might get returned for a particular query.
>> This however depends on the type of queries you envision. i.e:
>> 1)  If you always issue queries with date-filters, and *assuming*  
>> that
>> sub-events of a particular event don't temporally overlap, you will  
>> never
>> get multiple sub-events returned.
>> 2)  if 1)  doesn't hold and assuming you *do* mind multiple sub- 
>> events of
>> the same actual event, you could try to use Field Collapsing on
>> 'parent-eventid' to only return the first sub-event per parent- 
>> eventid that
>> matches the rest of your query. (Note however, that Field  
>> Collapsing is a
>> patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
>>
>> Not sure if this helped you at all, but at the very least it was a  
>> nice
>> conceptual exercise ;-)
>>
>> Cheers,
>> Geert-Jan
>>
>>
>> 2010/6/22 Mark Allan <ma...@ed.ac.uk>
>>
>>> Hi all,
>>>
>>> Firstly, I apologise for the length of this email but I need to  
>>> describe
>>> properly what I'm doing before I get to the problem!
>>>
>>> I'm working on a project just now which requires the ability to  
>>> store and
>>> search on temporal coverage data - ie. a field which specifies a  
>>> date range
>>> during which a certain event took place.
>>>
>>> I hunted around for a few days and couldn't find anything which  
>>> seemed to
>>> fit, so I had a go at writing my own field type based on  
>>> solr.PointType.
>>> It's used as follows:
>>> schema.xml
>>>      <fieldType name="temporal" class="solr.TemporalCoverage"
>>> dimension="2" subFieldSuffix="_i"/>
>>>      <field name="daterange" type="temporal" indexed="true"  
>>> stored="true"
>>> multiValued="true"/>
>>> data.xml
>>>      <add>
>>>      <doc>
>>>      ...
>>>      <field name="daterange">1940,1945</field>
>>>      </doc>
>>>      </add>
>>>
>>> Internally, this gets stored as:
>>>  <arr name="daterange"><str>1940,1945</str></arr>
>>>  <int name="daterange_0_i">19400000</int>
>>>  <int name="daterange_1_i">19450000</int>
>>>
>>> In due course, I'll declare the subfields as a proper date type,  
>>> but in the
>>> meantime, this works absolutely fine.  I can search for an  
>>> individual date
>>> and Solr will check (queryDate > daterange_0 AND queryDate <  
>>> daterange_1 )
>>> and the correct documents are returned.  My code also allows the  
>>> user to
>>> input a date range in the query but I won't complicate matters  
>>> with that
>>> just now!
>>>
>>> The problem arises when a document has more than one "daterange"  
>>> field
>>> (imagine a news broadcast which covers a variety of topics and  
>>> hence time
>>> periods).
>>>
>>> A document with two daterange fields
>>>      <doc>
>>>      ...
>>>      <field name="daterange">19820402,19820614</field>
>>>      <field name="daterange">1990,2000</field>
>>>      </doc>
>>> gets stored internally as
>>>  <arr
>>> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></ 
>>> arr>
>>>  <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></ 
>>> arr>
>>>  <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></ 
>>> arr>
>>>
>>> In this situation, searching for 1985 should yield zero results as  
>>> it is
>>> contained within neither daterange, however, the above document is  
>>> returned
>>> in the result set.  What Solr is doing is checking that the  
>>> queryDate (1985)
>>> is greater than *any* of the values in daterange_0 AND queryDate  
>>> is less
>>> than *any* of the values in daterange_1.
>>>
>>> How can I get Solr to respect the positions of each item in the  
>>> daterange_0
>>> and _1 arrays?  Ideally I'd like the search to use the following  
>>> logic, thus
>>> preventing the above document from being returned in a search for  
>>> 1985:
>>>      (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
>>> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>>>
>>> Someone else had a very similar problem recently on the mailing  
>>> list with a
>>> multiValued PointType field but the thread went cold without a final
>>> solution.
>>>
>>> While I could filter the results when they get back to my  
>>> application
>>> layer, it seems like it's not really the right place to do it.
>>>
>>> Any help getting Solr to respect the positions of items in arrays  
>>> would be
>>> very gratefully received.
>>>
>>> Many thanks,
>>> Mark
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>


remove from list

Posted by Susan Rust <su...@achieveinternet.com>.
Hey SOLR folks -- There's too much info for me to digest, so please  
remove me from the email threads.

However, if we can build you a forum, bulletin board or other web- 
based tool, please let us know. For that matter, we would be happy to  
build you a new website.

Bill O'Connor is our CTO and the Drupal.org SOLR Redesign Lead. So we  
love SOLR! Let us know how we can support your efforts.

Susan Rust
VP of Client Services

If you wish to travel quickly, go alone
If you wish to travel far, go together
------------------------------------------------
Achieve Internet
1767 Grand Avenue, Suite 2
San Diego, CA 92109

800-618-8777 x106
858-453-5760 x106

Susan-Rust (skype)
@Susan_Rust (twitter)
@Achieveinternet (twitter)
@drupalsandiego (San Diego Drupal Users' Group Twitter)



This message contains confidential information and is intended only  
for the individual named. If you are not the named addressee you  
should not disseminate, distribute or copy this e-mail. Please notify  
the sender immediately by e-mail if you have received this e-mail by  
mistake and delete this e-mail from your system. E-mail transmission  
cannot be guaranteed to be secure or error-free as information could  
be intercepted, corrupted, lost, destroyed, arrive late or incomplete,  
or contain viruses. The sender therefore does not accept liability for  
any errors or omissions in the contents of this message, which arise  
as a result of e-mail transmission. If verification is required please  
request a hard-copy version.













On Jun 23, 2010, at 1:52 AM, Mark Allan wrote:

> Cheers, Geert-Jan, that's very helpful.
>
> We won't always be searching with dates and we wouldn't want  
> duplicates to show up in the results, so your second suggestion  
> looks like a good workaround if I can't solve the actual problem.  I  
> didn't know about FieldCollapsing, so I'll definitely keep it in mind.
>
> Thanks
> Mark
>
> On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:
>
>> Perhaps my answer is useless, bc I don't have an answer to your  
>> direct
>> question, but:
>> You *might* want to consider if your concept of a solr-document is  
>> on the
>> correct granular level, i.e:
>>
>> your problem posted could be tackled (afaik) by defining a   
>> document being a
>> 'sub-event' with only 1 daterange.
>> So for each event-doc you have now, this is replaced by several sub- 
>> event
>> docs in this proposed situation.
>>
>> Additionally each sub-event doc gets an additional field 'parent- 
>> eventid'
>> which maps to something like an event-id (which you're probably  
>> using) .
>> So several sub-event docs can point to the same event-id.
>>
>> Lastly, all sub-event docs belonging to a particular event  
>> implement all the
>> other fields that you may have stored in that particular event-doc.
>>
>> Now you can query for events based on data-rages like you  
>> envisioned, but
>> instead of returning events you return sub-event-docs. However  
>> since all
>> data of the original event (except the multiple dateranges) is  
>> available in
>> the subevent-doc this shouldn't really bother the client. If you  
>> need to
>> display all dates of an event (the only info missing from the  
>> returned
>> solr-doc) you could easily store it in a RDB and fetch it using the  
>> defined
>> parent-eventid.
>>
>> The only caveat I see, is that possibly multiple sub-events with  
>> the same
>> 'parent-eventid' might get returned for a particular query.
>> This however depends on the type of queries you envision. i.e:
>> 1)  If you always issue queries with date-filters, and *assuming*  
>> that
>> sub-events of a particular event don't temporally overlap, you will  
>> never
>> get multiple sub-events returned.
>> 2)  if 1)  doesn't hold and assuming you *do* mind multiple sub- 
>> events of
>> the same actual event, you could try to use Field Collapsing on
>> 'parent-eventid' to only return the first sub-event per parent- 
>> eventid that
>> matches the rest of your query. (Note however, that Field  
>> Collapsing is a
>> patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
>>
>> Not sure if this helped you at all, but at the very least it was a  
>> nice
>> conceptual exercise ;-)
>>
>> Cheers,
>> Geert-Jan
>>
>>
>> 2010/6/22 Mark Allan <ma...@ed.ac.uk>
>>
>>> Hi all,
>>>
>>> Firstly, I apologise for the length of this email but I need to  
>>> describe
>>> properly what I'm doing before I get to the problem!
>>>
>>> I'm working on a project just now which requires the ability to  
>>> store and
>>> search on temporal coverage data - ie. a field which specifies a  
>>> date range
>>> during which a certain event took place.
>>>
>>> I hunted around for a few days and couldn't find anything which  
>>> seemed to
>>> fit, so I had a go at writing my own field type based on  
>>> solr.PointType.
>>> It's used as follows:
>>> schema.xml
>>>      <fieldType name="temporal" class="solr.TemporalCoverage"
>>> dimension="2" subFieldSuffix="_i"/>
>>>      <field name="daterange" type="temporal" indexed="true"  
>>> stored="true"
>>> multiValued="true"/>
>>> data.xml
>>>      <add>
>>>      <doc>
>>>      ...
>>>      <field name="daterange">1940,1945</field>
>>>      </doc>
>>>      </add>
>>>
>>> Internally, this gets stored as:
>>>  <arr name="daterange"><str>1940,1945</str></arr>
>>>  <int name="daterange_0_i">19400000</int>
>>>  <int name="daterange_1_i">19450000</int>
>>>
>>> In due course, I'll declare the subfields as a proper date type,  
>>> but in the
>>> meantime, this works absolutely fine.  I can search for an  
>>> individual date
>>> and Solr will check (queryDate > daterange_0 AND queryDate <  
>>> daterange_1 )
>>> and the correct documents are returned.  My code also allows the  
>>> user to
>>> input a date range in the query but I won't complicate matters  
>>> with that
>>> just now!
>>>
>>> The problem arises when a document has more than one "daterange"  
>>> field
>>> (imagine a news broadcast which covers a variety of topics and  
>>> hence time
>>> periods).
>>>
>>> A document with two daterange fields
>>>      <doc>
>>>      ...
>>>      <field name="daterange">19820402,19820614</field>
>>>      <field name="daterange">1990,2000</field>
>>>      </doc>
>>> gets stored internally as
>>>  <arr
>>> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></ 
>>> arr>
>>>  <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></ 
>>> arr>
>>>  <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></ 
>>> arr>
>>>
>>> In this situation, searching for 1985 should yield zero results as  
>>> it is
>>> contained within neither daterange, however, the above document is  
>>> returned
>>> in the result set.  What Solr is doing is checking that the  
>>> queryDate (1985)
>>> is greater than *any* of the values in daterange_0 AND queryDate  
>>> is less
>>> than *any* of the values in daterange_1.
>>>
>>> How can I get Solr to respect the positions of each item in the  
>>> daterange_0
>>> and _1 arrays?  Ideally I'd like the search to use the following  
>>> logic, thus
>>> preventing the above document from being returned in a search for  
>>> 1985:
>>>      (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
>>> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>>>
>>> Someone else had a very similar problem recently on the mailing  
>>> list with a
>>> multiValued PointType field but the thread went cold without a final
>>> solution.
>>>
>>> While I could filter the results when they get back to my  
>>> application
>>> layer, it seems like it's not really the right place to do it.
>>>
>>> Any help getting Solr to respect the positions of items in arrays  
>>> would be
>>> very gratefully received.
>>>
>>> Many thanks,
>>> Mark
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>


Re: Searching across multiple repeating fields

Posted by Mark Allan <ma...@ed.ac.uk>.
In case anyone's interested (and I know at least one person is because  
they asked me where to find the solr.TemporalCoverage class - sorry  
that was my fault, I shouldn't have used the default package name),  
here's how I got around the problem.

It's not the neatest solution in the world, but it does work and  
performance doesn't seem to take a hit when I do it this way.  That  
said, I've only tested it with approximately 55,000 documents, so your  
mileage may vary.

I'm defining daterange as a dynamic field with the pattern  
"daterange*". If any document should have more than one daterange  
field, my script which generates appropriately formatted XML will  
append subsequent fieldnames with a counter like so:

   <field name="daterange">19820402,19820614</field>
   <field name="daterange1">1990,2000</field>

However, the problem with this approach is that the subfields end up  
getting called daterange_0_i and daterange_1_i and these in turn also  
match the dynamicField pattern for the main daterange field. So to  
avoid this, I modified a copy of AbstractSubTypeFieldType.java to use  
a substring of the main fieldname when naming the internal subfields.   
They now come out as aterange_0_i and aterange_1_i.

Next, in order to ensure that all daterange fields (eg daterange,  
daterange1, daterange2 etc) get used in a search, I implemented a  
crude query parser which expands the user's query to include all  
daterange* fields.  It uses a "maxtempcoveragefields" default setting  
in solrconfig.xml to determine at runtime how many times the user's  
query should be expanded before passing it on to the default parser.

Here's snippets of how everything looks:
solrconfig.xml
	<requestHandler name="standard" class="solr.SearchHandler"  
default="true">
	     <lst name="defaults">
	       <int name="maxtempcoveragefields">1</int>
	....
	<queryParser name="temporalcoverageqparser"  
class="uk.ac.edina.solr.search.TemporalCoverageQParserPlugin" />

schema.xml
	<fieldType name="temporal"  
class="uk.ac.edina.solr.schema.TemporalCoverage" dimension="2"  
subFieldSuffix="_i"/>
	<dynamicField name="daterange*" type="temporal" indexed="true"  
stored="true" />

update.xml
<doc>
   ...
   <field name="daterange">19820402,19820614</field>
   <field name="daterange1">1990,2000</field>
</doc>

If anyone wants the code as it is just now, I can happily provide it.  
Alternatively, if you think it might be of use to others, I can roll  
it back into the org.apache.solr packages and submit it to the  
repository so that those with more Solr experience than I can see if  
it could be better implemented another way.

Cheers,

Mark

On 23 Jun 2010, at 9:52 am, Mark Allan wrote:

> Cheers, Geert-Jan, that's very helpful.
>
> We won't always be searching with dates and we wouldn't want  
> duplicates to show up in the results, so your second suggestion  
> looks like a good workaround if I can't solve the actual problem.  I  
> didn't know about FieldCollapsing, so I'll definitely keep it in mind.
>
> Thanks
> Mark
>
> On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:
>
>> Perhaps my answer is useless, bc I don't have an answer to your  
>> direct
>> question, but:
>> You *might* want to consider if your concept of a solr-document is  
>> on the
>> correct granular level, i.e:
>>
>> your problem posted could be tackled (afaik) by defining a   
>> document being a
>> 'sub-event' with only 1 daterange.
>> So for each event-doc you have now, this is replaced by several sub- 
>> event
>> docs in this proposed situation.
>>
>> Additionally each sub-event doc gets an additional field 'parent- 
>> eventid'
>> which maps to something like an event-id (which you're probably  
>> using) .
>> So several sub-event docs can point to the same event-id.
>>
>> Lastly, all sub-event docs belonging to a particular event  
>> implement all the
>> other fields that you may have stored in that particular event-doc.
>>
>> Now you can query for events based on data-rages like you  
>> envisioned, but
>> instead of returning events you return sub-event-docs. However  
>> since all
>> data of the original event (except the multiple dateranges) is  
>> available in
>> the subevent-doc this shouldn't really bother the client. If you  
>> need to
>> display all dates of an event (the only info missing from the  
>> returned
>> solr-doc) you could easily store it in a RDB and fetch it using the  
>> defined
>> parent-eventid.
>>
>> The only caveat I see, is that possibly multiple sub-events with  
>> the same
>> 'parent-eventid' might get returned for a particular query.
>> This however depends on the type of queries you envision. i.e:
>> 1)  If you always issue queries with date-filters, and *assuming*  
>> that
>> sub-events of a particular event don't temporally overlap, you will  
>> never
>> get multiple sub-events returned.
>> 2)  if 1)  doesn't hold and assuming you *do* mind multiple sub- 
>> events of
>> the same actual event, you could try to use Field Collapsing on
>> 'parent-eventid' to only return the first sub-event per parent- 
>> eventid that
>> matches the rest of your query. (Note however, that Field  
>> Collapsing is a
>> patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
>>
>> Not sure if this helped you at all, but at the very least it was a  
>> nice
>> conceptual exercise ;-)
>>
>> Cheers,
>> Geert-Jan
>>
>>
>> 2010/6/22 Mark Allan <ma...@ed.ac.uk>
>>
>>> Hi all,
>>>
>>> Firstly, I apologise for the length of this email but I need to  
>>> describe
>>> properly what I'm doing before I get to the problem!
>>>
>>> I'm working on a project just now which requires the ability to  
>>> store and
>>> search on temporal coverage data - ie. a field which specifies a  
>>> date range
>>> during which a certain event took place.
>>>
>>> I hunted around for a few days and couldn't find anything which  
>>> seemed to
>>> fit, so I had a go at writing my own field type based on  
>>> solr.PointType.
>>> It's used as follows:
>>> schema.xml
>>>      <fieldType name="temporal" class="solr.TemporalCoverage"
>>> dimension="2" subFieldSuffix="_i"/>
>>>      <field name="daterange" type="temporal" indexed="true"  
>>> stored="true"
>>> multiValued="true"/>
>>> data.xml
>>>      <add>
>>>      <doc>
>>>      ...
>>>      <field name="daterange">1940,1945</field>
>>>      </doc>
>>>      </add>
>>>
>>> Internally, this gets stored as:
>>>  <arr name="daterange"><str>1940,1945</str></arr>
>>>  <int name="daterange_0_i">19400000</int>
>>>  <int name="daterange_1_i">19450000</int>
>>>
>>> In due course, I'll declare the subfields as a proper date type,  
>>> but in the
>>> meantime, this works absolutely fine.  I can search for an  
>>> individual date
>>> and Solr will check (queryDate > daterange_0 AND queryDate <  
>>> daterange_1 )
>>> and the correct documents are returned.  My code also allows the  
>>> user to
>>> input a date range in the query but I won't complicate matters  
>>> with that
>>> just now!
>>>
>>> The problem arises when a document has more than one "daterange"  
>>> field
>>> (imagine a news broadcast which covers a variety of topics and  
>>> hence time
>>> periods).
>>>
>>> A document with two daterange fields
>>>      <doc>
>>>      ...
>>>      <field name="daterange">19820402,19820614</field>
>>>      <field name="daterange">1990,2000</field>
>>>      </doc>
>>> gets stored internally as
>>>  <arr
>>> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></ 
>>> arr>
>>>  <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></ 
>>> arr>
>>>  <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></ 
>>> arr>
>>>
>>> In this situation, searching for 1985 should yield zero results as  
>>> it is
>>> contained within neither daterange, however, the above document is  
>>> returned
>>> in the result set.  What Solr is doing is checking that the  
>>> queryDate (1985)
>>> is greater than *any* of the values in daterange_0 AND queryDate  
>>> is less
>>> than *any* of the values in daterange_1.
>>>
>>> How can I get Solr to respect the positions of each item in the  
>>> daterange_0
>>> and _1 arrays?  Ideally I'd like the search to use the following  
>>> logic, thus
>>> preventing the above document from being returned in a search for  
>>> 1985:
>>>      (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
>>> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>>>
>>> Someone else had a very similar problem recently on the mailing  
>>> list with a
>>> multiValued PointType field but the thread went cold without a final
>>> solution.
>>>
>>> While I could filter the results when they get back to my  
>>> application
>>> layer, it seems like it's not really the right place to do it.
>>>
>>> Any help getting Solr to respect the positions of items in arrays  
>>> would be
>>> very gratefully received.
>>>
>>> Many thanks,
>>> Mark


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: Searching across multiple repeating fields

Posted by Mark Allan <ma...@ed.ac.uk>.
Cheers, Geert-Jan, that's very helpful.

We won't always be searching with dates and we wouldn't want  
duplicates to show up in the results, so your second suggestion looks  
like a good workaround if I can't solve the actual problem.  I didn't  
know about FieldCollapsing, so I'll definitely keep it in mind.

Thanks
Mark

On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:

> Perhaps my answer is useless, bc I don't have an answer to your direct
> question, but:
> You *might* want to consider if your concept of a solr-document is  
> on the
> correct granular level, i.e:
>
> your problem posted could be tackled (afaik) by defining a  document  
> being a
> 'sub-event' with only 1 daterange.
> So for each event-doc you have now, this is replaced by several sub- 
> event
> docs in this proposed situation.
>
> Additionally each sub-event doc gets an additional field 'parent- 
> eventid'
> which maps to something like an event-id (which you're probably  
> using) .
> So several sub-event docs can point to the same event-id.
>
> Lastly, all sub-event docs belonging to a particular event implement  
> all the
> other fields that you may have stored in that particular event-doc.
>
> Now you can query for events based on data-rages like you  
> envisioned, but
> instead of returning events you return sub-event-docs. However since  
> all
> data of the original event (except the multiple dateranges) is  
> available in
> the subevent-doc this shouldn't really bother the client. If you  
> need to
> display all dates of an event (the only info missing from the returned
> solr-doc) you could easily store it in a RDB and fetch it using the  
> defined
> parent-eventid.
>
> The only caveat I see, is that possibly multiple sub-events with the  
> same
> 'parent-eventid' might get returned for a particular query.
> This however depends on the type of queries you envision. i.e:
> 1)  If you always issue queries with date-filters, and *assuming* that
> sub-events of a particular event don't temporally overlap, you will  
> never
> get multiple sub-events returned.
> 2)  if 1)  doesn't hold and assuming you *do* mind multiple sub- 
> events of
> the same actual event, you could try to use Field Collapsing on
> 'parent-eventid' to only return the first sub-event per parent- 
> eventid that
> matches the rest of your query. (Note however, that Field Collapsing  
> is a
> patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
>
> Not sure if this helped you at all, but at the very least it was a  
> nice
> conceptual exercise ;-)
>
> Cheers,
> Geert-Jan
>
>
> 2010/6/22 Mark Allan <ma...@ed.ac.uk>
>
>> Hi all,
>>
>> Firstly, I apologise for the length of this email but I need to  
>> describe
>> properly what I'm doing before I get to the problem!
>>
>> I'm working on a project just now which requires the ability to  
>> store and
>> search on temporal coverage data - ie. a field which specifies a  
>> date range
>> during which a certain event took place.
>>
>> I hunted around for a few days and couldn't find anything which  
>> seemed to
>> fit, so I had a go at writing my own field type based on  
>> solr.PointType.
>> It's used as follows:
>> schema.xml
>>       <fieldType name="temporal" class="solr.TemporalCoverage"
>> dimension="2" subFieldSuffix="_i"/>
>>       <field name="daterange" type="temporal" indexed="true"  
>> stored="true"
>> multiValued="true"/>
>> data.xml
>>       <add>
>>       <doc>
>>       ...
>>       <field name="daterange">1940,1945</field>
>>       </doc>
>>       </add>
>>
>> Internally, this gets stored as:
>>   <arr name="daterange"><str>1940,1945</str></arr>
>>   <int name="daterange_0_i">19400000</int>
>>   <int name="daterange_1_i">19450000</int>
>>
>> In due course, I'll declare the subfields as a proper date type,  
>> but in the
>> meantime, this works absolutely fine.  I can search for an  
>> individual date
>> and Solr will check (queryDate > daterange_0 AND queryDate <  
>> daterange_1 )
>> and the correct documents are returned.  My code also allows the  
>> user to
>> input a date range in the query but I won't complicate matters with  
>> that
>> just now!
>>
>> The problem arises when a document has more than one "daterange"  
>> field
>> (imagine a news broadcast which covers a variety of topics and  
>> hence time
>> periods).
>>
>> A document with two daterange fields
>>       <doc>
>>       ...
>>       <field name="daterange">19820402,19820614</field>
>>       <field name="daterange">1990,2000</field>
>>       </doc>
>> gets stored internally as
>>   <arr
>> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></ 
>> arr>
>>   <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></ 
>> arr>
>>   <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></ 
>> arr>
>>
>> In this situation, searching for 1985 should yield zero results as  
>> it is
>> contained within neither daterange, however, the above document is  
>> returned
>> in the result set.  What Solr is doing is checking that the  
>> queryDate (1985)
>> is greater than *any* of the values in daterange_0 AND queryDate is  
>> less
>> than *any* of the values in daterange_1.
>>
>> How can I get Solr to respect the positions of each item in the  
>> daterange_0
>> and _1 arrays?  Ideally I'd like the search to use the following  
>> logic, thus
>> preventing the above document from being returned in a search for  
>> 1985:
>>       (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
>> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>>
>> Someone else had a very similar problem recently on the mailing  
>> list with a
>> multiValued PointType field but the thread went cold without a final
>> solution.
>>
>> While I could filter the results when they get back to my application
>> layer, it seems like it's not really the right place to do it.
>>
>> Any help getting Solr to respect the positions of items in arrays  
>> would be
>> very gratefully received.
>>
>> Many thanks,
>> Mark


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: Searching across multiple repeating fields

Posted by Geert-Jan Brits <gb...@gmail.com>.
Perhaps my answer is useless, bc I don't have an answer to your direct
question, but:
You *might* want to consider if your concept of a solr-document is on the
correct granular level, i.e:

your problem posted could be tackled (afaik) by defining a  document being a
'sub-event' with only 1 daterange.
So for each event-doc you have now, this is replaced by several sub-event
docs in this proposed situation.

Additionally each sub-event doc gets an additional field 'parent-eventid'
which maps to something like an event-id (which you're probably using) .
So several sub-event docs can point to the same event-id.

Lastly, all sub-event docs belonging to a particular event implement all the
other fields that you may have stored in that particular event-doc.

Now you can query for events based on data-rages like you envisioned, but
instead of returning events you return sub-event-docs. However since all
data of the original event (except the multiple dateranges) is available in
the subevent-doc this shouldn't really bother the client. If you need to
display all dates of an event (the only info missing from the returned
solr-doc) you could easily store it in a RDB and fetch it using the defined
parent-eventid.

The only caveat I see, is that possibly multiple sub-events with the same
'parent-eventid' might get returned for a particular query.
This however depends on the type of queries you envision. i.e:
1)  If you always issue queries with date-filters, and *assuming* that
sub-events of a particular event don't temporally overlap, you will never
get multiple sub-events returned.
2)  if 1)  doesn't hold and assuming you *do* mind multiple sub-events of
the same actual event, you could try to use Field Collapsing on
'parent-eventid' to only return the first sub-event per parent-eventid that
matches the rest of your query. (Note however, that Field Collapsing is a
patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)

Not sure if this helped you at all, but at the very least it was a nice
conceptual exercise ;-)

Cheers,
Geert-Jan


2010/6/22 Mark Allan <ma...@ed.ac.uk>

> Hi all,
>
> Firstly, I apologise for the length of this email but I need to describe
> properly what I'm doing before I get to the problem!
>
> I'm working on a project just now which requires the ability to store and
> search on temporal coverage data - ie. a field which specifies a date range
> during which a certain event took place.
>
> I hunted around for a few days and couldn't find anything which seemed to
> fit, so I had a go at writing my own field type based on solr.PointType.
>  It's used as follows:
>  schema.xml
>        <fieldType name="temporal" class="solr.TemporalCoverage"
> dimension="2" subFieldSuffix="_i"/>
>        <field name="daterange" type="temporal" indexed="true" stored="true"
> multiValued="true"/>
>  data.xml
>        <add>
>        <doc>
>        ...
>        <field name="daterange">1940,1945</field>
>        </doc>
>        </add>
>
> Internally, this gets stored as:
>    <arr name="daterange"><str>1940,1945</str></arr>
>    <int name="daterange_0_i">19400000</int>
>    <int name="daterange_1_i">19450000</int>
>
> In due course, I'll declare the subfields as a proper date type, but in the
> meantime, this works absolutely fine.  I can search for an individual date
> and Solr will check (queryDate > daterange_0 AND queryDate < daterange_1 )
> and the correct documents are returned.  My code also allows the user to
> input a date range in the query but I won't complicate matters with that
> just now!
>
> The problem arises when a document has more than one "daterange" field
> (imagine a news broadcast which covers a variety of topics and hence time
> periods).
>
> A document with two daterange fields
>        <doc>
>        ...
>        <field name="daterange">19820402,19820614</field>
>        <field name="daterange">1990,2000</field>
>        </doc>
> gets stored internally as
>    <arr
> name="daterange"><str>19820402,19820614</str><str>1990,2000</str></arr>
>    <arr name="daterange_0_i"><int>19820402</int><int>19900000</int></arr>
>    <arr name="daterange_1_i"><int>19820614</int><int>20000000</int></arr>
>
> In this situation, searching for 1985 should yield zero results as it is
> contained within neither daterange, however, the above document is returned
> in the result set.  What Solr is doing is checking that the queryDate (1985)
> is greater than *any* of the values in daterange_0 AND queryDate is less
> than *any* of the values in daterange_1.
>
> How can I get Solr to respect the positions of each item in the daterange_0
> and _1 arrays?  Ideally I'd like the search to use the following logic, thus
> preventing the above document from being returned in a search for 1985:
>        (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
> (queryDate > daterange_0[1] AND queryDate < daterange_1[1])
>
> Someone else had a very similar problem recently on the mailing list with a
> multiValued PointType field but the thread went cold without a final
> solution.
>
> While I could filter the results when they get back to my application
> layer, it seems like it's not really the right place to do it.
>
> Any help getting Solr to respect the positions of items in arrays would be
> very gratefully received.
>
> Many thanks,
> Mark
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>