You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by Peter S <pe...@hotmail.com> on 2010/01/14 22:07:51 UTC

Date Facet duplicate counts

I saw some previous threads related to this subject, but on a slightly different use case, so staring a new thread...

For reference, a related thread topic can be found here:
http://www.lucidimagination.com/search/document/2025d6670004838b/date_faceting_and_double_counting#2025d6670004838b

This has to do with date facets setting double counts across adjacent date facets, if the documents' time is 'on the cusp'.

In fact, I found this problem because I was testing date facets where the gap is +1SECOND. In this case many/most/all document counts can be duplicated, because as a general rule in my case, milliseconds are set to 0, and there is 'No logic for milliseconds' in the DateMathParser. This behaviour can sometimes be observed in general date faceting -- in the +1SECOND scenario, it is much more likely to occur (because these values are more likely to be quantized).

I had a look at the date math with regards this (in SimpleFacets.java : getFacetDateCounts()), and I noticed the following line of code (~line 622):

resInner.add(label, rangeCount(sf,low,high,true,true));

The two 'true' booleans mean: 'include at start of range' *AND* 'include at end of range'. Any documents that live on the border will match in date.facet[n] and date.facet[n+1], because of the 'double-sided' inclusive range search.

By convention, a time value of '0' (00:00) belongs to the next period, rather than the previous, so I changed the *first* boolean to false, and voila! no more duplications! I believe this will be the case for other gap values, not just +1SECOND.

As there's no need to read any '[' or '{' because date faceting doesn't have/need these, the patch couldn't be simpler.

My question to the experts of this code is:
Was this done for a reason - are there any implications somewhere else for having a Lucene-double-sided-inclusive search?
I can't think of any reason, but perhaps someone knows differently?

If interested parties are in agreement, I can create an issue for it and the associated fix.

Many thanks,
Peter

_________________________________________________________________
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Date Facet duplicate counts

Posted by Chris Hostetter <ho...@fucit.org>.

: > should we be inclusive of the lower or the upper? ... even if we make it 
: > an option, how should it apply to the "first" and "last" ranges computed? 
: > do the answers change if facet.date.other includes "before" and/or "after" 
: > should the "between" option be inclusive of both end points as well?

: I guess to be consistent, the 'inclusiveness' should be intrsinically handled in
: detection of '[', '{' and/or ']', '}' -- i.e. match the facet.date field with the corresponding field 
: in the query - e.g.: q= *:* AND timestamp:[then TO now}&date.facet=timestamp .

...that only works if the datefield being faceted on is included in the -- 
which is frequently not the case, particularly on the "first" request of a 
session, where you want to face on date, buthte user has not yet made any 
attempt to restrict by any of those facets.

: If no such token exists in the query, perhaps the date.facet token parsing could process
: an option ala:  date.facet=[timestamp} to explicitly set the edge behaviour, or to override a match 
: in the query parser tokenization.
: 
: This way, there's no new explicit option; it would work with existing queries (no extra []{}'s = default behaviour); 
: and people could easily add it if they need custom edge behaviour.

I suppose ... but it still doesn't address some of hte outstanding 
questions i pointed out before (handling the first/last range in the block 
... ie: "i want inclusive of the lower, exclusive of hte upper, except for 
the last range which should be inclusive of both).  Personally i think 
addign a new option is just as clear as adding markup to the "date.facet" 
param parsing ... the less we make assumptions about what "special 
characters" people have in their fieldnames the better.


: Another way to deal with it is to add MILLISECOND logic to the DateMathParser. Then the '1ms' adjustment
: at one and/or the other could be done by the caller at query time, leaving the stored data intact, and leaving
: the server-side date faceting as it is. In fact, this can be done today using SECOND, but can be a problem if:
:  - You're using HOURS or DAYS and don't want to convert to SECONDS each time, or
:  - You need granularity to the SECOND

I don't follow you at all ... yes this can be done today, but i don't 
understand what you mean about needing to convert to seconds, or requiring 
second granularity.  

If you don't index with millisecond precision, then no matter what 
precision you index with, this example would let you get ranges including 
the "lower" bound, but not the upper bound of each range using a 1ms 
"fudge" ...

	facet.date=timestamp
	facet.date.start=NOW/DAY-5DAYS-1MILLI
	facet.date.end=NOW/DAY+1DAY-1MILLI
	facet.date.gap=+1DAY

Brainstorming a bit...

I think the semantics that might make the most sense is to add a 
multivalued "facet.date.include" param that supports the following 
options:  all, lower, upper, edge, outer
 - "all" is shorthand for lower,upper,edge,outer and is the default 
   (for back compat)
 - if "lower" is specified, then all ranges include their lower bound
 - if "upper" is specified, then all ranges include their upper bound
 - if "edge" is specified, then the first and last ranges include 
   their edge bounds (ie: lower for the first one, upper for the last 
   one) even if the corrisponding "upper"/"lower" option is not 
   specified.
 - the "between" count is inclusive of each of the start and end 
   bounds iff the first and last range are inclusive of them
 - the "before" and "after" ranges are inclusive of their respective 
   bounds if:
    - "outer" is specified ... OR ...
    - the first and last ranges don't already include them


so assuming you started with something like...
  facet.date.start=1 facet.date.end=3 facet.date.gap=+1 facet.date.other=all
...your ranges would be...
  [1 TO 2], [2 TO 3] and [* TO 1], [1 TO 3], [3 TO *]

w/ facet.date.include=lower ...
  [1 TO 2}, [2 TO 3} and [* TO 1}, [1 TO 3}, [3 TO *]

w/ facet.date.include=upper ...
  {1 TO 2], {2 TO 3] and [* TO 1], {1 TO 3], {3 TO *]

w/ facet.date.include=lower&facet.date.include=edge ...
  [1 TO 2}, [2 TO 3] and [* TO 1}, [1 TO 3], {3 TO *]

w/ facet.date.include=upper&facet.date.include=edge ...
  [1 TO 2], {2 TO 3] and [* TO 1}, [1 TO 3], {3 TO *]

w/ facet.date.include=upper&facet.date.include=outer ...
  {1 TO 2], {2 TO 3] and [* TO 1], {1 TO 3], [3 TO *]

...etc.


what do you think?



-Hoss

RE: Date Facet duplicate counts

Posted by Peter S <pe...@hotmail.com>.

 

> should we be inclusive of the lower or the upper? ... even if we make it 
> an option, how should it apply to the "first" and "last" ranges computed? 
> do the answers change if facet.date.other includes "before" and/or "after" 
> should the "between" option be inclusive of both end points as well?
> 

 

I guess to be consistent, the 'inclusiveness' should be intrsinically handled in

detection of '[', '{' and/or ']', '}' -- i.e. match the facet.date field with the corresponding field 

in the query - e.g.: q= *:* AND timestamp:[then TO now}&date.facet=timestamp .

If no such token exists in the query, perhaps the date.facet token parsing could process

an option ala:  date.facet=[timestamp} to explicitly set the edge behaviour, or to override a match 

in the query parser tokenization.

This way, there's no new explicit option; it would work with existing queries (no extra []{}'s = default behaviour); 

and people could easily add it if they need custom edge behaviour.

 


> In practice: people either don't notice, don't care, or find it easy 
> enough to add/subtract 1 millisecond to their times to get the behavior 
> they want.
> 


The main problem here is that this time change would need to be done at index time - and is essentially 'tampering' with

the data (e.g. if the timestamp is extracted from a security field that needs to be stored unmodified, or is needed/used

for another purpose).

 

Another way to deal with it is to add MILLISECOND logic to the DateMathParser. Then the '1ms' adjustment

at one and/or the other could be done by the caller at query time, leaving the stored data intact, and leaving

the server-side date faceting as it is. In fact, this can be done today using SECOND, but can be a problem if:

 - You're using HOURS or DAYS and don't want to convert to SECONDS each time, or

 - You need granularity to the SECOND

I've not looked at the DateMathParser code in great detail, but maybe adding MILLIS logic could be the most 

straighforward option, as then the 'nuances' of query parser changes (curent and future), 'before', 'after' et al. 

are handled as they are now. Anyone wishing to make use of such new behaviour, well, they'd have to use MILLIS.

 

Peter

 
 		 	   		  
_________________________________________________________________
Got a cool Hotmail story? Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Re: Date Facet duplicate counts

Posted by Chris Hostetter <ho...@fucit.org>.

: For reference, a related thread topic can be found here:
: http://www.lucidimagination.com/search/document/2025d6670004838b/date_faceting_and_double_counting#2025d6670004838b
...
: Was this done for a reason - are there any implications somewhere else

A major reason was mentioned in my econd reply to the thread you
mentioned...
http://www.lucidimagination.com/search/document/2025d6670004838b/date_faceting_and_double_counting#f9fa1b56803c68c4

...we wanted to make sure the counts accurately represented what you would
get if you then filtered on that date range -- and since the query parser
only supported ranges that were inclusive on both ends we wound up with
this. some improvements to the QUeryparser to support mixed use of [] and
{} (ie: "date:[A TO B}") would help, but that leads to another small
complexity...

should we be inclusive of the lower or the upper? ... even if we make it
an option, how should it apply to the "first" and "last" ranges computed?
do the answers change if facet.date.other includes "before" and/or "after"
should the "between" option be inclusive of both end points as well?

...lots of little nuances subtleties that ultimately lead to the decisiosn
that for the time being it was simple, easy, and straight forward to just
always be inclusive, and add support for more complexities later.

In practice: people either don't notice, don't care, or find it easy
enough to add/subtract 1 millisecond to their times to get the behavior
they want.

: If interested parties are in agreement, I can create an issue for it and the associated fix.

If you can suggest some semantics for a new option to control the
inclusion/exclusion of the endpoints on all of the various edge cases,
that is straight forward and easy to understand, that would certainly be a
nice addition. we can worry about the query parser aspect of filtering on
those ranges later and people who want to use the new option at the
expensive of being able to have consistent counts when filtering can turn
it on.

-Hoss