You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Peter S <pe...@hotmail.com> on 2010/01/14 22:07:51 UTC
Date Facet duplicate counts
I saw some previous threads related to this subject, but on a slightly different use case, so staring a new thread...
For reference, a related thread topic can be found here:
http://www.lucidimagination.com/search/document/2025d6670004838b/date_faceting_and_double_counting#2025d6670004838b
This has to do with date facets setting double counts across adjacent date facets, if the documents' time is 'on the cusp'.
In fact, I found this problem because I was testing date facets where the gap is +1SECOND. In this case many/most/all document counts can be duplicated, because as a general rule in my case, milliseconds are set to 0, and there is 'No logic for milliseconds' in the DateMathParser. This behaviour can sometimes be observed in general date faceting -- in the +1SECOND scenario, it is much more likely to occur (because these values are more likely to be quantized).
I had a look at the date math with regards this (in SimpleFacets.java : getFacetDateCounts()), and I noticed the following line of code (~line 622):
resInner.add(label, rangeCount(sf,low,high,true,true));
The two 'true' booleans mean: 'include at start of range' *AND* 'include at end of range'. Any documents that live on the border will match in date.facet[n] and date.facet[n+1], because of the 'double-sided' inclusive range search.
By convention, a time value of '0' (00:00) belongs to the next period, rather than the previous, so I changed the *first* boolean to false, and voila! no more duplications! I believe this will be the case for other gap values, not just +1SECOND.
As there's no need to read any '[' or '{' because date faceting doesn't have/need these, the patch couldn't be simpler.
My question to the experts of this code is:
Was this done for a reason - are there any implications somewhere else for having a Lucene-double-sided-inclusive search?
I can't think of any reason, but perhaps someone knows differently?
If interested parties are in agreement, I can create an issue for it and the associated fix.
Many thanks,
Peter
_________________________________________________________________
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Date Facet duplicate counts
Posted by Chris Hostetter <ho...@fucit.org>.
: > should we be inclusive of the lower or the upper? ... even if we make it
: > an option, how should it apply to the "first" and "last" ranges computed?
: > do the answers change if facet.date.other includes "before" and/or "after"
: > should the "between" option be inclusive of both end points as well?
: I guess to be consistent, the 'inclusiveness' should be intrsinically handled in
: detection of '[', '{' and/or ']', '}' -- i.e. match the facet.date field with the corresponding field
: in the query - e.g.: q= *:* AND timestamp:[then TO now}&date.facet=timestamp .
...that only works if the datefield being faceted on is included in the --
which is frequently not the case, particularly on the "first" request of a
session, where you want to face on date, buthte user has not yet made any
attempt to restrict by any of those facets.
: If no such token exists in the query, perhaps the date.facet token parsing could process
: an option ala: date.facet=[timestamp} to explicitly set the edge behaviour, or to override a match
: in the query parser tokenization.
:
: This way, there's no new explicit option; it would work with existing queries (no extra []{}'s = default behaviour);
: and people could easily add it if they need custom edge behaviour.
I suppose ... but it still doesn't address some of hte outstanding
questions i pointed out before (handling the first/last range in the block
... ie: "i want inclusive of the lower, exclusive of hte upper, except for
the last range which should be inclusive of both). Personally i think
addign a new option is just as clear as adding markup to the "date.facet"
param parsing ... the less we make assumptions about what "special
characters" people have in their fieldnames the better.
: Another way to deal with it is to add MILLISECOND logic to the DateMathParser. Then the '1ms' adjustment
: at one and/or the other could be done by the caller at query time, leaving the stored data intact, and leaving
: the server-side date faceting as it is. In fact, this can be done today using SECOND, but can be a problem if:
: - You're using HOURS or DAYS and don't want to convert to SECONDS each time, or
: - You need granularity to the SECOND
I don't follow you at all ... yes this can be done today, but i don't
understand what you mean about needing to convert to seconds, or requiring
second granularity.
If you don't index with millisecond precision, then no matter what
precision you index with, this example would let you get ranges including
the "lower" bound, but not the upper bound of each range using a 1ms
"fudge" ...
facet.date=timestamp
facet.date.start=NOW/DAY-5DAYS-1MILLI
facet.date.end=NOW/DAY+1DAY-1MILLI
facet.date.gap=+1DAY
Brainstorming a bit...
I think the semantics that might make the most sense is to add a
multivalued "facet.date.include" param that supports the following
options: all, lower, upper, edge, outer
- "all" is shorthand for lower,upper,edge,outer and is the default
(for back compat)
- if "lower" is specified, then all ranges include their lower bound
- if "upper" is specified, then all ranges include their upper bound
- if "edge" is specified, then the first and last ranges include
their edge bounds (ie: lower for the first one, upper for the last
one) even if the corrisponding "upper"/"lower" option is not
specified.
- the "between" count is inclusive of each of the start and end
bounds iff the first and last range are inclusive of them
- the "before" and "after" ranges are inclusive of their respective
bounds if:
- "outer" is specified ... OR ...
- the first and last ranges don't already include them
so assuming you started with something like...
facet.date.start=1 facet.date.end=3 facet.date.gap=+1 facet.date.other=all
...your ranges would be...
[1 TO 2], [2 TO 3] and [* TO 1], [1 TO 3], [3 TO *]
w/ facet.date.include=lower ...
[1 TO 2}, [2 TO 3} and [* TO 1}, [1 TO 3}, [3 TO *]
w/ facet.date.include=upper ...
{1 TO 2], {2 TO 3] and [* TO 1], {1 TO 3], {3 TO *]
w/ facet.date.include=lower&facet.date.include=edge ...
[1 TO 2}, [2 TO 3] and [* TO 1}, [1 TO 3], {3 TO *]
w/ facet.date.include=upper&facet.date.include=edge ...
[1 TO 2], {2 TO 3] and [* TO 1}, [1 TO 3], {3 TO *]
w/ facet.date.include=upper&facet.date.include=outer ...
{1 TO 2], {2 TO 3] and [* TO 1], {1 TO 3], [3 TO *]
...etc.
what do you think?
-Hoss
RE: Date Facet duplicate counts
Posted by Peter S <pe...@hotmail.com>.
> should we be inclusive of the lower or the upper? ... even if we make it
> an option, how should it apply to the "first" and "last" ranges computed?
> do the answers change if facet.date.other includes "before" and/or "after"
> should the "between" option be inclusive of both end points as well?
>
I guess to be consistent, the 'inclusiveness' should be intrsinically handled in
detection of '[', '{' and/or ']', '}' -- i.e. match the facet.date field with the corresponding field
in the query - e.g.: q= *:* AND timestamp:[then TO now}&date.facet=timestamp .
If no such token exists in the query, perhaps the date.facet token parsing could process
an option ala: date.facet=[timestamp} to explicitly set the edge behaviour, or to override a match
in the query parser tokenization.
This way, there's no new explicit option; it would work with existing queries (no extra []{}'s = default behaviour);
and people could easily add it if they need custom edge behaviour.
> In practice: people either don't notice, don't care, or find it easy
> enough to add/subtract 1 millisecond to their times to get the behavior
> they want.
>
The main problem here is that this time change would need to be done at index time - and is essentially 'tampering' with
the data (e.g. if the timestamp is extracted from a security field that needs to be stored unmodified, or is needed/used
for another purpose).
Another way to deal with it is to add MILLISECOND logic to the DateMathParser. Then the '1ms' adjustment
at one and/or the other could be done by the caller at query time, leaving the stored data intact, and leaving
the server-side date faceting as it is. In fact, this can be done today using SECOND, but can be a problem if:
- You're using HOURS or DAYS and don't want to convert to SECONDS each time, or
- You need granularity to the SECOND
I've not looked at the DateMathParser code in great detail, but maybe adding MILLIS logic could be the most
straighforward option, as then the 'nuances' of query parser changes (curent and future), 'before', 'after' et al.
are handled as they are now. Anyone wishing to make use of such new behaviour, well, they'd have to use MILLIS.
Peter
_________________________________________________________________
Got a cool Hotmail story? Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/
Re: Date Facet duplicate counts
Posted by Chris Hostetter <ho...@fucit.org>.
: For reference, a related thread topic can be found here:
: http://www.lucidimagination.com/search/document/2025d6670004838b/date_faceting_and_double_counting#2025d6670004838b
...
: Was this done for a reason - are there any implications somewhere else
A major reason was mentioned in my econd reply to the thread you
mentioned...
http://www.lucidimagination.com/search/document/2025d6670004838b/date_faceting_and_double_counting#f9fa1b56803c68c4
...we wanted to make sure the counts accurately represented what you would
get if you then filtered on that date range -- and since the query parser
only supported ranges that were inclusive on both ends we wound up with
this. some improvements to the QUeryparser to support mixed use of [] and
{} (ie: "date:[A TO B}") would help, but that leads to another small
complexity...
should we be inclusive of the lower or the upper? ... even if we make it
an option, how should it apply to the "first" and "last" ranges computed?
do the answers change if facet.date.other includes "before" and/or "after"
should the "between" option be inclusive of both end points as well?
...lots of little nuances subtleties that ultimately lead to the decisiosn
that for the time being it was simple, easy, and straight forward to just
always be inclusive, and add support for more complexities later.
In practice: people either don't notice, don't care, or find it easy
enough to add/subtract 1 millisecond to their times to get the behavior
they want.
: If interested parties are in agreement, I can create an issue for it and the associated fix.
If you can suggest some semantics for a new option to control the
inclusion/exclusion of the endpoints on all of the various edge cases,
that is straight forward and easy to understand, that would certainly be a
nice addition. we can worry about the query parser aspect of filtering on
those ranges later and people who want to use the new option at the
expensive of being able to have consistent counts when filtering can turn
it on.
-Hoss