You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Joshua Harness <jk...@gmail.com> on 2011/08/04 22:59:22 UTC

Question about LUCENE-3097 - Post Group Faceting

Hello -

     Please let me know if this question is more appropriate of the user
list. I had assumed the developer list was more appropriate since the ticket
is still open.  I was analyzing the comments on
LUCENE-3097<https://issues.apache.org/jira/browse/LUCENE-3097>and had
a couple of questions.

     A comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13033953&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13033953>started
a small thread that mentioned that all documents in a given group
would need to be contiguous and in the same segment. Also - a statement was
made that ' The app would have to ensure this'. I was unclear the result of
this conversation. It sounded like maybe this could have turned out to not
be the case. What is the status of this? Does my application have to ensure
all the documents in the group are in the same segment? How would one
accomplish this?

     Another comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038297&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038297>mentioned
that 'we pick only the head doc...as long as the head doc is
guaranteed to have the same value for field X, it safe to use that doc to
represent the entire group for facet counting'.  Does this mean that there
is a restriction placed on me that the head document must have field values
that match the rest of the documents in the same group? Or is this simply an
implementation detail that uses the head document when this condition is the
case or chooses another strategy when this is not the case?

     I am very interested in adopting this patch. However - I am attempting
to understand any limitations/conditions so that I may use it correctly. Any
advice would be greatly appreciated.

Thanks!

Josh Harness

Re: Question about LUCENE-3097 - Post Group Faceting

Posted by Martijn v Groningen <ma...@gmail.com>.
The facet result for field productType will show the following count:
BOOK: 1
DVD: 0

So yes, because of post group faceting you'll miss the second facet.
This is basically the same example I described in LUCENE-3097.

I've also described three ways of calculating facet counts in combination
grouping.
The third way which I've named matrix counts (field value & group value
combination) would give the result that you expect.
However this isn't implemented yet. In Solr this would require changes in
the FacetComponent.
I hope this explains it a bit!

Martijn

On 5 August 2011 16:28, Joshua Harness <jk...@gmail.com> wrote:

> Martin -
>
>      Thanks for the reply. I understand your answer about the segments.
> However, I'm still cloudy about faceting with respect to the group head.
> Perhaps an example will clarify my confusion.  Suppose I have 3 order
> documents with the following data:
>
> *orderNumber: 1
> customerNumber: 1
> totalInCents: 1500
> productType: 'BOOK'
>
> orderNumber: 2
> customerNumber: 1
> totalInCents: 500
> productType: 'BOOK'
>
> orderNumber: 3
> customerNumber: 1
> totalInCents: 1000
> productType: 'DVD'
>
> *
>
> *     *Imagine I perform a search for items greater than or equal to 1000
> cents grouped by customer number. I would expect to get order numbers 1 and
> 3 back grouped underneath customer id.  Lets assume that order number 1 is
> considered the most relevant document (in your scenario). Will the post
> group faceting miss that I actually have two facet values for productType:
> BOOK and DVD?
>
> Thanks!
>
> Josh
>
>
> On Fri, Aug 5, 2011 at 4:22 AM, Martijn v Groningen <
> martijn.is.hier@gmail.com> wrote:
>
>> Hi Josh,
>>
>> For post grouping the documents don't need to reside in the same segment.
>> Lucene's grouping module has a collector (TermAllGroupHeadsCollector) that
>> can
>> collect the most relevant document for each group (GroupHead). This
>> collector can produce a int[] or a FixedBitSet that can be used during
>> faceting to produce
>> post group facets (patch in SOLR-2665 uses this). During faceting only the
>> the groupheads are known, because of this field values that are different in
>> documents
>> less relevant than the most relevant document of a group aren't taken into
>> account. This is the same as in example described in the description of
>> LUCENE-3097.
>> Hope this helps!
>>
>> Martijn
>>
>>
>> On 4 August 2011 22:59, Joshua Harness <jk...@gmail.com> wrote:
>>
>>> Hello -
>>>
>>>      Please let me know if this question is more appropriate of the user
>>> list. I had assumed the developer list was more appropriate since the ticket
>>> is still open.  I was analyzing the comments on LUCENE-3097<https://issues.apache.org/jira/browse/LUCENE-3097>and had a couple of questions.
>>>
>>>      A comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13033953&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13033953>started a small thread that mentioned that all documents in a given group
>>> would need to be contiguous and in the same segment. Also - a statement was
>>> made that ' The app would have to ensure this'. I was unclear the result of
>>> this conversation. It sounded like maybe this could have turned out to not
>>> be the case. What is the status of this? Does my application have to ensure
>>> all the documents in the group are in the same segment? How would one
>>> accomplish this?
>>>
>>>      Another comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038297&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038297>mentioned that 'we pick only the head doc...as long as the head doc is
>>> guaranteed to have the same value for field X, it safe to use that doc to
>>> represent the entire group for facet counting'.  Does this mean that there
>>> is a restriction placed on me that the head document must have field values
>>> that match the rest of the documents in the same group? Or is this simply an
>>> implementation detail that uses the head document when this condition is the
>>> case or chooses another strategy when this is not the case?
>>>
>>>      I am very interested in adopting this patch. However - I am
>>> attempting to understand any limitations/conditions so that I may use it
>>> correctly. Any advice would be greatly appreciated.
>>>
>>> Thanks!
>>>
>>> Josh Harness
>>>
>>
>>
>>
>> --
>> Met vriendelijke groet,
>>
>> Martijn van Groningen
>>
>
>


-- 
Met vriendelijke groet,

Martijn van Groningen

Re: Question about LUCENE-3097 - Post Group Faceting

Posted by Joshua Harness <jk...@gmail.com>.
Martin -

     Thanks for the reply. I understand your answer about the segments.
However, I'm still cloudy about faceting with respect to the group head.
Perhaps an example will clarify my confusion.  Suppose I have 3 order
documents with the following data:

*orderNumber: 1
customerNumber: 1
totalInCents: 1500
productType: 'BOOK'

orderNumber: 2
customerNumber: 1
totalInCents: 500
productType: 'BOOK'

orderNumber: 3
customerNumber: 1
totalInCents: 1000
productType: 'DVD'

*

*     *Imagine I perform a search for items greater than or equal to 1000
cents grouped by customer number. I would expect to get order numbers 1 and
3 back grouped underneath customer id.  Lets assume that order number 1 is
considered the most relevant document (in your scenario). Will the post
group faceting miss that I actually have two facet values for productType:
BOOK and DVD?

Thanks!

Josh

On Fri, Aug 5, 2011 at 4:22 AM, Martijn v Groningen <
martijn.is.hier@gmail.com> wrote:

> Hi Josh,
>
> For post grouping the documents don't need to reside in the same segment.
> Lucene's grouping module has a collector (TermAllGroupHeadsCollector) that
> can
> collect the most relevant document for each group (GroupHead). This
> collector can produce a int[] or a FixedBitSet that can be used during
> faceting to produce
> post group facets (patch in SOLR-2665 uses this). During faceting only the
> the groupheads are known, because of this field values that are different in
> documents
> less relevant than the most relevant document of a group aren't taken into
> account. This is the same as in example described in the description of
> LUCENE-3097.
> Hope this helps!
>
> Martijn
>
>
> On 4 August 2011 22:59, Joshua Harness <jk...@gmail.com> wrote:
>
>> Hello -
>>
>>      Please let me know if this question is more appropriate of the user
>> list. I had assumed the developer list was more appropriate since the ticket
>> is still open.  I was analyzing the comments on LUCENE-3097<https://issues.apache.org/jira/browse/LUCENE-3097>and had a couple of questions.
>>
>>      A comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13033953&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13033953>started a small thread that mentioned that all documents in a given group
>> would need to be contiguous and in the same segment. Also - a statement was
>> made that ' The app would have to ensure this'. I was unclear the result of
>> this conversation. It sounded like maybe this could have turned out to not
>> be the case. What is the status of this? Does my application have to ensure
>> all the documents in the group are in the same segment? How would one
>> accomplish this?
>>
>>      Another comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038297&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038297>mentioned that 'we pick only the head doc...as long as the head doc is
>> guaranteed to have the same value for field X, it safe to use that doc to
>> represent the entire group for facet counting'.  Does this mean that there
>> is a restriction placed on me that the head document must have field values
>> that match the rest of the documents in the same group? Or is this simply an
>> implementation detail that uses the head document when this condition is the
>> case or chooses another strategy when this is not the case?
>>
>>      I am very interested in adopting this patch. However - I am
>> attempting to understand any limitations/conditions so that I may use it
>> correctly. Any advice would be greatly appreciated.
>>
>> Thanks!
>>
>> Josh Harness
>>
>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>

Re: Question about LUCENE-3097 - Post Group Faceting

Posted by Martijn v Groningen <ma...@gmail.com>.
Hi Josh,

For post grouping the documents don't need to reside in the same segment.
Lucene's grouping module has a collector (TermAllGroupHeadsCollector) that
can
collect the most relevant document for each group (GroupHead). This
collector can produce a int[] or a FixedBitSet that can be used during
faceting to produce
post group facets (patch in SOLR-2665 uses this). During faceting only the
the groupheads are known, because of this field values that are different in
documents
less relevant than the most relevant document of a group aren't taken into
account. This is the same as in example described in the description of
LUCENE-3097.
Hope this helps!

Martijn

On 4 August 2011 22:59, Joshua Harness <jk...@gmail.com> wrote:

> Hello -
>
>      Please let me know if this question is more appropriate of the user
> list. I had assumed the developer list was more appropriate since the ticket
> is still open.  I was analyzing the comments on LUCENE-3097<https://issues.apache.org/jira/browse/LUCENE-3097>and had a couple of questions.
>
>      A comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13033953&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13033953>started a small thread that mentioned that all documents in a given group
> would need to be contiguous and in the same segment. Also - a statement was
> made that ' The app would have to ensure this'. I was unclear the result of
> this conversation. It sounded like maybe this could have turned out to not
> be the case. What is the status of this? Does my application have to ensure
> all the documents in the group are in the same segment? How would one
> accomplish this?
>
>      Another comment<https://issues.apache.org/jira/browse/LUCENE-3097?focusedCommentId=13038297&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13038297>mentioned that 'we pick only the head doc...as long as the head doc is
> guaranteed to have the same value for field X, it safe to use that doc to
> represent the entire group for facet counting'.  Does this mean that there
> is a restriction placed on me that the head document must have field values
> that match the rest of the documents in the same group? Or is this simply an
> implementation detail that uses the head document when this condition is the
> case or chooses another strategy when this is not the case?
>
>      I am very interested in adopting this patch. However - I am attempting
> to understand any limitations/conditions so that I may use it correctly. Any
> advice would be greatly appreciated.
>
> Thanks!
>
> Josh Harness
>



-- 
Met vriendelijke groet,

Martijn van Groningen