You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by dontspamterry <do...@yahoo.com> on 2007/05/16 01:13:02 UTC

Multi-field distinct query

Hi all,

I know this whole distinct query has been discussed a bunch of times for
various scenarios because I've been scouring the forums trying to find a
clue as to how I could solve my problem. I'm indexing a large set of
parent-child term relations (~1 million). The number of unique terms is
about ~570,000. Each relation is a document. Each term in a relation
contains all of the term's attributes. Effectively, a term's attributes will
be duplicated "x" number of times for the "x" number of relations it
participates in. For example, say I have the following term tree:

A
|--B
|--E
|--H
|--F
|--C
|--G
|--D

I would then have documents for:
A->B, A->C, A->D, B->E, (and so forth...)

For all relations involving A, A's attributes will be duplicated in 3
separate documents.
For all relations involving B, B's attributes will be duplicated in 3
separate documents.
(you get the picture...)

This index structure works great for queries which traverse up and down the
tree. However, I have a requirement where I would also like to do a distinct
query which returns the data for each unique term satisfying the query. For
example, say I have a query which returns all relations where A or B is the
parent (that would be 5 documents in total),
but do a distinct on the parent such that I get 2 documents back, one for A
as the parent (any 1 of the 3 matching docs) and the other where B is the
parent (any 1 of the 2 matching docs). For this query, I don't care about
the child information since I'm only interested in retrieving the distinct
parent terms. This query is analogous to a 'select distinct <set of parent
term attributes>' . I played around with caching BitSets for the fields
which I'd like to do a distinct on, but given the amount of data, I run out
of memory. I also took the approach where I retrieve the bitset using a
queryfilter and then process each document id, hashing the field values on
which I'm doing a distinct to construct my distinct set. Problem with this
is that I have tree structures where a parent has over 100K children.
Retrieving each doc for this size is too time- and memory- consuming. Since
I don't really want to return that much data, I thought that I could use
paging. The problem I faced is that I do not know if a distinct value in the
current query was actually returned in some previous query for a previous
page.

Sorry for the long description, but wanted to make sure I explained it as
clearly as I could.

-Terry
--
View this message in context: http://www.nabble.com/Multi-field-distinct-query-tf3761682.html#a10633050
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Multi-field distinct query

Posted by Paul Elschot <pa...@xs4all.nl>.

Terry,

On Wednesday 16 May 2007 01:13, dontspamterry wrote:
> 
> ... I played around with caching BitSets for the fields
> which I'd like to do a distinct on, but given the amount of data, I run out
> of memory.

I don't know whether your final solution will require filtering,
but if it does, decoupling filters from bitsets might help:
http://issues.apache.org/jira/browse/LUCENE-584

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Multi-field distinct query

Posted by dontspamterry <do...@yahoo.com>.

Thanks for the tip. I've re-posted on Lucene-Java Users.

-Terry


Grant Ingersoll-6 wrote:
> 
> I suggest you ask on the user mailing list (java- 
> user@lucene.apache.org) as you are likely to get a lot more interest  
> from others.  java-dev is for discussing the internals of how Lucene  
> works.
> 
> 
> 
> Thanks,
> Grant
> 
> On May 15, 2007, at 7:13 PM, dontspamterry wrote:
> 
>>
>> Hi all,
>>
>> I know this whole distinct query has been discussed a bunch of  
>> times for
>> various scenarios because I've been scouring the forums trying to  
>> find a
>> clue as to how I could solve my problem. I'm indexing a large set of
>> parent-child term relations (~1 million). The number of unique  
>> terms is
>> about ~570,000. Each relation is a document. Each term in a relation
>> contains all of the term's attributes. Effectively, a term's  
>> attributes will
>> be duplicated "x" number of times for the "x" number of relations it
>> participates in. For example, say I have the following term tree:
>>
>> A
>> |--B
>>     |--E
>>         |--H
>>     |--F
>> |--C
>>     |--G
>> |--D
>>
>> I would then have documents for:
>> A->B, A->C, A->D, B->E, (and so forth...)
>>
>> For all relations involving A, A's attributes will be duplicated in 3
>> separate documents.
>> For all relations involving B, B's attributes will be duplicated in 3
>> separate documents.
>> (you get the picture...)
>>
>> This index structure works great for queries which traverse up and  
>> down the
>> tree. However, I have a requirement where I would also like to do a  
>> distinct
>> query which returns the data for each unique term satisfying the  
>> query. For
>> example, say I have a query which returns all relations where A or  
>> B is the
>> parent (that would be 5 documents in total),
>> but do a distinct on the parent such that I get 2 documents back,  
>> one for A
>> as the parent (any 1 of the 3 matching docs)  and the other where B  
>> is the
>> parent (any 1 of the 2 matching docs). For this query, I don't care  
>> about
>> the child information since I'm only interested in retrieving the  
>> distinct
>> parent terms. This query is analogous to a 'select distinct <set of  
>> parent
>> term attributes>' . I played around with caching BitSets for the  
>> fields
>> which I'd like to do a distinct on, but given the amount of data, I  
>> run out
>> of memory. I also took the approach where I retrieve the bitset  
>> using a
>> queryfilter and then process each document id, hashing the field  
>> values on
>> which I'm doing a distinct to construct my distinct set. Problem  
>> with this
>> is that I have tree structures where a parent has over 100K children.
>> Retrieving each doc for this size is too time- and memory-  
>> consuming. Since
>> I don't really want to return that much data, I thought that I  
>> could use
>> paging. The problem I faced is that I do not know if a distinct  
>> value in the
>> current query was actually returned in some previous query for a  
>> previous
>> page.
>>
>> Sorry for the long description, but wanted to make sure I explained  
>> it as
>> clearly as I could.
>>
>> -Terry
>> -- 
>> View this message in context: http://www.nabble.com/Multi-field- 
>> distinct-query-tf3761682.html#a10633050
>> Sent from the Lucene - Java Developer mailing list archive at  
>> Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
> 
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
> LuceneFAQ
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Multi-field-distinct-query-tf3761682.html#a10645430
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Multi-field distinct query

Posted by Grant Ingersoll <gs...@apache.org>.

I suggest you ask on the user mailing list (java- 
user@lucene.apache.org) as you are likely to get a lot more interest  
from others.  java-dev is for discussing the internals of how Lucene  
works.



Thanks,
Grant

On May 15, 2007, at 7:13 PM, dontspamterry wrote:

>
> Hi all,
>
> I know this whole distinct query has been discussed a bunch of  
> times for
> various scenarios because I've been scouring the forums trying to  
> find a
> clue as to how I could solve my problem. I'm indexing a large set of
> parent-child term relations (~1 million). The number of unique  
> terms is
> about ~570,000. Each relation is a document. Each term in a relation
> contains all of the term's attributes. Effectively, a term's  
> attributes will
> be duplicated "x" number of times for the "x" number of relations it
> participates in. For example, say I have the following term tree:
>
> A
> |--B
>     |--E
>         |--H
>     |--F
> |--C
>     |--G
> |--D
>
> I would then have documents for:
> A->B, A->C, A->D, B->E, (and so forth...)
>
> For all relations involving A, A's attributes will be duplicated in 3
> separate documents.
> For all relations involving B, B's attributes will be duplicated in 3
> separate documents.
> (you get the picture...)
>
> This index structure works great for queries which traverse up and  
> down the
> tree. However, I have a requirement where I would also like to do a  
> distinct
> query which returns the data for each unique term satisfying the  
> query. For
> example, say I have a query which returns all relations where A or  
> B is the
> parent (that would be 5 documents in total),
> but do a distinct on the parent such that I get 2 documents back,  
> one for A
> as the parent (any 1 of the 3 matching docs)  and the other where B  
> is the
> parent (any 1 of the 2 matching docs). For this query, I don't care  
> about
> the child information since I'm only interested in retrieving the  
> distinct
> parent terms. This query is analogous to a 'select distinct <set of  
> parent
> term attributes>' . I played around with caching BitSets for the  
> fields
> which I'd like to do a distinct on, but given the amount of data, I  
> run out
> of memory. I also took the approach where I retrieve the bitset  
> using a
> queryfilter and then process each document id, hashing the field  
> values on
> which I'm doing a distinct to construct my distinct set. Problem  
> with this
> is that I have tree structures where a parent has over 100K children.
> Retrieving each doc for this size is too time- and memory-  
> consuming. Since
> I don't really want to return that much data, I thought that I  
> could use
> paging. The problem I faced is that I do not know if a distinct  
> value in the
> current query was actually returned in some previous query for a  
> previous
> page.
>
> Sorry for the long description, but wanted to make sure I explained  
> it as
> clearly as I could.
>
> -Terry
> -- 
> View this message in context: http://www.nabble.com/Multi-field- 
> distinct-query-tf3761682.html#a10633050
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Multi-field distinct query

Posted by Steven Rowe <sa...@syr.edu>.

[Moving the discussion from java-dav to java-user]

Hi Terry,

The index I suggested would be in addition to the one you now have, so
the requirements you mentioned would not be affected (unless of course
you have a single-index requirement :) ).

Given the narrow use for this additional index, you might be able to
address your memory issues either by not storing the "children" field
terms, or by using Lazy Field Loading - see the api docs for
IndexReader.document(int, FieldSelector):

<http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/index/IndexReader.html#document(int,%20org.apache.lucene.document.FieldSelector)>

Grant Ingersoll's "Advanced Lucene" talk at ApacheCon Europe earlier
this month mentioned Lazy Field Loading - see the second to last slide
in the slide presentation, available on the Lucene wiki Resources page,
under Presentations:

<http://wiki.apache.org/lucene-java/Resources#head-538bbb76157074df20f3a9a30195c74d68772f4b>

Steve

dontspamterry wrote:
> Hi Steve,
> 
> We originally had documents which were term-centric, i.e. what you described
> - document for the parent and all of its children. We changed it to model a
> single, parent-child relation as one document due to requirements and the
> fact that we were having memory issues for cases where a parent had an
> extremely large number of children (~200,000).
> 
> -Terry
> 
> 
> Steven Rowe wrote:
>> Hi Terry,
>>
>> Why not have another index in which a document has one field for the
>> parent and another field containing all of its children.  An OR query
>> over the "children" field would return you exactly what you want - one
>> document for each distinct parent.
>>
>> Steve
>>
>> dontspamterry wrote:
>>> Hi all,
>>>
>>> I know this whole distinct query has been discussed a bunch of times for
>>> various scenarios because I've been scouring the forums trying to find a
>>> clue as to how I could solve my problem. I'm indexing a large set of
>>> parent-child term relations (~1 million). The number of unique terms is
>>> about ~570,000. Each relation is a document. Each term in a relation
>>> contains all of the term's attributes. Effectively, a term's attributes
>>> will
>>> be duplicated "x" number of times for the "x" number of relations it
>>> participates in. For example, say I have the following term tree:
>>>
>>> A
>>> |--B
>>>     |--E
>>>         |--H
>>>     |--F
>>> |--C
>>>     |--G
>>> |--D
>>>
>>> I would then have documents for:
>>> A->B, A->C, A->D, B->E, (and so forth...)
>>>
>>> For all relations involving A, A's attributes will be duplicated in 3
>>> separate documents.
>>> For all relations involving B, B's attributes will be duplicated in 3
>>> separate documents.
>>> (you get the picture...)
>>>
>>> This index structure works great for queries which traverse up and down
>>> the
>>> tree. However, I have a requirement where I would also like to do a
>>> distinct
>>> query which returns the data for each unique term satisfying the query.
>>> For
>>> example, say I have a query which returns all relations where A or B is
>>> the
>>> parent (that would be 5 documents in total),
>>> but do a distinct on the parent such that I get 2 documents back, one for
>>> A
>>> as the parent (any 1 of the 3 matching docs)  and the other where B is
>>> the
>>> parent (any 1 of the 2 matching docs). For this query, I don't care about
>>> the child information since I'm only interested in retrieving the
>>> distinct
>>> parent terms. This query is analogous to a 'select distinct <set of
>>> parent
>>> term attributes>' . I played around with caching BitSets for the fields
>>> which I'd like to do a distinct on, but given the amount of data, I run
>>> out
>>> of memory. I also took the approach where I retrieve the bitset using a
>>> queryfilter and then process each document id, hashing the field values
>>> on
>>> which I'm doing a distinct to construct my distinct set. Problem with
>>> this
>>> is that I have tree structures where a parent has over 100K children.
>>> Retrieving each doc for this size is too time- and memory- consuming.
>>> Since
>>> I don't really want to return that much data, I thought that I could use
>>> paging. The problem I faced is that I do not know if a distinct value in
>>> the
>>> current query was actually returned in some previous query for a previous
>>> page.
>>>
>>> Sorry for the long description, but wanted to make sure I explained it as
>>> clearly as I could.
>>>
>>> -Terry


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multi-field distinct query

Posted by dontspamterry <do...@yahoo.com>.

Hi Steve,

We originally had documents which were term-centric, i.e. what you described
- document for the parent and all of its children. We changed it to model a
single, parent-child relation as one document due to requirements and the
fact that we were having memory issues for cases where a parent had an
extremely large number of children (~200,000).

-Terry


Steven Rowe wrote:
> 
> Hi Terry,
> 
> Why not have another index in which a document has one field for the
> parent and another field containing all of its children.  An OR query
> over the "children" field would return you exactly what you want - one
> document for each distinct parent.
> 
> Steve
> 
> dontspamterry wrote:
>> Hi all,
>> 
>> I know this whole distinct query has been discussed a bunch of times for
>> various scenarios because I've been scouring the forums trying to find a
>> clue as to how I could solve my problem. I'm indexing a large set of
>> parent-child term relations (~1 million). The number of unique terms is
>> about ~570,000. Each relation is a document. Each term in a relation
>> contains all of the term's attributes. Effectively, a term's attributes
>> will
>> be duplicated "x" number of times for the "x" number of relations it
>> participates in. For example, say I have the following term tree:
>> 
>> A
>> |--B
>>     |--E
>>         |--H
>>     |--F
>> |--C
>>     |--G
>> |--D
>> 
>> I would then have documents for:
>> A->B, A->C, A->D, B->E, (and so forth...)
>> 
>> For all relations involving A, A's attributes will be duplicated in 3
>> separate documents.
>> For all relations involving B, B's attributes will be duplicated in 3
>> separate documents.
>> (you get the picture...)
>> 
>> This index structure works great for queries which traverse up and down
>> the
>> tree. However, I have a requirement where I would also like to do a
>> distinct
>> query which returns the data for each unique term satisfying the query.
>> For
>> example, say I have a query which returns all relations where A or B is
>> the
>> parent (that would be 5 documents in total),
>> but do a distinct on the parent such that I get 2 documents back, one for
>> A
>> as the parent (any 1 of the 3 matching docs)  and the other where B is
>> the
>> parent (any 1 of the 2 matching docs). For this query, I don't care about
>> the child information since I'm only interested in retrieving the
>> distinct
>> parent terms. This query is analogous to a 'select distinct <set of
>> parent
>> term attributes>' . I played around with caching BitSets for the fields
>> which I'd like to do a distinct on, but given the amount of data, I run
>> out
>> of memory. I also took the approach where I retrieve the bitset using a
>> queryfilter and then process each document id, hashing the field values
>> on
>> which I'm doing a distinct to construct my distinct set. Problem with
>> this
>> is that I have tree structures where a parent has over 100K children.
>> Retrieving each doc for this size is too time- and memory- consuming.
>> Since
>> I don't really want to return that much data, I thought that I could use
>> paging. The problem I faced is that I do not know if a distinct value in
>> the
>> current query was actually returned in some previous query for a previous
>> page.
>> 
>> Sorry for the long description, but wanted to make sure I explained it as
>> clearly as I could.
>> 
>> -Terry
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Multi-field-distinct-query-tf3761682.html#a10645416
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Multi-field distinct query

Posted by Steven Rowe <sa...@syr.edu>.

Hi Terry,

Why not have another index in which a document has one field for the
parent and another field containing all of its children.  An OR query
over the "children" field would return you exactly what you want - one
document for each distinct parent.

Steve

dontspamterry wrote:
> Hi all,
> 
> I know this whole distinct query has been discussed a bunch of times for
> various scenarios because I've been scouring the forums trying to find a
> clue as to how I could solve my problem. I'm indexing a large set of
> parent-child term relations (~1 million). The number of unique terms is
> about ~570,000. Each relation is a document. Each term in a relation
> contains all of the term's attributes. Effectively, a term's attributes will
> be duplicated "x" number of times for the "x" number of relations it
> participates in. For example, say I have the following term tree:
> 
> A
> |--B
>     |--E
>         |--H
>     |--F
> |--C
>     |--G
> |--D
> 
> I would then have documents for:
> A->B, A->C, A->D, B->E, (and so forth...)
> 
> For all relations involving A, A's attributes will be duplicated in 3
> separate documents.
> For all relations involving B, B's attributes will be duplicated in 3
> separate documents.
> (you get the picture...)
> 
> This index structure works great for queries which traverse up and down the
> tree. However, I have a requirement where I would also like to do a distinct
> query which returns the data for each unique term satisfying the query. For
> example, say I have a query which returns all relations where A or B is the
> parent (that would be 5 documents in total),
> but do a distinct on the parent such that I get 2 documents back, one for A
> as the parent (any 1 of the 3 matching docs)  and the other where B is the
> parent (any 1 of the 2 matching docs). For this query, I don't care about
> the child information since I'm only interested in retrieving the distinct
> parent terms. This query is analogous to a 'select distinct <set of parent
> term attributes>' . I played around with caching BitSets for the fields
> which I'd like to do a distinct on, but given the amount of data, I run out
> of memory. I also took the approach where I retrieve the bitset using a
> queryfilter and then process each document id, hashing the field values on
> which I'm doing a distinct to construct my distinct set. Problem with this
> is that I have tree structures where a parent has over 100K children.
> Retrieving each doc for this size is too time- and memory- consuming. Since
> I don't really want to return that much data, I thought that I could use
> paging. The problem I faced is that I do not know if a distinct value in the
> current query was actually returned in some previous query for a previous
> page.
> 
> Sorry for the long description, but wanted to make sure I explained it as
> clearly as I could.
> 
> -Terry


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org