You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Kristofer Karlsson <kr...@spotify.com> on 2013/09/05 09:28:29 UTC

Lucene handling of duplicate terms

I have a use case where some of my documents have duplicate terms in
various fields or within the same field.

For an example, I may have a million documents with just the term "foo" in
field A, and one particular document with the term "foo" in both field A
and B, or have two terms "foo" in the same field.

If I search for "foo foo" I would like to filter out all the documents with
only one matching term - is this possible?

Re: Lucene handling of duplicate terms

Posted by Kristofer Karlsson <kr...@spotify.com>.
On Thu, Sep 5, 2013 at 9:46 AM, Adrien Grand <jp...@gmail.com> wrote:

> Hi,
>
> On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson <kr...@spotify.com>
> wrote:
> > I have a use case where some of my documents have duplicate terms in
> > various fields or within the same field.
> >
> > For an example, I may have a million documents with just the term "foo"
> in
> > field A, and one particular document with the term "foo" in both field A
> > and B, or have two terms "foo" in the same field.
> >
> > If I search for "foo foo" I would like to filter out all the documents
> with
> > only one matching term - is this possible?
>
> I don't think we have existing queries that allow for doing it
> efficiently (if someone reads this and knows it is wrong, please
> correct!). However, it should be doable to implement such a query
> rather easily by iterating over the postings lists of the 'foo' term
> in all the fields you are interested in, suming up frequencies (the
> index must have been created with IndexOptions.DOCS_AND_FREQS or
> higher) and only keeping documents whose sum of frequencies is at
> least 2.
>
> --
> Adrien
>
> Thanks for the quick reply!
So I'd have to manually count each term after tokenizing the search query
and keep a map of term to count. I will definitely try this.

---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene handling of duplicate terms

Posted by Adrien Grand <jp...@gmail.com>.
Hi,

On Thu, Sep 5, 2013 at 9:28 AM, Kristofer Karlsson <kr...@spotify.com> wrote:
> I have a use case where some of my documents have duplicate terms in
> various fields or within the same field.
>
> For an example, I may have a million documents with just the term "foo" in
> field A, and one particular document with the term "foo" in both field A
> and B, or have two terms "foo" in the same field.
>
> If I search for "foo foo" I would like to filter out all the documents with
> only one matching term - is this possible?

I don't think we have existing queries that allow for doing it
efficiently (if someone reads this and knows it is wrong, please
correct!). However, it should be doable to implement such a query
rather easily by iterating over the postings lists of the 'foo' term
in all the fields you are interested in, suming up frequencies (the
index must have been created with IndexOptions.DOCS_AND_FREQS or
higher) and only keeping documents whose sum of frequencies is at
least 2.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene handling of duplicate terms

Posted by Kristofer Karlsson <kr...@spotify.com>.
On Thu, Sep 5, 2013 at 3:40 PM, Toke Eskildsen <te...@statsbiblioteket.dk>wrote:

> On Thu, 2013-09-05 at 09:28 +0200, Kristofer Karlsson wrote:
> > For an example, I may have a million documents with just the term "foo"
> in
> > field A, and one particular document with the term "foo" in both field A
> > and B, or have two terms "foo" in the same field.
> >
> > If I search for "foo foo" I would like to filter out all the documents
> with
> > only one matching term - is this possible?
>
> A bit of creative querying should do it:
>
> For the "only one foo-field"-case, you could do
>   (A:foo NOT B:foo) OR (B:foo NOT A:foo)
>
> To avoid two foo's in the same field, you could do
>   NOT field:"foo foo"~1000
>
> Combining those we get
>   ((A:foo NOT B:foo) OR (B:foo NOT A:foo)) NOT A:"foo foo"~1000 NOT
> B:"foo foo"~1000
>
>
> Or did I misunderstand? Do you want to keep the documents that has at
> least two foo's and discard the ones that only has one? That is simpler:
>   (A:foo AND B:foo) OR A:"foo foo"~1000 OR B:"foo foo"~1000
>
>
> This all works under the assumption that you have less than 1000 terms
> in each instance of your fields. Adjust accordingly.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
> Yes, I meant that latter part - getting rid of hits that didn't actually
have as many occurrences of the term as the search query.
The query generation sort of works if I just have two fields. For more
fields and more search terms it quickly gets more complicated - it would be
a combinatorial explosion.

---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene handling of duplicate terms

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Thu, 2013-09-05 at 09:28 +0200, Kristofer Karlsson wrote:
> For an example, I may have a million documents with just the term "foo" in
> field A, and one particular document with the term "foo" in both field A
> and B, or have two terms "foo" in the same field.
> 
> If I search for "foo foo" I would like to filter out all the documents with
> only one matching term - is this possible?

A bit of creative querying should do it:

For the "only one foo-field"-case, you could do
  (A:foo NOT B:foo) OR (B:foo NOT A:foo)

To avoid two foo's in the same field, you could do
  NOT field:"foo foo"~1000

Combining those we get
  ((A:foo NOT B:foo) OR (B:foo NOT A:foo)) NOT A:"foo foo"~1000 NOT
B:"foo foo"~1000


Or did I misunderstand? Do you want to keep the documents that has at
least two foo's and discard the ones that only has one? That is simpler:
  (A:foo AND B:foo) OR A:"foo foo"~1000 OR B:"foo foo"~1000


This all works under the assumption that you have less than 1000 terms
in each instance of your fields. Adjust accordingly.

- Toke Eskildsen, State and University Library, Denmark



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org