You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stäbler, "Christoph (IT/I4Z)" <ch...@ww-informatik.de> on 2012/08/30 12:58:44 UTC

DuplicateFilter filters not only duplicates

Hey,

I have an index with documentations of our products. The documentfields are:

group
name
version
description

Because most of the documentations contains several sites I create for each site one document in the index. So when I am searching for a product by group, name and version I get a few results. But sometime I want for this combination (group, name and version) only one result (regardless how many documents exists for the product).

Therefor I use the DuplicateFilter:

Because this filter can only be used on one field (and not on fieldcombinations) I created another field (productkey). In this field I stored an id for this product (md5Hashvalue of the combination of group, name and version fields). Then I told the DuplicateFilter to use this field to filter duplicates.

But now I got not all the expected searchresults. i.e:

All documents without filter:

group | name | version | productkey | description
a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424 | ...
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...

Results with filter:

group | name | version | productkey
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f
zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec

so I am missing these results:

group | name | version | productkey
a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e


Here is my code to instantiate the filter:

DuplicateFilter filter = new DuplicateFilter("productkey");
filter.setKeepMode(DuplicateFilter.KM_USE_FIRST_OCCURRENCE);
filter.setProcessingMode(DuplicateFilter.PM_FULL_VALIDATION);

I am using Lucene-core Version 3.6 and lucene-queries (contains the DuplicatFilter Class) Version 3.6.1.
The productkey field uses no analyzer.

Did I make a mistake or is it a bug in the duplicateFilter (maybe to long fieldvalues, etc.)?

Thanks for your help
*******************************************************************************
Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte Informationen. 
Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtuemlich erhalten
haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail.
Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht
gestattet.

This email may contain confidential and/or privileged information. 
If you are not the intended recipient (or have received this email 
in error) please notify the sender immediately and destroy this email. 
Any unauthorized copying, disclosure or distribution of the material 
in this email is strictly forbidden.
*******************************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DuplicateFilter filters not only duplicates

Posted by mark harwood <ma...@yahoo.co.uk>.
DuplicateFilter has been mostly broken  since Lucene's switch over to segment-level filtering.

Since v2.9 the calls to Filter.getDocIdSet no longer pass a "top-level" reader for accessing the whole index and instead pass a reader restricted to only accessing a single segment's contents.

Because the DuplicateFilter logic relied on having a global view the de-dup logic is invalid unless your index happens to consist of only one segment.

This issue is referenced here : https://issues.apache.org/jira/browse/LUCENE-2348 



----- Original Message -----
From: "Stäbler, Christoph (IT/I4Z)" <ch...@ww-informatik.de>
To: java-user@lucene.apache.org
Cc: 
Sent: Thursday, 30 August 2012, 11:58
Subject: DuplicateFilter filters not only duplicates

Hey,

I have an index with documentations of our products. The documentfields are:

group
name
version
description

Because most of the documentations contains several sites I create for each site one document in the index. So when I am searching for a product by group, name and version I get a few results. But sometime I want for this combination (group, name and version) only one result (regardless how many documents exists for the product).

Therefor I use the DuplicateFilter:

Because this filter can only be used on one field (and not on fieldcombinations) I created another field (productkey). In this field I stored an id for this product (md5Hashvalue of the combination of group, name and version fields). Then I told the DuplicateFilter to use this field to filter duplicates.

But now I got not all the expected searchresults. i.e:

All documents without filter:

group | name | version | productkey | description
a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424 | ...
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...

Results with filter:

group | name | version | productkey
a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9
a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f
zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec

so I am missing these results:

group | name | version | productkey
a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424
a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8
a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a
zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a
zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e


Here is my code to instantiate the filter:

DuplicateFilter filter = new DuplicateFilter("productkey");
filter.setKeepMode(DuplicateFilter.KM_USE_FIRST_OCCURRENCE);
filter.setProcessingMode(DuplicateFilter.PM_FULL_VALIDATION);

I am using Lucene-core Version 3.6 and lucene-queries (contains the DuplicatFilter Class) Version 3.6.1.
The productkey field uses no analyzer.

Did I make a mistake or is it a bug in the duplicateFilter (maybe to long fieldvalues, etc.)?

Thanks for your help
*******************************************************************************
Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte Informationen. 
Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtuemlich erhalten
haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail.
Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht
gestattet.

This email may contain confidential and/or privileged information. 
If you are not the intended recipient (or have received this email 
in error) please notify the sender immediately and destroy this email. 
Any unauthorized copying, disclosure or distribution of the material 
in this email is strictly forbidden.
*******************************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DuplicateFilter filters not only duplicates

Posted by Ian Lea <ia...@gmail.com>.
https://issues.apache.org/jira/browse/LUCENE-2348 suggests there are
long-standing and probably still current issues with DuplicateFilter
and multiple segments.  I'm not sure if this could explain what you
are seeing.  You could try calling optimize(1) on your index writer
and see if that makes a difference.


--
Ian.


On Thu, Aug 30, 2012 at 11:58 AM, Stäbler, Christoph (IT/I4Z)
<ch...@ww-informatik.de> wrote:
> Hey,
>
> I have an index with documentations of our products. The documentfields are:
>
> group
> name
> version
> description
>
> Because most of the documentations contains several sites I create for each site one document in the index. So when I am searching for a product by group, name and version I get a few results. But sometime I want for this combination (group, name and version) only one result (regardless how many documents exists for the product).
>
> Therefor I use the DuplicateFilter:
>
> Because this filter can only be used on one field (and not on fieldcombinations) I created another field (productkey). In this field I stored an id for this product (md5Hashvalue of the combination of group, name and version fields). Then I told the DuplicateFilter to use this field to filter duplicates.
>
> But now I got not all the expected searchresults. i.e:
>
> All documents without filter:
>
> group | name | version | productkey | description
> a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424 | ...
> a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
> a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
> a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
> a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
> a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
> a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
> a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f | ...
> a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
> a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
> a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
> a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
> a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
> a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
> a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
> a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
> a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
> a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
> a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a | ...
> zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
> zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
> zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a | ...
> zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec | ...
> zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
> zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
> zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
>
> Results with filter:
>
> group | name | version | productkey
> a     | two  | 1.0     | 0225635fc76ed8b88c65c7eb9f2ec1f9
> a     | three| 1.0     | 621e2597b189ee8d9448f6bfb26c5a8f
> zz    | two  | 1.0     | f5bb84453af30dd5f229d04cdb787dec
>
> so I am missing these results:
>
> group | name | version | productkey
> a     | one  | 1.0     | 808d8f96138b7dec7cc69c2769176424
> a     | four | 1.0     | 3d03056a0d0f29f63477ee1f130b7ae8
> a     | five | 1.0     | b2d49bc320325007e1466a38e41ce69a
> zz    | one  | 1.0     | b610a470c9a7d2cc928725e1fb1a577a
> zz    | three| 1.0     | 4b86d91feded953e57fb3d1ccbf0fc6e
>
>
> Here is my code to instantiate the filter:
>
> DuplicateFilter filter = new DuplicateFilter("productkey");
> filter.setKeepMode(DuplicateFilter.KM_USE_FIRST_OCCURRENCE);
> filter.setProcessingMode(DuplicateFilter.PM_FULL_VALIDATION);
>
> I am using Lucene-core Version 3.6 and lucene-queries (contains the DuplicatFilter Class) Version 3.6.1.
> The productkey field uses no analyzer.
>
> Did I make a mistake or is it a bug in the duplicateFilter (maybe to long fieldvalues, etc.)?
>
> Thanks for your help
> *******************************************************************************
> Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte Informationen.
> Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtuemlich erhalten
> haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail.
> Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht
> gestattet.
>
> This email may contain confidential and/or privileged information.
> If you are not the intended recipient (or have received this email
> in error) please notify the sender immediately and destroy this email.
> Any unauthorized copying, disclosure or distribution of the material
> in this email is strictly forbidden.
> *******************************************************************************
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org