You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/08/19 18:41:46 UTC

[jira] [Commented] (LUCENE-6747) FingerprintFilter - a TokenFilter for clustering/linking purposes

    [ https://issues.apache.org/jira/browse/LUCENE-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703316#comment-14703316 ] 

Adrien Grand commented on LUCENE-6747:
--------------------------------------

If you could tolerate that these fingerprints are not be reliable identifiers of your input, I'm wondering that we could make it more efficient by just using a hash function that doesn't depend on the order of its inputs?

Otherwise this looks rather good to me. Instead of taking the min offset and the max offset as offsets for the final token, I'm wondering that it might make more sense to use 0 and the final offset (the one returned after end() has been called) instead so that we don't treat token chars differently depending on whether they appear before/after the tokens or in the middle? By the way even with the current approach, we don't need to call Math.min/max: As tokens are supposed to be emitted in order, the start offset would be the start offset of the first token and the end offset would be the end offset of the last token.

> FingerprintFilter - a TokenFilter for clustering/linking purposes
> -----------------------------------------------------------------
>
>                 Key: LUCENE-6747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6747
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Mark Harwood
>            Priority: Minor
>         Attachments: fingerprintv1.patch
>
>
> A TokenFilter that emits a single token which is a sorted, de-duplicated set of the input tokens.
> This approach to normalizing text is used in tools like OpenRefine[1] and elsewhere [2] to help in clustering or linking texts.
> The implementation proposed here has a an upper limit on the size of the combined token which is output.
> [1] https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
> [2] https://rajmak.wordpress.com/2013/04/27/clustering-text-map-reduce-in-python/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org