You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Aaron Schon <aa...@yahoo.com> on 2008/11/14 22:59:57 UTC

Extremely Large Strings Comparison (slightly off-topic)

hi I need to compare two Base64 representation strings of some MIME content that I am storing within a Lucene index. I need to efficiently compare them to find the closest match to a query Base64 string , post Lucene query.

I am not sure of the best way to approach this, could I compare the hashes and compute their similarity? Levenshtein distance seems hard because of the size of ths strings and seems inefficient? Is there any other method you could suggest?

n.b: The idea is to not to determine exact match or not, it is to compute a similarity metric. for example

John & Johnson (closer)
vs,
John & Jimmy (farther)

tia,
Aaron


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extremely Large Strings Comparison (slightly off-topic)

Posted by Aaron Schon <aa...@yahoo.com>.

Thanks for responding Jonathan. I will look into k-grams approach. 

The objects could differ by small local changes. To provide some business context, the application requires indexing email messages and attachments. If the attachments differ by some threshold (users making edits/reviews), the attachment needs to be flagged and major/minor versioned.

Thanks
AS


----- Original Message ----
From: Jonathan Young <JY...@attivio.com>
To: Aaron Schon <Aa...@yahoo.com>
Sent: Friday, November 14, 2008 5:52:17 PM
Subject: RE: Extremely Large Strings Comparison (slightly off-topic)

Aaron - Although a naïve implementation of a Levenshtein distance metric takes O(n*m) time, if you are willing to bound the maximum distance by k << n,m (e.g. if you aren't interested in distances greater than k) then the distance calculation can take O(k*min(n,m)).

Another approach would be to use shingles or k-grams from the original document.  I believe Lucene has some support for that.

It depends a lot on the ways in which you expect the two strings to differ.  The fact that it is Base64-encoded MIME content doesn't matter - are they actually objects which are going to have small local changes, or are there likely to be changes which cascade (e.g. if the data is compressed).

I hope this helps,

--- Jonathan

-----Original Message-----
From: Aaron Schon [mailto:aaron_schon@yahoo.com]
Sent: Friday, November 14, 2008 5:00 PM
To: java-user@lucene.apache.org
Subject: Extremely Large Strings Comparison (slightly off-topic)

hi I need to compare two Base64 representation strings of some MIME content that I am storing within a Lucene index. I need to efficiently compare them to find the closest match to a query Base64 string , post Lucene query.

I am not sure of the best way to approach this, could I compare the hashes and compute their similarity? Levenshtein distance seems hard because of the size of ths strings and seems inefficient? Is there any other method you could suggest?

n.b: The idea is to not to determine exact match or not, it is to compute a similarity metric. for example

John & Johnson (closer)
vs,
John & Jimmy (farther)

tia,
Aaron




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org