You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Bob Carpenter <ca...@alias-i.com> on 2006/06/08 16:54:13 UTC

Re: Edit-distance strategy (slicing and one vs. all algorithms)

Here's a thoughtful discussion of Hirschberg's
edit-distance algorithm:

http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/

In general, it appears to be *slower* than Levenstein, although
it only uses space proportional to the shorter string (vs. Levenstein's
product of the string lengths).

They conclude it's only useful for space if the strings are long.
Looks reasonable, but I have no experience with
this algorithm.

If only edit distance is needed and not alignment,
which is how I understand FuzzyQuery, then
you can also get along with space proportional
to the smaller of the strings by only allocating one
slice of the N x M dynamic programming table at
a time.  You can do this because each slice only
depends on the previous slice, with the boundary
conditions easily defined.  The resulting algorithm
looks a lot more like the well-known implementation
than Hirschberg's does.

You can re-use the slices, too, in a hand-over-hand
fashion, so you do less allocation and garbage collection.
You'll find a fairly clean Java implementation of this
technique in LingPipe's WeightedEditDistance class.
(http://www.alias-i.com/lingpipe)  It even does transpose
and allows weights on the edits (aka Smith-Waterman).

There are even more space efficient algorithms if you're
only looking to find things within a given edit distance.
The standard Levenstein, etc., computes the edit distance --
it doesn't just answer "are they within distance k".
The trick here is to prune the search when the edits get
too large.  This requires all edit costs to be positive.

The real gain would be to do something like the
edit-distance generalization of Aho-Corasick.  The
basic idea is that instead of n iterations of string vs. string,
you do one iteration of string vs. trie.  This improvement
makes a *huge* difference in run time for typical tries
composed of natural language terms.  It's not that
expensive to compute the trie, either -- it might even
be worth doing on the fly.  We have an implementation
in the LingPipe class ApproximateDictionaryChunker, which
is used to pull fuzzy matches against a dictionary (e.g. of
protein names, product names or transliterated names,
all of which have significant string variation) out of running
text. 

There's an awesome book covering all of these algorithms:

Dan Gusfield.  Algorithms on Strings, Trees and Sequences.
Cambridge Uni Press.

Chapter 12 is all about approximate matching of sets of
strings vs. a string using edit distance.  You'll have to learn
their notation for suffix trees along the way, though you
won't need full suffix trees for matching against a dictionary.

A bonus is that you get to learn about really cool biology
sequencing applications and the *sub-linear* matching
possible with exclusion techniques (mind-bogglingly
clever algorithms).

- Bob Carpenter    http://www.colloquial.com/carp
  Alias-i                  http://www.alias-i.com/lingpipe

eks dev wrote:
> what could show measurably faster results is more in Jaro distance direction, but even than, the fact that you need to scan all terms in dictionary will make it prohibitive for large colletions. For small collections this can be packed in some smart data structures to restrict number of distance calculations...
> good luck with it
>   
>> I'm about to replace the edit-distance algorithm in FuzzyQuery from
>> Levenstein to Hirschberg to save a couple of clockticks.
>>     
>
> Have you allready confirmed Hirschberg algorithm to be faster than current implementation of edit distance? I am not convinced it helps really. Hirschberg and standard DP algorithm both have O(|s1|*|s2|) time. The only saving is that Hirschberg needs O(|s2|) space using binary recursion (classic needs O(|s1|*|s2|) space). 
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Edit-distance strategy (slicing and one vs. all algorithms)

Posted by karl wettin <ka...@snigel.net>.

Thanks for the great replies everybody! 

On Thu, 2006-06-08 at 10:54 -0400, Bob Carpenter wrote:
> 
> Here's a thoughtful discussion of Hirschberg's
> edit-distance algorithm:
> 
> http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Hirsch/
> 
> In general, it appears to be *slower* than Levenstein, although
> it only uses space proportional to the shorter string (vs.
> Levenstein's product of the string lengths).
> 
> They conclude it's only useful for space if the strings are long.
> Looks reasonable, but I have no experience with this algorithm. 

That is actually the article that made me want to run Hirschberg in the
first place.

> the edit-distance generalization of Aho-Corasick. The basic idea is
> that instead of n iterations of string vs. string, you do one
> iteration of string vs. trie.

Cool. I really have to take a look at that.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Edit-distance strategy (slicing and one vs. all algorithms)

Posted by Yonik Seeley <ys...@gmail.com>.

On 6/10/06, eks dev <ek...@yahoo.co.uk> wrote:
> On the other side, I am convinced Bob won't mind if we learn something from Lingpipe (must say, on first look, the thing has some extremly clever solutions).

Oh sure... descriptions in an email are fine, and I'm glad for Bob's
participation here!
But looking at the actual code is another matter... with the usual
disclaimer IANAL, but I try to stay away from the too-grey areas.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Edit-distance strategy (slicing and one vs. all algorithms)

Posted by eks dev <ek...@yahoo.co.uk>.

No worries at all Yonik, Lingpipe is too big to be included into Lucene and nobody plans to go shadow rute by stealing somebody's hard work :)
 
On the other side, I am convinced Bob won't mind if we learn something from Lingpipe (must say, on first look, the thing has some extremly clever solutions).




----- Original Message ----
From: Yonik Seeley <ys...@gmail.com>
To: java-dev@lucene.apache.org
Sent: Saturday, 10 June, 2006 3:43:20 AM
Subject: Re: Edit-distance strategy (slicing and one vs. all algorithms)

On 6/9/06, Bob Carpenter <ca...@alias-i.com> wrote:
>  > Can you share how you implemented Trie in your App, especialy
> interesting part for me is how you go about memory consumption,
>  > have you tried really large dictionaries (1Mio+)?
>
> Sure can.  You could slog through the source,
> which is available from:

The license isn't ASF compatible though, so please don't look at the
code of you plan on contributing anything to Lucene.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Edit-distance strategy (slicing and one vs. all algorithms)

Posted by Yonik Seeley <ys...@gmail.com>.

On 6/9/06, Bob Carpenter <ca...@alias-i.com> wrote:
>  > Can you share how you implemented Trie in your App, especialy
> interesting part for me is how you go about memory consumption,
>  > have you tried really large dictionaries (1Mio+)?
>
> Sure can.  You could slog through the source,
> which is available from:

The license isn't ASF compatible though, so please don't look at the
code of you plan on contributing anything to Lucene.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Edit-distance strategy (slicing and one vs. all algorithms)

Posted by Bob Carpenter <ca...@alias-i.com>.

 > Can you share how you implemented Trie in your App, especialy 
interesting part for me is how you go about memory consumption,
 > have you tried really large dictionaries (1Mio+)?

Sure can.  You could slog through the source,
which is available from:

     http://www.alias-i.com/lingpipe

There are several trie implementations.  The
approximate dictionary trie's implemented very crudely.
We've run it on 1M gene names derived from
EntrezGene (anywhere from 2 to 60 characters long).
I don't know the speed numbers offhand, but nothing
to write home about.  The real problem is that it too
easily matched short terms to short terms within a
given edit distance.  So a gene name AT was
matching GT or IT or whatever at edit-distance 1,
whereas alpha-ACT wasn't matching AACT like
we wanted.  Still haven't solved this problem -- we
went back to using the method we described in
our case study in Lucene in Action.

The spelling checker trie for known tokens is also very crude, as it's not
a bottleneck in that processing (not even .1% of the profiled time).
We've scaled that to several million entries in a large customer 
application.
The language model tries were, though.

The trie underlying our language models is implemented
pretty space-efficiently in-memory.  It only stores counts
at each node, not general records or even terminal/non-terminal
information as you'd need in a word list.  It codes trie nodes to an 
interface
with PAT implementations, shared terminal node implementations,
etc., with unfolding all the way down to elements so that
nodes with 1-3 daughters don't allocate arrays, counts below
256 are represented with bytes, etc.  It uses a factory to sort out
all the object creations/sharing.  I wouldn't
normally be so agressive optimzing this kind of thing, but memory
and speed were a real problem with the naive implementation, and
this sits in most of our inner loops for model training.  

The tries can be compiled into a form where they can be easily
written to disk and read back in for efficient run-time use
for language modeling (estimating probability of a string
given a topic, language, sentiment, etc.).  That's the critical
step in most of our run-time operations.

Anyway, there's a paper on the scalable LM tries that covers everything
in pretty excruciating detail (even more than this message):

http://www.colloquial.com/carp/Publications/acl05soft-carpenter.pdf

In LingPipe 2.3, which should be out next week, there are methods
to write character tries to disk and to merge on-disk tries with
and without pruning.  This uses some of the same bit-level
coding techniques (gamma-coding, specifically) as reverse-indexes for 
search engines, and yields
a very tight memory representation with good merging properties.
I've basically adopted Lucene's strategy of on-disk representations
writing out to a mergeable streaming format that can scale on disk
(or in memory).

- Bob Carpenter

eks dev wrote:
> Hi Bob, 
>
> really nice answer!
>
>   
>> The real gain would be to do something like the
>> edit-distance generalization of Aho-Corasick.  The
>> basic idea is that instead of n iterations of string vs. string,
>> you do one iteration of string vs. trie. 
>>     
>  
> I was experimenting a bit with ternary trie as it has some nice properties, e.g being 40% faster than standard java or trove HashMap for exact matches,  but never got to finish path compression and null node reduction (this way one saves huge amount of memory). Must do it one day. 
>
>
> thanks!
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Edit-distance strategy (slicing and one vs. all algorithms)

Posted by eks dev <ek...@yahoo.co.uk>.

Hi Bob, 

really nice answer!

>The real gain would be to do something like the
>edit-distance generalization of Aho-Corasick.  The
>basic idea is that instead of n iterations of string vs. string,
>you do one iteration of string vs. trie. 
 
I was experimenting a bit with ternary trie as it has some nice properties, e.g being 40% faster than standard java or trove HashMap for exact matches,  but never got to finish path compression and null node reduction (this way one saves huge amount of memory). Must do it one day. 

Can you share how you implemented Trie in your App,  especialy interesting part for me is how you go about memory consumption, have you tried really large dictionaries (1Mio+)?

thanks!




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org