You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Thomas Neidhart <th...@gmail.com> on 2012/03/12 21:20:08 UTC

[lang] Longest common substring / Suffix Tree

Hi,

on the weekend, I started to work on issue LANG-680
(https://issues.apache.org/jira/browse/LANG-680), which is about adding
support for finding the longest common substring of a set of Strings.

Suffix Trees are a standard data structure to efficiently solve this
problem, and I created a variant of Ukkonen's algorithm to construct
such a tree in linear time.

After some research to create a generalized version of it (to support
multiple strings), I found a very nice implementation from Alessandro
Bahgat (https://github.com/abahgat/suffixtree), which is very similar to
my own, and contacted the author if he is interested to put his code
under apache license (which he did already).

So the question basically is how to proceed, as adding a data structure
like a Suffix Tree to commons-lang may be out-of-scope. On the other
hand there are several string related problems that can be efficiently
solved with a Suffix Tree:

 * longest common substring
 * longest repeated substring
 * exact string set matching
 * palindrom finding
 * finding frequent substrings

Implementing a longest common substring for two strings is trivial
(using dynamic programming ~20 lines of code), but for a set of strings
a suffix tree would be needed.

Henri already outlined a possible API as comment to the issue, which I
like, so what do other people think about adding a suffix tree to
commons-lang?

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [lang] Longest common substring / Suffix Tree

Posted by Thomas Neidhart <th...@gmail.com>.
Hi all,

as a follow-up to the discussion, a first implementation of the lcs with
SuffixTree is available at

https://github.com/netomi/suffixtree

see also the associated issue:

https://issues.apache.org/jira/browse/LANG-680

Feedback would be very welcome, especially if we should go forward with
this approach and include a SuffixTree into commons-lang.

btw. to be memory-effective, I have added a CharHashmap, which is derived
from an IntHashmap that was part of commons-lang in prior versions.

Thanks in advance,

Thomas

Re: [lang] Longest common substring / Suffix Tree

Posted by James Carman <jc...@carmanconsulting.com>.
It would have come in handy for me while doing bioinformatics work in the
past.  But, you're right, they have very cool tools out there.  I was
amazed at some of the stuff they can do.
On Mar 13, 2012 7:08 PM, "Thomas Neidhart" <th...@gmail.com>
wrote:

> On 03/13/2012 08:55 AM, Luc Maisonobe wrote:
> > Le 13/03/2012 00:53, James Carman a écrit :
> >> A lot of bioinformaticians would love us if we added this!
>
> I picked this topic up as I find it interesting to myself and it would
> be a useful addition for many other people too I guess, but from what I
> have seen so far, bioinformaticians wouldn't be necessarily impressed by
> that ;-). Afaik they have pretty good tools, and there exist special
> algorithms to compute suffix trees for really large strings in clusters
> or on disk as they wont fit in memory anymore.
>
> > In the same spirit, I know an implementation of the Myers difference
> > algorithm that runs on any object implementing equals and also provides
> > an API for browsing the "edit script" resulting from the comparison.
> > This allows for example to retrieve only the shared elements, or only
> > the ones in the first or the second sequence, or "running" the script,
> > or whatever.
> >
> > If you consider this could be a good addition to [lang] or another
> > component ([graph] ?) I can ask for a grant for this.
>
> this would be a perfect companion for the longest common substring
> problem, the o.a.c.l.text package looks like a good fit for these things
> imho.
>
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [lang] Longest common substring / Suffix Tree

Posted by Thomas Neidhart <th...@gmail.com>.
On 03/13/2012 08:55 AM, Luc Maisonobe wrote:
> Le 13/03/2012 00:53, James Carman a écrit :
>> A lot of bioinformaticians would love us if we added this!

I picked this topic up as I find it interesting to myself and it would
be a useful addition for many other people too I guess, but from what I
have seen so far, bioinformaticians wouldn't be necessarily impressed by
that ;-). Afaik they have pretty good tools, and there exist special
algorithms to compute suffix trees for really large strings in clusters
or on disk as they wont fit in memory anymore.

> In the same spirit, I know an implementation of the Myers difference
> algorithm that runs on any object implementing equals and also provides
> an API for browsing the "edit script" resulting from the comparison.
> This allows for example to retrieve only the shared elements, or only
> the ones in the first or the second sequence, or "running" the script,
> or whatever.
> 
> If you consider this could be a good addition to [lang] or another
> component ([graph] ?) I can ask for a grant for this.

this would be a perfect companion for the longest common substring
problem, the o.a.c.l.text package looks like a good fit for these things
imho.

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [lang] Longest common substring / Suffix Tree

Posted by Luc Maisonobe <Lu...@free.fr>.
Le 13/03/2012 00:53, James Carman a écrit :
> A lot of bioinformaticians would love us if we added this!

In the same spirit, I know an implementation of the Myers difference
algorithm that runs on any object implementing equals and also provides
an API for browsing the "edit script" resulting from the comparison.
This allows for example to retrieve only the shared elements, or only
the ones in the first or the second sequence, or "running" the script,
or whatever.

If you consider this could be a good addition to [lang] or another
component ([graph] ?) I can ask for a grant for this.

Luc

> On Mar 12, 2012 4:20 PM, "Thomas Neidhart" <th...@gmail.com>
> wrote:
> 
>> Hi,
>>
>> on the weekend, I started to work on issue LANG-680
>> (https://issues.apache.org/jira/browse/LANG-680), which is about adding
>> support for finding the longest common substring of a set of Strings.
>>
>> Suffix Trees are a standard data structure to efficiently solve this
>> problem, and I created a variant of Ukkonen's algorithm to construct
>> such a tree in linear time.
>>
>> After some research to create a generalized version of it (to support
>> multiple strings), I found a very nice implementation from Alessandro
>> Bahgat (https://github.com/abahgat/suffixtree), which is very similar to
>> my own, and contacted the author if he is interested to put his code
>> under apache license (which he did already).
>>
>> So the question basically is how to proceed, as adding a data structure
>> like a Suffix Tree to commons-lang may be out-of-scope. On the other
>> hand there are several string related problems that can be efficiently
>> solved with a Suffix Tree:
>>
>>  * longest common substring
>>  * longest repeated substring
>>  * exact string set matching
>>  * palindrom finding
>>  * finding frequent substrings
>>
>> Implementing a longest common substring for two strings is trivial
>> (using dynamic programming ~20 lines of code), but for a set of strings
>> a suffix tree would be needed.
>>
>> Henri already outlined a possible API as comment to the issue, which I
>> like, so what do other people think about adding a suffix tree to
>> commons-lang?
>>
>> Thomas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [lang] Longest common substring / Suffix Tree

Posted by James Carman <jc...@carmanconsulting.com>.
A lot of bioinformaticians would love us if we added this!
On Mar 12, 2012 4:20 PM, "Thomas Neidhart" <th...@gmail.com>
wrote:

> Hi,
>
> on the weekend, I started to work on issue LANG-680
> (https://issues.apache.org/jira/browse/LANG-680), which is about adding
> support for finding the longest common substring of a set of Strings.
>
> Suffix Trees are a standard data structure to efficiently solve this
> problem, and I created a variant of Ukkonen's algorithm to construct
> such a tree in linear time.
>
> After some research to create a generalized version of it (to support
> multiple strings), I found a very nice implementation from Alessandro
> Bahgat (https://github.com/abahgat/suffixtree), which is very similar to
> my own, and contacted the author if he is interested to put his code
> under apache license (which he did already).
>
> So the question basically is how to proceed, as adding a data structure
> like a Suffix Tree to commons-lang may be out-of-scope. On the other
> hand there are several string related problems that can be efficiently
> solved with a Suffix Tree:
>
>  * longest common substring
>  * longest repeated substring
>  * exact string set matching
>  * palindrom finding
>  * finding frequent substrings
>
> Implementing a longest common substring for two strings is trivial
> (using dynamic programming ~20 lines of code), but for a set of strings
> a suffix tree would be needed.
>
> Henri already outlined a possible API as comment to the issue, which I
> like, so what do other people think about adding a suffix tree to
> commons-lang?
>
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [lang] Longest common substring / Suffix Tree

Posted by Benedikt Ritter <be...@googlemail.com>.
Am 12. März 2012 21:20 schrieb Thomas Neidhart <th...@gmail.com>:
> Hi,
>
> on the weekend, I started to work on issue LANG-680
> (https://issues.apache.org/jira/browse/LANG-680), which is about adding
> support for finding the longest common substring of a set of Strings.
>
> Suffix Trees are a standard data structure to efficiently solve this
> problem, and I created a variant of Ukkonen's algorithm to construct
> such a tree in linear time.
>
> After some research to create a generalized version of it (to support
> multiple strings), I found a very nice implementation from Alessandro
> Bahgat (https://github.com/abahgat/suffixtree), which is very similar to
> my own, and contacted the author if he is interested to put his code
> under apache license (which he did already).
>
> So the question basically is how to proceed, as adding a data structure
> like a Suffix Tree to commons-lang may be out-of-scope. On the other
> hand there are several string related problems that can be efficiently
> solved with a Suffix Tree:
>
>  * longest common substring
>  * longest repeated substring
>  * exact string set matching
>  * palindrom finding
>  * finding frequent substrings
>

Maybe it would make sense to create an new class "SubstringUtils"
instead of adding all that logic to StringUtils (which already has a
fair amount of funtionality).

> Implementing a longest common substring for two strings is trivial
> (using dynamic programming ~20 lines of code), but for a set of strings
> a suffix tree would be needed.
>
> Henri already outlined a possible API as comment to the issue, which I
> like, so what do other people think about adding a suffix tree to
> commons-lang?
>
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org