You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by Ka...@trilogy.com on 2002/05/16 19:04:30 UTC

greedy matching of subgroups/alternatives

Hello -

I am trying to solve the following problem:

Given a list of regular expressions and a string, find the longest 
expression that matches.

For example, I have the following properties file which defines my regular 
expressions

Explorer=10
Explorer.Sport=11
Explorer.Sport.Trac=12

So, what I do is, I create a single large regexp string as follows:
(Explorer)|(Explorer.Sport)|(Explorer.Sport.Trac)

And I create a mapping as follows:
Group 1 = 10
Group 2 = 11
Group 3 = 12

Then I compile the pattern, and call match.contains() on this pattern, 
then iterate through the groups to find the subgroup that matches. 
However, I have 
a couple of concerns about this algorithm

1) it is not very efficient, in that I have to iterate through all of the 
groups potentially when I'm really just interested in the first subgroup 
that matches
2) Unfortunately, b/c of the perl-inspired implementation, it will fail to 
correctly match the longest string.  For example, a match on
'Explorer Sport Trac' will give me a match on Explorer, *unless* I put 
'Explorer.Sport.Trac' *before* 'Explorer' in the RE.

Does anyone have any suggestions on how to solve these two problems, i.e. 
having to iterate over all of the subgroups to find the matching subgroup, 
and
having to put the REs in a certain order to guarantee the match I'm 
looking for?  Also, I don't know much about the implementation details of 
the Perl5Compiler, but is the representation I've chosen going to be 
efficient? (i.e a bunch of regexps ORed together)  There will be probably 
75-100 regular expressions all ORed together for our final implementation.

thanks for any help/advice,
Karl Brown