You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by "Daniel F. Savarese" <df...@savarese.org> on 2003/09/04 17:45:40 UTC

Re: Wanted: Regex[Input|Output]Stream

In message <NB...@devtech.com>, "Noel J. Bergman" w
rites:
>Does anyone know of an implementation of the titular classes?
>
>Basically, to compile multiple regex patterns into a stream, and then as I
>read through the stream, the data should be optimally checked.  This is a
>classic FSA situation.  If a match occurs, I think that I'd like to receive
>an exception at that point in the stream, but the ata would still be valid,
>and I can continue processing.

The org.apache.oro.text.awk package will perform matches on streams
(see http://jakarta.apache.org/oro/api/org/apache/oro/text/awk/AwkMatcher.html#contains(org.apache.oro.text.awk.AwkStreamInput,%20org.apache.oro.text.regex.Pattern)
,
return offset information, and allow continued processing.  It doesn't
throw an exception though and it only searches for a single pattern, so
you have to use alternation to match multiple patterns (which, unlike
Perl, does not require backtracking with Awk patterns).  The downside
is the Awk package is limited to 8-bit input to keep the DFA tables it
builds (in lazy fashion) small.  The functionality was removed from the
jakarta-oro Perl classes on account of:

"On the Use of Regular Expressions for Searching Text", Clark and Cormack,
  ACM Transactions on Programming Languages and Systems, Vol 19, No. 3,
  pp 413-426.

In short, the behavior of Perl regexes is such that in most cases you can't
determine a match in a stream without reading the entire stream.  TCL
regexes have the same problem and Expect used to (still does?) handle the
problem by looking for a match in a lookahead buffer (2000 characters
sounds familiar) and giving up if not found, so you could lose matches
if boundary conditions came into play (e.g., a match ran off the end of
the buffer).  If you can bound the length of a match, then it's a problem
that can be solved efficiently.  It may be worth reintroducing the
functionality as a standard part of jakarta-oro, but requiring the
programmer to specify a limit on the size of a match.

daniel



RE: Wanted: Regex[Input|Output]Stream

Posted by "Noel J. Bergman" <no...@devtech.com>.
Daniel,

Sorry for the time gap, but I've been consumed on some other projects, and
am just getting back to this topic.  I've two issues that I need to resolve.

  (1) I have data arriving on a feed that I want to
      read and process, but I would like to check it
      against multiple expressions (see 2) in real-
      time and recognize when one has been matched.

  (2) I need to have multiple expressions, and am
      not seeing support for that in ORO or any of
      the other Java regex packages.  Processing a
      gigabyte of data for every 1 megabyte of real
      data because I have 1K expressions does not
      seem overly efficient.

If there is a ready-made solution to (2), please let me know.  Or is the
easy way to do avoid searching for:

  R1, R2, R3, ... Rn

to do:

  (R1)|(R2)|(R3)|...|(Rn)?

(yes, I realize that a limitation is that I wouldn't get to know which Rn
has matched) and what is ORO's limit?  Or do you have a better way that I'm
missing?

An example of (1) is doing real-time pattern matching in the incoming stream
of a mail server while we process the data.  Now, another way of doing it
would be for me to use push-processing (think java.nio) and push blocks of
data into the regex matcher before pushing the data into the protocol
handler.  Yes, to answer one of your other messages, I would want matching
to continue from where it left off.  If a Listener approach is used, I just
want it to fire off RegexNotificationEvent as a pattern is recognized.

The FSA should assume that there is more data to arrive until told
otherwise.  And, yes, the matcher would have to preserve the state of its
FSA.  With respect to NFA vs DFA, for all NFA there exists an equivalent
DFA, albeit with quite a few more states, but perhaps I am missing your
point.

One other stipulation.  I am assuming a long-lived recognizer with a lot of
data, so I am quite willing to expend more cycles up front to compile a
relatively optimized FSA in exchange for efficient processing during the
main duty cycle.

Does this better explain what I'm looking for, and where I'm coming from?

	--- Noel

http://camino.rutgers.edu/ut/utsa/cs1713-1/assign3/ramble.html
http://sources.redhat.com/ml/bug-glibc/2002-03/msg00295.html
http://www.cs.duke.edu/~rodger/tools/jflap/jflap99/flap/JFLAP11.html
http://www.cs.duke.edu/~rodger/tools/jflap/


---------------------------------------------------------------------
To unsubscribe, e-mail: oro-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: oro-dev-help@jakarta.apache.org


RE: Wanted: Regex[Input|Output]Stream

Posted by "Noel J. Bergman" <no...@devtech.com>.
Daniel,

Sorry for the time gap, but I've been consumed on some other projects, and
am just getting back to this topic.  I've two issues that I need to resolve.

  (1) I have data arriving on a feed that I want to
      read and process, but I would like to check it
      against multiple expressions (see 2) in real-
      time and recognize when one has been matched.

  (2) I need to have multiple expressions, and am
      not seeing support for that in ORO or any of
      the other Java regex packages.  Processing a
      gigabyte of data for every 1 megabyte of real
      data because I have 1K expressions does not
      seem overly efficient.

If there is a ready-made solution to (2), please let me know.  Or is the
easy way to do avoid searching for:

  R1, R2, R3, ... Rn

to do:

  (R1)|(R2)|(R3)|...|(Rn)?

(yes, I realize that a limitation is that I wouldn't get to know which Rn
has matched) and what is ORO's limit?  Or do you have a better way that I'm
missing?

An example of (1) is doing real-time pattern matching in the incoming stream
of a mail server while we process the data.  Now, another way of doing it
would be for me to use push-processing (think java.nio) and push blocks of
data into the regex matcher before pushing the data into the protocol
handler.  Yes, to answer one of your other messages, I would want matching
to continue from where it left off.  If a Listener approach is used, I just
want it to fire off RegexNotificationEvent as a pattern is recognized.

The FSA should assume that there is more data to arrive until told
otherwise.  And, yes, the matcher would have to preserve the state of its
FSA.  With respect to NFA vs DFA, for all NFA there exists an equivalent
DFA, albeit with quite a few more states, but perhaps I am missing your
point.

One other stipulation.  I am assuming a long-lived recognizer with a lot of
data, so I am quite willing to expend more cycles up front to compile a
relatively optimized FSA in exchange for efficient processing during the
main duty cycle.

Does this better explain what I'm looking for, and where I'm coming from?

	--- Noel

http://camino.rutgers.edu/ut/utsa/cs1713-1/assign3/ramble.html
http://sources.redhat.com/ml/bug-glibc/2002-03/msg00295.html
http://www.cs.duke.edu/~rodger/tools/jflap/jflap99/flap/JFLAP11.html
http://www.cs.duke.edu/~rodger/tools/jflap/