You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-dev@jakarta.apache.org by "Daniel F. Savarese" <df...@savarese.org> on 2001/06/09 01:07:14 UTC

Re: End Anchor bug on non-Unix platforms

I'll look at Ed's report this weekend, but it would help if someone
else could look at it too and offer their view (Takashi? Mark?).  On first
glance, I don't think it's really a bug because Perl is very particular
about '\n' being the newline for matching $, not the platform-specific
file-based end of line delimiter.

One can debate what the proper way to map this Perl behavior to Java is,
but I would suggest it is in the I/O stage, not the matching stage.
In other words, write a class that filters the input stream converting the
platform-specific end of line representation to a Java newline.  It
may in fact be that Perl dodges the issue by using ANSI C text-mode I/O
that does the translation.

One reason for doing this is that the platform-specific line separator in
no way correlates with the line separator of an arbitrary input source.
For example, I'm on a Mac and want to parse an HTML page pulled from a URL
in NetASCII.  However, the principal rationale, and the argument I offer,
is that the rules of regular expression matching have nothing to do with
platform-specific text file representation formats.  They have to do with
programming-language-specific string representation formats.  $ matches
after a newline but before the start of the next line.  In the Java language,
just as in C and Perl, \n (character value 0x0A) is the representation for a
newline.

Also given that a platform-specific end of line delimiter may be an
arbitrary sequence of characters (not just \n, \r, and \r\n), it is not
possible to implement efficiently in the matching phase (it's easier to 
implement in an object-based NFA implementation, but it will be slow like
everything else in object-based NFA's).  You'll notice, the JDK 1.4
java.util.regex package chooses to recognize multiple line terminators,
which is actually worse because now \r, \r\n, \n, \f, \u2028, etc. will
all simultaneously represent the end of a line, which is generally not what
you want (at least no in Perl).  Supporting an arbitrary multi-character
end of line marker would be more complicated than just adding a helper
function, because for every potential sequence terminating character, you
have to verify the sequence prefix was already encountered, but you also
have to make sure it wasn't matched as part of the preceding parts of the
pattern, which means for every sequence prefix encountered you have to
lookahead to see if sequence suffix is present and account for interactions
with the search pattern.  It's really quite a mess when you get down to it,
slows things down, and requires a respecification of matching semantics.

daniel



Re: End Anchor bug on non-Unix platforms

Posted by "Mark F. Murphy" <ma...@tyrell.com>.
At 7:07 PM -0400 6/8/01, Daniel F. Savarese wrote:
>I'll look at Ed's report this weekend, but it would help if someone
>else could look at it too and offer their view (Takashi? Mark?).  On first
>glance, I don't think it's really a bug because Perl is very particular
>about '\n' being the newline for matching $, not the platform-specific
>file-based end of line delimiter.

I double checked on this and Danial is correct.

I fired up MacPerl... which is known for making use of the Mac's line 
endings when reading files (input separator $/ is 0x0D).

I ran a test under MacPerl to see if it would do the same for regex:

$test1 = "hello\nworld";
$test2 = "hello\rworld";

$result[0] = "Failed";
$result[1] = "Passed";

print "Start test...\n\n";

print "Test1 " . $result[($test1 =~ /hello$/m)] . "...\n";
print "Test2 " . $result[($test2 =~ /hello$/m)] . "...\n";


Results under MacPerl:


Start test...

Test1 Passed...
Test2 Failed...


Results under perl on un


Start test...

Test1 Passed...
Test2 Failed...


I didn't get a chance to test under Win32... but I'd be surprised if 
it worked any different.

>One can debate what the proper way to map this Perl behavior to Java is,
>but I would suggest it is in the I/O stage, not the matching stage.
>In other words, write a class that filters the input stream converting the
>platform-specific end of line representation to a Java newline.  It
>may in fact be that Perl dodges the issue by using ANSI C text-mode I/O
>that does the translation.

The other thing to do is adjust the regex for the particular target file.

When reading the file in, check for line ending type.  Then adjust 
the regex as needed.

So on a Mac I might do the following in perl:

$test1 =~ /hello\r/m


Regex expressions can be built dynamically.  So changing the regex is 
probably less expensive than changing the entire input line or buffer.

mark

-- 
---------------------------------------------------------------------------
  Mark F. Murphy, Director Software Development   <ma...@tyrell.com>
  Tyrell Software Corp                            <http://www.tyrell.com>
  PowerPerl(tm), Add Power To Your Webpage!       <http://www.powerperl.com>
---------------------------------------------------------------------------
  Families Against Internet Censorship:        http://www.netfamilies.org/