You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oro-user@jakarta.apache.org by Aarti Halleppanavar <ec...@yahoo.com> on 2002/12/30 10:01:32 UTC

[oro] Some Unicode doubts

Hi Daniel,

I am developing a project which requires searching an
http response for some user entered strings. Given
that the size of the response can be very big, I need
an efficient pattern matching mechanism. Using
AwkStreamInput with AwkMatcher works extremely well
for me except that it does not support Unicode. When
do you intend providing unicode support for it? 

I also tried MatchActionProcessor, but am unable to
get match results for unicode strings. In a demo
program, I try to search a string present in
'unicoderegex.txt' in a 'unicode.html' file. Both
files are stored in "utf-8" encoding. The unicode.html
file contains a mix of english and japanese chars. I
changed the contents of unicoderegex.txt from plain
English text, to mix content, to only Jap content. I
also tried specifying content encoding while creating
the InputStreamReader, but did not succeed in finding
matches. I also tried with commenting and uncommenting
the line 
regex = getStringAsCodes(regex);
However, when I init regex as 'regex = "Unicode";'
then I am able to find matches for the string
'Unicode'. I went through the mailing lists but could
not find any examples. Could you please tell me if I
am doing anything wrong here, or is that ORO does not
support unicode at all? Or do I have to set some flags
to enable unicode? 

Another concern is that : is multiline matching
possible with MatchActionProcessor?

Below is the source code for my demo program. For
additional info, I used jdk13 and
jakarta-oro-2.0.7-dev-1.jar. Any help is greatly
appreciated.

Thanks a lot,
Aarti H.

===================================== CODE
=================================
import java.io.*;

import org.apache.oro.text.*;
import org.apache.oro.text.regex.*;

public final class UnicodeDemo
{
  public static final void main(String[] args) throws
Exception
  {
    //init the regex
    FileInputStream fis = new
FileInputStream("C:\\unicoderegex.txt");
    BufferedReader bf = new BufferedReader(new
InputStreamReader(fis/*, "UTF-8"*/));
    String regex = bf.readLine();
    regex = getStringAsCodes(regex);
    //regex = "Unicode";
    System.out.println("regex = "+regex);
    bf.close();

    MatchActionProcessor processor = new
MatchActionProcessor();
    processor.addAction(regex, new MatchAction() {
        //if a match is found, show it on console.
        public void processMatch(MatchActionInfo info)
        {
          info.output.println("match found = " +
info.line);
        }
      });
    processor.processMatches(new
FileInputStream("c:\\unicode.html"), System.out);
  }

  /**
   * takes a string which may contain unicode chars
and returns a string 
   * with the unicode chars replaced by their unicode
codes.
   * Example return value:
"\u00ef\u00bb\u00bf\u00e6\u2014"
   */
  private static String getStringAsCodes(String sName)
  {
    if (sName == null || sName.trim().length() == 0)
    {
      return sName;
    }

    final char [] chArray = sName.toCharArray();
    String sReturnName = "";
    for (int i = 0; i < chArray.length; i ++)
    {
      if (Character.UnicodeBlock.of(chArray[i]) !=
Character.UnicodeBlock.BASIC_LATIN)
      {
        sReturnName +=
getUnicodeRepresentationOfChar(chArray[i]);
      }
      else
      {
        char cc [] = {chArray[i]};
        sReturnName += new String(cc);
      }
    }
    return sReturnName;
  }

  private static String
getUnicodeRepresentationOfChar(char ch)
  {
    String s = Integer.toHexString(ch);
    final int iLen = s.length();
    if (iLen < 4)
    {
      for (int i = 0; i < 4 - iLen; i ++)
      {
        s = "0" + s;
      }
    }
    return "\\u" + s;
  }
}
===================================== CODE =================================

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>