You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Kathey Marsden <km...@sbcglobal.net> on 2007/07/20 20:15:52 UTC

Single character does not match high value unicode character with collation TERRITORY_BASED. Is this a bug

With TERRITORY_BASED collation '_' does not match  the character 
\uFA2D.  It is the same for english or norwegian. FOR collation 
UCS_BASIC it matches fine.  Could you tell me if this is a bug?
Here is a program to reproduce.


Kathey


import java.sql.*;

public class HighCharacter {
 
    public static void main(String args[]) throws Exception
    {
    System.out.println("\n Territory no_NO");
    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
    Connection conn = 
DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
    testLikeWithHighestValidCharacter(conn);
    conn.close();
    System.out.println("\n Territory en_US");
    conn = 
DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
    testLikeWithHighestValidCharacter(conn);
    conn.close();
    System.out.println("\n Collation USC_BASIC");
    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
    testLikeWithHighestValidCharacter(conn);
 
    }


public static  void testLikeWithHighestValidCharacter(Connection conn) 
throws SQLException {
    Statement stmt = conn.createStatement();
    try {
    stmt.executeUpdate("drop table t1");
    }catch (SQLException se)
    {// drop failure ok.
    }
    stmt.executeUpdate("create table t1(c11 int)");
    stmt.executeUpdate("insert into t1 values 1");
   

    // \uFA2D - the highest valid character according to
    // Character.isDefined() of JDK 1.4;
    PreparedStatement ps =
    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
   
    String[] match = { "%", "_", "\uFA2D" };

    for (int i = 0; i < match.length; i++) {
    System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]);
    ps.setString(1, match[i]);
    ResultSet rs = ps.executeQuery();
    if( rs.next() && rs.getString(1).equals("1"))
        System.out.println("PASS");
    else   
        System.out.println("FAIL: no match");

    rs.close();
    }
   
}
}

Re: Single character does not match high value unicode character with collation TERRITORY_BASED. Is this a bug

Posted by Daniel John Debrunner <dj...@apache.org>.

Mamta Satoor wrote:

> The method above uses the passed RuleBasedCollator to find the collation 
> element for '_'. For our specific example, in Norwegian, '_' translates 
> into only one collation element (vs 2 elements for '\uFA2D'). When 
> looking for '_', we eliminate only 1 collation element from the array 
> created for '\uFA2D' because '_' got translated into 1 collation 
> element. 

That in itself looks like a bug. _ means match any single character, at 
no point should the code be translating _ into a collation element. The 
use of _ as a 'any character' has no relationship to the collation value 
for the character underscore.

Dan.

Re: Single character does not match high value unicode character with collation TERRITORY_BASED. Is this a bug

Posted by Mamta Satoor <ms...@gmail.com>.

Hi Kathey,

I debugged the code below and it looks like _ not matching \uFA2D might be a
bug.
The actual code for comparison happens in the existing code that was left
over for National character types. In SQLChar and in the newly introduced
classes for collation, there are two methods

public BooleanDataValue like(DataValueDescriptor pattern)
public BooleanDataValue like(DataValueDescriptor pattern,DataValueDescriptor
escape) throws StandardException

In SQLChar, we check if we are dealing with national character types and if
so, we do special code for it's like implementation. The same special code
gets used for collation related classes like CollatorSQLChar.

The special processing involves getting the collation elements using the
RuleBasedCollator for the character string. The collation elements for a
string are obtained using RuleBasedCollator.getCollationElementIterator(
characterString.getString()). Taking specific example of Norwegian, '\uFA2D'
converts into 2 (and not 1 and this is the cause of the problem) collation
elements. These collation elements are passed as in int array to following
method in iapi.types.Like class
public static Boolean like(int[] value, int valueLength, int[] pattern, int
patternLength, RuleBasedCollator collator)

The method above uses the passed RuleBasedCollator to find the collation
element for '_'. For our specific example, in Norwegian, '_' translates into
only one collation element (vs 2 elements for '\uFA2D'). When looking for
'_', we eliminate only 1 collation element from the array created for
'\uFA2D' because '_' got translated into 1 collation element. Following is
the code copied from Like.like
   else if (matchSpecial(pat, pLoc, pEnd, anyCharInts))
   {
    // regardless of the char, it matches
    vLoc += anyCharInts.length;
    pLoc += anyCharInts.length;

    result = checkLengths(vLoc, vEnd, pLoc, pat, pEnd, anyStringInts);
    if (result != null)
     return result;
   }

So, it seems that the code above can't assume that the collation elements
for all the characters in say Norwegian are 1 in length just because
collation element for '_' is 1 element.

I think we should go ahead and open a jira entry for this. Would like to
hear if anyone has any comments on this.

thanks,
Mamta

On 7/20/07, Kathey Marsden <km...@sbcglobal.net> wrote:
>
> With TERRITORY_BASED collation '_' does not match  the character
> \uFA2D.  It is the same for english or norwegian. FOR collation
> UCS_BASIC it matches fine.  Could you tell me if this is a bug?
> Here is a program to reproduce.
>
>
> Kathey
>
>
> import java.sql.*;
>
> public class HighCharacter {
>
>    public static void main(String args[]) throws Exception
>    {
>    System.out.println("\n Territory no_NO");
>    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
>    Connection conn =
> DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
>
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Territory en_US");
>    conn =
> DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
>
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Collation USC_BASIC");
>    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
>    testLikeWithHighestValidCharacter(conn);
>
>    }
>
>
> public static  void testLikeWithHighestValidCharacter(Connection conn)
> throws SQLException {
>    Statement stmt = conn.createStatement();
>    try {
>    stmt.executeUpdate("drop table t1");
>    }catch (SQLException se)
>    {// drop failure ok.
>    }
>    stmt.executeUpdate("create table t1(c11 int)");
>    stmt.executeUpdate("insert into t1 values 1");
>
>
>    // \uFA2D - the highest valid character according to
>    // Character.isDefined() of JDK 1.4;
>    PreparedStatement ps =
>    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
>
>    String[] match = { "%", "_", "\uFA2D" };
>
>    for (int i = 0; i < match.length; i++) {
>    System.out.println("select 1 from t1 where '\\uFA2D' like " +
> match[i]);
>    ps.setString(1, match[i]);
>    ResultSet rs = ps.executeQuery ();
>    if( rs.next() && rs.getString(1).equals("1"))
>        System.out.println("PASS");
>    else
>        System.out.println("FAIL: no match");
>
>    rs.close();
>    }
>
> }
> }
>
>
>

Re: Single character does not match high value unicode character with collation TERRITORY_BASED. Is this a bug

Posted by Mamta Satoor <ms...@gmail.com>.

Kathey, let me take a look at this.

thanks,
Mamta


On 7/20/07, Kathey Marsden <km...@sbcglobal.net> wrote:
>
> With TERRITORY_BASED collation '_' does not match  the character
> \uFA2D.  It is the same for english or norwegian. FOR collation
> UCS_BASIC it matches fine.  Could you tell me if this is a bug?
> Here is a program to reproduce.
>
>
> Kathey
>
>
> import java.sql.*;
>
> public class HighCharacter {
>
>    public static void main(String args[]) throws Exception
>    {
>    System.out.println("\n Territory no_NO");
>    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
>    Connection conn =
> DriverManager.getConnection
> ("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Territory en_US");
>    conn =
> DriverManager.getConnection
> ("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Collation USC_BASIC");
>    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
>    testLikeWithHighestValidCharacter(conn);
>
>    }
>
>
> public static  void testLikeWithHighestValidCharacter(Connection conn)
> throws SQLException {
>    Statement stmt = conn.createStatement();
>    try {
>    stmt.executeUpdate("drop table t1");
>    }catch (SQLException se)
>    {// drop failure ok.
>    }
>    stmt.executeUpdate("create table t1(c11 int)");
>    stmt.executeUpdate("insert into t1 values 1");
>
>
>    // \uFA2D - the highest valid character according to
>    // Character.isDefined() of JDK 1.4;
>    PreparedStatement ps =
>    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
>
>    String[] match = { "%", "_", "\uFA2D" };
>
>    for (int i = 0; i < match.length; i++) {
>    System.out.println("select 1 from t1 where '\\uFA2D' like " +
> match[i]);
>    ps.setString(1, match[i]);
>    ResultSet rs = ps.executeQuery();
>    if( rs.next() && rs.getString(1).equals("1"))
>        System.out.println("PASS");
>    else
>        System.out.println("FAIL: no match");
>
>    rs.close();
>    }
>
> }
> }
>
>
>