You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by "Andreas Korneliussen (JIRA)" <de...@db.apache.org> on 2006/09/18 14:04:23 UTC

[jira] Updated: (DERBY-1862) Simple hash improves performance

     [ http://issues.apache.org/jira/browse/DERBY-1862?page=all ]

Andreas Korneliussen updated DERBY-1862:
----------------------------------------

    Attachment: DERBY-1862.diff

Attached is a patch which uses another approach to improve the SQLEqualsIgnoreCase method. The patch check the identity and length of the strings to be compared, before doing conversions to uppercase with english locale. 

String.toUpperCase(..) with english locale, should return a string with the same number of characters, and it should therefore be valid to do a check of number of characters before doing any conversions.

The patch which is posted as part of the description, will leak memory, since strings are never removed from the upperCaseMap.

> Simple hash improves performance
> --------------------------------
>
>                 Key: DERBY-1862
>                 URL: http://issues.apache.org/jira/browse/DERBY-1862
>             Project: Derby
>          Issue Type: Improvement
>          Components: Performance
>    Affects Versions: 10.1.3.1, 10.1.2.1
>         Environment: WinXp, JRE 1.5_6., Hibernate 3.1
>            Reporter: Tore Andre Olmheim
>         Attachments: DERBY-1862.diff
>
>
> We are currently developing a system where we load between 1000 and 5000 objects in one go. The user can load different chunks of objects at any time as he/she is navigating. 
> The system consist of a java application which accesses derby via hibernate.
> During profiling we discovered that the org.apache.derby.iapi.util.StringUtil is the biggest bottleneck in the system.
> The method SQLEqualsIgnoreCase(String s1, String s2) is doing upperCase on both s1 and s2, all the time.
> By putting the uppcase value into a Hashtable and using the input-string as key we increates the performance with about 40%. 
> Our test-users report that the system now seems to run at  "double speed". 
> The class calling the StringUtil.SQLEqualsIgnoreCase in this case is
> org.apache.derby.impl.jdbc.EmbedResultSet
> This class should also be checked as it seems to do a lot of looping.  
> It might be a canditate for hashing, as it is stated in the code:
> "// REVISIT: we might want to cache our own info..."
> Here is a diff agains the 10.1.3.1 source for org.apache.derby.iapi.util.StringUtil
> 22a23
> > import java.util.Hashtable;
> 319c320,326
> < 			return s1.toUpperCase(Locale.ENGLISH).equals(s2.toUpperCase(Locale.ENGLISH));
> ---
> >       {
> >          String s1Up = (String) uppercaseMap.get(s1);
> >          if (s1Up == null)
> >          {
> >             s1Up = s1.toUpperCase(Locale.ENGLISH);
> >             uppercaseMap.put(s1,s1Up);
> >          }
> 320a328,332
> >          String s2Up = (String) uppercaseMap.get(s2);
> >          if (s2Up == null)
> >          {
> >             s2Up = s2.toUpperCase(Locale.ENGLISH);
> >             uppercaseMap.put(s2,s2Up);
> 321a334
> >          return s1Up.equals(s2Up);
> 322a336,339
> >          //return s1.toUpperCase(Locale.ENGLISH).equals(s2.toUpperCase(Locale.ENGLISH));
> >       }
> >    }
> >    private static Hashtable uppercaseMap = new Hashtable();

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Øystein Grøvlen <Oy...@Sun.COM>.
Andreas Korneliussen (JIRA) wrote:

> 
> Attached is a patch which uses another approach to improve the SQLEqualsIgnoreCase method. The patch check the identity and length of the strings to be compared, before doing conversions to uppercase with english locale. 
> 
> String.toUpperCase(..) with english locale, should return a string with the same number of characters, and it should therefore be valid to do a check of number of characters before doing any conversions.

Still, nothing beats using column indexes :-)

Another optimization one could do would be to store the column 
descriptors in a normalize form.  (Maybe they already are).  Then, only 
one of the strings would need to be upper-cased for each comparison.

--
Øystein

Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by "Bernt M. Johnsen" <Be...@Sun.COM>.
Andreas Korneliussen wrote:
> <....>
> 
>> As far as I remember from my high-school German, is that even if all "ß" may be
>> converted to uppercase "SS", not all "SS" in uppercase may be converted to the
>> lowercase "ß". If the "SS" appears in a combined word (in German, words are
>> combined by concatenating them, as in Norwegian) where one word ends with "S"
>> and the second word starts with "S", the result when converted to lowercase
>> should be "ss" (I am trying to construct an example, but my German is very, very
>> rusty....... ;-)
>>
> 
> There is of course no logic in String.toLowerCase() to make "SS" be
> converted to "ß" based on German grammar rules, since it does a
> character by character conversion.
> 
> So "ICH HEISSE BERNT".toLowerCase() will be "ich heisse bernt", and not
> "ich heiße bernt" ;-)

Which is perfectly legal in modern German, I think ("ss" being an optional way
of writing "ß").

> 
> Regards
> Andreas
> 
> 


-- 
Bernt Marius Johnsen, Database Technology Group,
Staff Engineer, Technical Lead Derby/Java DB
Sun Microsystems, Trondheim, Norway

Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Andreas Korneliussen <An...@Sun.COM>.
<....>

> As far as I remember from my high-school German, is that even if all "ß" may be
> converted to uppercase "SS", not all "SS" in uppercase may be converted to the
> lowercase "ß". If the "SS" appears in a combined word (in German, words are
> combined by concatenating them, as in Norwegian) where one word ends with "S"
> and the second word starts with "S", the result when converted to lowercase
> should be "ss" (I am trying to construct an example, but my German is very, very
> rusty....... ;-)
> 

There is of course no logic in String.toLowerCase() to make "SS" be
converted to "ß" based on German grammar rules, since it does a
character by character conversion.

So "ICH HEISSE BERNT".toLowerCase() will be "ich heisse bernt", and not
"ich heiße bernt" ;-)

Regards
Andreas



Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by "Bernt M. Johnsen" <Be...@Sun.COM>.
Andreas Korneliussen wrote:
> String.toUpperCase() is locale dependent, however I am not sure that
> String.equalsIgnoreCase() is locale dependend (does not seem so when
> reading the code and javadoc).
> 
> I did find an issue with the German double s: ß.
> 
> "ß".toUpperCase() returns "SS".

That is according to the Unicode standard.

> However "ß".equalsIgnoreCase("SS") returns false.

As far as I remember from my high-school German, is that even if all "ß" may be
converted to uppercase "SS", not all "SS" in uppercase may be converted to the
lowercase "ß". If the "SS" appears in a combined word (in German, words are
combined by concatenating them, as in Norwegian) where one word ends with "S"
and the second word starts with "S", the result when converted to lowercase
should be "ss" (I am trying to construct an example, but my German is very, very
rusty....... ;-)

> 
> So basically, "ß".toUpperCase().equalsIgnoreCase("ß") returns false.
> 
> The Derby method: SQLUtil.SQLIgnoreCase("ß", "SS") returns true (however
> the patch which I attached, will make it return false and therefore is
> not as intended).
> 
> If my column name is "classnames", should it be accessible by using the
> string "claßnames" ?
> 
> Regards
> Andreas
> 


-- 
Bernt Marius Johnsen, Database Technology Group,
Staff Engineer, Technical Lead Derby/Java DB
Sun Microsystems, Trondheim, Norway

Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Andreas Korneliussen <An...@Sun.COM>.
Daniel John Debrunner wrote:
> Andreas Korneliussen wrote:
> 
>> Øystein Grøvlen wrote:
>>
>>> Andreas Korneliussen (JIRA) wrote:
>>>
>>>
>>>> String.toUpperCase(..) with english locale, should return a string
>>>> with the same number of characters, and it should therefore be valid
>>>> to do a check of number of characters before doing any conversions.
>>> Is it correct to always use English locale in this case?  Ref the
>>> reference guide on SQL identifiers:
>>>
>>>      An ordinary identifier must begin with a letter and contain
>>>      only letters, underscore characters (_), and digits. The
>>>      permitted letters and digits include all Unicode letters and
>>>      digits, but Derby does not attempt to ensure that the
>>>      characters in identifiers are valid in the database's
>>>      locale.
>>>
>>> Should not it be possible to match column names in any locale?
>>>
> 
> No, see below.
> 
>> Your question is a valid question to ask about this method, however my
>> intention was to make the method keep its current behavior. The patch
>> simply preserves the current behaviour (which is to use english locale).
>> So any sets of strings s1 and s2 should make the method return the same
>> values as before the patch. If this is not the case, the patch is not as
>> intended.
>>
>> When looking deeper into the String class, my understanding is that the
>> only Locale which has different semantics than other Locales when it
>> comes to toUpperCase(Locale..), is Turkish, so maybe Derby does not work
>> correctly in Turkish locale.
> 
> I think the changes were made to use a single locale (English) for the
> SQL language so that Derby would work in Turkish. Having the name
> matching in SQL be dependent on the locale of the client or engine would
> mean that the potential exists for a SQL statement from a single
> application to have different meanings in different locales. That is not
> the expected behaviour when working against a programming language.
> 
> When the SQL parser upper cased items in the engine's locale an
> application using 'insert' would fail in Turkish, as it does not upper
> case to "INSERT".
> 
>> I also wondered why Derby has its own SQLIgnoreCase method, instead of
>> simply using String.equalsIgnoreCase(). The Derby implementation is very
>> inefficient compared to the String.equalsIgnoreCase() method, since you
>> risk creating two new string objects before doing the comparison.
> 
> I think because String.equalsIgnoreCase() is dependent on the current
> locale.
> 

String.toUpperCase() is locale dependent, however I am not sure that
String.equalsIgnoreCase() is locale dependend (does not seem so when
reading the code and javadoc).

I did find an issue with the German double s: ß.

"ß".toUpperCase() returns "SS".

However "ß".equalsIgnoreCase("SS") returns false.

So basically, "ß".toUpperCase().equalsIgnoreCase("ß") returns false.

The Derby method: SQLUtil.SQLIgnoreCase("ß", "SS") returns true (however
the patch which I attached, will make it return false and therefore is
not as intended).

If my column name is "classnames", should it be accessible by using the
string "claßnames" ?

Regards
Andreas


Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Daniel John Debrunner <dj...@apache.org>.
Andreas Korneliussen wrote:

> Øystein Grøvlen wrote:
> 
>>Andreas Korneliussen (JIRA) wrote:
>>
>>
>>>String.toUpperCase(..) with english locale, should return a string
>>>with the same number of characters, and it should therefore be valid
>>>to do a check of number of characters before doing any conversions.
>>
>>Is it correct to always use English locale in this case?  Ref the
>>reference guide on SQL identifiers:
>>
>>      An ordinary identifier must begin with a letter and contain
>>      only letters, underscore characters (_), and digits. The
>>      permitted letters and digits include all Unicode letters and
>>      digits, but Derby does not attempt to ensure that the
>>      characters in identifiers are valid in the database's
>>      locale.
>>
>>Should not it be possible to match column names in any locale?
>>

No, see below.

> Your question is a valid question to ask about this method, however my
> intention was to make the method keep its current behavior. The patch
> simply preserves the current behaviour (which is to use english locale).
> So any sets of strings s1 and s2 should make the method return the same
> values as before the patch. If this is not the case, the patch is not as
> intended.
> 
> When looking deeper into the String class, my understanding is that the
> only Locale which has different semantics than other Locales when it
> comes to toUpperCase(Locale..), is Turkish, so maybe Derby does not work
> correctly in Turkish locale.

I think the changes were made to use a single locale (English) for the
SQL language so that Derby would work in Turkish. Having the name
matching in SQL be dependent on the locale of the client or engine would
mean that the potential exists for a SQL statement from a single
application to have different meanings in different locales. That is not
the expected behaviour when working against a programming language.

When the SQL parser upper cased items in the engine's locale an
application using 'insert' would fail in Turkish, as it does not upper
case to "INSERT".

> I also wondered why Derby has its own SQLIgnoreCase method, instead of
> simply using String.equalsIgnoreCase(). The Derby implementation is very
> inefficient compared to the String.equalsIgnoreCase() method, since you
> risk creating two new string objects before doing the comparison.

I think because String.equalsIgnoreCase() is dependent on the current
locale.

Dan.




Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Andreas Korneliussen <An...@Sun.COM>.
Øystein Grøvlen wrote:
> Andreas Korneliussen (JIRA) wrote:
> 
>> String.toUpperCase(..) with english locale, should return a string
>> with the same number of characters, and it should therefore be valid
>> to do a check of number of characters before doing any conversions.
> 
> Is it correct to always use English locale in this case?  Ref the
> reference guide on SQL identifiers:
> 
>       An ordinary identifier must begin with a letter and contain
>       only letters, underscore characters (_), and digits. The
>       permitted letters and digits include all Unicode letters and
>       digits, but Derby does not attempt to ensure that the
>       characters in identifiers are valid in the database's
>       locale.
> 
> Should not it be possible to match column names in any locale?
> 

Your question is a valid question to ask about this method, however my
intention was to make the method keep its current behavior. The patch
simply preserves the current behaviour (which is to use english locale).
So any sets of strings s1 and s2 should make the method return the same
values as before the patch. If this is not the case, the patch is not as
intended.

When looking deeper into the String class, my understanding is that the
only Locale which has different semantics than other Locales when it
comes to toUpperCase(Locale..), is Turkish, so maybe Derby does not work
correctly in Turkish locale.

I also wondered why Derby has its own SQLIgnoreCase method, instead of
simply using String.equalsIgnoreCase(). The Derby implementation is very
inefficient compared to the String.equalsIgnoreCase() method, since you
risk creating two new string objects before doing the comparison.

Andreas


> -- 
> Øystein



Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Daniel John Debrunner <dj...@apache.org>.
Knut Anders Hatlen wrote:

> Øystein Grøvlen <Oy...@Sun.COM> writes:
> 
> 
>>Andreas Korneliussen (JIRA) wrote:
>>
>>
>>>String.toUpperCase(..) with english locale, should return a string
>>>with the same number of characters, and it should therefore be valid
>>>to do a check of number of characters before doing any conversions.
>>
>>Is it correct to always use English locale in this case?  Ref the
>>reference guide on SQL identifiers:
> 
> 
> And is it correct to upcase the identifiers before comparing them in
> findColumnName()?

I beloive so, based upon this test in the javadoc of ResultSet.

"Column names used as input to getter methods are case insensitive. When
a getter method is called with a column name and several columns have
the same name, the value of the first matching column will be returned."

Dan.



Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.
Øystein Grøvlen <Oy...@Sun.COM> writes:

> Andreas Korneliussen (JIRA) wrote:
>
>> String.toUpperCase(..) with english locale, should return a string
>> with the same number of characters, and it should therefore be valid
>> to do a check of number of characters before doing any conversions.
>
> Is it correct to always use English locale in this case?  Ref the
> reference guide on SQL identifiers:

And is it correct to upcase the identifiers before comparing them in
findColumnName()?

  ResultSet rs = stmt.executeQuery("select \"x\", x from t");
  rs.next();
  int smallX = rs.getInt("x"); // which x is this?
  int bigX = rs.getInt("X");   // and this?

In derby both smallX and bigX get the value of column 1, whereas I
would expect them to get the values from column 1 and 2,
respectively. I haven't checked what the spec says.

-- 
Knut Anders

Re: [jira] Updated: (DERBY-1862) Simple hash improves performance

Posted by Øystein Grøvlen <Oy...@Sun.COM>.
Andreas Korneliussen (JIRA) wrote:

 > String.toUpperCase(..) with english locale, should return a string
 > with the same number of characters, and it should therefore be valid
 > to do a check of number of characters before doing any conversions.

Is it correct to always use English locale in this case?  Ref the
reference guide on SQL identifiers:

	  An ordinary identifier must begin with a letter and contain
	  only letters, underscore characters (_), and digits. The
	  permitted letters and digits include all Unicode letters and
	  digits, but Derby does not attempt to ensure that the
	  characters in identifiers are valid in the database's
	  locale.

Should not it be possible to match column names in any locale?

--
Øystein