You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by arina-ielchiieva <gi...@git.apache.org> on 2017/04/27 12:29:38 UTC

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/821

    DRILL-5450: Fix initcap function to convert upper case characters cor\u2026

    \u2026rectly

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-5450

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/821.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #821
    
----
commit 22f5a8afb11cf6bbe7282b70404c7e30b7bc2c8e
Author: Arina Ielchiieva <ar...@gmail.com>
Date:   2017-04-27T11:44:18Z

    DRILL-5450: Fix initcap function to convert upper case characters correctly

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/821


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r115026331
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestStringFunctions.java ---
    @@ -308,4 +308,58 @@ public void testReverseLongVarChars() throws Exception {
           FileUtils.deleteQuietly(path);
         }
       }
    +
    +  @Test
    +  public void testLower() throws Exception {
    +    testBuilder()
    +        .sqlQuery("select\n" +
    +            "lower('ABC') col_upper,\n" +
    +            "lower('abc') col_lower,\n" +
    --- End diff --
    
    Created Jira DRILL-5477 and added appropriate unit test which is ignored for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r114249116
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java ---
    @@ -144,41 +144,28 @@ public static int varTypesToInt(final int start, final int end, DrillBuf buffer)
         return result;
       }
     
    -  // Assumes Alpha as [A-Za-z0-9]
    -  // white space is treated as everything else.
    +  /**
    +   * Capitalizes first letter in each word.
    +   * Any symbol except digits and letters is considered as word delimiter.
    +   *
    +   * @param start start position in input buffer
    +   * @param end end position in input buffer
    +   * @param inBuf buffer with input characters
    +   * @param outBuf buffer with output characters
    +   */
       public static void initCap(int start, int end, DrillBuf inBuf, DrillBuf outBuf) {
    -    boolean capNext = true;
    +    boolean capitalizeNext = true;
         int out = 0;
         for (int id = start; id < end; id++, out++) {
    -      byte currentByte = inBuf.getByte(id);
    -
    -      // 'A - Z' : 0x41 - 0x5A
    -      // 'a - z' : 0x61 - 0x7A
    -      // '0-9' : 0x30 - 0x39
    -      if (capNext) { // curCh is whitespace or first character of word.
    -        if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
    -          capNext = false;
    -        } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          capNext = false;
    -        } else if (currentByte >= 0x61 && currentByte <= 0x7A) { // a-z
    -          capNext = false;
    -          currentByte -= 0x20; // Uppercase this character
    -        }
    -        // else {} whitespace
    -      } else { // Inside of a word or white space after end of word.
    -        if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
    -          // noop
    -        } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          currentByte -= 0x20; // Lowercase this character
    -        } else if (currentByte >= 0x61 && currentByte <= 0x7A) { // a-z
    -          // noop
    -        } else { // whitespace
    -          capNext = true;
    -        }
    +      int currentByte = inBuf.getByte(id);
    --- End diff --
    
    This code works only for ASCII, but not for UTF-8. UTF-8 is a multi-byte code that requires special encoding/decoding to convert to Unicode characters. Without that encoding, this method won't work for Cyrillic, Greek or any other character set with upper/lower distinctions.
    
    Since this method never worked, it is probably OK to make it a bit less broken than before: at least now it works for ASCII. Please add unit tests below, then file a JIRA, for the fact that this function does not work with UTF-8 despite the fact that Drill claims it supports UTF-8.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #821: DRILL-5450: Fix initcap function to convert upper case cha...

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:

    https://github.com/apache/drill/pull/821
  
    +1 so we can commit this. But Paul is right. This could be a lot better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

Posted by Ben-Zvi <gi...@git.apache.org>.
Github user Ben-Zvi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r113813564
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java ---
    @@ -169,7 +169,7 @@ public static void initCap(int start, int end, DrillBuf inBuf, DrillBuf outBuf)
             if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
               // noop
             } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          currentByte -= 0x20; // Lowercase this character
    +          currentByte += 0x20; // Lowercase this character
    --- End diff --
    
    toLowerCase() is implemented as a big switch statement for the Unicode, so very little cost ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r114249334
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestStringFunctions.java ---
    @@ -308,4 +308,58 @@ public void testReverseLongVarChars() throws Exception {
           FileUtils.deleteQuietly(path);
         }
       }
    +
    +  @Test
    +  public void testLower() throws Exception {
    +    testBuilder()
    +        .sqlQuery("select\n" +
    +            "lower('ABC') col_upper,\n" +
    +            "lower('abc') col_lower,\n" +
    --- End diff --
    
    Please add tests for Greek and Cyrillic. Our source encoding is UTF-8, so you can enter the characters directly. Or, if that does not work, you can instead use the Java Unicode encoding: U1234.
    
    If the tests fail because of parsing of SQL, please file a bug. If they fail because the function above does not support UTF-8, please file a different bug.
    
    In either case, you can then comment out the test cases and add a comment that says that they fail due to DRILL-xxxx, whatever your bug number turns out to be.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #821: DRILL-5450: Fix initcap function to convert upper case cha...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/821
  
    @paul-rogers 
    I have changed customer lower / upper implementation in favor of `Character` methods.
    Made changes in lower, upper and initcap functions.
    Please review when possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r113900140
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java ---
    @@ -169,7 +169,7 @@ public static void initCap(int start, int end, DrillBuf inBuf, DrillBuf outBuf)
             if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
               // noop
             } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          currentByte -= 0x20; // Lowercase this character
    +          currentByte += 0x20; // Lowercase this character
    --- End diff --
    
    I did not notice any significant difference in performance, so will replace to `Character` methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r113791308
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java ---
    @@ -169,7 +169,7 @@ public static void initCap(int start, int end, DrillBuf inBuf, DrillBuf outBuf)
             if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
               // noop
             } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          currentByte -= 0x20; // Lowercase this character
    +          currentByte += 0x20; // Lowercase this character
    --- End diff --
    
    currentByte = Character.toLowerCase( currentByte )
    
    The above handles all the Unicode complexity -- no need for us to reimplement it here.
    
    A concern might be performance. Try calling the above 10K times in a loop and this function 10K times. Is there a difference in cost?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---