You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by sachouche <gi...@git.apache.org> on 2017/12/13 23:04:28 UTC

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

GitHub user sachouche opened a pull request:

    https://github.com/apache/drill/pull/1072

    DRILL-5879: Improved SQL Pattern Contains Performance

    **BACKGROUND**
    - JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal is to improve the "Contains" pattern performance
    - [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task) was created subsequently to avoid the ASCII / Unicode pre-processing overhead.
    - This pull-request addresses the algorithmic part of this functionality
    
    **ANALYSIS**
    - Contains has O(n*m) complexity
    - There are two ways to optimize the associated runtime
    1) Minimize the number of instructions, pipelining, and CPU stalls
    2) Use smarter algorithms (compared to the naive one); for example, the Boyer-Moore algorithm (which is implemented by several popular Open Source tools such as grep)
    
    **IMPLEMENTATION**
    Our approach contains both suggestions (listed in the analysis)
    - Created five matchers that are chosen based on the pattern length
    - The first three, are based on pattern 1) and target patterns with length [1..4[ 
    - The forth one, has a similar runtime then the current implementation and targets patterns with length [4..10[
    - We use the the Boyer-Moore algorithm for patterns with a length larger than 9 bytes
    NOTE - the JDK doesn't use this algorithm because of two main  reasons
    - Two extra arrays are necessary with a size relative to the supported character-set and the pattern length (this would be particularly costly for unicode as this would require 64k entries)
    - Each Contains (or indexOf) invocation would require memory and initialization overhead
    - Drill doesn't have this issue as a) initialization overhead is amortized since the pattern will be matched against many input values and b) our Contains logic is centered around bytes so the memory overhead is around 256 integers per fragment
    
    **PERFORMANCE IMPROVEMENTS**
    - We observe at least 25% improvement per Contains operation for matchers with pattern length lower than 4; 100% for negative cases (as the code never accesses a secondary for-loop)
    - It was noticed the Boyer-Moor algorithm performs poorly for small patterns as lookup accesses erase the associated optimizations; this algorithm performs extremely well when the pattern length increases as unlike the naive implementation it is able to use the lookup table to jump beyond the next character (since the matching phase has already gained so insight).
    - One popular introduction to this algorithm is the following use-case
    - Input: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    - Pattern: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaax
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sachouche/drill DRILL-5879

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1072
    
----
commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <sa...@gmail.com>
Date:   2017-12-13T22:24:45Z

    DRILL-5879: Improved SQL Pattern Contains Performance

----


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r161906070
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length two */
    +  private final class MatcherTwo extends MatcherFcn {
    +
    +    private MatcherTwo() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
     
    -    final int txtLength = end - start;
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    --- End diff --
    
    space between + and 1


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158160358
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    --- End diff --
    
    In this code, the `lengthToProcess` is used only in the `for` loop, so a negative value causes the loop to never execute. The termination condition works OK. But, since the termination index is to compare the pattern length and index, and both will be 0 in the empty-pattern case, there is no reason for the special length == 0 check above.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r161899883
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length two */
    +  private final class MatcherTwo extends MatcherFcn {
    +
    +    private MatcherTwo() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
     
    -    final int txtLength = end - start;
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length three */
    +  private final class MatcherThree extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private MatcherThree() {
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    --- End diff --
    
    space between - and 2


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158086368
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected int match(int start, int end, DrillBuf drillBuf)  {
    +      if (patternLength == 0) {
    --- End diff --
    
    Yes, it would have come here; but for simplicity I guess I can handle that use-case within the switch case.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158080247
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    --- End diff --
    
    same answer.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r162124388
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length two */
    +  private final class MatcherTwo extends MatcherFcn {
    +
    +    private MatcherTwo() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
     
    -    final int txtLength = end - start;
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length three */
    +  private final class MatcherThree extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private MatcherThree() {
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    --- End diff --
    
    fixed


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r161900034
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length two */
    +  private final class MatcherTwo extends MatcherFcn {
    +
    +    private MatcherTwo() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
     
    -    final int txtLength = end - start;
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length three */
    +  private final class MatcherThree extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private MatcherThree() {
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    --- End diff --
    
    space between + and 1


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158025793
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    --- End diff --
    
    switch instead of if 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158085723
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    --- End diff --
    
    will do.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158161881
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestSqlPatterns.java ---
    @@ -446,6 +446,61 @@ public void testSqlPatternComplex() {
         assertEquals(1, sqlPatternComplex.match(0, byteBuffer.limit(), drillBuf)); // should match
       }
     
    +  @Test
    +  public void testSqlPatternContainsMUltipleMatchers() {
    --- End diff --
    
    These tests are incomplete. We do not test:
    
    * Pattern length = input length, match
    * Pattern length = input length, no match
    * Pattern length < input length
    * Pattern length = 0
    * Non-ASCII paterns and/or data
    * Data longer than the internal table sizes
    * Patterns of each of the special lengths for each of the above cases: match, not match, shorter, equals, longer


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1072


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158030234
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    --- End diff --
    
    same comment as above. No need to save in local variables


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160754694
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    --- End diff --
    
    Good point Paul!


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160764681
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected int match(int start, int end, DrillBuf drillBuf)  {
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int inputLength = end - start;
    +
    +      for (int idx1 = patternLength - 1, idx2; idx1 < inputLength;) {
    +        for (idx2 = patternLength - 1; patternArray[idx2] == drillBuf.getByte(start + idx1); --idx1, --idx2) {
    +          if (idx2 == 0) {
    +            return 1;
    +          }
    +        }
    +        // i += pattern.length - j; // For naive method
    +        idx1 += Math.max(offsetTable[patternLength - 1 - idx2], characterTable[drillBuf.getByte(start + idx1) & 0xFF]);
    +      }
    +      return 0;
    +    }
    +
    +    /** Build the jump table based on the mismatched character information **/
    +    private int[] makeCharTable() {
    +      final int TABLE_SIZE = 256; // This implementation is based on byte comparison
    +      int[] resultTable    = new int[TABLE_SIZE];
    +
    +      for (int idx = 0; idx < resultTable.length; ++idx) {
    +        resultTable[idx] = patternLength;
    +      }
    +
    +      for (int idx = 0; idx < patternLength - 1; ++idx) {
    +        final int patternValue    = ((int) patternArray[idx]) & 0xFF;
    --- End diff --
    
    Assume val1 is a byte value equal to 129, since java's byte DT is signed, then val1 will be negative. The bitwise AND with 0xFF will have a the side effect of turning off the negative flag.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r162123690
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length two */
    +  private final class MatcherTwo extends MatcherFcn {
    +
    +    private MatcherTwo() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
    --- End diff --
    
    Good idea; done this for all the matchers.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r162124359
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length two */
    +  private final class MatcherTwo extends MatcherFcn {
    +
    +    private MatcherTwo() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
     
    -    final int txtLength = end - start;
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length three */
    +  private final class MatcherThree extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private MatcherThree() {
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    --- End diff --
    
    fixed


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r161899357
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    --- End diff --
    
    redundant ?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160765882
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestSqlPatterns.java ---
    @@ -446,6 +446,61 @@ public void testSqlPatternComplex() {
         assertEquals(1, sqlPatternComplex.match(0, byteBuffer.limit(), drillBuf)); // should match
       }
     
    +  @Test
    +  public void testSqlPatternContainsMUltipleMatchers() {
    --- End diff --
    
    Paul, the test-suite already has many other tests for SQL matching. But I agree, they now might hit only few of these matcher. I will add more tests.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160759375
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    --- End diff --
    
    Paul, this is more as a convention (within constructor) to use "this" in case the member variable and input is the same.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r162119757
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    --- End diff --
    
    cut-and-paste issue :-)


---

[GitHub] drill issue #1072: DRILL-5879: Improved SQL Pattern Contains Performance

Posted by priteshm <gi...@git.apache.org>.
Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/1072
  
    @paul-rogers Is this ready for commit?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158158949
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    --- End diff --
    
    But switch does not work for the last case: < 10


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158029512
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    --- End diff --
    
    This is a problem. It can become negative if end is same as start i.e. null text. 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r161895158
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    --- End diff --
    
     length zero.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158160928
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected int match(int start, int end, DrillBuf drillBuf)  {
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int inputLength = end - start;
    +
    +      for (int idx1 = patternLength - 1, idx2; idx1 < inputLength;) {
    --- End diff --
    
    Where is `idx2` initialized?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158160610
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    --- End diff --
    
    Not needed. Call to no-argument `super()` is implicit.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158084806
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    --- End diff --
    
    This will not happen because of the outer-loop condition; idx < lengthToProcess which guarantees that we always have enough bytes to fully match the pattern; example:
    - input-len = 6; pattern = 4 this means lengthToProcess = 3 (6-4+1)
    - index 0, 1, and 2 are allowed since we have at least 4 bytes we can match (1 within the outer-loop and 3 in the inner one); index 3 will be rejected since we have only three bytes left (b3, b4, and b5)
    - If no match, then 



---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160763490
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected int match(int start, int end, DrillBuf drillBuf)  {
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int inputLength = end - start;
    +
    +      for (int idx1 = patternLength - 1, idx2; idx1 < inputLength;) {
    --- End diff --
    
    idx2 is initialized in the inner loop; it is defined before because its value is needed immediately after the inner loop. 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158032627
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    --- End diff --
    
    is this true for null pattern ?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158074289
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    --- End diff --
    
    HeapByteBuffer will always create an internal buffer. Empty string will have a byte array with zero bytes.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158160669
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    --- End diff --
    
    `this.` not really needed; there is no name conflict.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r162120115
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    --- End diff --
    
    removed.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158031007
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    --- End diff --
    
    same problem. can become negative.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158033448
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    --- End diff --
    
    you might end up accessing drillBuf out of bound if idx+patternIndex > length of drillBuf.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158072762
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    --- End diff --
    
    will do.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158161425
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected int match(int start, int end, DrillBuf drillBuf)  {
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int inputLength = end - start;
    +
    +      for (int idx1 = patternLength - 1, idx2; idx1 < inputLength;) {
    +        for (idx2 = patternLength - 1; patternArray[idx2] == drillBuf.getByte(start + idx1); --idx1, --idx2) {
    +          if (idx2 == 0) {
    +            return 1;
    +          }
    +        }
    +        // i += pattern.length - j; // For naive method
    +        idx1 += Math.max(offsetTable[patternLength - 1 - idx2], characterTable[drillBuf.getByte(start + idx1) & 0xFF]);
    +      }
    +      return 0;
    +    }
    +
    +    /** Build the jump table based on the mismatched character information **/
    +    private int[] makeCharTable() {
    +      final int TABLE_SIZE = 256; // This implementation is based on byte comparison
    +      int[] resultTable    = new int[TABLE_SIZE];
    +
    +      for (int idx = 0; idx < resultTable.length; ++idx) {
    +        resultTable[idx] = patternLength;
    +      }
    +
    +      for (int idx = 0; idx < patternLength - 1; ++idx) {
    +        final int patternValue    = ((int) patternArray[idx]) & 0xFF;
    --- End diff --
    
    The cast may be unnecessary. `0xFF` is an `int` which will case the `byte` `patternArray[idx]` to be promoted to `int` implicitly.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158029113
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    --- End diff --
    
    since inputByte is being used only once and re initialized every time in for loop, we can avoid overhead of reading and saving that.  
    if (firstPattByte != drillBuf.getByte(start + idx)) {


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r162121115
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    --- End diff --
    
    renamed all such variables to XXXPatternBytes


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158080071
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    --- End diff --
    
    This is not a problem as the for-loop condition (index = 0; index < lengthToProcess; ..) will prevent any processing to take place unless the correct condition is met. The key point here is that the input length is always provided by the "end - start" formula. 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r162517870
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestSqlPatterns.java ---
    @@ -446,6 +446,61 @@ public void testSqlPatternComplex() {
         assertEquals(1, sqlPatternComplex.match(0, byteBuffer.limit(), drillBuf)); // should match
       }
     
    +  @Test
    +  public void testSqlPatternContainsMUltipleMatchers() {
    --- End diff --
    
    If tests exist, and you have verified that each of the lengths is, in fact, tested by those tests, then I'm fine with adding a comment identifying the cases and where to find the existing tests. The point is to make sure all variations are fully tested -- one way or another.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160758370
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    --- End diff --
    
    Ok, removed it (it was mainly got code clarity).


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158160058
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    --- End diff --
    
    Should there be a special case for an empty pattern? Just return 1 always. Else, seems odd for this to be the length = 0 || length > n case.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158030478
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    --- End diff --
    
    same problem as above. This can become negative.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158033645
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    +          }
    +        }
    +      }
    +      return patternIndex == patternLength ? 1 : 0;
    +    }
    +  }
    +
    +  /**
    +   * Boyer-Moore matcher algorithm; excellent for large patterns and for prefix patterns which appear
    +   * frequently in the input.
    +   */
    +  private final class BoyerMooreMatcher extends MatcherFcn {
    +    private final int[] offsetTable;
    +    private final int[] characterTable;
    +
    +    private BoyerMooreMatcher() {
    +      super();
    +
    +      this.offsetTable    = makeOffsetTable();
    +      this.characterTable = makeCharTable();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected int match(int start, int end, DrillBuf drillBuf)  {
    +      if (patternLength == 0) {
    --- End diff --
    
    do we even come here if patternLength is 0 ?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158034838
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    +        return 1;
    +      }
    +
    +      final int lengthToProcess = end - start - patternLength + 1;
    +      int patternIndex          = 0;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte == inputByte) {
    +          for (patternIndex = 1; patternIndex < patternLength; ++patternIndex) {
    +            final byte currInByte   = drillBuf.getByte(start + idx + patternIndex);
    +            final byte currPattByte = patternArray[patternIndex];
    +
    +            if (currInByte != currPattByte) {
    +              break;
    +            }
    +          }
    +
    +          if (patternIndex == patternLength) {
    +            break;
    --- End diff --
    
    can we just return from here ?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158159524
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    --- End diff --
    
    Is it faster to say `secondInByte = thirdInByte`; shuffle values between variables in each iteration rather than diving into direct memory multiple types for the same byte?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r161905865
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length two */
    +  private final class MatcherTwo extends MatcherFcn {
    +
    +    private MatcherTwo() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
    --- End diff --
    
    can you initialize them in the constructor for matcher function ? that way you can initialize once and reuse for each match. 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r158074965
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    --- End diff --
    
    These optimizations are useless with modern compilers. This would have been useful if the inputByte value was the same.. then we would save few instructions.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r161895458
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,286 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 0) {
    +      matcherFcn = new MatcherZero();
    +    } else if (patternLength == 1) {
    +      matcherFcn = new MatcherOne();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new MatcherTwo();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new MatcherThree();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherZero extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    +    private MatcherZero() {
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
           return 1;
         }
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class MatcherOne extends MatcherFcn {
    +
    +    private MatcherOne() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
    --- End diff --
    
    can we name it firstPatternByte ?


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160753323
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    --- End diff --
    
    That should help; though how do you suggest we implement such logic knowing that we're not sure when the else statement is encountered. I could try to unroll the loop but this would complicate the logic.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1072#discussion_r160755097
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -19,44 +19,283 @@
     
     import io.netty.buffer.DrillBuf;
     
    -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +/** SQL Pattern Contains implementation */
    +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +  private final MatcherFcn matcherFcn;
     
       public SqlPatternContainsMatcher(String patternString) {
         super(patternString);
    +
    +    // Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is
    +    // that there is no single implementation that can do it all well. So, we use multiple implementations
    +    // chosen based on the pattern length.
    +    if (patternLength == 1) {
    +      matcherFcn = new Matcher1();
    +    } else if (patternLength == 2) {
    +      matcherFcn = new Matcher2();
    +    } else if (patternLength == 3) {
    +      matcherFcn = new Matcher3();
    +    } else if (patternLength < 10) {
    +      matcherFcn = new MatcherN();
    +    } else {
    +      matcherFcn = new BoyerMooreMatcher();
    +    }
       }
     
       @Override
       public int match(int start, int end, DrillBuf drillBuf) {
    +    return matcherFcn.match(start, end, drillBuf);
    +  }
    +
    +  //--------------------------------------------------------------------------
    +  // Inner Data Structure
    +  // --------------------------------------------------------------------------
    +
    +  /** Abstract matcher class to allow us pick the most efficient implementation */
    +  private abstract class MatcherFcn {
    +    protected final byte[] patternArray;
    +
    +    protected MatcherFcn() {
    +      assert patternByteBuffer.hasArray();
    +
    +      patternArray = patternByteBuffer.array();
    +    }
    +
    +    /**
    +     * @return 1 if the pattern was matched; 0 otherwise
    +     */
    +    protected abstract int match(int start, int end, DrillBuf drillBuf);
    +  }
    +
    +  /** Handles patterns with length one */
    +  private final class Matcher1 extends MatcherFcn {
     
    -    if (patternLength == 0) { // Everything should match for null pattern string
    -      return 1;
    +    private Matcher1() {
    +      super();
         }
     
    -    final int txtLength = end - start;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start;
    +      final byte firstPattByte  = patternArray[0];
     
    -    // no match if input string length is less than pattern length
    -    if (txtLength < patternLength) {
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        }
    +        return 1;
    +      }
           return 0;
         }
    +  }
     
    +  /** Handles patterns with length two */
    +  private final class Matcher2 extends MatcherFcn {
     
    -    final int outerEnd = txtLength - patternLength;
    +    private Matcher2() {
    +      super();
    +    }
     
    -    outer:
    -    for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess = end - start - 1;
    +      final byte firstPattByte  = patternArray[0];
    +      final byte secondPattByte = patternArray[1];
     
           // simplePattern string has meta characters i.e % and _ and escape characters removed.
           // so, we can just directly compare.
    -      for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) {
    -        if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) {
    -          continue outer;
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte firstInByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != firstInByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +
    +          if (secondInByte == secondPattByte) {
    +            return 1;
    +          }
             }
           }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with length three */
    +  private final class Matcher3 extends MatcherFcn {
     
    -      return 1;
    +    private Matcher3() {
    +      super();
         }
     
    -    return  0;
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +      final int lengthToProcess  = end - start -2;
    +      final byte firstPattByte   = patternArray[0];
    +      final byte secondPattByte  = patternArray[1];
    +      final byte thirdPattByte   = patternArray[2];
    +
    +      // simplePattern string has meta characters i.e % and _ and escape characters removed.
    +      // so, we can just directly compare.
    +      for (int idx = 0; idx < lengthToProcess; idx++) {
    +        final byte inputByte = drillBuf.getByte(start + idx);
    +
    +        if (firstPattByte != inputByte) {
    +          continue;
    +        } else {
    +          final byte secondInByte = drillBuf.getByte(start + idx +1);
    +          final byte thirdInByte  = drillBuf.getByte(start + idx +2);
    +
    +          if (secondInByte == secondPattByte && thirdInByte == thirdPattByte) {
    +            return 1;
    +          }
    +        }
    +      }
    +      return 0;
    +    }
    +  }
    +
    +  /** Handles patterns with arbitrary length */
    +  private final class MatcherN extends MatcherFcn {
    +
    +    private MatcherN() {
    +      super();
    +    }
    +
    +    /** {@inheritDoc} */
    +    @Override
    +    protected final int match(int start, int end, DrillBuf drillBuf) {
    +
    +      if (patternLength == 0) {
    --- End diff --
    
    Agree; I moved the special case ZERO length in the top.


---