You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by ppadma <gi...@git.apache.org> on 2017/10/28 04:04:26 UTC

[GitHub] drill pull request #1015: DRILL-5889: Simple pattern matchers can work with ...

GitHub user ppadma opened a pull request:

    https://github.com/apache/drill/pull/1015

    DRILL-5889: Simple pattern matchers can work with DrillBuf directly

    For the 4 simple patterns we have i.e. startsWith, endsWith, contains and constant,, we do not need the overhead of charSequenceWrapper. We can work with DrillBuf directly. This will save us from doing isAscii check and UTF8 decoding for each row.
    UTF-8 encoding ensures that no UTF-8 character is a prefix of any other valid character. So, instead of decoding varChar from each row we are processing, encode the patternString once during setup and do raw byte comparison. 
    This improved overall performance for filter operator by around 20%.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ppadma/drill DRILL-5899

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1015.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1015
    
----
commit ebe2c6f9110f0501a85cb5ae6e119e05254b9f3e
Author: Padma Penumarthy <pp...@yahoo.com>
Date:   2017-10-25T20:37:25Z

    DRILL-5889: Simple pattern matchers can work with DrillBuf directly

----


---

[GitHub] drill pull request #1015: DRILL-5899: Simple pattern matchers can work with ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1015#discussion_r149552453
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/AbstractSqlPatternMatcher.java ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.expr.fn.impl;
    +
    +import com.google.common.base.Charsets;
    +import org.apache.drill.common.exceptions.UserException;
    +import java.nio.ByteBuffer;
    +import java.nio.CharBuffer;
    +import java.nio.charset.CharacterCodingException;
    +import java.nio.charset.CharsetEncoder;
    +import static org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.logger;
    +
    +// To get good performance for most commonly used pattern matches
    --- End diff --
    
    Javadoc?
    
    ```
    /**
     * This is a Javadoc comment and appears in generated documentation.
     */
    // This is a plain comment and does not appear in documentation.
    ```


---

[GitHub] drill issue #1015: DRILL-5899: Simple pattern matchers can work with DrillBu...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1015
  
    @paul-rogers Thanks a lot for the review. Updated the PR with code review comments. Please take a look. 
    
    Overall, good improvement with this change. Here are the numbers.
    
    select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a' 
    1.4 sec vs 7 sec
    
    select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like '%a%' 
    6.5 sec vs 13.5 sec
    
    select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a%' 
    1.4 sec vs 5.8 sec
    
    select count(*) from `/Users/ppenumarthy/MAPRTECH/padma/testdata` where l_comment like 'a' 
    1.1.65 sec vs 5.8 sec
    
    
    I think for "contains", improvement is not as much as others, probably because of nested for loops. @sachouche changes on top of these changes can improve further. 



---

[GitHub] drill pull request #1015: DRILL-5899: Simple pattern matchers can work with ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1015#discussion_r149554356
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternEndsWithMatcher.java ---
    @@ -17,33 +17,30 @@
      */
     package org.apache.drill.exec.expr.fn.impl;
     
    -public class SqlPatternEndsWithMatcher implements SqlPatternMatcher {
    -  final String patternString;
    -  CharSequence charSequenceWrapper;
    -  final int patternLength;
    -
    -  public SqlPatternEndsWithMatcher(String patternString, CharSequence charSequenceWrapper) {
    -    this.charSequenceWrapper = charSequenceWrapper;
    -    this.patternString = patternString;
    -    this.patternLength = patternString.length();
    +import io.netty.buffer.DrillBuf;
    +
    +public class SqlPatternEndsWithMatcher extends AbstractSqlPatternMatcher {
    +
    +  public SqlPatternEndsWithMatcher(String patternString) {
    +    super(patternString);
       }
     
       @Override
    -  public int match() {
    -    int txtIndex = charSequenceWrapper.length();
    -    int patternIndex = patternLength;
    -    boolean matchFound = true; // if pattern is empty string, we always match.
    +  public int match(int start, int end, DrillBuf drillBuf) {
    +
    +    if ( (end - start) < patternLength) { // No match if input string length is less than pattern length.
    --- End diff --
    
    `( (end - start)` --> `(end - start`


---

[GitHub] drill issue #1015: DRILL-5899: Simple pattern matchers can work with DrillBu...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1015
  
    Second long list of detailed comments sent privately to keep dev list traffic down.


---

[GitHub] drill pull request #1015: DRILL-5899: Simple pattern matchers can work with ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1015#discussion_r149555002
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternEndsWithMatcher.java ---
    @@ -17,33 +17,30 @@
      */
     package org.apache.drill.exec.expr.fn.impl;
     
    -public class SqlPatternEndsWithMatcher implements SqlPatternMatcher {
    -  final String patternString;
    -  CharSequence charSequenceWrapper;
    -  final int patternLength;
    -
    -  public SqlPatternEndsWithMatcher(String patternString, CharSequence charSequenceWrapper) {
    -    this.charSequenceWrapper = charSequenceWrapper;
    -    this.patternString = patternString;
    -    this.patternLength = patternString.length();
    +import io.netty.buffer.DrillBuf;
    +
    +public class SqlPatternEndsWithMatcher extends AbstractSqlPatternMatcher {
    +
    +  public SqlPatternEndsWithMatcher(String patternString) {
    +    super(patternString);
       }
     
       @Override
    -  public int match() {
    -    int txtIndex = charSequenceWrapper.length();
    -    int patternIndex = patternLength;
    -    boolean matchFound = true; // if pattern is empty string, we always match.
    +  public int match(int start, int end, DrillBuf drillBuf) {
    +
    +    if ( (end - start) < patternLength) { // No match if input string length is less than pattern length.
    +      return 0;
    +    }
     
         // simplePattern string has meta characters i.e % and _ and escape characters removed.
         // so, we can just directly compare.
    -    while (patternIndex > 0 && txtIndex > 0) {
    -      if (charSequenceWrapper.charAt(--txtIndex) != patternString.charAt(--patternIndex)) {
    -        matchFound = false;
    -        break;
    +    for (int index = 1; index <= patternLength; index++) {
    --- End diff --
    
    ```
    int txtStart = end - patternLength;
    if (txtStart < start) { return 0; }
    for (int index = 0; index < patternLength; index++) {
       ... patternByteBuffer.get(index) ... drillBuf.getByte(txtStart + index) ...
    ```



---

[GitHub] drill issue #1015: DRILL-5889: Simple pattern matchers can work with DrillBu...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1015
  
    @sachouche @paul-rogers can you please review ?


---

[GitHub] drill issue #1015: DRILL-5899: Simple pattern matchers can work with DrillBu...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1015
  
    @sachouche @paul-rogers Thanks for the review. I updated the PR with review comments. 
    
    I made one more change.  Previously, I was copying native memory buffer into byte array and using it. Instead, if we go to native memory  directly, performance is significantly better. In fact,  it is 3 times faster :-)
    
    Please review updated changes.



---

[GitHub] drill pull request #1015: DRILL-5899: Simple pattern matchers can work with ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1015#discussion_r149552506
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/AbstractSqlPatternMatcher.java ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.expr.fn.impl;
    +
    +import com.google.common.base.Charsets;
    +import org.apache.drill.common.exceptions.UserException;
    +import java.nio.ByteBuffer;
    +import java.nio.CharBuffer;
    +import java.nio.charset.CharacterCodingException;
    +import java.nio.charset.CharsetEncoder;
    +import static org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.logger;
    +
    +// To get good performance for most commonly used pattern matches
    +// i.e. CONSTANT('ABC'), STARTSWITH('%ABC'), ENDSWITH('ABC%') and CONTAINS('%ABC%'),
    +// we have simple pattern matchers.
    +// Idea is to have our own implementation for simple pattern matchers so we can
    +// avoid heavy weight regex processing, skip UTF-8 decoding and char conversion.
    +// Instead, we encode the pattern string and do byte comparison against native memory.
    +// Overall, this approach
    +// gives us orders of magnitude performance improvement for simple pattern matches.
    +// Anything that is not simple is considered
    +// complex pattern and we use Java regex for complex pattern matches.
    +
    +public abstract class AbstractSqlPatternMatcher implements SqlPatternMatcher {
    +  final String patternString;
    --- End diff --
    
    `protected final`


---

[GitHub] drill pull request #1015: DRILL-5899: Simple pattern matchers can work with ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1015#discussion_r149554195
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java ---
    @@ -17,37 +17,48 @@
      */
     package org.apache.drill.exec.expr.fn.impl;
     
    -public class SqlPatternContainsMatcher implements SqlPatternMatcher {
    -  final String patternString;
    -  CharSequence charSequenceWrapper;
    -  final int patternLength;
    -
    -  public SqlPatternContainsMatcher(String patternString, CharSequence charSequenceWrapper) {
    -    this.patternString = patternString;
    -    this.charSequenceWrapper = charSequenceWrapper;
    -    patternLength = patternString.length();
    +import io.netty.buffer.DrillBuf;
    +
    +public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
    +
    +  public SqlPatternContainsMatcher(String patternString) {
    +    super(patternString);
       }
     
       @Override
    -  public int match() {
    -    final int txtLength = charSequenceWrapper.length();
    -    int patternIndex = 0;
    -    int txtIndex = 0;
    -
    -    // simplePattern string has meta characters i.e % and _ and escape characters removed.
    -    // so, we can just directly compare.
    -    while (patternIndex < patternLength && txtIndex < txtLength) {
    -      if (patternString.charAt(patternIndex) != charSequenceWrapper.charAt(txtIndex)) {
    -        // Go back if there is no match
    -        txtIndex = txtIndex - patternIndex;
    -        patternIndex = 0;
    -      } else {
    -        patternIndex++;
    +  public int match(int start, int end, DrillBuf drillBuf) {
    +
    +    if (patternLength == 0) { // Everything should match for null pattern string
    +      return 1;
    +    }
    +
    +    final int txtLength = end - start;
    +
    +    // no match if input string length is less than pattern length
    +    if (txtLength < patternLength) {
    +      return 0;
    +    }
    +
    +    outer:
    +    for (int txtIndex = 0; txtIndex < txtLength; txtIndex++) {
    +
    +      // boundary check
    +      if (txtIndex + patternLength > txtLength) {
    --- End diff --
    
    Better:
    ```
    int end = txtLength - patternLength;
    for (int txtIndex = 0; txtIndex < end; txtIndex++) {
    ```
    And omit the boundary check on every iteration. That is, no reason to iterate past the last possible match, then use an if-statement to shorten the loop. Just shorten the loop.


---

[GitHub] drill issue #1015: DRILL-5889: Simple pattern matchers can work with DrillBu...

Posted by sachouche <gi...@git.apache.org>.
Github user sachouche commented on the issue:

    https://github.com/apache/drill/pull/1015
  
    Padma, DRILL-5889 is the wrong JIRA (sqlline loses RPC..); I think your JIRA is 5899
    
    
    Regards,
    
    Salim
    
    ________________________________
    From: Padma Penumarthy <no...@github.com>
    Sent: Wednesday, November 1, 2017 11:15:20 AM
    To: apache/drill
    Cc: Salim Achouche; Mention
    Subject: Re: [apache/drill] DRILL-5889: Simple pattern matchers can work with DrillBuf directly (#1015)
    
    
    @sachouche<https://github.com/sachouche> @paul-rogers<https://github.com/paul-rogers> can you please review ?
    
    —
    You are receiving this because you were mentioned.
    Reply to this email directly, view it on GitHub<https://github.com/apache/drill/pull/1015#issuecomment-341192974>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AckoK8UN4uDlGPJGg6fpypIbPxKEOqGKks5syLU4gaJpZM4QJzty>.



---

[GitHub] drill issue #1015: DRILL-5899: Simple pattern matchers can work with DrillBu...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1015
  
    @paul-rogers updated with latest review comments taken care of. Please take a look.


---

[GitHub] drill pull request #1015: DRILL-5899: Simple pattern matchers can work with ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1015


---