You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sqoop.apache.org by JoeriHermans <gi...@git.apache.org> on 2016/04/11 15:49:37 UTC

[GitHub] sqoop pull request: Optimize toAvroIdentifier

GitHub user JoeriHermans opened a pull request:

    https://github.com/apache/sqoop/pull/18

    Optimize toAvroIdentifier

    Our distributed profiler indicated some inefficiencies in the AvroUtil.toAvroIdentifier method, more specifically, the use of Regex patterns. This can be directly observed from the FlameGraph generated by this profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg). We implemented an optimization, and compared this with the original method. On our testing machine, the optimization by itself is 230% (on average) more efficient compared to the original implementation. We have yet to test how this optimization will influence the performance of user jobs.
    
    Any suggestions or remarks are welcome.
    
    
    Kind regards,
    
    Joeri

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoeriHermans/sqoop patch-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/sqoop/pull/18.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18
    
----
commit e8a3eaf872fe9804375c736a6e2603015e3a36a2
Author: Joeri Hermans <jo...@joerihermans.com>
Date:   2016-04-11T13:46:07Z

    Optimize toAvroIdentifier
    
    Our distributed profiler indicated some inefficiencies in the AvroUtil.toAvroIdentifier method, more specifically, the use of Regex patterns. This can be directly observed from the FlameGraph generated by this profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg). We implemented an optimization, and compared this with the original method. On our testing machine, the optimization by itself is 230% (on average) more efficient compared to the original implementation. We have yet to test how this optimization will influence the performance of user jobs.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] sqoop pull request: [SQOOP-2906] Optimize toAvroIdentifier

Posted by JoeriHermans <gi...@git.apache.org>.
Github user JoeriHermans commented on the pull request:

    https://github.com/apache/sqoop/pull/18#issuecomment-208859666
  
    @stanleyxu2005 The main difference in my implementation is that I only have to do a single copy. StringBuilders are inefficient, char arrays are a lot faster since everything is pre-allocated already. Furthermore, the overhead over constantly allocating a new object, and the fact that internally the StringBuilder will do some copying as well, make this an inefficient approach.
    
    For example, I implemented your approach and instead of a 500% improvement with our method, I have a 230% increase in performance with the StringBuilder approach. So the char-array is definitely a lot faster, and this is what it is all about, since this is a very prominent function on the stack.
    
    Kind regards,
    
    Joeri


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

RE: [GitHub] sqoop pull request: Optimize toAvroIdentifier

Posted by "Xu, Qian A" <qi...@intel.com>.
Yeah. Hot function deserves a good optimization. It will be great to create an issue in JIRA and change your code to fit common Sqoop (Java) code convention. E.g. always have {} to wrap even single line code.

Stanley

-----Original Message-----
From: JoeriHermans [mailto:git@git.apache.org] 
Sent: Monday, April 11, 2016 9:50 PM
To: dev@sqoop.apache.org
Subject: [GitHub] sqoop pull request: Optimize toAvroIdentifier

GitHub user JoeriHermans opened a pull request:

    https://github.com/apache/sqoop/pull/18

    Optimize toAvroIdentifier

    Our distributed profiler indicated some inefficiencies in the AvroUtil.toAvroIdentifier method, more specifically, the use of Regex patterns. This can be directly observed from the FlameGraph generated by this profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg). We implemented an optimization, and compared this with the original method. On our testing machine, the optimization by itself is 230% (on average) more efficient compared to the original implementation. We have yet to test how this optimization will influence the performance of user jobs.
    
    Any suggestions or remarks are welcome.
    
    
    Kind regards,
    
    Joeri

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoeriHermans/sqoop patch-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/sqoop/pull/18.patch

To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message:

    This closes #18
    
----
commit e8a3eaf872fe9804375c736a6e2603015e3a36a2
Author: Joeri Hermans <jo...@joerihermans.com>
Date:   2016-04-11T13:46:07Z

    Optimize toAvroIdentifier
    
    Our distributed profiler indicated some inefficiencies in the AvroUtil.toAvroIdentifier method, more specifically, the use of Regex patterns. This can be directly observed from the FlameGraph generated by this profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg). We implemented an optimization, and compared this with the original method. On our testing machine, the optimization by itself is 230% (on average) more efficient compared to the original implementation. We have yet to test how this optimization will influence the performance of user jobs.

----


---
If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] sqoop pull request: Optimize toAvroIdentifier

Posted by JoeriHermans <gi...@git.apache.org>.
Github user JoeriHermans commented on the pull request:

    https://github.com/apache/sqoop/pull/18#issuecomment-208796233
  
    @stanleyxu2005 Why would you check if data has valid length? The original code also crashes if an empty string is specified, so I didn't bother with checking for validity. Furthermore, I don't think it is necessary to create a wrapper for this method.
    
    I also wanted to ask why in https://github.com/apache/sqoop/commit/1dd50cfb2ae327b0df8393dd96d1adb86bb2f65f candidate.replaceAll("\\W+", "") has been replaced with candidate.replaceAll("\\W+", "_")? This change makes our code a bit slower, and does not add any value from my point of view. Of course, if this is done with a particular reason, I will commit the change.
    
    
    Kind regards,
    
    Joeri


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] sqoop pull request: Optimize toAvroIdentifier

Posted by stanleyxu2005 <gi...@git.apache.org>.
Github user stanleyxu2005 commented on a diff in the pull request:

    https://github.com/apache/sqoop/pull/18#discussion_r59337930
  
    --- Diff: src/java/org/apache/sqoop/avro/AvroUtil.java ---
    @@ -114,11 +114,20 @@ public static String toAvroColumn(String column) {
        * Format candidate to avro specifics
        */
       public static String toAvroIdentifier(String candidate) {
    -    String formattedCandidate = candidate.replaceAll("\\W+", "_");
    -    if (formattedCandidate.substring(0,1).matches("[a-zA-Z_]")) {
    -      return formattedCandidate;
    +    char[] data = candidate.toCharArray();
    --- End diff --
    
    I'd suggest you create a private function `replaceWhitespace` and move your code logic there. So it will be
    ```
      String formattedCandidate = replaceWhitespace(candidate);
    ```
    Second how can you make sure the string data has a valid length, so that `data[0]` is valid? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] sqoop pull request: [SQOOP-2906] Optimize toAvroIdentifier

Posted by stanleyxu2005 <gi...@git.apache.org>.
Github user stanleyxu2005 commented on a diff in the pull request:

    https://github.com/apache/sqoop/pull/18#discussion_r59360059
  
    --- Diff: src/java/org/apache/sqoop/avro/AvroUtil.java ---
    @@ -114,11 +114,20 @@ public static String toAvroColumn(String column) {
        * Format candidate to avro specifics
        */
       public static String toAvroIdentifier(String candidate) {
    -    String formattedCandidate = candidate.replaceAll("\\W+", "_");
    -    if (formattedCandidate.substring(0,1).matches("[a-zA-Z_]")) {
    -      return formattedCandidate;
    +    char[] data = candidate.toCharArray();
    +    int stringIndex = 0;
    +
    +    for (char c:data) {
    +      if (Character.isLetterOrDigit(c) || c == '_') {
    +        data[stringIndex++] = c;
    +      }
    +    }
    +
    +    char initial = data[0];
    +    if (Character.isLetter(initial) || initial == '_') {
    +      return new String(data, 0, stringIndex);
    --- End diff --
    
    Your code will first create a char array and then eventually update char in the array. As result you will create another copy as a new String. Have you thought about using a `StringBuilder` directly?
    ```
      final StringBuilder sb = new StringBuilder();
      for (char c : candidate) {
        if (Character.isLetterOrDigit(c) || c == '_') {
          sb.append(c);
        }
      }
      ...
        return sb.toString();
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---