You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/10/14 05:59:00 UTC

[jira] [Commented] (HADOOP-18395) Performance improvement in org.apache.hadoop.io.Text#find

    [ https://issues.apache.org/jira/browse/HADOOP-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617488#comment-17617488 ] 

ASF GitHub Bot commented on HADOOP-18395:
-----------------------------------------

huxinqiu commented on PR #4714:
URL: https://github.com/apache/hadoop/pull/4714#issuecomment-1278521721

   @ZanderXu Thanks for helping to review the code. Can you help merge this pr into trunk branch?




> Performance improvement in org.apache.hadoop.io.Text#find
> ---------------------------------------------------------
>
>                 Key: HADOOP-18395
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18395
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: xinqiu.hu
>            Priority: Trivial
>              Labels: pull-request-available
>         Attachments: 0001-add-UT-with-timeout-for-Text-find-and-fix-comments.patch
>
>
> The current implementation reset src and tgt to the mark and continues searching when tgt has remaining and src expired first. which is probably not necessary.
> {code:java}
> public int find(String what, int start) {
>   try {
>     ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
>     ByteBuffer tgt = encode(what);
>     byte b = tgt.get();
>     src.position(start);
>     while (src.hasRemaining()) {
>       if (b == src.get()) { // matching first byte
>         src.mark(); // save position in loop
>         tgt.mark(); // save position in target
>         boolean found = true;
>         int pos = src.position()-1;
>         while (tgt.hasRemaining()) {
>           if (!src.hasRemaining()) { // src expired first
>             tgt.reset();
>             src.reset();
>             found = false;
>             break;
>           }
>           if (!(tgt.get() == src.get())) {
>             tgt.reset();
>             src.reset();
>             found = false;
>             break; // no match
>           }
>         }
>         if (found) return pos;
>       }
>     }
>     return -1; // not found
>   } catch (CharacterCodingException e) {
>     throw new RuntimeException("Should not have happened", e);
>   }
> } {code}
> For example, when q is searched, it is found that src has no remaining, and src is reset to d to continue searching. But the remaining length of src is always smaller than tgt, at this point we can return -1 directly.
> {code:java}
> @Test
> public void testFind() throws Exception {
>   Text text = new Text("abcd\u20acbdcd\u20ac");
>   assertThat(text.find("cd\u20acq")).isEqualTo(-1);
> } {code}
> Perhaps it could be:
> {code:java}
> public int find(String what, int start) {
>   try {
>     ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
>     ByteBuffer tgt = encode(what);
>     byte b = tgt.get();
>     src.position(start);
>     while (src.hasRemaining()) {
>       if (b == src.get()) { // matching first byte
>         src.mark(); // save position in loop
>         tgt.mark(); // save position in target
>         boolean found = true;
>         int pos = src.position()-1;
>         while (tgt.hasRemaining()) {
>           if (!src.hasRemaining()) { // src expired first
>             return -1;
>           }
>           if (!(tgt.get() == src.get())) {
>             tgt.reset();
>             src.reset();
>             found = false;
>             break; // no match
>           }
>         }
>         if (found) return pos;
>       }
>     }
>     return -1; // not found
>   } catch (CharacterCodingException e) {
>     throw new RuntimeException("Should not have happened", e);
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org