You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/10/14 05:59:00 UTC
[jira] [Commented] (HADOOP-18395) Performance improvement in org.apache.hadoop.io.Text#find
[ https://issues.apache.org/jira/browse/HADOOP-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617488#comment-17617488 ]
ASF GitHub Bot commented on HADOOP-18395:
-----------------------------------------
huxinqiu commented on PR #4714:
URL: https://github.com/apache/hadoop/pull/4714#issuecomment-1278521721
@ZanderXu Thanks for helping to review the code. Can you help merge this pr into trunk branch?
> Performance improvement in org.apache.hadoop.io.Text#find
> ---------------------------------------------------------
>
> Key: HADOOP-18395
> URL: https://issues.apache.org/jira/browse/HADOOP-18395
> Project: Hadoop Common
> Issue Type: Improvement
> Components: io
> Reporter: xinqiu.hu
> Priority: Trivial
> Labels: pull-request-available
> Attachments: 0001-add-UT-with-timeout-for-Text-find-and-fix-comments.patch
>
>
> The current implementation reset src and tgt to the mark and continues searching when tgt has remaining and src expired first. which is probably not necessary.
> {code:java}
> public int find(String what, int start) {
> try {
> ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
> ByteBuffer tgt = encode(what);
> byte b = tgt.get();
> src.position(start);
> while (src.hasRemaining()) {
> if (b == src.get()) { // matching first byte
> src.mark(); // save position in loop
> tgt.mark(); // save position in target
> boolean found = true;
> int pos = src.position()-1;
> while (tgt.hasRemaining()) {
> if (!src.hasRemaining()) { // src expired first
> tgt.reset();
> src.reset();
> found = false;
> break;
> }
> if (!(tgt.get() == src.get())) {
> tgt.reset();
> src.reset();
> found = false;
> break; // no match
> }
> }
> if (found) return pos;
> }
> }
> return -1; // not found
> } catch (CharacterCodingException e) {
> throw new RuntimeException("Should not have happened", e);
> }
> } {code}
> For example, when q is searched, it is found that src has no remaining, and src is reset to d to continue searching. But the remaining length of src is always smaller than tgt, at this point we can return -1 directly.
> {code:java}
> @Test
> public void testFind() throws Exception {
> Text text = new Text("abcd\u20acbdcd\u20ac");
> assertThat(text.find("cd\u20acq")).isEqualTo(-1);
> } {code}
> Perhaps it could be:
> {code:java}
> public int find(String what, int start) {
> try {
> ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
> ByteBuffer tgt = encode(what);
> byte b = tgt.get();
> src.position(start);
> while (src.hasRemaining()) {
> if (b == src.get()) { // matching first byte
> src.mark(); // save position in loop
> tgt.mark(); // save position in target
> boolean found = true;
> int pos = src.position()-1;
> while (tgt.hasRemaining()) {
> if (!src.hasRemaining()) { // src expired first
> return -1;
> }
> if (!(tgt.get() == src.get())) {
> tgt.reset();
> src.reset();
> found = false;
> break; // no match
> }
> }
> if (found) return pos;
> }
> }
> return -1; // not found
> } catch (CharacterCodingException e) {
> throw new RuntimeException("Should not have happened", e);
> }
> }{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org