You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zach Amsden (Code Review)" <ge...@cloudera.org> on 2017/01/24 03:07:46 UTC

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Zach Amsden has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5776

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
4 files changed, 82 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 210:   if (pattern.len == 0) return str;
> Yes, unclear what the proper behavior is.  I elected to return the original
We don't guarantee anything about the value of len when null = true. Sometimes everything is zero-initialised but it's not contractual.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS2, Line 221: // First pass - find matches to compute output size
A high level question: for very long string, is a two pass algorithm better than a one pass approach in which one just keep appending to the result string as we find matches in the haystack string.


PS2, Line 237: int rlen = num_matches * (replace.len - needle.len) + str.len;
Can this overflow with a malicious replacement string ?


PS2, Line 249: DCHECK(match_pos_in_substring >= 0);
DCHECK_GE()


http://gerrit.cloudera.org:8080/#/c/5776/2/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

PS2, Line 2955: ident_or_restricted:ident
Not sure if dotted_path is the right place to modify as it's used at other places too. Ideally we should limit "replace" to be accepted as  function_names only but not for say database / table / column names.


PS2, Line 3050: ident_or_restricted 
May want to hide this as function_name's definition only to avoid others from using it if possible.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 15:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5776/15/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS15, Line 256: (context->GetFunctionState(FunctionContext::FRAGMENT_LOCAL));
> nit: indent 4.
Done


PS15, Line 362: result.len = ptr - result.ptr + bytes_remaining;
> Feel free to ignore:
Done


http://gerrit.cloudera.org:8080/#/c/5776/15/be/src/udf/udf.h
File be/src/udf/udf.h:

PS15, Line 598:   /// Note that the memory for the buffer of the new StringVal is managed by Impala.
              :   /// Impala will handle freeing it. Callers should not call Free() on the 'ptr' of
              :   /// the StringVal returned.
> Please add a similar comment to Resize().
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 19:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/247/

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5776

to look at the new patch set (#18).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 396 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/18
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 20: Verified-1

Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/253/

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#10).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
12 files changed, 337 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/10
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 232: if (UNLIKELY(out_max > 1ULL << 18)) {
> This is why I was asking earlier about what is considered a small / large m
Sounds good.


PS5, Line 236: haystack.len + replace.len - pattern.len
> max string size is 1 << 30
Oops. Bad example. I suppose line 244 below will catch the case in which the buffer space exceeds StringVal::MAX_LENGTH.


PS5, Line 266: UNLIKELY(check_space))
Not sure if UNLIKELY() is really unlikely here. Isn't it dependent on the inputs ?


PS5, Line 272: buffer_space <<= 1;
Please see comments in StringVal::Resize(). I think this can potentially go to 1 << 31 but it should be fine as long as StringVal::Resize() catches it. May want to DCHECK_LE(buffer_space, 1ULL << 30);


PS5, Line 273: DCHECK(
DCHECK_EQ().


http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/udf/udf.cc
File be/src/udf/udf.cc:

PS5, Line 519:  auto* new_ptr = ctx->impl()->ReallocateLocal(ptr, new_len);
This should return StringVal::null() if new_len exceeds StringVal::MAX_LENGTH.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#14).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 376 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/14
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/16/testdata/workloads/functional-query/queries/QueryTest/exprs.test
File testdata/workloads/functional-query/queries/QueryTest/exprs.test:

Line 1948: ---- QUERY
> is the case of both 'search' and 'replace' non-constants interesting?
not really, but non-constant search is important.  Added another test for multiple calls to replace since that actually matters now


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 14:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5776/14/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS14, Line 238: (context->GetFunctionState(FunctionContext::FRAGMENT_LOCAL));
> nit: indent 4.
Done


PS14, Line 289: ==
> Should this be <= ?
<= is nicer.  This actually depends on operator precedence to not overflow itself o.O


PS14, Line 352: UNLIKELY(resized == false)
> nit: !resized.
Done


http://gerrit.cloudera.org:8080/#/c/5776/14/testdata/workloads/functional-query/queries/QueryTest/exprs.test
File testdata/workloads/functional-query/queries/QueryTest/exprs.test:

PS14, Line 1942: select sum(length(replace(y, x, 'bbbbbbbbbb'))) from (select cast(round(
               : float_col) AS STRING) as x, string_col as y from functional.alltypes) v;
> This seems rather similar to the previous query ? How about we test the cas
will try that


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5776

to look at the new patch set (#19).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 396 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/19
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5776

to look at the new patch set (#16).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 381 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/16
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5776/10//COMMIT_MSG
Commit Message:

Line 13: Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Can you clarify whether replace() is faster than regexp_replace() in simple cases? Otherwise, there's no point in a custom implementation versus just internally calling regexp_replace().


http://gerrit.cloudera.org:8080/#/c/5776/6/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

Line 2619:   /* Since "IF", "TRUNCATE" are keywords, need to special case these functions */
update comment


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/18/be/src/udf/udf.h
File be/src/udf/udf.h:

PS18, Line 582: does
> spelling
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 4:

[localhost:21000] > select count(regexp_replace(l_comment, ' ', '')) from lineitem;
Query: select count(regexp_replace(l_comment, ' ', '')) from lineitem
Query submitted at: 2017-01-27 23:30:47 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=4a49a82b8255ca4d:76e7876300000000
+-------------------------------------------+
| count(regexp_replace(l_comment, ' ', '')) |
+-------------------------------------------+
| 6001215                                   |
+-------------------------------------------+
Fetched 1 row(s) in 2.74s
[localhost:21000] > select count(replace(l_comment, ' ', '')) from lineitem;
Query: select count(replace(l_comment, ' ', '')) from lineitem
Query submitted at: 2017-01-27 23:30:58 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=bc40c2b4f4291d90:6f70ae2500000000
+------------------------------------+
| count(replace(l_comment, ' ', '')) |
+------------------------------------+
| 6001215                            |
+------------------------------------+
Fetched 1 row(s) in 0.94s

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/2//COMMIT_MSG
Commit Message:

Line 11: since function names act as bare identifiers.
> Can you also run a basic perf test to compare the speed of regex_replace() 
I don't think std::replace would work here - the input isn't exactly easily modeled as an iterator since we are replacing multiple characters and one can imagine nasty side effects if this isn't explicitly coded, with say:

replace('strstrstr', 'str', 'strstr')


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 16:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/5776/16/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS16, Line 215: _
for structs we don't usually use the underscore (but we do for classes), so let's remove that (even though it doesn't need to be accessed).


http://gerrit.cloudera.org:8080/#/c/5776/16/be/src/udf/udf.cc
File be/src/udf/udf.cc:

PS16, Line 453: std::find(std::begin(local_allocations_), std::end(local_allocations_), ptr);
is this going to become a perf issue?


Line 520:     std::cout << "MAX_LENGTH, Trying to allocate " << len << std::endl;
delete


http://gerrit.cloudera.org:8080/#/c/5776/16/be/src/udf/udf.h
File be/src/udf/udf.h:

Line 583:   /// buffers, and so don't need to be explicitly freed.
this isn't true in general, but it is true for StringVal's created via this constructor.  Let's reword to be clearer about that:

The memory backing this StringVal is a local allocation (i.e. managed by Impala), and so doesn't need to be explicitly freed.

Or something like that. See also the "Memory Management" for UDF comment above.


PS16, Line 586: StringVal
... StringVal that is backed by a local allocation...

since not all StringVal's are like that.


PS16, Line 591: allocated from the
              :   /// function context.
a local allocation.

FunctionContext has interfaces for non-local allocations as well. Note that the "local" terminology is unfortunate, but it's what we currently have.


http://gerrit.cloudera.org:8080/#/c/5776/16/testdata/workloads/functional-query/queries/QueryTest/exprs.test
File testdata/workloads/functional-query/queries/QueryTest/exprs.test:

Line 1935: float_col) AS STRING) as x, string_col as y from functional.alltypes) v;
this could probably be formatted better


Line 1942: select sum(length(replace(y, '0', x))) from (select cast(round(float_col) AS STRING) as x, string_col as y from functional.alltypes) v;
can you add line breaks to stay within 90 columns?


Line 1948: ---- QUERY
is the case of both 'search' and 'replace' non-constants interesting?


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#11).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
12 files changed, 343 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/11
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 211: pattern.is_null
I believe pattern.is_null should also return StringVal::null(). pattern.is_null == true means pattern is "undefined".


PS5, Line 231: int64_t out_max = match_pos + replace.len * ((haystack.len - match_pos) / pattern.len);
This may potentially result in a huge overestimation if there are few matches and the replacement string is a lot longer than pattern string.


PS5, Line 232: if (UNLIKELY(out_max > 1ULL << 18)) {
It seems a bit arbitrary to choose 256K. Why not just start with a buffer of about the same length as the original string and keep appending to it as we find matches ?


PS5, Line 236: haystack.len + replace.len - pattern.len
I believe this can still overflow a 32-bit value, right ? Btw, please keep the line to be within 90 characters.


PS5, Line 240: buffer_space = out_max;
May help to have a test case in which the result string is longer than StringVal::MAX_LENGTH and verifies that we actually get StringVal::null() back.


PS5, Line 249:  while (match_pos + needle.len <= haystack.len) {
             :       // Copy in original string
             :       const int bytes = match_pos - consumed;
             :       memcpy(ptr, &haystack.ptr[consumed], bytes);
             :       DCHECK_LE(ptr - result.ptr + bytes, buffer_space);
             :       ptr += bytes;
             : 
             :       // Copy in replacement - always safe since we always leave room for one more replace
             :       DCHECK_LE(ptr - result.ptr + replace.len, buffer_space);
             :       memcpy(ptr, replace.ptr, replace.len);
             :       ptr += replace.len;
             : 
             :       // Don't want to re-match within already replaced pattern
             :       match_pos += needle.len;
             :       consumed = match_pos;
             : 
             :       // If we had an enlarging pattern, we may need more space
             :       if (UNLIKELY(check_space)) {
             :         const int bytes_produced = ptr - result.ptr;
             :         const int bytes_remaining = haystack.len - consumed;
             :         if (UNLIKELY(bytes_produced + bytes_remaining + replace.len > buffer_space)) {
             :           // Ran out of space, double the size.  See the note above regarding the choice
             :           // of a power of two sized buffer.
             :           buffer_space <<= 1;
             :           DCHECK((buffer_space & (buffer_space - 1)) == 0);
             :           const auto ofs = ptr - result.ptr;
             :           if (UNLIKELY(!result.Resize(context, buffer_space))) {
             :             return StringVal::null();
             :           }
             :           ptr = result.ptr + ofs;
             :         }
             :       }
             : 
             :       StringValue haystack_substring = haystack.Substring(match_pos);
             :       int match_pos_in_substring = search.Search(&haystack_substring);
             :       if (match_pos_in_substring < 0) break;
             :       match_pos += match_pos_in_substring;
             :   }
Is this more performant than simply using string::append() or string += operator ?


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 266: UNLIKELY(check_space))
> Not sure if UNLIKELY() is really unlikely here. Isn't it dependent on the i
Well it was unlikely when I over-allocated the space.  Now it is still probably unlikely, since many replaces either will be stripping characters or replacing equal size strings.  Doesn't really matter much I suppose.  Should I remove it?


PS5, Line 273: DCHECK(
> DCHECK_EQ().
will fix.  originally was using math on the ptr val and DCHECK_EQ didn't like the types involved.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/4/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

Line 3047: // REPLACE() function must be included as a valid function name
Did you read my comment on an earlier patchset? We already have a way of handling cases like this.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 14:

(8 comments)

Looking good. Just some more minor comments.

http://gerrit.cloudera.org:8080/#/c/5776/13/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS13, Line 238: (context->GetFunctionState(FunctionContext::FRAGMENT_LOCAL));
nit: line continuation uses indent 4.


http://gerrit.cloudera.org:8080/#/c/5776/14/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS14, Line 238: (context->GetFunctionState(FunctionContext::FRAGMENT_LOCAL));
nit: indent 4.


PS14, Line 278: was
nit: is


PS14, Line 289: ==
Should this be <= ?

It won't overflow if it's StringVal::MAX_LENGTH - 1 + StringVal::MAX_LENGTH <= std::numeric_limits<decltype(buffer_space)>::max(), right ?


PS14, Line 352: UNLIKELY(resized == false)
nit: !resized.

May be easier to just do:

if (!result.Resize(...)) return StringVal::null();


http://gerrit.cloudera.org:8080/#/c/5776/14/be/src/udf/udf.h
File be/src/udf/udf.h:

Line 583:   /// Reallocate a StringVal so that it as at least as large as len.  May shrink or
May help to clarify that the string being resized needs to be allocated from StringVal(FunctionContext* context, int len);

Memory allocated from that interface is considered memory managed by Impala instead of the UDF/UDA (aka local allocations). In other words, one shouldn't call this function on a  string allocated via FunctionContext::Allocate() at the init() of a UDA. This may result in double free. Sorry, our UDF memory management story is rather messy.


Line 587:   /// If the resize fails, the original StringVal remains in place.
and returns false. Returns true otherwise.


http://gerrit.cloudera.org:8080/#/c/5776/14/testdata/workloads/functional-query/queries/QueryTest/exprs.test
File testdata/workloads/functional-query/queries/QueryTest/exprs.test:

PS14, Line 1942: select sum(length(replace(y, x, 'bbbbbbbbbb'))) from (select cast(round(
               : float_col) AS STRING) as x, string_col as y from functional.alltypes) v;
This seems rather similar to the previous query ? How about we test the case in which the replace string is a the string_col instead ?

select sum(length(replace(y, '0', x))) from (select cast(round(
float_col) AS STRING) as x, string_col as y from functional.alltypes) v;


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#2).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
5 files changed, 93 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5776/2//COMMIT_MSG
Commit Message:

Line 11: since function names act as bare identifiers.
> I don't think std::replace would work here - the input isn't exactly easily
Makes sense. Still a basic perf comparison against regex_replace() would be good.


http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 210:   if (pattern.len == 0) return str;
> Yes, unclear what the proper behavior is.  I elected to return the original
When in doubt check Postgres (e.g. via sqlfiddle.com)


Line 218:   const int kMemoizedMatches = 16;
> Will fix style - still learning this.
I have no evidence to back up the need for more than 16. Intuitively 16 seems to be sufficient for most cases.

My comment is more about documenting in the code the motivation behind this memo and the reason for its specific size.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 266: UNLIKELY(check_space))
> Well it was unlikely when I over-allocated the space.  Now it is still prob
Yes. I usually leave UNLIKELY() to error cases only. The general consensus (from my past experience) is that we use LIKELY/UNLIKELY on critical paths only.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 249:  while (match_pos + needle.len <= haystack.len) {
             :       // Copy in original string
             :       const int bytes = match_pos - consumed;
             :       memcpy(ptr, &haystack.ptr[consumed], bytes);
             :       DCHECK_LE(ptr - result.ptr + bytes, buffer_space);
             :       ptr += bytes;
             : 
             :       // Copy in replacement - always safe since we always leave room for one more replace
             :       DCHECK_LE(ptr - result.ptr + replace.len, buffer_space);
             :       memcpy(ptr, replace.ptr, replace.len);
             :       ptr += replace.len;
             : 
             :       // Don't want to re-match within already replaced pattern
             :       match_pos += needle.len;
             :       consumed = match_pos;
             : 
             :       // If we had an enlarging pattern, we may need more space
             :       if (UNLIKELY(check_space)) {
             :         const int bytes_produced = ptr - result.ptr;
             :         const int bytes_remaining = haystack.len - consumed;
             :         if (UNLIKELY(bytes_produced + bytes_remaining + replace.len > buffer_space)) {
             :           // Ran out of space, double the size.  See the note above regarding the choice
             :           // of a power of two sized buffer.
             :           buffer_space <<= 1;
             :           DCHECK((buffer_space & (buffer_space - 1)) == 0);
             :           const auto ofs = ptr - result.ptr;
             :           if (UNLIKELY(!result.Resize(context, buffer_space))) {
             :             return StringVal::null();
             :           }
             :           ptr = result.ptr + ofs;
             :         }
             :       }
             : 
             :       StringValue haystack_substring = haystack.Substring(match_pos);
             :       int match_pos_in_substring = search.Search(&haystack_substring);
             :       if (match_pos_in_substring < 0) break;
             :       match_pos += match_pos_in_substring;
             :   }
> Is this more performant than simply using string::append() or string += ope
Please ignore my previous comment. We need to keep the memory allocations in a freepool.  Ideally, we should use StringBuffer if possible but that interface works with MemPool only. May be we can consider adapting its interface to take a Resize() function pointer instead so it can be shared by both FreePool and MemPool.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#4).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M be/src/util/bit-util.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
10 files changed, 204 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#15).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 378 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/15
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5776

to look at the new patch set (#17).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 398 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/17
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 17
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 18: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/18/be/src/udf/udf.h
File be/src/udf/udf.h:

PS18, Line 582: does
spelling


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 11:

(15 comments)

Please also add exprs.test as discussed offline to cover cases in which pattern or replace are non-constant.

http://gerrit.cloudera.org:8080/#/c/5776/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS9, Line 290: 
nit: indent 4.


PS9, Line 291: t, buffer_space)
Please see comments from previous patch. Isn't this bytes_produced computed above ?


PS9, Line 305: _LE(ptr - resul
Same as bytes_remaining above ?


http://gerrit.cloudera.org:8080/#/c/5776/11/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS11, Line 227: replace = new ReplaceContext(pattern);
Sorry, I may have misunderstood your question about this. We need to call context->Allocate() here in order to track the memory consumption. The general guideline is to avoid untracked memory usage as much as possible.


PS11, Line 236: delete rptr;
Please use context->Free().


PS11, Line 274: (delta > 0 && delta < 128)
parenthesis seems unnecessary here.


PS11, Line 274: 128
Mind documenting how the number 128 is derived ? In other words, why not 256, 512, or 64 ?


PS11, Line 282: (replace.len - pattern.len)
Just use 'delta' for clarity.


PS11, Line 288: haystack.len - pattern.len + replace.len
For clarity, can you please use haystack.len + delta ?


PS11, Line 322:        const int bytes_remaining = haystack.len - consumed;
This seems to overlap with the remaining_bytes in line 352. How about we hoist it out of the loop like the following ?

// number of bytes to match in the original string
int bytes_remaining = haystack.len - consumed;
while (bytes_remaining >= pattern.len) {
   ...
   ...

   consumed = match_pos;
   bytes_remaining = haystack.len - consumed;
   ....
   .....

   if (delta > 0) {
       ....
   }

}


PS11, Line 335: it's
nit: its


PS11, Line 337:           static_assert(BitUtil::IsPowerOf2(StringVal::MAX_LENGTH),
              :               "buffer_space to not exceed MAX_LENGTH requires it to be a power of 2");
I am not sure I understand the purpose of this assert completely. Do we rely on it for line 331 above to be effective ?

Why don't we move line 331 to to the point after line 343 ? This seems to be easier to follow and I am not sure just assuming the resize to succeed is a good thing. We can fail to resize for other reasons (e.g. exceeding memory limit set on a query).


PS11, Line 342: const auto ofs = ptr - result.ptr;
Isn't this the same as bytes_produced above ?


PS11, Line 344: DCHECK_EQ(resized, true);
As mentioned above, this may not hold all the time as we can exceed memory limit even if buffer_space <= StringVal::MAX_LENGTH;


PS11, Line 352: const int remaining_bytes = haystack.len - consumed;
If you take the suggestion above to hoist bytes_remaining out of the while loop, you don't need this line.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 24: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 24
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#12).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 371 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/12
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 232: if (UNLIKELY(out_max > 1ULL << 18)) {
> Did you measure the performance gain for that ? Any hardcoded bound still l
This is why I was asking earlier about what is considered a small / large memory size in Impala.  How about we just start with the next power of two above something that can fit one replacement?


PS5, Line 236: haystack.len + replace.len - pattern.len
> Wouldn't this be negative if haystack.len == (1 << 31) - 4 and replace.len 
max string size is 1 << 30


PS5, Line 240: buffer_space = out_max;
> an excellent idea
But makes the test sssslllloooowwww


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

Line 1954:   TestStringValue("replace('abc', 'abcd', 'a')", "abc");
> suggest these additional tests:
It may help to have non-constant input arguments. For instance: select replace(string_col, '0', '1') from functional.alltypestiny. And similar inputs for the pattern and the replacement string.


http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS2, Line 221: // First pass - find matches to compute output size
> A high level question: for very long string, is a two pass algorithm better
Thinking it a bit more, the answer could be it depends. In general, append() may incur extra copying when expanding the buffer (usually double every time). This may be an issue if the replacement string has the same length or longer than the pattern string. On the other hand, the one pass approach may be faster if the result string is a lot shorter than the original string (e.g. the replacement string is an empty string '') as it avoids the cache footprint of going over a potentially long haystack string again.

May be it helps to have a microbenchmark.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 24: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 24
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#3).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/util/bit-util.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
7 files changed, 167 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 11:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/5776/11/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS11, Line 227: replace = new ReplaceContext(pattern);
> Sorry, I may have misunderstood your question about this. We need to call c
Done


PS11, Line 236: delete rptr;
> Please use context->Free().
Done


PS11, Line 274: 128
> Mind documenting how the number 128 is derived ? In other words, why not 25
In order to not burn too many cache lines.  This makes short string data go really fast.  I'll add a comment.


PS11, Line 282: (replace.len - pattern.len)
> Just use 'delta' for clarity.
Done


PS11, Line 288: haystack.len - pattern.len + replace.len
> For clarity, can you please use haystack.len + delta ?
Done


PS11, Line 322:        const int bytes_remaining = haystack.len - consumed;
> This seems to overlap with the remaining_bytes in line 352. How about we ho
Done


PS11, Line 335: it's
> nit: its
Done


PS11, Line 342: const auto ofs = ptr - result.ptr;
> Isn't this the same as bytes_produced above ?
yes


PS11, Line 352: const int remaining_bytes = haystack.len - consumed;
> If you take the suggestion above to hoist bytes_remaining out of the while 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#6).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
9 files changed, 171 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, Michael Ho, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5776

to look at the new patch set (#20).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
14 files changed, 400 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/20
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 20: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 19: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 8:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/5776/8/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS8, Line 236: (UNLIKELY(result.is_null))
May help to add a comment stating that result can be null if buffer_space exceeds StringVal::MAX_LENGTH at this point.


PS8, Line 242: bytes
may be this can use a more meaningful name such as unmatched_len.


PS8, Line 257: check_space)
We want to skip this doubling behavior if bytes_remaining == 0.

The code could be slightly simpler if we avoid having check_space and directly check the size of the buffer_space (like line 260). The branch should be pretty predictable for the case in which the replacement string is shorter than the pattern string.


PS8, Line 260: replace.len
nit: replace.len - pattern.len ?


PS8, Line 265:  std::numeric_limits<decltype(buffer_space)>::max() / 2);
nit: indent 4.


PS8, Line 267:  ptr - result.ptr
Isn't this bytes_produced ?


PS8, Line 268: if (UNLIKELY(!result.Resize(context, buffer_space))) {
A one line comment about this already handles case in which the buffer_space exceeds StringVal::MAX_LENGTH ?


PS8, Line 282: remaining_bytes
'bytes_remaining' for consistency with the code above.


PS8, Line 289: context->Reallocate
As discussed in person, this will not shrink the buffer given the way FreePool::Reallocate() is implemented.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 24:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/271/

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 24
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/4/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

Line 1963:   TestIsNull("replace('zomg', 'foo', NULL)", TYPE_STRING);
as michael mentioned previously, we should also test this with non-constant args using end-to-end test. while you aren't exploiting IsArgConstant(), it's common to do so (grep for that in the string-functions-ir.cc file for examples). see exprs.test and test_exprs.py for examples of end-to-end functional cases.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/8/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS8, Line 260: replace.len
> nit: replace.len - pattern.len ?
Tempting, but a bug - this assumes a match.  If replace.len is less than pattern.len, this undersizes the buffer.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 218:   const int kMemoizedMatches = 16;
> Will fix style - still learning this.
Why not a vector ?


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 21:

Is that just a rebase? Any fixes needed for the last failing run?

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 21
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 16:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/16/be/src/udf/udf.cc
File be/src/udf/udf.cc:

Line 520:     std::cout << "MAX_LENGTH, Trying to allocate " << len << std::endl;
> kill the one above too?
hmm, didn't notice we already did that. yeah, I think we should get rid of it. looking at git history, it looks like it snuck in by mistake.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#7).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
9 files changed, 180 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/10/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 338:           buffer_space = space_needed;
also left over from earlier code and unnecessary


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 211: pattern.is_null
> I believe pattern.is_null should also return StringVal::null(). pattern.is_
Will change this semantic.


PS5, Line 232: if (UNLIKELY(out_max > 1ULL << 18)) {
> It seems a bit arbitrary to choose 256K. Why not just start with a buffer o
I dropped this down to 64k.  Seems like the benefit for not re-allocating is pretty high and we only want to pay costs on very large replaces, which is what we'll get.


PS5, Line 236: haystack.len + replace.len - pattern.len
> I believe this can still overflow a 32-bit value, right ? Btw, please keep 
should not be able to overflow.  preferred indent is
   in 4 
   and continuing, correct?


PS5, Line 240: buffer_space = out_max;
> May help to have a test case in which the result string is longer than Stri
an excellent idea


PS5, Line 249:  while (match_pos + needle.len <= haystack.len) {
             :       // Copy in original string
             :       const int bytes = match_pos - consumed;
             :       memcpy(ptr, &haystack.ptr[consumed], bytes);
             :       DCHECK_LE(ptr - result.ptr + bytes, buffer_space);
             :       ptr += bytes;
             : 
             :       // Copy in replacement - always safe since we always leave room for one more replace
             :       DCHECK_LE(ptr - result.ptr + replace.len, buffer_space);
             :       memcpy(ptr, replace.ptr, replace.len);
             :       ptr += replace.len;
             : 
             :       // Don't want to re-match within already replaced pattern
             :       match_pos += needle.len;
             :       consumed = match_pos;
             : 
             :       // If we had an enlarging pattern, we may need more space
             :       if (UNLIKELY(check_space)) {
             :         const int bytes_produced = ptr - result.ptr;
             :         const int bytes_remaining = haystack.len - consumed;
             :         if (UNLIKELY(bytes_produced + bytes_remaining + replace.len > buffer_space)) {
             :           // Ran out of space, double the size.  See the note above regarding the choice
             :           // of a power of two sized buffer.
             :           buffer_space <<= 1;
             :           DCHECK((buffer_space & (buffer_space - 1)) == 0);
             :           const auto ofs = ptr - result.ptr;
             :           if (UNLIKELY(!result.Resize(context, buffer_space))) {
             :             return StringVal::null();
             :           }
             :           ptr = result.ptr + ofs;
             :         }
             :       }
             : 
             :       StringValue haystack_substring = haystack.Substring(match_pos);
             :       int match_pos_in_substring = search.Search(&haystack_substring);
             :       if (match_pos_in_substring < 0) break;
             :       match_pos += match_pos_in_substring;
             :   }
> Please ignore my previous comment. We need to keep the memory allocations i
I looked at StringBuffer but it uses MemPool directly.  I'd rather keep this change locally confined to using the current allocator rather than adapting a new interface.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/5776/2//COMMIT_MSG
Commit Message:

Line 11: since function names act as bare identifiers.
Can you also run a basic perf test to compare the speed of regex_replace() vs. replace()? The motivation for replace() is to be significantly faster in simple cases than the next best alternative. You could try something along the lines of:

select count(replace(l_comment, " ", "")) from tpch_parquet.lineitem;

but probably with a larger lineitem dataset. Please reach out if you need help creating a larger dataset.

Might also be interesting to see how a stupid version with std::replace() would compare.


http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

Line 1954:   TestStringValue("replace('abc', 'abcd', 'a')", "abc");
suggest these additional tests:
1. nonsense replace
replace("abcdefg", "z", "abcdefg");
2. empty input string
replace("", "z", "");
3. input string is NULL, rest non-NULL
TestStringValue("replace(NULL, 'abc', 'a')", "abc");
4. output string is bigger than the input string
TestStringValue("replace('aaa', 'a', 'aa')", "aaaaaa");
5. add test with exactly 16 matches and some trailing chars in the input that are non-matches


http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 210:   if (pattern.len == 0) return str;
pattern can be null as well


Line 218:   const int kMemoizedMatches = 16;
style: k_memoized_matches or max_memoized_matches

add comment: why this memoization? why only 16?


Line 225:       if (match_pos_in_substring < 0)
single line


Line 237:   const int rlen = num_matches * (replace.len - needle.len) + str.len;
This could overflow and lead to writing past the result buffer.

Suggest using a larger result type and add a test. The SQL function "repeat" might come in handy.


Line 249:       DCHECK(match_pos_in_substring >= 0);
Must there really always be a match here?

(also DCHECK_GE)


Line 253:     int bytes = match_pos - consumed;
non_match_len?


Line 266:   memcpy(ptr, &haystack.ptr[match_pos], str.len - consumed);
// Copy trailing non-matches.


http://gerrit.cloudera.org:8080/#/c/5776/2/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

Line 3050: ident_or_restricted ::=
Agree about the issues.

Why not handle the REPLACE case like IF and TRUNCATE? Have a look at the non_pred_expr production.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5776/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS5, Line 232: if (UNLIKELY(out_max > 1ULL << 18)) {
> I dropped this down to 64k.  Seems like the benefit for not re-allocating i
Did you measure the performance gain for that ? Any hardcoded bound still looks a bit arbitrary unless there is justification for choosing it. It'd be nice to keep the decision simple while still getting reasonable speedup over regex_replace(). Starting with the next power of 2 of the current string length will work well with how FreePool allocates memory internally.


PS5, Line 236: haystack.len + replace.len - pattern.len
> should not be able to overflow.  preferred indent is
Wouldn't this be negative if haystack.len == (1 << 31) - 4 and replace.len = 6 and pattern.len = 2 ?

Correct. Indentation is 4 for line continuation.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 9:

Added support for saving StringSearcher for complicated queries.

***********************************************************************************
[localhost:21000] > select count(regexp_replace(l_comment, 'complicated', '')) from tpch.lineitem;
Query: select count(regexp_replace(l_comment, 'complicated', '')) from tpch.lineitem
Query submitted at: 2017-02-01 03:21:19 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=a49fad015129539:afa30da00000000
+-----------------------------------------------------+
| count(regexp_replace(l_comment, 'complicated', '')) |
+-----------------------------------------------------+
| 6001215                                             |
+-----------------------------------------------------+
Fetched 1 row(s) in 7.55s
[localhost:21000] > select count(regexp_replace(l_comment, 'complicated', '')) from tpch.lineitem;
Query: select count(regexp_replace(l_comment, 'complicated', '')) from tpch.lineitem
Query submitted at: 2017-02-01 03:21:31 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=3946a55a95c4529f:6739435000000000
+-----------------------------------------------------+
| count(regexp_replace(l_comment, 'complicated', '')) |
+-----------------------------------------------------+
| 6001215                                             |
+-----------------------------------------------------+
Fetched 1 row(s) in 1.54s
[localhost:21000] > select count(regexp_replace(l_comment, 'complicated', '')) from tpch.lineitem;
Query: select count(regexp_replace(l_comment, 'complicated', '')) from tpch.lineitem
Query submitted at: 2017-02-01 03:21:35 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=3f41480452d08ce1:21ebac3500000000
+-----------------------------------------------------+
| count(regexp_replace(l_comment, 'complicated', '')) |
+-----------------------------------------------------+
| 6001215                                             |
+-----------------------------------------------------+
Fetched 1 row(s) in 1.53s
[localhost:21000] > select count(replace(l_comment, 'complicated', '')) from tpch.lineitem;
Query: select count(replace(l_comment, 'complicated', '')) from tpch.lineitem
Query submitted at: 2017-02-01 03:21:44 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=d24d13228c8c8833:110700e300000000
+----------------------------------------------+
| count(replace(l_comment, 'complicated', '')) |
+----------------------------------------------+
| 6001215                                      |
+----------------------------------------------+
Fetched 1 row(s) in 1.33s
[localhost:21000] > select count(replace(l_comment, 'complicated', '')) from tpch.lineitem;
Query: select count(replace(l_comment, 'complicated', '')) from tpch.lineitem
Query submitted at: 2017-02-01 03:21:48 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=ea4d6cd631e38ce2:75984bb600000000

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

Re: [Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by Alex Behm <al...@cloudera.com>.
Patch has some test failures, please fix.

On Tue, Feb 7, 2017 at 9:23 PM, Impala Public Jenkins (Code Review) <
gerrit@cloudera.org> wrote:

> Impala Public Jenkins has posted comments on this change.
>
> Change subject: IMPALA-4729: Implement REPLACE()
> ......................................................................
>
>
> Patch Set 19: Verified-1
>
> Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/247/
>
> --
> To view, visit http://gerrit.cloudera.org:8080/5776
> To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
>
> Gerrit-MessageType: comment
> Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
> Gerrit-PatchSet: 19
> Gerrit-Project: Impala-ASF
> Gerrit-Branch: master
> Gerrit-Owner: Zach Amsden <za...@cloudera.com>
> Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
> Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
> Gerrit-Reviewer: Impala Public Jenkins
> Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
> Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
> Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
> Gerrit-HasComments: No
>
> --
> You received this message because you are subscribed to the Google Groups
> "impala-cr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to impala-cr+unsubscribe@cloudera.com.
> For more options, visit https://groups.google.com/a/cloudera.com/d/optout.
>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 19: Verified-1

Build failed: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/247/

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 15: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5776/15/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS15, Line 256: (context->GetFunctionState(FunctionContext::FRAGMENT_LOCAL));
nit: indent 4.


PS15, Line 362: result.len = ptr - result.ptr + bytes_remaining;
Feel free to ignore:

Move this line up to line 360 before the DCHECK and modify the DCHECK to DCHECK_LE(result.len, buffer_space);


http://gerrit.cloudera.org:8080/#/c/5776/15/be/src/udf/udf.h
File be/src/udf/udf.h:

PS15, Line 598:   /// Note that the memory for the buffer of the new StringVal is managed by Impala.
              :   /// Impala will handle freeing it. Callers should not call Free() on the 'ptr' of
              :   /// the StringVal returned.
Please add a similar comment to Resize().


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 20:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/253/

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 218:   const int kMemoizedMatches = 16;
> style: k_memoized_matches or max_memoized_matches
Will fix style - still learning this.

As far as 16, don't want to blow the stack.  16 seems safe.  If this code can nest, wouldn't go large, but if stack is not a concern, no reason not to spend a lot here.  Do you think I should increase it?


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#5).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M be/src/util/bit-util.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
10 files changed, 192 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 1:

Hi Zach, please add some regression tests for this.  Take a look at exprs.test & test_exprs.py for examples of similar existing test cases.

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#8).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
9 files changed, 181 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/11/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS11, Line 288: haystack.len - pattern.len + replace.len
> Actually I can't.  The subtraction has to happen first so the we don't over
disregard - now that I have delta this doesn't actually matter.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, Michael Ho, Alex Behm, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5776

to look at the new patch set (#21).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
14 files changed, 401 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/21
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 21
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 4:

This is good to go and fully tested now.  I had to replumb a few things to get a StringVal.Resize() implementation, but now large strings have test coverage as well.

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, Michael Ho, Alex Behm, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5776

to look at the new patch set (#23).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
14 files changed, 400 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/23
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 23
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#13).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
13 files changed, 373 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/13
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5776/2/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

Line 210:   if (pattern.len == 0) return str;
> pattern can be null as well
Yes, unclear what the proper behavior is.  I elected to return the original in that case as well (null also has len == 0).


PS2, Line 221: // First pass - find matches to compute output size
> Thinking it a bit more, the answer could be it depends. In general, append(
This depends on how expensive the allocation and potential over-allocation is for large strings.  We could greedily grab a large enough buffer for mid-size strings (1Mb-ish) and probably want a different strategy with dynamic allocation for very large strings or worst case expansions.


Line 225:       if (match_pos_in_substring < 0)
> single line
Done


PS2, Line 237: int rlen = num_matches * (replace.len - needle.len) + str.len;
> Can this overflow with a malicious replacement string ?
Yes, overflow is a real possibility.  Underflow is not.


PS2, Line 249: DCHECK(match_pos_in_substring >= 0);
> DCHECK_GE()
If we know the number of matches already, yes - i < num_matches.  If we change this to a one-pass algorithm, the answer is more complicated.


http://gerrit.cloudera.org:8080/#/c/5776/2/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

PS2, Line 2955: ident_or_restricted:ident
> Not sure if dotted_path is the right place to modify as it's used at other 
dotted_path is overloaded to be the primary term for function_name - not sure this was a perfect choice but it looks hard to change this decision now.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 9:

Michael, yeah, I will fix those.  I still need to plug in the additional tests that will be required now that we do constant based stuff.

Dan - replace ' ' -> '' was the initial test.  This was much faster, 2x so.  Regexp compilation starts to be more competitive on longer expressions because it has more advanced state.  Still it can't beat a cached StringSearch - 'complicated' actually isn't that complicated, but it (due to the repeated 'c') allows the search to advance faster, better illustrating the difference on longer strings.

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/10/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

Line 2038:   (*short_buf)[4096] = 'Z';
oops, this should be 4095 - test bug


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 20:

Trivial test fix +2

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 16:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5776/16/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS16, Line 215: _
> for structs we don't usually use the underscore (but we do for classes), so
Done


http://gerrit.cloudera.org:8080/#/c/5776/16/be/src/udf/udf.cc
File be/src/udf/udf.cc:

PS16, Line 453: std::find(std::begin(local_allocations_), std::end(local_allocations_), ptr);
> is this going to become a perf issue?
let's just fix it


Line 520:     std::cout << "MAX_LENGTH, Trying to allocate " << len << std::endl;
> delete
kill the one above too?


http://gerrit.cloudera.org:8080/#/c/5776/16/be/src/udf/udf.h
File be/src/udf/udf.h:

Line 583:   /// buffers, and so don't need to be explicitly freed.
> this isn't true in general, but it is true for StringVal's created via this
will fix


PS16, Line 586: StringVal
> ... StringVal that is backed by a local allocation...
will fix


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has uploaded a new patch set (#9).

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................

IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
9 files changed, 201 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/76/5776/9
-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


IMPALA-4729: Implement REPLACE()

This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.

It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart.  Let's leave
optimizing that for later.

Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify.  Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.

Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Reviewed-on: http://gerrit.cloudera.org:8080/5776
Reviewed-by: Dan Hecht <dh...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M be/src/gutil/bits.h
M be/src/udf/udf-internal.h
M be/src/udf/udf-test-harness.cc
M be/src/udf/udf-test-harness.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M common/function-registry/impala_functions.py
M fe/src/main/cup/sql-parser.cup
M fe/src/test/java/org/apache/impala/analysis/ParserTest.java
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M testdata/workloads/functional-query/queries/QueryTest/large_strings.test
14 files changed, 400 insertions(+), 12 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 25
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5776/11/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

PS11, Line 288: haystack.len - pattern.len + replace.len
> For clarity, can you please use haystack.len + delta ?
Actually I can't.  The subtraction has to happen first so the we don't overflow.  The results of the computation after overflow in signed arithmetic become undefined behavior.


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 9:

What do you mean by "complicated" queries?
And how much did computing the StringSearch only once help?  From your numbers below, REPLACE is not much faster than REGEX_REPLACE, but I thought earlier you were seeing a 2x speedup.

-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4729: Implement REPLACE()

Posted by "Zach Amsden (Code Review)" <ge...@cloudera.org>.
Zach Amsden has posted comments on this change.

Change subject: IMPALA-4729: Implement REPLACE()
......................................................................


Patch Set 6:

(1 comment)

After pre-loading the data (lost the first few lines, can't figure out how to get more scrollback in bash on Windows yet ;)

But we can see replace() now wins on all simple replacements.  The first version lost horribly on replace of a single space with 17 spaces.  The fancier buffer sizing on expanding patterns was actually required.

+------------------------------------------------------------------+
| sum(length(regexp_replace(l_comment, ' ', '                 '))) |
+------------------------------------------------------------------+
| 496964585                                                        |
+------------------------------------------------------------------+
Fetched 1 row(s) in 3.63s
[localhost:21000] > select sum(length(replace(l_comment, ' ', '                 '))) from tpch.lineitem;
Query: select sum(length(replace(l_comment, ' ', '                 '))) from tpch.lineitem
Query submitted at: 2017-02-03 19:21:42 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=504aabb65f306ab1:7601c2c800000000
+-----------------------------------------------------------+
| sum(length(replace(l_comment, ' ', '                 '))) |
+-----------------------------------------------------------+
| 496964585                                                 |
+-----------------------------------------------------------+
Fetched 1 row(s) in 1.63s
[localhost:21000] > select sum(length(regexp_replace(l_comment, ' ', ''))) from tpch.lineitem;
Query: select sum(length(regexp_replace(l_comment, ' ', ''))) from tpch.lineitem
Query submitted at: 2017-02-03 19:21:58 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=9440a311864f0940:f89d1f3e00000000
+-------------------------------------------------+
| sum(length(regexp_replace(l_comment, ' ', ''))) |
+-------------------------------------------------+
| 137874248                                       |
+-------------------------------------------------+
Fetched 1 row(s) in 3.04s
[localhost:21000] > select sum(length(replace(l_comment, ' ', ''))) from tpch.lineitem;
Query: select sum(length(replace(l_comment, ' ', ''))) from tpch.lineitem
Query submitted at: 2017-02-03 19:22:09 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=c345d0f14967fd99:4b23ef8c00000000
+------------------------------------------+
| sum(length(replace(l_comment, ' ', ''))) |
+------------------------------------------+
| 137874248                                |
+------------------------------------------+
Fetched 1 row(s) in 1.54s
[localhost:21000] > select sum(length(regexp_replace(l_comment, 'e', 'I'))) from tpch.lineitem;
Query: select sum(length(regexp_replace(l_comment, 'e', 'I'))) from tpch.lineitem
Query submitted at: 2017-02-03 19:22:47 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=d8405ff45581ef67:7df386ab00000000
+--------------------------------------------------+
| sum(length(regexp_replace(l_comment, 'e', 'i'))) |
+--------------------------------------------------+
| 158997209                                        |
+--------------------------------------------------+
Fetched 1 row(s) in 2.84s
[localhost:21000] > select sum(length(replace(l_comment, 'e', 'I'))) from tpch.lineitem;
Query: select sum(length(replace(l_comment, 'e', 'I'))) from tpch.lineitem
Query submitted at: 2017-02-03 19:22:58 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=5f42bba668201666:623a45f200000000
+-------------------------------------------+
| sum(length(replace(l_comment, 'e', 'i'))) |
+-------------------------------------------+
| 158997209                                 |
+-------------------------------------------+
Fetched 1 row(s) in 1.63s
[localhost:21000] > select sum(length(regex_replace(l_comment, 'he', 'HE'))) from tpch.lineitem;
Query: select sum(length(regex_replace(l_comment, 'he', 'HE'))) from tpch.lineitem
Query submitted at: 2017-02-03 19:23:30 (Coordinator: http://impala-dev:25000)
ERROR: AnalysisException: default.regex_replace() unknown

[localhost:21000] > select sum(length(regexp_replace(l_comment, 'he', 'HE'))) from tpch.lineitem;
Query: select sum(length(regexp_replace(l_comment, 'he', 'HE'))) from tpch.lineitem
Query submitted at: 2017-02-03 19:23:37 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=134a4fc63cf86279:372eeb6f00000000
+----------------------------------------------------+
| sum(length(regexp_replace(l_comment, 'he', 'he'))) |
+----------------------------------------------------+
| 158997209                                          |
+----------------------------------------------------+
Fetched 1 row(s) in 1.73s
[localhost:21000] > select sum(length(replace(l_comment, 'he', 'HE'))) from tpch.lineitem;
Query: select sum(length(replace(l_comment, 'he', 'HE'))) from tpch.lineitem
Query submitted at: 2017-02-03 19:23:45 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=1541a32bc801a543:8efcb85300000000
+---------------------------------------------+
| sum(length(replace(l_comment, 'he', 'he'))) |
+---------------------------------------------+
| 158997209                                   |
+---------------------------------------------+
Fetched 1 row(s) in 1.53s
[localhost:21000] > select sum(length(regexp_replace(l_comment, 'comment', '//'))) from tpch.lineitem;
Query: select sum(length(regexp_replace(l_comment, 'comment', '//'))) from tpch.lineitem
Query submitted at: 2017-02-03 19:24:22 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=5e453cd4807ee411:987abebe00000000
+---------------------------------------------------------+
| sum(length(regexp_replace(l_comment, 'comment', '//'))) |
+---------------------------------------------------------+
| 158997209                                               |
+---------------------------------------------------------+
Fetched 1 row(s) in 1.74s
[localhost:21000] > select sum(length(replace(l_comment, 'comment', '//'))) from tpch.lineitem;
Query: select sum(length(replace(l_comment, 'comment', '//'))) from tpch.lineitem
Query submitted at: 2017-02-03 19:24:30 (Coordinator: http://impala-dev:25000)
Query progress can be monitored at: http://impala-dev:25000/query_plan?query_id=854f49d5c5c3951f:271f7f2200000000
+--------------------------------------------------+
| sum(length(replace(l_comment, 'comment', '//'))) |
+--------------------------------------------------+
| 158997209                                        |
+--------------------------------------------------+
Fetched 1 row(s) in 1.33s
[localhost:21000] >

http://gerrit.cloudera.org:8080/#/c/5776/6/fe/src/main/cup/sql-parser.cup
File fe/src/main/cup/sql-parser.cup:

Line 2619:   /* Since "IF", "TRUNCATE" are keywords, need to special case these functions */
> update comment
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <za...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zach Amsden <za...@cloudera.com>
Gerrit-HasComments: Yes