You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Igor Kuzmenko <f1...@gmail.com> on 2016/08/04 14:14:56 UTC
Hive LIKE predicate. '_' wildcard decrease perfomance
I've got Hive Transactional table 'data_http' in ORC format, containing
around 100.000.000 rows.
When I execute query:
select * from data_http
where res_url like '%mts.ru%'
it completes in 10 seconds.
But executing query
select * from data_http
where res_url like '%mts_ru%'
takes more than 30 minutes.
Why '_' wildcard decrease perfomance?
Re: Hive LIKE predicate. '_' wildcard decrease perfomance
Posted by Igor Kuzmenko <f1...@gmail.com>.
Thanks for reply, Gopal. Very helpful.
On Thu, Aug 4, 2016 at 10:15 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:
> > where res_url like '%mts.ru%'
> ...
> > where res_url like '%mts_ru%'
> ...
> > Why '_' wildcard decrease perfomance?
>
> Because it misses the fast path by just one "_".
>
> ORC vectorized reader has a zero-copy check for 3 patterns - prefix,
> suffix and middle.
>
> That means "https://%", "%.html", "%mts.ru%" will hit the fast path -
> which uses StringExpr::equal() which JITs into the following.
>
> https://issues.apache.org/jira/secure/attachment/
> 12748720/string-intrinsic-
> sse.png
>
>
> In Hive-2.0, you can mix these up too to get "https:%mts%.html" in a
> ChainedChecker.
>
>
> Anything other than these 3 cases becomes a Regex and takes the slow path.
>
> The pattern you mentioned gets rewritten into ".*mts.ru.*" and the inner
> loop has a new String() as the input to the matcher + matcher.matches() in
> it.
>
> I've put in some patches recently which rewrite it Lazy regexes like
> ".?*mts.ru.?*", so the regex DFA will be smaller (HIVE-13196).
>
> That improves the case where the pattern is found, but does nothing to
> improve the performance of the new String() GC garbage.
>
> Cheers,
> Gopal
>
>
>
Re: Hive LIKE predicate. '_' wildcard decrease perfomance
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> where res_url like '%mts.ru%'
...
> where res_url like '%mts_ru%'
...
> Why '_' wildcard decrease perfomance?
Because it misses the fast path by just one "_".
ORC vectorized reader has a zero-copy check for 3 patterns - prefix,
suffix and middle.
That means "https://%", "%.html", "%mts.ru%" will hit the fast path -
which uses StringExpr::equal() which JITs into the following.
https://issues.apache.org/jira/secure/attachment/12748720/string-intrinsic-
sse.png
In Hive-2.0, you can mix these up too to get "https:%mts%.html" in a
ChainedChecker.
Anything other than these 3 cases becomes a Regex and takes the slow path.
The pattern you mentioned gets rewritten into ".*mts.ru.*" and the inner
loop has a new String() as the input to the matcher + matcher.matches() in
it.
I've put in some patches recently which rewrite it Lazy regexes like
".?*mts.ru.?*", so the regex DFA will be smaller (HIVE-13196).
That improves the case where the pattern is found, but does nothing to
improve the performance of the new String() GC garbage.
Cheers,
Gopal