You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Viral Parikh <vi...@gmail.com> on 2015/02/03 19:46:28 UTC

Hive - regexp_replace function for multiple strings

Hi Everyone,

I am using hive 0.13! I want to find multiple tokens like "hip hop" and
"rock music" in my data and replace them with "hiphop" and "rockmusic" -
basically replace them without white space. I have used the regexp_replace
function in hive. Below is my query and it works great for above 2 examples.

drop table vp_hiphop;
create table vp_hiphop asselect userid, ntext,
       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'),
'rock music', 'rockmusic') as ntext1from  vp_nlp_protext_males;

But I have 100 such bigrams/ngrams and want to be able to do replace
efficiently where I just remove the whitespace. I can pattern match the
phrase - hip hop and rock music but in the replace I want to simply trim
the white spaces. Below is what I tried. I also tried using trim with
regexp_replace but it wants the third argument in the regexp_replace
function.

drop table vp_hiphop;
create table vp_hiphop asselect  userid, ntext,
        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1from
vp_nlp_protext_males;

RE: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for multiplestrings

Posted by Matthew Dixon <ma...@jagex.com>.

I suspect that your ultimate goal with this is likely solved by extracting just the bits of the original string that you want, ie the underlined section below, and then doing some regex over the result:



                I like listening to rock music



So:



SET PATTERNS=(rock music)|(hip hop)|(dub step);



from (

from (

  from dual

  select 'I like listening to rock music' as str

) a

select

  a.str

  ,regexp_replace(regexp_extract(str,'${hiveconf:PATTERNS}'),' ','') as genre

  ,regexp_replace(str,'${hiveconf:PATTERNS}','#genre') as str_minus_genre

) b

select

 b.str as original

,b.str_minus_genre

,b.genre as extracted_genre

,regexp_replace(b.str_minus_genre,'#genre',b.genre) as cleaned_str;











-----Original Message-----
From: Viral Parikh [mailto:Viral.Parikh@match.com]
Sent: 12 February 2015 09:32
To: user@hive.apache.org
Cc: Viral Parikh
Subject: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for multiplestrings



Thank you Matthew!



In Solution 2, is the highlighted portion your typo - so 'hip music' would remain 'hip music'. Did you hipmusic (as unigram?)



However your second comment is correct - "however it presumes that for each ngram you don't want to keep any spaces", but I don't want that to happen.



So for example if I have something like -



I like listening to rock music.



I don't want this to be - Ilikelisteningtorockmusic since with the IN condition it selected this statement and its replacing all spaces with no space as per the regexp_replace function.



Correct me if I am understanding your solution 2 wrong?





Thanks,

Viral

From: Matthew Dixon <ma...@jagex.com>>>

Reply-To: "user@hive.apache.org<ma...@hive.apache.org><mailto:user@hive.apache.org%3cmailto:user@hive.apache.org%3e>" <us...@hive.apache.org>>>

Date: Friday, February 6, 2015 at 3:18 AM

To: "user@hive.apache.org<ma...@hive.apache.org><mailto:user@hive.apache.org%3cmailto:user@hive.apache.org%3e>" <us...@hive.apache.org>>>

Cc: Viral Parikh <vi...@gmail.com>>>

Subject: RE: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings



Below 2 solutions.



Solution1 uses lookahead and lookbehind but works with bi-grams only.  It also doesn't enforce the pairs you're asking for, so for instance hip music would become hipmusic.



Solution2 uses simple IN syntax with if(), works with n-grams beyond bi-grams and enforces the actual patterns you want to change (so 'hip music' would remain 'hip music', however it presumes that for each ngram you don't want to keep any spaces.



from (

from dual

select explode(array('hip hop','rock music')) as txt

) sub

select

txt

,regexp_replace(txt,'(?<=(hip)|(rock)) (?=(hop)|(music))','') as solution1 ,if(txt in ('hip hop','rock music'),regexp_replace(txt,' ',''),txt) as solution2





Using a text editor like sublime or n++ you could construct the monster regex (for solution1) or the IN set (for solution2) to capture all the patterns you need.





From: Pradeep Gollakota [mailto:pradeepg26@gmail.com]

Sent: 03 February 2015 21:56

To: user@hive.apache.org<ma...@hive.apache.org>>

Cc: Viral Parikh

Subject: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings



I don't think this is doable using the out of the box regexp_replace() UDF. That way I would do it, is using a file to create a mapping between a regexp and it's replacement and write a custom UDF that loads this file and applies all regular expressions on the input.



Hope this helps.



On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <vi...@gmail.com>>> wrote:



Hi Everyone,



I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.



drop table vp_hiphop;



create table vp_hiphop as



select userid, ntext,



       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1



from  vp_nlp_protext_males



;



But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.



drop table vp_hiphop;



create table vp_hiphop as



select  userid, ntext,



        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1



from  vp_nlp_protext_males



;









--



IMPORTANT NOTICE



The sender does not guarantee that this message, including any attachment, is secure or virus free. Also, it is confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, do not disclose or copy it or its contents. Please telephone or email the sender and delete the message entirely from your system. No binding obligations or payment commitments are to be derived from the contents of this email unless and until a clear written agreement containing all the necessary terms and conditions is properly executed.



Jagex Limited is a company registered in England & Wales with company number 03982706 and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.






-- 

IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is secure or virus free. Also, it is confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, do not disclose or copy it or its contents. Please telephone or email the sender and delete the message entirely from your system. No binding obligations or payment commitments are to be derived from the contents of this email unless and until a clear written agreement containing all the necessary terms and conditions is properly executed.

Jagex Limited is a company registered in England & Wales with company number 03982706 and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.

Re: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings

Posted by Viral Parikh <Vi...@match.com>.

Thank you Matthew!

In Solution 2, is the highlighted portion your typo - so 'hip music' would remain 'hip music'. Did you hipmusic (as unigram?)

However your second comment is correct - "however it presumes that for each ngram you don't want to keep any spaces", but I don't want that to happen.

So for example if I have something like -

I like listening to rock music.

I don't want this to be - Ilikelisteningtorockmusic since with the IN condition it selected this statement and its replacing all spaces with no space as per the regexp_replace function.

Correct me if I am understanding your solution 2 wrong?


Thanks,
Viral
From: Matthew Dixon <ma...@jagex.com>>
Reply-To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Date: Friday, February 6, 2015 at 3:18 AM
To: "user@hive.apache.org<ma...@hive.apache.org>" <us...@hive.apache.org>>
Cc: Viral Parikh <vi...@gmail.com>>
Subject: RE: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings

Below 2 solutions.

Solution1 uses lookahead and lookbehind but works with bi-grams only.  It also doesn't enforce the pairs you're asking for, so for instance hip music would become hipmusic.

Solution2 uses simple IN syntax with if(), works with n-grams beyond bi-grams and enforces the actual patterns you want to change (so 'hip music' would remain 'hip music', however it presumes that for each ngram you don't want to keep any spaces.

from (
from dual
 select explode(array('hip hop','rock music')) as txt
) sub
select
 txt
,regexp_replace(txt,'(?<=(hip)|(rock)) (?=(hop)|(music))','') as solution1
,if(txt in ('hip hop','rock music'),regexp_replace(txt,' ',''),txt) as solution2


Using a text editor like sublime or n++ you could construct the monster regex (for solution1) or the IN set (for solution2) to capture all the patterns you need.


From: Pradeep Gollakota [mailto:pradeepg26@gmail.com]
Sent: 03 February 2015 21:56
To: user@hive.apache.org<ma...@hive.apache.org>
Cc: Viral Parikh
Subject: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings

I don't think this is doable using the out of the box regexp_replace() UDF. That way I would do it, is using a file to create a mapping between a regexp and it's replacement and write a custom UDF that loads this file and applies all regular expressions on the input.

Hope this helps.

On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <vi...@gmail.com>> wrote:

Hi Everyone,

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.

drop table vp_hiphop;

create table vp_hiphop as

select userid, ntext,

       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1

from  vp_nlp_protext_males

;

But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.

drop table vp_hiphop;

create table vp_hiphop as

select  userid, ntext,

        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1

from  vp_nlp_protext_males

;




--

IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is secure or virus free. Also, it is confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, do not disclose or copy it or its contents. Please telephone or email the sender and delete the message entirely from your system. No binding obligations or payment commitments are to be derived from the contents of this email unless and until a clear written agreement containing all the necessary terms and conditions is properly executed.

Jagex Limited is a company registered in England & Wales with company number 03982706 and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.

RE: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings

Posted by Matthew Dixon <ma...@jagex.com>.

Below 2 solutions.

Solution1 uses lookahead and lookbehind but works with bi-grams only.  It also doesn’t enforce the pairs you’re asking for, so for instance hip music would become hipmusic.

Solution2 uses simple IN syntax with if(), works with n-grams beyond bi-grams and enforces the actual patterns you want to change (so ‘hip music’ would remain ‘hip music’, however it presumes that for each ngram you don’t want to keep any spaces.

from (
from dual
 select explode(array('hip hop','rock music')) as txt
) sub
select
 txt
,regexp_replace(txt,'(?<=(hip)|(rock)) (?=(hop)|(music))','') as solution1
,if(txt in ('hip hop','rock music'),regexp_replace(txt,' ',''),txt) as solution2


Using a text editor like sublime or n++ you could construct the monster regex (for solution1) or the IN set (for solution2) to capture all the patterns you need.


From: Pradeep Gollakota [mailto:pradeepg26@gmail.com]
Sent: 03 February 2015 21:56
To: user@hive.apache.org
Cc: Viral Parikh
Subject: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings

I don't think this is doable using the out of the box regexp_replace() UDF. That way I would do it, is using a file to create a mapping between a regexp and it's replacement and write a custom UDF that loads this file and applies all regular expressions on the input.

Hope this helps.

On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <vi...@gmail.com>> wrote:

Hi Everyone,

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.

drop table vp_hiphop;

create table vp_hiphop as

select userid, ntext,

       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1

from  vp_nlp_protext_males

;

But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.

drop table vp_hiphop;

create table vp_hiphop as

select  userid, ntext,

        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1

from  vp_nlp_protext_males

;

-- 

IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is secure or virus free. Also, it is confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, do not disclose or copy it or its contents. Please telephone or email the sender and delete the message entirely from your system. No binding obligations or payment commitments are to be derived from the contents of this email unless and until a clear written agreement containing all the necessary terms and conditions is properly executed.

Jagex Limited is a company registered in England & Wales with company number 03982706 and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.

Re: Hive - regexp_replace function for multiple strings

Posted by Devopam Mittra <de...@gmail.com>.

hi Viral,
Unless you are strictly bound to change the text to achieve your
objectives... you may actually wish to explore ngrams and context_ngram
combinations to uniquely identify the patterns you want to explore and move
them to a new table for further processinng

Better do it at file level on Unix for faster and cleaner results , if it
has to be done the replacing way only.

regards
Devopam


On Wed, Feb 4, 2015 at 3:25 AM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> I don't think this is doable using the out of the box regexp_replace()
> UDF. That way I would do it, is using a file to create a mapping between a
> regexp and it's replacement and write a custom UDF that loads this file and
> applies all regular expressions on the input.
>
> Hope this helps.
>
> On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <vi...@gmail.com>
> wrote:
>
>> Hi Everyone,
>>
>> I am using hive 0.13! I want to find multiple tokens like "hip hop" and
>> "rock music" in my data and replace them with "hiphop" and "rockmusic" -
>> basically replace them without white space. I have used the regexp_replace
>> function in hive. Below is my query and it works great for above 2 examples.
>>
>> drop table vp_hiphop;
>> create table vp_hiphop asselect userid, ntext,
>>        regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1from  vp_nlp_protext_males;
>>
>> But I have 100 such bigrams/ngrams and want to be able to do replace
>> efficiently where I just remove the whitespace. I can pattern match the
>> phrase - hip hop and rock music but in the replace I want to simply trim
>> the white spaces. Below is what I tried. I also tried using trim with
>> regexp_replace but it wants the third argument in the regexp_replace
>> function.
>>
>> drop table vp_hiphop;
>> create table vp_hiphop asselect  userid, ntext,
>>         regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1from  vp_nlp_protext_males;
>>
>>
>


-- 
Devopam Mittra
Life and Relations are not binary

Re: Hive - regexp_replace function for multiple strings

Posted by Pradeep Gollakota <pr...@gmail.com>.

I don't think this is doable using the out of the box regexp_replace() UDF.
That way I would do it, is using a file to create a mapping between a
regexp and it's replacement and write a custom UDF that loads this file and
applies all regular expressions on the input.

Hope this helps.

On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <vi...@gmail.com>
wrote:

> Hi Everyone,
>
> I am using hive 0.13! I want to find multiple tokens like "hip hop" and
> "rock music" in my data and replace them with "hiphop" and "rockmusic" -
> basically replace them without white space. I have used the regexp_replace
> function in hive. Below is my query and it works great for above 2 examples.
>
> drop table vp_hiphop;
> create table vp_hiphop asselect userid, ntext,
>        regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1from  vp_nlp_protext_males;
>
> But I have 100 such bigrams/ngrams and want to be able to do replace
> efficiently where I just remove the whitespace. I can pattern match the
> phrase - hip hop and rock music but in the replace I want to simply trim
> the white spaces. Below is what I tried. I also tried using trim with
> regexp_replace but it wants the third argument in the regexp_replace
> function.
>
> drop table vp_hiphop;
> create table vp_hiphop asselect  userid, ntext,
>         regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1from  vp_nlp_protext_males;
>
>