You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Matthew Dixon <ma...@jagex.com> on 2015/02/17 14:16:18 UTC
RE: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for multiplestrings

I suspect that your ultimate goal with this is likely solved by extracting just the bits of the original string that you want, ie the underlined section below, and then doing some regex over the result:



                I like listening to rock music



So:



SET PATTERNS=(rock music)|(hip hop)|(dub step);



from (

from (

  from dual

  select 'I like listening to rock music' as str

) a

select

  a.str

  ,regexp_replace(regexp_extract(str,'${hiveconf:PATTERNS}'),' ','') as genre

  ,regexp_replace(str,'${hiveconf:PATTERNS}','#genre') as str_minus_genre

) b

select

 b.str as original

,b.str_minus_genre

,b.genre as extracted_genre

,regexp_replace(b.str_minus_genre,'#genre',b.genre) as cleaned_str;











-----Original Message-----
From: Viral Parikh [mailto:Viral.Parikh@match.com]
Sent: 12 February 2015 09:32
To: user@hive.apache.org
Cc: Viral Parikh
Subject: COMMERCIAL:Re: COMMERCIAL:Re: Hive - regexp_replace function for multiplestrings



Thank you Matthew!



In Solution 2, is the highlighted portion your typo - so 'hip music' would remain 'hip music'. Did you hipmusic (as unigram?)



However your second comment is correct - "however it presumes that for each ngram you don't want to keep any spaces", but I don't want that to happen.



So for example if I have something like -



I like listening to rock music.



I don't want this to be - Ilikelisteningtorockmusic since with the IN condition it selected this statement and its replacing all spaces with no space as per the regexp_replace function.



Correct me if I am understanding your solution 2 wrong?





Thanks,

Viral

From: Matthew Dixon <ma...@jagex.com>>>

Reply-To: "user@hive.apache.org<ma...@hive.apache.org><mailto:user@hive.apache.org%3cmailto:user@hive.apache.org%3e>" <us...@hive.apache.org>>>

Date: Friday, February 6, 2015 at 3:18 AM

To: "user@hive.apache.org<ma...@hive.apache.org><mailto:user@hive.apache.org%3cmailto:user@hive.apache.org%3e>" <us...@hive.apache.org>>>

Cc: Viral Parikh <vi...@gmail.com>>>

Subject: RE: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings



Below 2 solutions.



Solution1 uses lookahead and lookbehind but works with bi-grams only.  It also doesn't enforce the pairs you're asking for, so for instance hip music would become hipmusic.



Solution2 uses simple IN syntax with if(), works with n-grams beyond bi-grams and enforces the actual patterns you want to change (so 'hip music' would remain 'hip music', however it presumes that for each ngram you don't want to keep any spaces.



from (

from dual

select explode(array('hip hop','rock music')) as txt

) sub

select

txt

,regexp_replace(txt,'(?<=(hip)|(rock)) (?=(hop)|(music))','') as solution1 ,if(txt in ('hip hop','rock music'),regexp_replace(txt,' ',''),txt) as solution2





Using a text editor like sublime or n++ you could construct the monster regex (for solution1) or the IN set (for solution2) to capture all the patterns you need.





From: Pradeep Gollakota [mailto:pradeepg26@gmail.com]

Sent: 03 February 2015 21:56

To: user@hive.apache.org<ma...@hive.apache.org>>

Cc: Viral Parikh

Subject: COMMERCIAL:Re: Hive - regexp_replace function for multiple strings



I don't think this is doable using the out of the box regexp_replace() UDF. That way I would do it, is using a file to create a mapping between a regexp and it's replacement and write a custom UDF that loads this file and applies all regular expressions on the input.



Hope this helps.



On Tue, Feb 3, 2015 at 10:46 AM, Viral Parikh <vi...@gmail.com>>> wrote:



Hi Everyone,



I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.



drop table vp_hiphop;



create table vp_hiphop as



select userid, ntext,



       regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1



from  vp_nlp_protext_males



;



But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.



drop table vp_hiphop;



create table vp_hiphop as



select  userid, ntext,



        regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1



from  vp_nlp_protext_males



;









--



IMPORTANT NOTICE



The sender does not guarantee that this message, including any attachment, is secure or virus free. Also, it is confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, do not disclose or copy it or its contents. Please telephone or email the sender and delete the message entirely from your system. No binding obligations or payment commitments are to be derived from the contents of this email unless and until a clear written agreement containing all the necessary terms and conditions is properly executed.



Jagex Limited is a company registered in England & Wales with company number 03982706 and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.






-- 

IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is secure or virus free. Also, it is confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, do not disclose or copy it or its contents. Please telephone or email the sender and delete the message entirely from your system. No binding obligations or payment commitments are to be derived from the contents of this email unless and until a clear written agreement containing all the necessary terms and conditions is properly executed.

Jagex Limited is a company registered in England & Wales with company number 03982706 and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.