You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Michael Howard (JIRA)" <ji...@apache.org> on 2015/04/15 19:13:59 UTC

[jira] [Commented] (PIG-4507) Problem with REGEX which just match for the first word

    [ https://issues.apache.org/jira/browse/PIG-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496536#comment-14496536 ] 

Michael Howard commented on PIG-4507:
-------------------------------------

My understanding is as follows:

You would like to use a regex to map sequences of non-word characters to a single space, so that what you are left with is a string of "words" separated by a single space char. 
You want to map a string to a string. 

The REGEX_EXTRACT_ALL function is designed to map a string which contains structure into a tuple ... Extract the fields out of a structured string to return a tuple/record. (REGEX_EXTRACT does the same thing, only for a single field.) Part of the structure of the string that you provide is that it contains a fixed number of fields. To the best of my knowledge, there isn't any way to specify variable numbers of groups in a regex. 

I don't think REGEX_EXTRACT_ALL is what you want to use. 

I suggest that you want to use the pig REPLACE function instead of REGEX_EXTRACT_ALL. This will allow you to replace sequences of non-word chars with a single space. I think it should be more-or-less like:

  REPLACE(dirty_string, '\W+', ' ') AS clean_string

Good luck.


> Problem with REGEX which just match for the first word
> ------------------------------------------------------
>
>                 Key: PIG-4507
>                 URL: https://issues.apache.org/jira/browse/PIG-4507
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>         Environment: IBM Infosphere BigInsights v3.0.0.1
>            Reporter: Adrien Bidault
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> I am trying to eliminate punctuation and special symbols from a string using REGEX of a type "(\\w+)". The problem is that this REGEX treatment is applied to the first word of the string only.
> Example:
> clean3 = FOREACH clean1 GENERATE id, REGEX_EXTRACT_ALL('toto,  likes ... to play ', '(\\w+)');
> It just resturn "toto" instead of "toto likes to play"
> Would you guys have any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)