You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Michael Howard (JIRA)" <ji...@apache.org> on 2015/04/15 19:13:59 UTC
[jira] [Commented] (PIG-4507) Problem with REGEX which just match
for the first word
[ https://issues.apache.org/jira/browse/PIG-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496536#comment-14496536 ]
Michael Howard commented on PIG-4507:
-------------------------------------
My understanding is as follows:
You would like to use a regex to map sequences of non-word characters to a single space, so that what you are left with is a string of "words" separated by a single space char.
You want to map a string to a string.
The REGEX_EXTRACT_ALL function is designed to map a string which contains structure into a tuple ... Extract the fields out of a structured string to return a tuple/record. (REGEX_EXTRACT does the same thing, only for a single field.) Part of the structure of the string that you provide is that it contains a fixed number of fields. To the best of my knowledge, there isn't any way to specify variable numbers of groups in a regex.
I don't think REGEX_EXTRACT_ALL is what you want to use.
I suggest that you want to use the pig REPLACE function instead of REGEX_EXTRACT_ALL. This will allow you to replace sequences of non-word chars with a single space. I think it should be more-or-less like:
REPLACE(dirty_string, '\W+', ' ') AS clean_string
Good luck.
> Problem with REGEX which just match for the first word
> ------------------------------------------------------
>
> Key: PIG-4507
> URL: https://issues.apache.org/jira/browse/PIG-4507
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.12.0
> Environment: IBM Infosphere BigInsights v3.0.0.1
> Reporter: Adrien Bidault
> Original Estimate: 6h
> Remaining Estimate: 6h
>
> I am trying to eliminate punctuation and special symbols from a string using REGEX of a type "(\\w+)". The problem is that this REGEX treatment is applied to the first word of the string only.
> Example:
> clean3 = FOREACH clean1 GENERATE id, REGEX_EXTRACT_ALL('toto, likes ... to play ', '(\\w+)');
> It just resturn "toto" instead of "toto likes to play"
> Would you guys have any ideas?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)