You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/08/04 19:56:20 UTC

[jira] [Updated] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

     [ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-16203:
------------------------------
    Component/s:     (was: PySpark)

> regexp_extract to return an ArrayType(StringType())
> ---------------------------------------------------
>
>                 Key: SPARK-16203
>                 URL: https://issues.apache.org/jira/browse/SPARK-16203
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Max Moroz
>            Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case - e.g., web log parsing) we need to parse the entire line and get all the groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How would Spark know not to do multiple pattern matching operations, when only one is needed? Or does the optimizer actually check whether the patterns are identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org