You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Max Moroz (JIRA)" <ji...@apache.org> on 2016/06/25 06:20:16 UTC

[jira] [Created] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

Max Moroz created SPARK-16203:
---------------------------------

             Summary: regexp_extract to return an ArrayType(StringType())
                 Key: SPARK-16203
                 URL: https://issues.apache.org/jira/browse/SPARK-16203
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.0.0
            Reporter: Max Moroz
            Priority: Minor


regexp_extract only returns a single matched group. If (as if often the case - e.g., web log parsing) we need to parse the entire line and get all the groups, we'll need to call it as many times as there are groups.

It's only a minor annoyance syntactically.

But unless I misunderstand something, it would be very inefficient.  (How would Spark know not to do multiple pattern matching operations, when only one is needed? Or does the optimizer actually check whether the patterns are identical, and if they are, avoid the repeated regex matching operations??)

Would it be  possible to have it return an array when the index is not specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org