You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by orhankislal <gi...@git.apache.org> on 2017/01/23 22:32:27 UTC

[GitHub] incubator-madlib pull request #89: K-means: support for array input

GitHub user orhankislal opened a pull request:

    https://github.com/apache/incubator-madlib/pull/89

    K-means: support for array input

    JIRA: MADLIB-1018
    
    Adds support for array input as data points. The function collates
    the columns into a column in a temp table.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/orhankislal/incubator-madlib feature/kmeans_arr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/89.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #89
    
----
commit d1a55cad17b29c29b2c8bf1ac61fca9d79dd9508
Author: Orhan Kislal <ok...@pivotal.io>
Date:   2017-01-23T21:13:16Z

    K-means: support for array input
    
    JIRA: MADLIB-1018
    
    Adds support for array input as data points. The function collates
    the columns into a column in a temp table.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r97463879
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    +    if not columns_exist_in_table(rel_source, [expr_point]):
    +
    +        p = re.compile('[Aa][Rr][Rr][Aa][Yy]\s*\[\s*[\s*\w|,\s*]+\s*\]')
    +
    +        if p.match(expr_point.strip()):
    +            view_name = unique_string('km_view')
    +
    +            plpy.execute(""" CREATE VIEW {view_name} AS
    +                SELECT {expr_point} AS expr FROM {rel_source}
    +                """.format(**locals()))
    +            return view_name,True
    +        else:
    +            plpy.error(
    --- End diff --
    
    How does failure of the regex mean something doesn't exist in the database?
    
    Maybe what you want is a `try:` block around the execute that returns this error, and a separate error if the regex doesn't match? If that's the case, I suggest changing the regex to `if not p.match(...): plpy.error()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r97463583
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    +    if not columns_exist_in_table(rel_source, [expr_point]):
    +
    +        p = re.compile('[Aa][Rr][Rr][Aa][Yy]\s*\[\s*[\s*\w|,\s*]+\s*\]')
    +
    +        if p.match(expr_point.strip()):
    +            view_name = unique_string('km_view')
    +
    +            plpy.execute(""" CREATE VIEW {view_name} AS
    --- End diff --
    
    Would this be better as a TEMP view? Or do all the callers create permanent tables?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r98128463
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +38,34 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    +    """
    +    Validation function for the expr_point parameter
    +    expr_point accepts 2 formats:
    +        - A single column name
    +        - An array expression
    +    If the array expression is used, we use the is_var_valid function to
    +    verify. This allows the regular array notation as well as any function
    +    that returns an array.
    +    """
    +
    +    if columns_exist_in_table(rel_source, [expr_point]):
    +        return rel_source, False
    +
    +    if is_var_valid(rel_source, expr_point):
    --- End diff --
    
    I think this is a better solution but we still allow non numeric arrays.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r97462821
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    --- End diff --
    
    This function could really use some comments or a docstring about what it's supposed to be doing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r98118503
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +38,34 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    +    """
    +    Validation function for the expr_point parameter
    +    expr_point accepts 2 formats:
    +        - A single column name
    +        - An array expression
    +    If the array expression is used, we use the is_var_valid function to
    +    verify. This allows the regular array notation as well as any function
    +    that returns an array.
    +    """
    +
    +    if columns_exist_in_table(rel_source, [expr_point]):
    +        return rel_source, False
    +
    +    if is_var_valid(rel_source, expr_point):
    --- End diff --
    
    An alternative to this is to use `pg_typeof` to get the type of the expression. See `get_expr_type` in `validate_args.py_in` and check for '[]' in the result. 
    This allows rejecting non-array expressions. 
    @decibel any concerns with using `pg_typeof` here? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r97462912
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    +    if not columns_exist_in_table(rel_source, [expr_point]):
    +
    +        p = re.compile('[Aa][Rr][Rr][Aa][Yy]\s*\[\s*[\s*\w|,\s*]+\s*\]')
    --- End diff --
    
    A comment on what the regex is doing would be very helpful. The array part is pretty obvious (though, couldn't that just be done with an insensitive regex??), but the rest is more difficult to follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r97463007
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    +    if not columns_exist_in_table(rel_source, [expr_point]):
    --- End diff --
    
    I think it'd be much clearer to make this if columns_exist... and return immediately if true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib issue #89: K-means: support for array input

Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal commented on the issue:

    https://github.com/apache/incubator-madlib/pull/89
  
    Hi @decibel, thanks for your comments. I removed the regex entirely and replaced it with a built-in madlib function. I addressed your other comments as well. Please let me know what you think. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by orhankislal <gi...@git.apache.org>.
Github user orhankislal closed the pull request at:

    https://github.com/apache/incubator-madlib/pull/89


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #89: K-means: support for array input

Posted by decibel <gi...@git.apache.org>.
Github user decibel commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/89#discussion_r97463365
  
    --- Diff: src/ports/postgres/modules/kmeans/kmeans.py_in ---
    @@ -34,6 +37,25 @@ def kmeans_validate_src(schema_madlib, rel_source, **kwargs):
     
     # ----------------------------------------------------------------------
     
    +def kmeans_validate_expr(schema_madlib, rel_source, expr_point, **kwargs):
    +    if not columns_exist_in_table(rel_source, [expr_point]):
    +
    +        p = re.compile('[Aa][Rr][Rr][Aa][Yy]\s*\[\s*[\s*\w|,\s*]+\s*\]')
    +
    +        if p.match(expr_point.strip()):
    +            view_name = unique_string('km_view')
    +
    +            plpy.execute(""" CREATE VIEW {view_name} AS
    +                SELECT {expr_point} AS expr FROM {rel_source}
    +                """.format(**locals()))
    +            return view_name,True
    +        else:
    +            plpy.error(
    +                """kmeans error: {expr_point} does not exist in {rel_source}!
    +                """.format(**locals()))
    +    return rel_source, False
    +
    --- End diff --
    
    It would be kinda nice to pre-compile the re by sticking a `kmeans_validate_expr.p = re.compile(...)` down here, as per http://stackoverflow.com/questions/279561/what-is-the-python-equivalent-of-static-variables-inside-a-function. Others might well have different ideas about that, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---