You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by jingyimei <gi...@git.apache.org> on 2018/05/31 22:53:24 UTC

[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

GitHub user jingyimei opened a pull request:

    https://github.com/apache/madlib/pull/274

    Handling special characters in MLP and Encode Categorical Variables

    This PR handles special characters and unicode in column name and column values in MLP and Encode Categorical Variables modules. Also updated install check test cases to cover it.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib one_hot_encoding_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #274
    
----
commit 1f97cd5a9c9d118e9024049c466e0e6cf44dcdd2
Author: Arvind Sridhar <as...@...>
Date:   2018-05-24T00:02:43Z

    Encode categorical variables: handling special characters
    
    JIRA: MADLIB-1238
    JIRA: MADLIB-1243
    
    This commit deals with special characters in column name and column
    values. Also adds install check test cases to cover these scenarios.

commit 7d70ac24fbd679c0e5d58ac09bd536e6cc887790
Author: Jingyi Mei <jm...@...>
Date:   2018-05-30T21:16:13Z

    MLP: handling special characters
    
    JIRA: MADLIB-1238
    
    This commit deals with special characters in column names and column
    values within tables passed into the multilayer perceptron (MLP). We
    also added install check cases to cover these scenarios.

----


---

[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/274#discussion_r193490031
  
    --- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
    @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None):
         if not array:
             return "'{{ }}'::{0}".format(array_type)
         else:
    -        array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
    -        return (array_str + "::{1}").format(','.join(map(str, array)), array_type)
    +        # For non-character array types:
    +        if long_format:
    +            array_str = "ARRAY[ {0} ]";
    +            return (array_str + "::{1}").format(
    +                ','.join(map(str, array)), array_type)
    +        else:
    +        # For character array types, we have to deal with special characters in
    +        # elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
    +        # We firstly use ",,," to join elements in python list and then call
    +        # postgres string_to_array with a delimiter ",,," to form the final
    +        # psql array, because this sequence of characters will be very
    +        # unlikely to show up in a real world use case and some special
    +        # case (such as "M,M") will be handled.
    +            array_str = "string_to_array($${0}$$, ',,,')"
    +            return (array_str + "::{1}").format(
    +                ',,,'.join(map(str, array)), array_type)
    --- End diff --
    
    Can we have a more obscure delimiter? As mentioned in a comment above, any string that ends with a `,` will break this code.
    Just a thought, but would something like `,:;^` be an acceptable delimiter?


---

[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/274#discussion_r193503140
  
    --- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
    @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None):
         if not array:
             return "'{{ }}'::{0}".format(array_type)
         else:
    -        array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
    -        return (array_str + "::{1}").format(','.join(map(str, array)), array_type)
    +        # For non-character array types:
    +        if long_format:
    +            array_str = "ARRAY[ {0} ]";
    +            return (array_str + "::{1}").format(
    +                ','.join(map(str, array)), array_type)
    +        else:
    +        # For character array types, we have to deal with special characters in
    +        # elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
    +        # We firstly use ",,," to join elements in python list and then call
    +        # postgres string_to_array with a delimiter ",,," to form the final
    +        # psql array, because this sequence of characters will be very
    +        # unlikely to show up in a real world use case and some special
    +        # case (such as "M,M") will be handled.
    +            array_str = "string_to_array($${0}$$, ',,,')"
    +            return (array_str + "::{1}").format(
    +                ',,,'.join(map(str, array)), array_type)
    --- End diff --
    
    This sounds like a more obscure, but there is always some corner cases we can't cover given this approach.


---

[GitHub] madlib issue #274: Handling special characters in MLP and Encode Categorical...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/274
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/495/



---

[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/274#discussion_r193489504
  
    --- Diff: src/ports/postgres/modules/convex/test/mlp.sql_in ---
    @@ -1096,3 +1096,58 @@ FROM
     JOIN pg_catalog.pg_namespace n
     ON n.oid=c.relnamespace
     WHERE c.relname = 'lin_housing_wi_batch_standardization' AND c.relkind='r' AND nspname=current_schema();
    +
    +-- Test special characters both in column name and column values
    +DROP TABLE IF EXISTS iris_data_special_char;
    +CREATE TABLE iris_data_special_char(
    +    id serial,
    +    "rinŠ–!#'gs" numeric[],
    +    "se''x" varchar,
    +    class integer,
    +    state varchar
    +);
    +INSERT INTO iris_data_special_char VALUES
    +(1,ARRAY[5.0,3.2,1.2,0.2],'M''M',1,'Alaska'),
    --- End diff --
    
    If you change `'M''M'` -> `'M''M,'`, the test will fail. This is because the delimiter used internally in the code is `,,,`, and any string that ends with a `,` will break it.


---

[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

Posted by njayaram2 <gi...@git.apache.org>.
Github user njayaram2 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/274#discussion_r193525463
  
    --- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
    @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None):
         if not array:
             return "'{{ }}'::{0}".format(array_type)
         else:
    -        array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
    -        return (array_str + "::{1}").format(','.join(map(str, array)), array_type)
    +        # For non-character array types:
    +        if long_format:
    +            array_str = "ARRAY[ {0} ]";
    +            return (array_str + "::{1}").format(
    +                ','.join(map(str, array)), array_type)
    +        else:
    +        # For character array types, we have to deal with special characters in
    +        # elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
    +        # We firstly use ",,," to join elements in python list and then call
    +        # postgres string_to_array with a delimiter ",,," to form the final
    +        # psql array, because this sequence of characters will be very
    +        # unlikely to show up in a real world use case and some special
    +        # case (such as "M,M") will be handled.
    +            array_str = "string_to_array($${0}$$, ',,,')"
    +            return (array_str + "::{1}").format(
    +                ',,,'.join(map(str, array)), array_type)
    --- End diff --
    
    True, but I feel having a single `,` at the end of a string is something that can occur easily.  In other words, having a single `,` at the end of a string is probably more likely to happen than the test cases we cover in the corresponding install check files in this PR.
    
    We don't need a super obscure delimiter, but `,,,` seems too simple.


---

[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei closed the pull request at:

    https://github.com/apache/madlib/pull/274


---

[GitHub] madlib issue #274: Handling special characters in MLP and Encode Categorical...

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on the issue:

    https://github.com/apache/madlib/pull/274
  
    Will handle those in another PR, closing this one.


---

[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/274#discussion_r193525716
  
    --- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
    @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None):
         if not array:
             return "'{{ }}'::{0}".format(array_type)
         else:
    -        array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
    -        return (array_str + "::{1}").format(','.join(map(str, array)), array_type)
    +        # For non-character array types:
    +        if long_format:
    +            array_str = "ARRAY[ {0} ]";
    +            return (array_str + "::{1}").format(
    +                ','.join(map(str, array)), array_type)
    +        else:
    +        # For character array types, we have to deal with special characters in
    +        # elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
    +        # We firstly use ",,," to join elements in python list and then call
    +        # postgres string_to_array with a delimiter ",,," to form the final
    +        # psql array, because this sequence of characters will be very
    +        # unlikely to show up in a real world use case and some special
    +        # case (such as "M,M") will be handled.
    +            array_str = "string_to_array($${0}$$, ',,,')"
    +            return (array_str + "::{1}").format(
    +                ',,,'.join(map(str, array)), array_type)
    --- End diff --
    
    Agree


---