You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by jingyimei <gi...@git.apache.org> on 2018/02/14 00:24:42 UTC

[GitHub] madlib pull request #234: Create lower case column name in encode_categorica...

GitHub user jingyimei opened a pull request:

    https://github.com/apache/madlib/pull/234

    Create lower case column name in encode_categorical_variables()

    JIRA:MADLIB-1202
    The previous madlib.encode_categorical_variables() function generates
    column name with some capital characters, including:
    1. when you specify top_values, there will be a column name with suffix __MISC__
    2. when you set encode_nulls as True, there will be a column name with suffix
    __NULL
    3. when the original column is boolean type, there will be column names
    with suffix _True and _False
    
    In the above cases, users have to use double quoting to query, which is
    not conveninet.
    
    This commit adresses this, and all of the three scenarios will generate
    coloumn name with lower cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jingyimei/madlib encode_categorial_column_name_change

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #234
    
----
commit 4d78f9425ffb76089d30bbe85cb3e07f1050268a
Author: Jingyi Mei <jm...@...>
Date:   2018-02-14T00:11:39Z

    Create lower case column name in encode_categorical_variables()
    
    JIRA:MADLIB-1202
    The previous madlib.encode_categorical_variables() function generates
    column name with some capital characters, including:
    1. when you specify top_values, there will be a column name with suffix __MISC__
    2. when you set encode_nulls as True, there will be a column name with suffix
    __NULL
    3. when the original column is boolean type, there will be column names
    with suffix _True and _False
    
    In the above cases, users have to use double quoting to query, which is
    not conveninet.
    
    This commit adresses this, and all of the three scenarios will generate
    coloumn name with lower cases.

----


---

[GitHub] madlib pull request #234: Create lower case column name in encode_categorica...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/234#discussion_r168525141
  
    --- Diff: src/ports/postgres/modules/utilities/encode_categorical.py_in ---
    @@ -317,7 +317,19 @@ class CategoricalEncoder(object):
     
                 if self.output_type not in ('array', 'svec'):
                     if not self._output_dictionary:
    -                    value_names = {None: 'NULL',
    +                    ## MADLIB-1202
    +                    ## In postgres, boolean variables are always saved
    +                    ## as 'True', 'False' with the first letter as capital,
    +                    ## which will cause the generated column name as
    +                    ## <boolean column name>_True/False that needs double
    +                    ## quoting to query. To make it more convnient to user,
    +                    ## we cast them to lower case true/false so that the
    +                    ## generated column name is <boolean column name>_true/false
    +                    ## The same logic applied to _null and _misc strs
    +                    if v in ('True', 'False'):
    --- End diff --
    
    Wouldn't this be better off as `if isinstance(v, bool)`? 


---

[GitHub] madlib pull request #234: Create lower case column name in encode_categorica...

Posted by jingyimei <gi...@git.apache.org>.
Github user jingyimei commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/234#discussion_r168891033
  
    --- Diff: src/ports/postgres/modules/utilities/encode_categorical.py_in ---
    @@ -317,7 +317,19 @@ class CategoricalEncoder(object):
     
                 if self.output_type not in ('array', 'svec'):
                     if not self._output_dictionary:
    -                    value_names = {None: 'NULL',
    +                    ## MADLIB-1202
    +                    ## In postgres, boolean variables are always saved
    +                    ## as 'True', 'False' with the first letter as capital,
    +                    ## which will cause the generated column name as
    +                    ## <boolean column name>_True/False that needs double
    +                    ## quoting to query. To make it more convnient to user,
    +                    ## we cast them to lower case true/false so that the
    +                    ## generated column name is <boolean column name>_true/false
    +                    ## The same logic applied to _null and _misc strs
    +                    if v in ('True', 'False'):
    --- End diff --
    
    We can't do isinstance(v, bool) here since v is either None or a string.  In method build_output_table(), distinct_values(which is a list of string and None) got passed to self._build_encoding_str() method, and self._build_encoding_str() will take the string value. 
    
    We decided to take another approach here. In self._get_distinct_values() and self._get_top_values(),we cast the column value to text in query and postgres will cast boolean column value 'True/False' to lowercase 'true/false' string while keeping the same format for text column. In this case, distinct_value will get lowercase true/false for boolean column and keep original format of other text column.
    
    Another thing to mention here is, since the output dictionary table also relies on whatever value from list distinct_values, if a user specifies creating dictionary tables, he will get lowercase 'true/false' for the boolean column in output dictionary table after this commit. 


---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    I've pushed a commit (912a4d629) to `madlib/madlib` repo, branch `feature/encode_categorial_column_name_change` to address above issues. That commit can be cherry-picked here to continue with this PR. 


---

[GitHub] madlib issue #234: Encode categorical variables: create lower case column na...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/359/



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/351/



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    Similarly
    ```
    DROP TABLE IF EXISTS abalone_out, abalone_out_dictionary;
    SELECT madlib.encode_categorical_variables (
            'abalone',                   -- Source table
            'abalone_out',               -- Output table
            'height::TEXT'               -- Categorical columns
            );
    SELECT * FROM abalone_out ORDER BY id;
    ```
    produces
    ```
    -[ RECORD 1 ]------+------
    id                 | 1
    sex                | M
    length             | 0.455
    diameter           | 0.365
    height             | 0.095
    rings              | 15
    height::TEXT_0.08  | 0
    height::TEXT_0.085 | 0
    height::TEXT_0.09  | 0
    height::TEXT_0.095 | 1
    height::TEXT_0.1   | 0
    height::TEXT_0.11  | 0
    height::TEXT_0.125 | 0
    height::TEXT_0.13  | 0
    height::TEXT_0.135 | 0
    height::TEXT_0.14  | 0
    height::TEXT_0.145 | 0
    height::TEXT_0.15  | 0
    -[ RECORD 2 ]------+------
    id                 | 2
    sex                | M
    length             | 0.35
    diameter           | 0.265
    height             | 0.09
    rings              | 7
    height::TEXT_0.08  | 0
    height::TEXT_0.085 | 0
    height::TEXT_0.09  | 1
    height::TEXT_0.095 | 0
    height::TEXT_0.1   | 0
    height::TEXT_0.11  | 0
    height::TEXT_0.125 | 0
    height::TEXT_0.13  | 0
    height::TEXT_0.135 | 0
    height::TEXT_0.14  | 0
    height::TEXT_0.145 | 0
    height::TEXT_0.15  | 0
    etc
    ```
    which does not seem to be PostgreSQL standard



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/343/



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/349/



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    LGTM


---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/354/



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    ```
    DROP TABLE IF EXISTS abalone_out, abalone_out_dictionary;
    SELECT madlib.encode_categorical_variables (
            'abalone',                   -- Source table
            'abalone_out',               -- Output table
            'height>.10'               -- Categorical columns
            );
    ```
    is failing now:
    ```
    Done.
    (psycopg2.ProgrammingError) plpy.SPIError: operator does not exist: double precision > text
    LINE 1: SELECT array_agg(DISTINCT height>.10::TEXT) AS "height>.10" ...
                                            ^
    HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
    QUERY:  SELECT array_agg(DISTINCT height>.10::TEXT) AS "height>.10"  FROM abalone
    CONTEXT:  Traceback (most recent call last):
      PL/Python function "encode_categorical_variables", line 23, in <module>
        return encode_categorical.encode_categorical_variables(**globals())
      PL/Python function "encode_categorical_variables", line 608, in encode_categorical_variables
      PL/Python function "encode_categorical_variables", line 103, in build_output_table
      PL/Python function "encode_categorical_variables", line 542, in _get_distinct_values
    PL/Python function "encode_categorical_variables"
     [SQL: "SELECT madlib.encode_categorical_variables (\n        'abalone',                   -- Source table\n        'abalone_out',               -- Output table\n        'height>.10'               -- Categorical columns\n        );"]
    ```
    whereas this used to work before in 1.13, i.e., it used to produce 2 columns called
    ```
    height>.10_False 
    ```
    and
    ```
    height>.10_True
    ```
    In this PR I expected to see
    ```
    height>.10_false 
    ```
    and
    ```
    height>.10_true
    ```



---

[GitHub] madlib issue #234: Encode categorical variables: create lower case column na...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/360/



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    ```
    DROP TABLE IF EXISTS abalone_out, abalone_out_dictionary;
    SELECT madlib.encode_categorical_variables (
            'abalone',                   -- Source table
            'abalone_out',               -- Output table
            'height>.10'               -- Categorical columns
            );
    SELECT * FROM abalone_out ORDER BY id;
    ```
    produces
    ```
    id | sex | length | diameter | height | rings | height>.10_false | height>.10_true 
    ----+-----+--------+----------+--------+-------+------------------+-----------------
      1 | M   |  0.455 |    0.365 |  0.095 |    15 |                1 |               0
      2 | M   |   0.35 |    0.265 |   0.09 |     7 |                1 |               0
      3 | F   |   0.53 |     0.42 |  0.135 |     9 |                0 |               1
      4 | M   |   0.44 |    0.365 |  0.125 |    10 |                0 |               1
      5 | I   |   0.33 |    0.255 |   0.08 |     7 |                1 |               0
      6 | I   |  0.425 |      0.3 |  0.095 |     8 |                1 |               0
      7 | F   |   0.53 |    0.415 |   0.15 |    20 |                0 |               1
      8 | F   |  0.545 |    0.425 |  0.125 |    16 |                0 |               1
      9 | M   |  0.475 |     0.37 |  0.125 |     9 |                0 |               1
     10 |     |   0.55 |     0.44 |   0.15 |    19 |                0 |               1
     11 | F   |  0.525 |     0.38 |   0.14 |    14 |                0 |               1
     12 | M   |   0.43 |     0.35 |   0.11 |    10 |                0 |               1
     13 | M   |   0.49 |     0.38 |  0.135 |    11 |                0 |               1
     14 | F   |  0.535 |    0.405 |  0.145 |    10 |                0 |               1
     15 | F   |   0.47 |    0.355 |    0.1 |    10 |                1 |               0
     16 | M   |    0.5 |      0.4 |   0.13 |    12 |                0 |               1
     17 | I   |  0.355 |     0.28 |  0.085 |     7 |                1 |               0
     18 | F   |   0.44 |     0.34 |    0.1 |    10 |                1 |               0
     19 | M   |  0.365 |    0.295 |   0.08 |     7 |                1 |               0
     20 |     |   0.45 |     0.32 |    0.1 |     9 |                1 |               0
    (20 rows)
    ```
    which looks fine.
    



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by fmcquillan99 <gi...@git.apache.org>.
Github user fmcquillan99 commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    was just testing 1.13 on postgres 9.6 and found this error
    
    ```
    DROP TABLE IF EXISTS abalone_out, abalone_out_dictionary;
    SELECT madlib.encode_categorical_variables (
            'abalone',                   -- Source table
            'abalone_out',               -- Output table
            'height>.10'               -- Categorical columns
            );
    ```
    produces
    ```
    Done.
    (psycopg2.ProgrammingError) spiexceptions.SyntaxError: syntax error at or near "="
    LINE 5:                     (CASE WHEN (height>.10 = 'False') THEN 1...
                                                       ^
    QUERY:  
                CREATE TABLE abalone_out AS (
                    SELECT
                        id, sex, length, diameter, height, rings,
                        (CASE WHEN (height>.10 = 'False') THEN 1 ELSE 0 END)::INTEGER AS "height>.10_False",(CASE WHEN (height>.10 = 'True') THEN 1 ELSE 0 END)::INTEGER AS "height>.10_True"
                    FROM
                        abalone
                    )
                
                
    CONTEXT:  Traceback (most recent call last):
      PL/Python function "encode_categorical_variables", line 23, in <module>
        return encode_categorical.encode_categorical_variables(**globals())
      PL/Python function "encode_categorical_variables", line 598, in encode_categorical_variables
      PL/Python function "encode_categorical_variables", line 147, in build_output_table
    PL/Python function "encode_categorical_variables"
     [SQL: "SELECT madlib.encode_categorical_variables (\n        'abalone',                   -- Source table\n        'abalone_out',               -- Output table\n        'height>.10'               -- Categorical columns\n        );"]
    ```



---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by iyerr3 <gi...@git.apache.org>.
Github user iyerr3 commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    Please note in the above example `height>.10_false` and `height>.10_true` will have to be double quoted when referred to. The lower case `false` and `true` does not eliminate the quotes in this example. 


---

[GitHub] madlib pull request #234: Encode categorical variables: create lower case co...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/madlib/pull/234


---

[GitHub] madlib issue #234: Create lower case column name in encode_categorical_varia...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit commented on the issue:

    https://github.com/apache/madlib/pull/234
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/madlib-pr-build/350/



---