You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Jingyi Mei (JIRA)" <ji...@apache.org> on 2018/05/30 00:46:00 UTC

[jira] [Commented] (MADLIB-1243) Encode_categorical_variables doesn't work with column name with special characters when specifying top

    [ https://issues.apache.org/jira/browse/MADLIB-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16494536#comment-16494536 ] 

Jingyi Mei commented on MADLIB-1243:
------------------------------------

Cause of this issue:

[https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/utilities/encode_categorical.py_in#L464]

 

here we use single quote to query column name and when there is special character it may interpret the string in wrong way. In the above example, the column name is "se' ' x" with actual 2 single quote inside the string. However, that query will interpret the first single quote as escape instead of a real character, and as a result, _get_top_values will return a dictionary with wrong key se\' x instead of se\'\'x. Later, when calling _build_encoding_str [https://github.com/apache/madlib/blob/master/src/ports/postgres/modules/utilities/encode_categorical.py_in#L343] , col_to_values[col] will try to get col_to_values[se'x] instead of col_to_values[se' 'x] and the key doesn't exist and throw the error.

 

Solved this by using $$ to quote column name instead of single quote. 

> Encode_categorical_variables doesn't work with column name with special characters when specifying top
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MADLIB-1243
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1243
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Utilities
>            Reporter: Jingyi Mei
>            Assignee: Jingyi Mei
>            Priority: Major
>             Fix For: v1.15
>
>
> Encode_categorical_variables doesn't work with column name with special characters when specifying 'top' value as input parameter. Here is the repro:
> 1. Create table with special character in column name
> {code:java}
> DROP TABLE IF EXISTS abalone_special_char;
> CREATE TABLE abalone_special_char (
>     id serial,
>     "se''x" character varying,
>     "len'%*()gth" double precision,
>     diameter double precision,
>     height double precision,
>     "ClaЖss" integer
> );
> COPY abalone_special_char ("se''x", "len'%*()gth", diameter, height, "ClaЖss") FROM stdin WITH DELIMITER '|' NULL as '@';
> F"F|0.475|0.37|0.125|2
> F'F|0.475|0.37|0.125|2
> F$F|0.475|0.37|0.125|2
> MЖM|0.475|0.37|0.125|2
> M@[}(:*;M|0.475|0.37|0.125|2
> M,M|0.475|0.37|0.125|2
> \.{code}
> 2. call encode_categorical_variables with "se''x" as categorical column name and specify 3 as top value:
> {code:java}
> select encode_categorical_variables('abalone_special_char', 'abalone_special_char_out2',
> '"se''''x"', '',
> NULL, '3'
> );{code}
> Here is the error msg:
> {code:java}
> ERROR: KeyError: '"se\'\'x"' (plpython.c:4960)
> CONTEXT: Traceback (most recent call last):
> PL/Python function "encode_categorical_variables", line 23, in <module>
> return encode_categorical.encode_categorical_variables(**globals())
> PL/Python function "encode_categorical_variables", line 611, in encode_categorical_variables
> PL/Python function "encode_categorical_variables", line 104, in build_output_table
> PL/Python function "encode_categorical_variables", line 342, in _build_encoding_str
> PL/Python function "encode_categorical_variables"{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)