You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2008/09/05 19:42:46 UTC

[jira] Created: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

internationalization support and sort order (ascedning/descending) support in create table
------------------------------------------------------------------------------------------

                 Key: HADOOP-4085
                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/hive
            Reporter: Namit Jain


User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 

select _utf8 'string' from <TableName>
select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>


To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.

The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.

Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
Create Table syntax should be enhanced to support that.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4085:
-------------------------------

    Attachment: patch1

The change fixes the mentioned problems. Tests have been added.
Can someone from Hive please comment on the changes ?

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain reassigned HADOOP-4085:
----------------------------------

    Assignee: Namit Jain

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631778#action_12631778 ] 

Ashish Thusoo commented on HADOOP-4085:
---------------------------------------

Lets cleanup the System.outs that we have (I presume that this debugging code that did not get removed when you submitted the patch).

Also can you comment on the second one - is that code ok?

Inline Comments:
ql/src/java/org/apache/hadoop/hive/ql/exec/CompositeHiveObject.java:92: We should avoid System.outs - use logging instead.
ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java:143: Don't we have to look at all the sort columns here?
ql/src/java/org/apache/hadoop/hive/ql/exec/ExtractOperator.java:40: No System.out !!
ql/src/java/org/apache/hadoop/hive/ql/exec/LabeledCompositeHiveObject.java:40: No System.out !!

Thanks,
Ashish

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch2
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628843#action_12628843 ] 

Ashish Thusoo commented on HADOOP-4085:
---------------------------------------

Comments are below. The most major one is about how we are treating character set name in the grammar. Ideally we would want this to an identifier instead of token (similar to table name identifiers). With that approach we would be able to support any kinds of character sets very easily.

Inline Comments:
cli/src/java/org/apache/hadoop/hive/cli/CliDriver.java:85: nitpick - Can we follow the convention of having the opening brace on the same line as the code.
ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g:781: Instead of having fixed tokens per character set in the grammar, we should define a character-set identifier and pass that across to the java calls. That is much more scalable and would get us to seamlessly be able to support any character sets supported by the java run time.

 http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html 

has information on what can be grammar rules to determine the character set name and how new charactersets can be added to the JVM by CharactersetProvider. 
So the rule for the character set could look something like

 charSetStringLiteral : charSetIdentifier StringLiteral charSetIdentifier can be defined in terms of the rules mentioned in the link above.

ql/src/test/queries/clientpositive/inputddl4.q:0: Lets put a brief comment in this describing what this actually tests.
ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java:157: nitpick - maybe we should call this PREFIX and not SAME
ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java:143: Should this not check across all sort columns instead of bucket columns? Is this a bug?
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:384: This function hardcodes the terminating character and the field delimiters while in the current code these are parameterized which is better as later we want to drive them through session level properties.

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4085:
-------------------------------

    Attachment: patch2

comments incorporated

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch2
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4085:
-------------------------------

    Attachment: patch3.txt

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch2, patch3.txt
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628928#action_12628928 ] 

Namit Jain commented on HADOOP-4085:
------------------------------------

I will take care of the above - but I was not sure of the charset names. Shouldn't we initially support only the 7-8 charsets 
specified as must in CharSet and not let the user add any charset. Because, we will be using that encoding to convert
into a valid String later on. I agree that instead of hardcoding UTF-8, it should be something like 

charSetName:  can be a list of UTF-8, UNICODE etc..


but we should not allow any identifier out there, because we have no way of knowing whether it is a valid encoding or not, and
we will be forced to throw errors at the Semantic Analyzer . Are you proposing that - instead of catching bad character set names 
at the parser, throw an error in the Semnatic Analyzer if is not valid. 

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628944#action_12628944 ] 

Ashish Thusoo commented on HADOOP-4085:
---------------------------------------

Yes. we should be checking for valid character sets at SemanitcAnalysis and not at parse. By encoding charSetName as a list of char sets we are gauranteeing that in future we will have to change the parse code as we add more characterset support. That does not scale. In my opinion, parse should just encode rules on what can be construed as a valid characterset name (valid by construction). Checking on the actual list of names that we support should be done at semantic analysis time.


> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4085) internationalization support and sort order (ascedning/descending) support in create table

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631969#action_12631969 ] 

Namit Jain commented on HADOOP-4085:
------------------------------------

some old code for debugging was left - removed that.

the ddltask change is intentional. we are only comparing the bucketcols size, since we are looking for a superset (sortCols >= bucketCols) 
adding a new patch

> internationalization support and sort order (ascedning/descending) support in create table
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4085
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch2
>
>
> User cannot specify utf8 strings in the query, both for selection and filtering. Mysql syntax should be followed: 
> select _utf8 'string' from <TableName>
> select <selectExpr> from <TableName> where col = _utf8 0x<HexValue>
> To start with, utf8 strings should be supported. Support for other character sets can be added in the future on demand.
> The identifiers (table name/column name etc.) cannot be utf8 strings, it is only for the data values.
> Although, in create table, the user has the option of specifying sorted columns, he does not have the option of specifying whether they are ascending or descending.
> Create Table syntax should be enhanced to support that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.