You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2013/10/12 00:54:42 UTC

[jira] [Commented] (HIVE-5304) Hive results can depend on metastore's underlying datastore

    [ https://issues.apache.org/jira/browse/HIVE-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793112#comment-13793112 ] 

Sergey Shelukhin commented on HIVE-5304:
----------------------------------------

[~ashutoshc] I updated the JIRA description with detailed investigation results... just fyi

> Hive results can depend on metastore's underlying datastore
> -----------------------------------------------------------
>
>                 Key: HIVE-5304
>                 URL: https://issues.apache.org/jira/browse/HIVE-5304
>             Project: Hive
>          Issue Type: Bug
>          Components: Metastore
>            Reporter: Sergey Shelukhin
>
> [removed old description]
> Hive JDOQL filter pushdown and direct SQL may end up pushing StringCol op 'SomeString' to underlying SQL datastore. However, the datastore may handle these differently based on the encoding and collation used for the columns of the database.
> So, query results can change depending on the underlying store for the metastore and its version.
> drop_partitions_filter.q test illustrates this problem. In byte order collation (proper way) USA is sorted before Uganda, but some collations may do it the other way, causing the test to fail.
> I am assuming that byte-order sort if the correct way to order things.
> Our MySQL script specifies _bin collation, which is byte-order; Postgres 9.1 and after, as far as I see, defaults to "C" collation, which is also byte-order.
> Derby seems to use byte-order by default, I didn't spend a lot of time on Derby.
> However, Postgres before 9.1 seems to default to "en_US.UTF8" and there's no way to change column collation in our script if database is already created.
> MySQL by default doesn't use _bin collation (on my machine), so if database is auto-created, the order of things is going to change. 
> I didn't investigate MSSQL or Oracle.
> For now it seems that:
> 1) Auto-create shouldn't be used.
> 2) If old version of postgres (<9.1) is used, the collation should be set properly by whoever issues "create database" (that is not our script).
> 3) We might want to add 'collate "C"' to varchar columns in the postgres script to ensure the correct collation; however, this will break the script for postgres <9.1.
> 4) MSSQL and Oracle might warrant investigation.



--
This message was sent by Atlassian JIRA
(v6.1#6144)