You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Cristian Giha <Cr...@equifax.com> on 2014/08/01 22:42:24 UTC

Hive Sort By clause on Table creation

Hi,

I am trying to test some optimizations that Partitioning and Clustering tables can do, but I have a dude on how works the SORT BY clause in a table.
The case is the following:
I create a simple bucketed table as :

CREATE TABLE USERS(ID INT,NAME STRING, OTHER INT)
CLUSTERED BY (ID) SORTED BY(ID) into 4 buckets;

I am setting some configuration parameters :

*         set hive.enforce.bucketing=true;

*         set hive.enforce.sorting=true;

Suppose that I have a 1 million sample data for this table, and the data is stored automatically ordered by ID column.
Now I am trying to check for optimized queries with the sorted data. I put in the data a non-ordered id, duplicated sometimes into the data. I hope that how I tell hive that the data is ordered by id, it search by id and when finds the first and there aren't more consecutives match of the same id, it stop the search and return only the first. In the practice it's not happening and the query return all the row with same id.

Is my logic bad of how to SORT BY() clause helps in the query or something is happening?

Sorry my bad English.
I hope your help...
Regards