You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Puneet Khatod <pu...@tavant.com> on 2013/08/26 15:23:25 UTC

How to validate data type in Hive

Hi,

I have a requirement to validate data type of the values present in my flat file (which is source for my hive table). I am unable to find any hive feature/function which would do that.
Is there any way to validate data type of the values present in the underlying file? Something like BCP (Bulk copy program), used in SQL.

Please reply, my whole project is struck due to this issue.

Thanks,
Puneet

From: Yin Huai [mailto:huaiyin.thu@gmail.com]
Sent: Monday, August 26, 2013 5:10 PM
To: user@hive.apache.org
Cc: dev; Eric Chu
Subject: Re: DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases

forgot to add in my last reply.... To generate correct results, you can set hive.optimize.reducededuplication to false to turn off ReduceSinkDeDuplication

On Sun, Aug 25, 2013 at 9:35 PM, Yin Huai <hu...@gmail.com>> wrote:
Created a jira https://issues.apache.org/jira/browse/HIVE-5149

On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai <hu...@gmail.com>> wrote:
Seems ReduceSinkDeDuplication picked the wrong partitioning columns.

On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <sk...@rocketfuel.com>> wrote:
I think the problem lies with in the group by operation. For this optimization to work the group bys partitioning should be on the column 1 only.

It wont effect the correctness of group by, can make it slow but int this case will fasten the overall query performance.

On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia <mc...@rocketfuelinc.com>> wrote:
I have attached the hive 10 and 11 query plans, for the sample query below, for illustration.

On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia <mc...@rocketfuelinc.com>> wrote:
Hi,

We are using DISTRIBUTE BY with custom reducer scripts in our query workload.

After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT BY and custom reducer scripts produced incorrect results. Particularly, rows with same value on DISTRIBUTE BY column ends up in multiple reducers and thus produce multiple rows in final result, when we expect only one.

I investigated a little bit and discovered the following behavior for Hive 0.11:

- Hive 0.11 produces a different plan for these queries with incorrect results. The extra stage for the DISTRIBUTE BY + Transform is missing and the Transform operator for the custom reducer script is pushed into the reduce operator tree containing GROUP BY itself.

- However, if the SORT BY in the query has a DESC order in it, the right plan is produced, and the results look correct too.

Hive 0.10 produces the expected plan with right results in all cases.


To illustrate, here is a simplified repro setup:

Table:

CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3 STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Query:

ADD FILE reducer.py;

FROM(
  SELECT grp, val2
  FROM test_cluster
  GROUP BY grp, val2
  DISTRIBUTE BY grp
  SORT BY grp, val2  -- add DESC here to get correct results
) a

REDUCE a.*
USING 'reducer.py'
AS grp, reducedValue


If i understand correctly, this is a bug. Is this a known issue? Any other insights? We have reverted to Hive 0.10 to avoid the incorrect results while we investigate this.

I have the repro sample, with test data and scripts, if anybody is interested.



Thanks,
pala






Any comments or statements made in this email are not necessarily those of Tavant Technologies.
The information transmitted is intended only for the person or entity to which it is addressed and may 
contain confidential and/or privileged material. If you have received this in error, please contact the 
sender and delete the material from any computer. All e-mails sent from or to Tavant Technologies 
may be subject to our monitoring procedures.