You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Eva Tse <et...@netflix.com> on 2009/07/10 02:38:30 UTC

Is [NOT] NULL operators issues

When we load the output generated by the reducer to hive, we run into some
issues with is NULL¹ and is not NULL¹ operators. Both returns zero when we
issue two queries like select count(1) from tablename where column_name is
NULL  or select count(1) from tablename where column_name is NOT NULL, which
shouldn¹t be possible. column_name is string type.

We tried using both SequenceFileOutput and TextFileOutput formats and we get
similar results. We are currently using Hadoop 0.20 unpatched and Hive trunk
(r786648) w/ HIVE-487 patch.

We suspect it is because of the fileformat that we are loading? But it used
to work with TextOutputFormat with 0.3 Hive. We have attached the test_log
table definition as well as how we generate the output files to be loaded
into this table.
 
Please let us know if anyone sees anything wrong or hits the same issue w/ a
workaround, etc. 

Thanks in advance,
Eva.

Describe extended test_log;
esn     string  
server_utc_ms   bigint
devtype_id      int
nccphn  string  
server_msg      string
other_properties        map<string,string>
dateint int
hour    int
                 
Detailed Table Information
Table(tableName:test_log,dbName:default,owner:dataeng,createTime:1247184142,
lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:esn
,type:string,comment:null),
FieldSchema(name:server_utc_ms,type:bigint,comment:null),
FieldSchema(name:devtype_id,type:int,comment:null),
FieldSchema(name:nccphn,type:string,comment:null),
FieldSchema(name:server_msg,type:string,comment:null),
FieldSchema(name:other_properties,type:map<string,string>,comment:null)],loc
ation:hdfs://ip-xxxxx.ec2.internal:9000/user/hive/warehouse/test_log,inputFo
rmat:org.apache.hadoop.mapred.SequenceFileInputFormat,outputFormat:org.apach
e.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,compressed:false,numBuckets
:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.se
rde2.lazy.LazySimpleSerDe,parameters:{colelction.delim=,mapkey.delim=,serial
ization.format=1,line.delim=
,field.delim=}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[Fiel
dSchema(name:dateint,type:int,comment:null),
FieldSchema(name:hour,type:int,comment:null)],parameters:{})


Output Format from the reducer:
SequenceFileOutputFormat
Key = NullWritable
Value = Text with: DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS
TERMINATED BY '\004' MAP KEYS TERMINATED BY '\002'

OR 

TextFileOutputFormat
Key = NullWritable
Value = Text with: DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS
TERMINATED BY '\004' MAP KEYS TERMINATED BY '\002' LINES TERMINATED by '\n'

Re: Is [NOT] NULL operators issues

Posted by Zheng Shao <zs...@gmail.com>.

Hi Eva,

Can you give us two lines of the data so that we can debug?
Also, what does "select count(1) from tablename" return?


Zheng

On Thu, Jul 9, 2009 at 5:38 PM, Eva Tse<et...@netflix.com> wrote:
>
> When we load the output generated by the reducer to hive, we run into some
> issues with ‘is NULL’ and ‘is not NULL’ operators. Both returns zero when we
> issue two queries like select count(1) from tablename where column_name is
> NULL  or select count(1) from tablename where column_name is NOT NULL, which
> shouldn’t be possible. column_name is string type.
>
> We tried using both SequenceFileOutput and TextFileOutput formats and we get
> similar results. We are currently using Hadoop 0.20 unpatched and Hive trunk
> (r786648) w/ HIVE-487 patch.
>
> We suspect it is because of the fileformat that we are loading? But it used
> to work with TextOutputFormat with 0.3 Hive. We have attached the test_log
> table definition as well as how we generate the output files to be loaded
> into this table.
>
> Please let us know if anyone sees anything wrong or hits the same issue w/ a
> workaround, etc.
>
> Thanks in advance,
> Eva.
>
> Describe extended test_log;
> esn     string
> server_utc_ms   bigint
> devtype_id      int
> nccphn  string
> server_msg      string
> other_properties        map<string,string>
> dateint int
> hour    int
>
> Detailed Table Information
> Table(tableName:test_log,dbName:default,owner:dataeng,createTime:1247184142,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:esn,type:string,comment:null),
> FieldSchema(name:server_utc_ms,type:bigint,comment:null),
> FieldSchema(name:devtype_id,type:int,comment:null),
> FieldSchema(name:nccphn,type:string,comment:null),
> FieldSchema(name:server_msg,type:string,comment:null),
> FieldSchema(name:other_properties,type:map<string,string>,comment:null)],location:hdfs://ip-xxxxx.ec2.internal:9000/user/hive/warehouse/test_log,inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,parameters:{colelction.delim=,mapkey.delim=,serialization.format=1,line.delim=
>
> ,field.delim=}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[FieldSchema(name:dateint,type:int,comment:null),
> FieldSchema(name:hour,type:int,comment:null)],parameters:{})
>
>
> Output Format from the reducer:
> SequenceFileOutputFormat
> Key = NullWritable
> Value = Text with: DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS
> TERMINATED BY '\004' MAP KEYS TERMINATED BY '\002'
>
> OR
>
> TextFileOutputFormat
> Key = NullWritable
> Value = Text with: DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS
> TERMINATED BY '\004' MAP KEYS TERMINATED BY '\002' LINES TERMINATED by '\n'
>
>
>



-- 
Yours,
Zheng