You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by xubo245 <60...@qq.com> on 2019/04/19 07:02:45 UTC

Why hive can't load normal string as binary from csv?

Why hive can't load normal string as binary from csv? https://issues.apache.org/jira/browse/HIVE-21626
Hive-1.2.2
hive>  CREATE TABLE IF NOT EXISTS hivetable (     >     id int,     >     label boolean,     >     name string,     >     image binary,     >     autoLabel boolean)     >  row format delimited fields terminated by 'ö'; OK Time taken: 0.068 seconds hive> LOAD DATA LOCAL INPATH '/Users/xubo/Desktop/xubo/git/carbondata3/integration/spark-common-test/src/test/resources/binarystringdata2.csv' INTO TABLE hivetable; Loading data to table default.hivetable Table default.hivetable stats: ÄnumFiles=1, totalSize=82Å OK Time taken: 0.122 seconds hive> select * from hivetable; OK 2	false	2.png	i�	true 3	false	3.png	n*%�                             	false 1	true	1.png	ÜAyard dutyÜB	true 


binarystringdata2.csv data is:
``` 2|false|2.png|abc|true 3|false|3.png|biology|false 1|true|1.png|^Ayard duty^B|true 


binarystringdata2.csv without \u0001 like over1k of hive project.

For the "abc" in csv, it should return abc by reading from hive after loading into hive, but why it is "I�"?. abc get bytes is byte[] 97 98 99, after org.apache.hadoop.hive.serde2.lazy.LazyBinary#decodeIfNeeded, it will decode to base64, return byte[] 105 -74:
  public static byte[] decodeIfNeeded(byte[] recv) {     boolean arrayByteBase64 = Base64.isArrayByteBase64(recv);     if (LOG.isDebugEnabled() && arrayByteBase64) {       LOG.debug("Data only contains Base64 alphabets only so try to decode the data.");     }     return arrayByteBase64 ? Base64.decodeBase64(recv) : recv;   } 


when we query with sql in spark, it will return byte[] 69 B7, for the hive alien/beeline, it will return string "I�"( char array is 105 65533).

Why the input and output data is different for hive load data ? insert into is ok.

Is it bug or limit ? only support base64 code or string that was validated with isBase64 as false in csv?