You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Nick Dimiduk (JIRA)" <ji...@apache.org> on 2014/02/14 00:06:20 UTC
[jira] [Created] (HIVE-6428) concat_ws mangles non-ASCII characters
Nick Dimiduk created HIVE-6428:
----------------------------------
Summary: concat_ws mangles non-ASCII characters
Key: HIVE-6428
URL: https://issues.apache.org/jira/browse/HIVE-6428
Project: Hive
Issue Type: Bug
Components: UDF
Affects Versions: 0.13.0
Reporter: Nick Dimiduk
Priority: Critical
Attachments: ru.txt
Marked critical because this results in data loss from using built-in functionality. I think the issue is {{concat_ws}}, though I suppose it could be the VIEW as well.
Hive is losing the distinction between non-ASCII characters, folding distinct values into the same value. Here are steps to reproduce, and I've attached a small sample containing 3 distinct lines from the larger input file.
Grab sample data, confirm the number of total records and the number of unique combinations of the first two columns match.
{noformat}
$ mkdir /tmp/pagecounts
$ cd /tmp/pagecounts
$ wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz
$ gzcat pagecounts-20081001-000000.gz | cut -d\ -f1,2 | wc -l
4268675
$ gzcat pagecounts-20081001-000000.gz | cut -d\ -f1,2 | sort | uniq | wc -l
4268675
{noformat}
Create hive table over input data.
{noformat}
CREATE EXTERNAL TABLE pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tmp/pagecounts/';
{noformat}
confirm number of unique combinations of the first two columns
{noformat}
SELECT count(DISTINCT projectcode, pagename) FROM pagecounts;
=> 4268675
{noformat}
Create a view over the raw data, concatenating first two columns. Distinct count does not match.
{noformat}
CREATE VIEW pgc_simple (rowkey, pageviews, bytes) AS
SELECT concat_ws('/', projectcode, pagename), pageviews, bytes
FROM pagecounts;
SELECT count(DISTINCT rowkey) FROM pgc_simple;
=> 4268561
{noformat}
Perform same "view" from shell. distinct count is retained.
{noformat}
$ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | wc -l
4268675
$ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | sort | uniq | wc -l
4268675
{noformat}
Look at some data.
{noformat}
$ hive -e "SELECT i.* FROM (SELECT rowkey, count(*) AS nbr FROM pgc_simple GROUP BY rowkey) i WHERE i.nbr > 1;" | tail
OK
Time taken: 40.493 seconds, Fetched: 37 row(s)
ru/?�N 2
ru/� 3
zh/?�� 2
zh/Category:�� 2
zh/Wikipedia:�� 5
zh/� 2
zh/�� 11
zh/��˹�� 2
zh/���� 17
zh/������ 15
{noformat}
Choose 2nd line of output to inspect on the shell. My locale isn't able to find a character for the codepoints, but sort | uniq identify them as different.
{noformat}
$ gzcat pagecounts-20081001-000000.gz | egrep '^ru \?. ' | cut -d\ -f1,2 | sort | uniq
ru ??
ru ??
ru ??
{noformat}
Print them as C-escape codes. They are indeed distinct.
{noformat}
$ cat pagecounts-20081001-000000 | egrep '^ru \?. ' | cut -d\ -f1,2 | od -c
0000000 r u ? 240 \n r u ? 244 \n r u ?
0000020 247 \n
0000022
{noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)