You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Nick Dimiduk (JIRA)" <ji...@apache.org> on 2014/02/14 00:06:22 UTC

[jira] [Updated] (HIVE-6428) concat_ws mangles non-ASCII characters

     [ https://issues.apache.org/jira/browse/HIVE-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Dimiduk updated HIVE-6428:
-------------------------------

    Attachment: ru.txt

> concat_ws mangles non-ASCII characters
> --------------------------------------
>
>                 Key: HIVE-6428
>                 URL: https://issues.apache.org/jira/browse/HIVE-6428
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>    Affects Versions: 0.13.0
>            Reporter: Nick Dimiduk
>            Priority: Critical
>         Attachments: ru.txt
>
>
> Marked critical because this results in data loss from using built-in functionality. I think the issue is {{concat_ws}}, though I suppose it could be the VIEW as well.
> Hive is losing the distinction between non-ASCII characters, folding distinct values into the same value. Here are steps to reproduce, and I've attached a small sample containing 3 distinct lines from the larger input file.
> Grab sample data, confirm the number of total records and the number of unique combinations of the first two columns match.
> {noformat}
> $ mkdir /tmp/pagecounts
> $ cd /tmp/pagecounts
> $ wget http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-000000.gz
> $ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | wc -l
>  4268675
> $ gzcat pagecounts-20081001-000000.gz | cut -d\  -f1,2 | sort | uniq | wc -l
>  4268675
> {noformat}
> Create hive table over input data.
> {noformat}
> CREATE EXTERNAL TABLE pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
> ROW FORMAT
>   DELIMITED FIELDS TERMINATED BY ' '
>   LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> LOCATION '/tmp/pagecounts/';
> {noformat}
> confirm number of unique combinations of the first two columns
> {noformat}
> SELECT count(DISTINCT projectcode, pagename) FROM pagecounts;
> => 4268675
> {noformat}
> Create a view over the raw data, concatenating first two columns. Distinct count does not match.
> {noformat}
> CREATE VIEW pgc_simple (rowkey, pageviews, bytes) AS
> SELECT concat_ws('/', projectcode, pagename), pageviews, bytes
> FROM pagecounts;
> SELECT count(DISTINCT rowkey) FROM pgc_simple;
> => 4268561
> {noformat}
> Perform same "view" from shell. distinct count is retained.
> {noformat}
> $ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | wc -l
>  4268675
> $ gzcat pagecounts-20081001-000000.gz | awk '{print $1 "/" $2}' | sort | uniq | wc -l                                                                                                       
>  4268675
> {noformat}
> Look at some data.
> {noformat}
> $ hive -e "SELECT i.* FROM (SELECT rowkey, count(*) AS nbr FROM pgc_simple GROUP BY rowkey) i WHERE i.nbr > 1;" | tail
> OK
> Time taken: 40.493 seconds, Fetched: 37 row(s)
> ru/?�N  2
> ru/�    3
> zh/?��  2
> zh/Category:��  2
> zh/Wikipedia:�� 5
> zh/�    2
> zh/��   11
> zh/��˹��        2
> zh/���� 17
> zh/������       15
> {noformat}
> Choose 2nd line of output to inspect on the shell. My locale isn't able to find a character for the codepoints, but sort | uniq identify them as different.
> {noformat}
> $ gzcat pagecounts-20081001-000000.gz | egrep '^ru \?. ' | cut -d\  -f1,2 | sort | uniq
> ru ??
> ru ??
> ru ??
> {noformat}
> Print them as C-escape codes. They are indeed distinct.
> {noformat}
> $ cat pagecounts-20081001-000000 | egrep '^ru \?. ' | cut -d\  -f1,2 | od -c
> 0000000    r   u       ? 240  \n   r   u       ? 244  \n   r   u       ?
> 0000020  247  \n
> 0000022
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)