You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Charles Robertson <ch...@gmail.com> on 2014/09/17 10:33:46 UTC
Remove non-ascii characters
Hi all,
What is the best way to remove non-ascii characters in Hive? I found this
thread from over a year ago:
http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E
which sort of implies there isn't a pre-defined way of doing it, and the
JIRA raised hasn't been implemented yet.
Does anyone have any suggestions?
Thanks,
Charles
Re: Remove non-ascii characters
Posted by Charles Robertson <ch...@gmail.com>.
Have answered my own question - dropped the serde properties from the
tweets_raw table and used the get_json_object function in the select
statement.
Regards,
Charles
On 17 September 2014 21:12, Charles Robertson <ch...@gmail.com>
wrote:
> I've written a script in python to strip out the ascii characters, and am
> using the streaming mapreduce functionality to scrub the text of non-ascii
> characters.
>
> However, this has now broken my hive queries because whereas I previously
> had one JSON object per line, I now have a JSON object, a tab and blank, so
> using LOAD DATA is no longer working. I've tried loading the data into a
> two-column table and then doing an INSERT SELECT on the field containing
> the JSON, but it doesn't like this because I'm selecting one column and the
> receiving table has five - this is the table definition:
>
> CREATE EXTERNAL TABLE tweets_raw (
> id BIGINT,
> created_at STRING,
> text STRING,
> screen_name STRING,
> name STRING
> )
> ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
> WITH SERDEPROPERTIES (
> 'paths'='id, created_at, text, user.screen_name, user.name'
> );
>
> Can anyone suggest how to get around this?
>
> Thanks,
> Charles
>
> On 17 September 2014 09:33, Charles Robertson <charles.robertson@gmail.com
> > wrote:
>
>> Hi all,
>>
>> What is the best way to remove non-ascii characters in Hive? I found this
>> thread from over a year ago:
>> http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E
>> which sort of implies there isn't a pre-defined way of doing it, and the
>> JIRA raised hasn't been implemented yet.
>>
>> Does anyone have any suggestions?
>>
>> Thanks,
>> Charles
>>
>
>
Re: Remove non-ascii characters
Posted by Charles Robertson <ch...@gmail.com>.
I've written a script in python to strip out the ascii characters, and am
using the streaming mapreduce functionality to scrub the text of non-ascii
characters.
However, this has now broken my hive queries because whereas I previously
had one JSON object per line, I now have a JSON object, a tab and blank, so
using LOAD DATA is no longer working. I've tried loading the data into a
two-column table and then doing an INSERT SELECT on the field containing
the JSON, but it doesn't like this because I'm selecting one column and the
receiving table has five - this is the table definition:
CREATE EXTERNAL TABLE tweets_raw (
id BIGINT,
created_at STRING,
text STRING,
screen_name STRING,
name STRING
)
ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
WITH SERDEPROPERTIES (
'paths'='id, created_at, text, user.screen_name, user.name'
);
Can anyone suggest how to get around this?
Thanks,
Charles
On 17 September 2014 09:33, Charles Robertson <ch...@gmail.com>
wrote:
> Hi all,
>
> What is the best way to remove non-ascii characters in Hive? I found this
> thread from over a year ago:
> http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E
> which sort of implies there isn't a pre-defined way of doing it, and the
> JIRA raised hasn't been implemented yet.
>
> Does anyone have any suggestions?
>
> Thanks,
> Charles
>