You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Charles Robertson <ch...@gmail.com> on 2014/09/17 10:33:46 UTC

Remove non-ascii characters

Hi all,

What is the best way to remove non-ascii characters in Hive? I found this
thread from over a year ago:
http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E
which sort of implies there isn't a pre-defined way of doing it, and the
JIRA raised hasn't been implemented yet.

Does anyone have any suggestions?

Thanks,
Charles

Re: Remove non-ascii characters

Posted by Charles Robertson <ch...@gmail.com>.
Have answered my own question - dropped the serde properties from the
tweets_raw table and used the get_json_object function in the select
statement.

Regards,
Charles

On 17 September 2014 21:12, Charles Robertson <ch...@gmail.com>
wrote:

> I've written a script in python to strip out the ascii characters, and am
> using the streaming mapreduce functionality to scrub the text of non-ascii
> characters.
>
> However, this has now broken my hive queries because whereas I previously
> had one JSON object per line, I now have a JSON object, a tab and blank, so
> using LOAD DATA is no longer working. I've tried loading the data into a
> two-column table and then doing an INSERT SELECT on the field containing
> the JSON, but it doesn't like this because I'm selecting one column and the
> receiving table has five - this is the table definition:
>
> CREATE EXTERNAL TABLE tweets_raw (
>    id BIGINT,
>    created_at STRING,
>    text STRING,
>    screen_name STRING,
>    name STRING
> )
> ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
> WITH SERDEPROPERTIES (
>       'paths'='id, created_at, text, user.screen_name, user.name'
>       );
>
> Can anyone suggest how to get around this?
>
> Thanks,
> Charles
>
> On 17 September 2014 09:33, Charles Robertson <charles.robertson@gmail.com
> > wrote:
>
>> Hi all,
>>
>> What is the best way to remove non-ascii characters in Hive? I found this
>> thread from over a year ago:
>> http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E
>> which sort of implies there isn't a pre-defined way of doing it, and the
>> JIRA raised hasn't been implemented yet.
>>
>> Does anyone have any suggestions?
>>
>> Thanks,
>> Charles
>>
>
>

Re: Remove non-ascii characters

Posted by Charles Robertson <ch...@gmail.com>.
I've written a script in python to strip out the ascii characters, and am
using the streaming mapreduce functionality to scrub the text of non-ascii
characters.

However, this has now broken my hive queries because whereas I previously
had one JSON object per line, I now have a JSON object, a tab and blank, so
using LOAD DATA is no longer working. I've tried loading the data into a
two-column table and then doing an INSERT SELECT on the field containing
the JSON, but it doesn't like this because I'm selecting one column and the
receiving table has five - this is the table definition:

CREATE EXTERNAL TABLE tweets_raw (
   id BIGINT,
   created_at STRING,
   text STRING,
   screen_name STRING,
   name STRING
)
ROW FORMAT SERDE 'com.amazon.elasticmapreduce.JsonSerde'
WITH SERDEPROPERTIES (
      'paths'='id, created_at, text, user.screen_name, user.name'
      );

Can anyone suggest how to get around this?

Thanks,
Charles

On 17 September 2014 09:33, Charles Robertson <ch...@gmail.com>
wrote:

> Hi all,
>
> What is the best way to remove non-ascii characters in Hive? I found this
> thread from over a year ago:
> http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3CCAKm=R7VE+VrQCCOhK69aPEEqNB=9zO8UAJ01ys8AuCAVvd9vag@mail.gmail.com%3E
> which sort of implies there isn't a pre-defined way of doing it, and the
> JIRA raised hasn't been implemented yet.
>
> Does anyone have any suggestions?
>
> Thanks,
> Charles
>