You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kat Huang <mh...@columbia.edu> on 2011/11/11 22:12:52 UTC

filter (regex) unicode from json files in pig?

I have parsed a json file structured as:
{"id":"xyz", "name":"John", "tags":"apples and oranges"}
{"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc

and I'd like to filter out the entries that contain unicode --like the
second entry.
I've tried using:

rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
logs = FOREACH rawdata generate json#name as thingtag;
result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
dump result;

This does not filter the second entry. What's more -- when I just look
at the tags being loaded, it looks like the unicode characters have
been converted (ie I see weird graphics)

running:
rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
logs = FOREACH rawdata generate json#name as thingtag;
dump logs;

Any help would be appreciated.

Re: filter (regex) unicode from json files in pig?

Posted by mh...@columbia.edu.
Thank you so much, that did the trick.

Quoting Jonathan Coveney <jc...@gmail.com>:

> Dmitriy's solution is definitely more elegant than writing a UDF, and in a
> quick test, worked equally as well.
>
> c = filter a by x matches '\\p{ASCII}*'
>
> This would work if you wanted to ensure that all characters are ASCII.
>
> 2011/11/11 Dmitriy Ryaboy <dv...@gmail.com>
>
>> I think you can just filter by "not foo matches '.*\\p{ASCII}.*'
>>
>> On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <mh...@columbia.edu> wrote:
>> >
>> > I have parsed a json file structured as:
>> > {"id":"xyz", "name":"John", "tags":"apples and oranges"}
>> > {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
>> >
>> > and I'd like to filter out the entries that contain unicode --like the
>> > second entry.
>> > I've tried using:
>> >
>> > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
>> > logs = FOREACH rawdata generate json#name as thingtag;
>> > result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
>> > dump result;
>> >
>> > This does not filter the second entry. What's more -- when I just look
>> > at the tags being loaded, it looks like the unicode characters have
>> > been converted (ie I see weird graphics)
>> >
>> > running:
>> > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
>> > logs = FOREACH rawdata generate json#name as thingtag;
>> > dump logs;
>> >
>> > Any help would be appreciated.
>>
>



Re: filter (regex) unicode from json files in pig?

Posted by Jonathan Coveney <jc...@gmail.com>.
Dmitriy's solution is definitely more elegant than writing a UDF, and in a
quick test, worked equally as well.

c = filter a by x matches '\\p{ASCII}*'

This would work if you wanted to ensure that all characters are ASCII.

2011/11/11 Dmitriy Ryaboy <dv...@gmail.com>

> I think you can just filter by "not foo matches '.*\\p{ASCII}.*'
>
> On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <mh...@columbia.edu> wrote:
> >
> > I have parsed a json file structured as:
> > {"id":"xyz", "name":"John", "tags":"apples and oranges"}
> > {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
> >
> > and I'd like to filter out the entries that contain unicode --like the
> > second entry.
> > I've tried using:
> >
> > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> > logs = FOREACH rawdata generate json#name as thingtag;
> > result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
> > dump result;
> >
> > This does not filter the second entry. What's more -- when I just look
> > at the tags being loaded, it looks like the unicode characters have
> > been converted (ie I see weird graphics)
> >
> > running:
> > rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> > logs = FOREACH rawdata generate json#name as thingtag;
> > dump logs;
> >
> > Any help would be appreciated.
>

Re: filter (regex) unicode from json files in pig?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I think you can just filter by "not foo matches '.*\\p{ASCII}.*'

On Fri, Nov 11, 2011 at 1:12 PM, Kat Huang <mh...@columbia.edu> wrote:
>
> I have parsed a json file structured as:
> {"id":"xyz", "name":"John", "tags":"apples and oranges"}
> {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
>
> and I'd like to filter out the entries that contain unicode --like the
> second entry.
> I've tried using:
>
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
> dump result;
>
> This does not filter the second entry. What's more -- when I just look
> at the tags being loaded, it looks like the unicode characters have
> been converted (ie I see weird graphics)
>
> running:
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> dump logs;
>
> Any help would be appreciated.