You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/11/11 23:48:21 UTC
Re: filtering out unicode entries loaded from json (tried using
FILTER and regex)
I'm not quite sure what you're going for, but here is kind of a hacky way
to do this (note, this was barely tested by me and worked on a pathetic
subset of utf8 but if it's important, I'd definitely include some unit
tests").
package rando;
import org.apache.pig.data.Tuple;
import org.apache.pig.FilterFunc;
import java.io.UnsupportedEncodingException;
import java.io.IOException;
public class NotUTF8 extends FilterFunc {
public Boolean exec(Tuple input) throws IOException {
if (input==null||input.size()==0)
return false;
String in = (String)input.get(0);
try {
return in.equals(new String(in.getBytes(),"US-ASCII"));
} catch (UnsupportedEncodingException e) {
return false;
}
}
}
Let me know if that works.
2011/11/11 <mh...@columbia.edu>
> I have parsed a json file structured as:
> {"id":"xyz", "name":"John", "tags":"apples and oranges"}
> {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
>
> and I'd like to filter out the entries that contain unicode --like the
> second entry.
> I've tried using:
>
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
> dump result;
>
> This does not filter the second entry. What's more -- when I just look
> at the tags being loaded, it looks like the unicode characters have
> been converted (ie I see weird graphics)
>
> running:
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> dump logs;
>
> Any help would be appreciated.
>