You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by mh...@columbia.edu on 2011/11/11 22:17:47 UTC

filtering out unicode entries loaded from json (tried using FILTER and regex)

I have parsed a json file structured as:
{"id":"xyz", "name":"John", "tags":"apples and oranges"}
{"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc

and I'd like to filter out the entries that contain unicode --like the
second entry.
I've tried using:

rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
logs = FOREACH rawdata generate json#name as thingtag;
result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
dump result;

This does not filter the second entry. What's more -- when I just look
at the tags being loaded, it looks like the unicode characters have
been converted (ie I see weird graphics)

running:
rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
logs = FOREACH rawdata generate json#name as thingtag;
dump logs;

Any help would be appreciated.

Re: filtering out unicode entries loaded from json (tried using FILTER and regex)

Posted by Jonathan Coveney <jc...@gmail.com>.
I'm not quite sure what you're going for, but here is kind of a hacky way
to do this (note, this was barely tested by me and worked on a pathetic
subset of utf8 but if it's important, I'd definitely include some unit
tests").

package rando;

import org.apache.pig.data.Tuple;
import org.apache.pig.FilterFunc;
import java.io.UnsupportedEncodingException;
import java.io.IOException;

public class NotUTF8 extends FilterFunc {
  public Boolean exec(Tuple input) throws IOException {
    if (input==null||input.size()==0)
      return false;
    String in = (String)input.get(0);
    try {
      return in.equals(new String(in.getBytes(),"US-ASCII"));
    } catch (UnsupportedEncodingException e) {
      return false;
    }
  }
}

Let me know if that works.

2011/11/11 <mh...@columbia.edu>

> I have parsed a json file structured as:
> {"id":"xyz", "name":"John", "tags":"apples and oranges"}
> {"id":"xyz", "name":"John", "tags":"\uac38\uc6b0"}...etc
>
> and I'd like to filter out the entries that contain unicode --like the
> second entry.
> I've tried using:
>
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> result = FILTER logs by thingtag matches '.*\\\\[a-z].*';
> dump result;
>
> This does not filter the second entry. What's more -- when I just look
> at the tags being loaded, it looks like the unicode characters have
> been converted (ie I see weird graphics)
>
> running:
> rawdata = LOAD 'data' using PigJasonLoader() as (json:map[]);
> logs = FOREACH rawdata generate json#name as thingtag;
> dump logs;
>
> Any help would be appreciated.
>