You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Alex McLintock <al...@gmail.com> on 2011/01/29 13:12:59 UTC

UDF discussion? Here or on the dev list? / Json Loading

I wonder if discussion of the Piggybank and other User Defined Fields is
best done here (since it is *using* Pig) or on the Development list (because
it is enhancing Pig).

I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
https://gist.github.com/601331


The class works for me - mostly....


This works when the Json is just a single level

{"field1": "value1", "field2": "value2", "field3": "value3"}

But doesn't seem to work when the json is nested

{"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
"value5", "field6": "value6"}, "field3": "value3"}

Has anyone got this working? I can't see how the existing code deals with
this.
parseStringToTuple only creates a single Map. There is no recursion I can
see.



Any suggestions?

Re: UDF discussion? Here or on the dev list? / Json Loading

Posted by Jacob Perkins <ja...@gmail.com>.

Yes, you would have to distribute ruby (though. it's typically  
installed by default) as well as the wukong and json libraries to all  
the nodes in the cluster. Unfortunately this isn't something wukong  
gives you for free at the moment though it is planned.

As far as I know Pig doesn't do anything more complex than launch a  
hadoop streaming job and use the output in the subsequent steps

btw I write 90% of my mr jobs using either wukong or Pig. Only when  
it's absolutely required do I use a language with as much overhead as  
java :)

--jacob
@thedatachef

Sent from my iPhone

On Jan 30, 2011, at 2:09 PM, Alex McLintock <al...@gmail.com>  
wrote:

> On 29 January 2011 13:43, Jacob Perkins <ja...@gmail.com>  
> wrote:
>>
>> Write a map only wukong script that parses the json as you want it.  
>> See
>> the example here:
>>
>>
>> http://thedatachef.blogspot.com/2011/01/processing-json-records-with-hadoop-and.html
>>
>>
> Hi Jacob,
>
> Thanks very much for helping me out. I haven't heard of Wukong before.
> I am a bit concerned though by adding Ruby into my tool stack as  
> well as
> Pig. It seems like a step too far.
> Presumably I have to distribute Ruby and Wukong across all my job  
> nodes in
> the same way as if I were writing perl or C++ streaming programs.
>
> With STREAMing - the script is launched once per file, right, not  
> once per
> record?
>
> Alex

Re: UDF discussion? Here or on the dev list? / Json Loading

Posted by Alex McLintock <al...@gmail.com>.

On 29 January 2011 13:43, Jacob Perkins <ja...@gmail.com> wrote:
>
> Write a map only wukong script that parses the json as you want it. See
> the example here:
>
>
> http://thedatachef.blogspot.com/2011/01/processing-json-records-with-hadoop-and.html
>
>
Hi Jacob,

Thanks very much for helping me out. I haven't heard of Wukong before.
I am a bit concerned though by adding Ruby into my tool stack as well as
Pig. It seems like a step too far.
Presumably I have to distribute Ruby and Wukong across all my job nodes in
the same way as if I were writing perl or C++ streaming programs.

With STREAMing - the script is launched once per file, right, not once per
record?

Alex

Re: UDF discussion? Here or on the dev list? / Json Loading

Posted by Jacob Perkins <ja...@gmail.com>.

Alex,

It's a hack (sort of) but here's how I always do it. Since parsing json
in java will put you in an insane asylum:

Write a map only wukong script that parses the json as you want it. See
the example here:

http://thedatachef.blogspot.com/2011/01/processing-json-records-with-hadoop-and.html

then use the STREAM operator to stream your raw records (load them as
chararrays first) through your wukong script. It's not perfect but it
gets the job done.

--jacob
@thedatachef


On Sat, 2011-01-29 at 12:12 +0000, Alex McLintock wrote:
> I wonder if discussion of the Piggybank and other User Defined Fields is
> best done here (since it is *using* Pig) or on the Development list (because
> it is enhancing Pig).
> 
> I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
> Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
> https://gist.github.com/601331
> 
> 
> The class works for me - mostly....
> 
> 
> This works when the Json is just a single level
> 
> {"field1": "value1", "field2": "value2", "field3": "value3"}
> 
> But doesn't seem to work when the json is nested
> 
> {"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
> "value5", "field6": "value6"}, "field3": "value3"}
> 
> Has anyone got this working? I can't see how the existing code deals with
> this.
> parseStringToTuple only creates a single Map. There is no recursion I can
> see.
> 
> 
> 
> Any suggestions?

Re: UDF discussion? Here or on the dev list? / Json Loading

Posted by Harsh J <qw...@gmail.com>.

Hello,

On Sat, Jan 29, 2011 at 5:42 PM, Alex McLintock
<al...@gmail.com> wrote:
> I wonder if discussion of the Piggybank and other User Defined Fields is
> best done here (since it is *using* Pig) or on the Development list (because
> it is enhancing Pig).
>
> I'm trying to load some Json into pig using the PigJsonLoader.java UDF which
> Kim Vogt posted about back in September. (It isn't in Piggybank AFAICS)
> https://gist.github.com/601331
>
>
> The class works for me - mostly....
>
>
> This works when the Json is just a single level
>
> {"field1": "value1", "field2": "value2", "field3": "value3"}
>
> But doesn't seem to work when the json is nested
>
> {"field1": "value1", "field2": "value2", {"field4": "value4", "field5":
> "value5", "field6": "value6"}, "field3": "value3"}
>

The json-simple library for Java will build the entire JSON
representation as a JSONObject, which is _exactly_ what you need. This
is a Java Map-like class which would contain your structure properly.
What remains is to properly convert this to a Pig-acceptable Map
structure.

But what's happening in Vogt's code (and also Elephant-Bird's
LzoJsonLoader from which it was sourced) is that the Map is
down-converted to a simple Key-Value mapping instead of a Map
containing another Map. This was done due to a limitation in Pig
0.6.0, where the Map type could not hold complex types in it -- as
noted in the latter class's javadoc [1].

This limitation has gone away in 0.7.0+ I think (As the Pig Map spec
now supports <String, {Atom, Tuple, Bag, Map}>, so you can feel free
to change/get rid of the iteration inside parseStringToTuple(...) to
not 'flatten' the Map.

Additionally I think the json-simple dependency can perhaps be removed
in favor of Jackson Core/Mapper libraries that are now being shipped
by Hadoop itself (eliminating an extra JAR). Pig does not ship the
json-simple library along. But you may want to be careful about the
version of Jackson Core/Mapper in place inside your Hadoop. There are
much more recent updates of it available with benefits.

Perhaps, if you feel like, you can contribute your change back to
elephant-bird [2]. I think they're open to newer-Pig related changes.

[1] - https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/load/LzoJsonLoader.java
[2] -  https://github.com/kevinweil/elephant-bird

-- 
Harsh J
www.harshj.com