You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Daniel Eklund <do...@gmail.com> on 2011/04/19 19:08:48 UTC

JSONToTuple for pig UDF

I noticed that there is a Pig JSON Loader (which might or might not be in
piggbank).
Could anyone confirm the existence or absence of a JSONToTuple UDF?  (not a
loader)

I am inspired by the UDF mentioned on Slide 23 here:
http://www.slideshare.net/danharvey/hbase-at-mendeley

  doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
DOC;

My desire is to store a raw JSON doc in a cell in HBase and run pig queries
against the tuples generated by the UDF.
I used the HBase Loader already to get the cell-data, and now I need a
JSON-deserializer.

I would be willing to roll my own, (and contribute), but I figure I'd see if
there was anything out there first.

thanks,
daniel

Re: JSONToTuple for pig UDF

Posted by Daniel Eklund <do...@gmail.com>.

great...

this was exactly what I was hoping for ... (although I have a bit of sadness
as I was just about ready to get by hands dirty)

On Tue, Apr 19, 2011 at 2:57 PM, Xavier Stevens <xs...@mozilla.com>wrote:

> For what it's worth I have one as well. This one uses Jackson to parse
> everything.
>
>
> https://github.com/xstevens/akela/blob/master/src/java/com/mozilla/pig/eval/json/JsonMap.java
>
>
> On 4/19/11 11:55 AM, Dmitriy Ryaboy wrote:
> > YES :)
> >
> > On Tue, Apr 19, 2011 at 11:49 AM, John Hui <jo...@gmail.com> wrote:
> >
> >> I have a JSON library and pig script working.  Should I just contribute
> it
> >> instead of reinventing the wheel?
> >>
> >> John
> >>
> >> On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund <do...@gmail.com>
> wrote:
> >>
> >>> Bill,  thanks...
> >>>
> >>>  so that is a confirmation... people have rolled their own, and it's
> not
> >> in
> >>> piggybank.
> >>> I would absolutely be willing to work with you to get a contribution
> >> going,
> >>> but (as
> >>> a warning) I am extremely new to Pig.
> >>>
> >>> I was looking at this:
> >>> http://wiki.apache.org/pig/UDFManual
> >>> to get my mind wrapped around the framework.  And I also discovered
> this
> >>>
> >>>
> >>
> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
> >>> ( I am assuming this was the UDF you mentioned that inspired you)...
> >>>
> >>> A quick question about the UDF's registered at the top of a pig script:
> >>>
> >>> does
> >>> REGISTER myJar.jar
> >>> distribute the jar across HDFS (like a Hadoop job jar) so that the
> >>> distribution of the code to the cluster nodes is transparent?
> >>> In other words, do we NOT have to distribute myJar.jar to each node on
> >> the
> >>> cluster.
> >>>
> >>> thanks more,
> >>> daniel
> >>>
> >>>
> >>>
> >>> On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <bi...@gmail.com>
> >> wrote:
> >>>> We're doing the same thing using a JsonToMap UDF followed by a
> >>>> MapToBag UDF. The former was similarly inspired by the elephant bird
> >>>> JSONLoader. I'd be glad to collaborate on a contribution if you'd
> >>>> like.
> >>>>
> >>>> Here's what our scripts look like:
> >>>>
> >>>> define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
> >>>> define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
> >>>> define concat org.apache.pig.builtin.StringConcat();
> >>>>
> >>>> raw = LOAD 'hbase://user_info'
> >>>>      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
> >> 'events:*')
> >>>>      AS (events_map:map[]);
> >>>>
> >>>> -- Convert our maps to bags so we can flatten them out
> >>>> B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
> >>>>
> >>>> C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
> >>>> event_v:chararray);
> >>>>
> >>>> -- Convert the JSON events into maps
> >>>> D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS
> event_map:map[];
> >>>>
> >>>> -- Example showing how to filter on a given field
> >>>> E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
> >>>> event_map#'levt.asid' IS NOT NULL);
> >>>>
> >>>> -- Example showing how to pull data out of a map
> >>>> F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
> >>>>                                             event_map#'levt.astid' AS
> >>>> astid;
> >>>>
> >>>>
> >>>> thanks,
> >>>> Bill
> >>>>
> >>>> On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <do...@gmail.com>
> >>>> wrote:
> >>>>> I noticed that there is a Pig JSON Loader (which might or might not
> >> be
> >>> in
> >>>>> piggbank).
> >>>>> Could anyone confirm the existence or absence of a JSONToTuple UDF?
> >>>  (not
> >>>> a
> >>>>> loader)
> >>>>>
> >>>>> I am inspired by the UDF mentioned on Slide 23 here:
> >>>>> http://www.slideshare.net/danharvey/hbase-at-mendeley
> >>>>>
> >>>>>  doc = FOREACH rawdocs GENERATE
> >> DocumentProtobufBytesToTuple(protodoc)
> >>> as
> >>>>> DOC;
> >>>>>
> >>>>> My desire is to store a raw JSON doc in a cell in HBase and run pig
> >>>> queries
> >>>>> against the tuples generated by the UDF.
> >>>>> I used the HBase Loader already to get the cell-data, and now I need
> >> a
> >>>>> JSON-deserializer.
> >>>>>
> >>>>> I would be willing to roll my own, (and contribute), but I figure I'd
> >>> see
> >>>> if
> >>>>> there was anything out there first.
> >>>>>
> >>>>> thanks,
> >>>>> daniel
> >>>>>
>

Re: JSONToTuple for pig UDF

Posted by Xavier Stevens <xs...@mozilla.com>.

For what it's worth I have one as well. This one uses Jackson to parse
everything.

https://github.com/xstevens/akela/blob/master/src/java/com/mozilla/pig/eval/json/JsonMap.java


On 4/19/11 11:55 AM, Dmitriy Ryaboy wrote:
> YES :)
>
> On Tue, Apr 19, 2011 at 11:49 AM, John Hui <jo...@gmail.com> wrote:
>
>> I have a JSON library and pig script working.  Should I just contribute it
>> instead of reinventing the wheel?
>>
>> John
>>
>> On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund <do...@gmail.com> wrote:
>>
>>> Bill,  thanks...
>>>
>>>  so that is a confirmation... people have rolled their own, and it's not
>> in
>>> piggybank.
>>> I would absolutely be willing to work with you to get a contribution
>> going,
>>> but (as
>>> a warning) I am extremely new to Pig.
>>>
>>> I was looking at this:
>>> http://wiki.apache.org/pig/UDFManual
>>> to get my mind wrapped around the framework.  And I also discovered this
>>>
>>>
>> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
>>> ( I am assuming this was the UDF you mentioned that inspired you)...
>>>
>>> A quick question about the UDF's registered at the top of a pig script:
>>>
>>> does
>>> REGISTER myJar.jar
>>> distribute the jar across HDFS (like a Hadoop job jar) so that the
>>> distribution of the code to the cluster nodes is transparent?
>>> In other words, do we NOT have to distribute myJar.jar to each node on
>> the
>>> cluster.
>>>
>>> thanks more,
>>> daniel
>>>
>>>
>>>
>>> On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <bi...@gmail.com>
>> wrote:
>>>> We're doing the same thing using a JsonToMap UDF followed by a
>>>> MapToBag UDF. The former was similarly inspired by the elephant bird
>>>> JSONLoader. I'd be glad to collaborate on a contribution if you'd
>>>> like.
>>>>
>>>> Here's what our scripts look like:
>>>>
>>>> define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
>>>> define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
>>>> define concat org.apache.pig.builtin.StringConcat();
>>>>
>>>> raw = LOAD 'hbase://user_info'
>>>>      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>> 'events:*')
>>>>      AS (events_map:map[]);
>>>>
>>>> -- Convert our maps to bags so we can flatten them out
>>>> B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
>>>>
>>>> C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
>>>> event_v:chararray);
>>>>
>>>> -- Convert the JSON events into maps
>>>> D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
>>>>
>>>> -- Example showing how to filter on a given field
>>>> E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
>>>> event_map#'levt.asid' IS NOT NULL);
>>>>
>>>> -- Example showing how to pull data out of a map
>>>> F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
>>>>                                             event_map#'levt.astid' AS
>>>> astid;
>>>>
>>>>
>>>> thanks,
>>>> Bill
>>>>
>>>> On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <do...@gmail.com>
>>>> wrote:
>>>>> I noticed that there is a Pig JSON Loader (which might or might not
>> be
>>> in
>>>>> piggbank).
>>>>> Could anyone confirm the existence or absence of a JSONToTuple UDF?
>>>  (not
>>>> a
>>>>> loader)
>>>>>
>>>>> I am inspired by the UDF mentioned on Slide 23 here:
>>>>> http://www.slideshare.net/danharvey/hbase-at-mendeley
>>>>>
>>>>>  doc = FOREACH rawdocs GENERATE
>> DocumentProtobufBytesToTuple(protodoc)
>>> as
>>>>> DOC;
>>>>>
>>>>> My desire is to store a raw JSON doc in a cell in HBase and run pig
>>>> queries
>>>>> against the tuples generated by the UDF.
>>>>> I used the HBase Loader already to get the cell-data, and now I need
>> a
>>>>> JSON-deserializer.
>>>>>
>>>>> I would be willing to roll my own, (and contribute), but I figure I'd
>>> see
>>>> if
>>>>> there was anything out there first.
>>>>>
>>>>> thanks,
>>>>> daniel
>>>>>

Re: JSONToTuple for pig UDF

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

YES :)

On Tue, Apr 19, 2011 at 11:49 AM, John Hui <jo...@gmail.com> wrote:

> I have a JSON library and pig script working.  Should I just contribute it
> instead of reinventing the wheel?
>
> John
>
> On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund <do...@gmail.com> wrote:
>
> > Bill,  thanks...
> >
> >  so that is a confirmation... people have rolled their own, and it's not
> in
> > piggybank.
> > I would absolutely be willing to work with you to get a contribution
> going,
> > but (as
> > a warning) I am extremely new to Pig.
> >
> > I was looking at this:
> > http://wiki.apache.org/pig/UDFManual
> > to get my mind wrapped around the framework.  And I also discovered this
> >
> >
> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
> > ( I am assuming this was the UDF you mentioned that inspired you)...
> >
> > A quick question about the UDF's registered at the top of a pig script:
> >
> > does
> > REGISTER myJar.jar
> > distribute the jar across HDFS (like a Hadoop job jar) so that the
> > distribution of the code to the cluster nodes is transparent?
> > In other words, do we NOT have to distribute myJar.jar to each node on
> the
> > cluster.
> >
> > thanks more,
> > daniel
> >
> >
> >
> > On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <bi...@gmail.com>
> wrote:
> >
> > > We're doing the same thing using a JsonToMap UDF followed by a
> > > MapToBag UDF. The former was similarly inspired by the elephant bird
> > > JSONLoader. I'd be glad to collaborate on a contribution if you'd
> > > like.
> > >
> > > Here's what our scripts look like:
> > >
> > > define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
> > > define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
> > > define concat org.apache.pig.builtin.StringConcat();
> > >
> > > raw = LOAD 'hbase://user_info'
> > >      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
> 'events:*')
> > >      AS (events_map:map[]);
> > >
> > > -- Convert our maps to bags so we can flatten them out
> > > B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
> > >
> > > C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
> > > event_v:chararray);
> > >
> > > -- Convert the JSON events into maps
> > > D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
> > >
> > > -- Example showing how to filter on a given field
> > > E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
> > > event_map#'levt.asid' IS NOT NULL);
> > >
> > > -- Example showing how to pull data out of a map
> > > F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
> > >                                             event_map#'levt.astid' AS
> > > astid;
> > >
> > >
> > > thanks,
> > > Bill
> > >
> > > On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <do...@gmail.com>
> > > wrote:
> > > > I noticed that there is a Pig JSON Loader (which might or might not
> be
> > in
> > > > piggbank).
> > > > Could anyone confirm the existence or absence of a JSONToTuple UDF?
> >  (not
> > > a
> > > > loader)
> > > >
> > > > I am inspired by the UDF mentioned on Slide 23 here:
> > > > http://www.slideshare.net/danharvey/hbase-at-mendeley
> > > >
> > > >  doc = FOREACH rawdocs GENERATE
> DocumentProtobufBytesToTuple(protodoc)
> > as
> > > > DOC;
> > > >
> > > > My desire is to store a raw JSON doc in a cell in HBase and run pig
> > > queries
> > > > against the tuples generated by the UDF.
> > > > I used the HBase Loader already to get the cell-data, and now I need
> a
> > > > JSON-deserializer.
> > > >
> > > > I would be willing to roll my own, (and contribute), but I figure I'd
> > see
> > > if
> > > > there was anything out there first.
> > > >
> > > > thanks,
> > > > daniel
> > > >
> > >
> >
>

Re: JSONToTuple for pig UDF

Posted by John Hui <jo...@gmail.com>.

I have a JSON library and pig script working.  Should I just contribute it
instead of reinventing the wheel?

John

On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund <do...@gmail.com> wrote:

> Bill,  thanks...
>
>  so that is a confirmation... people have rolled their own, and it's not in
> piggybank.
> I would absolutely be willing to work with you to get a contribution going,
> but (as
> a warning) I am extremely new to Pig.
>
> I was looking at this:
> http://wiki.apache.org/pig/UDFManual
> to get my mind wrapped around the framework.  And I also discovered this
>
> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
> ( I am assuming this was the UDF you mentioned that inspired you)...
>
> A quick question about the UDF's registered at the top of a pig script:
>
> does
> REGISTER myJar.jar
> distribute the jar across HDFS (like a Hadoop job jar) so that the
> distribution of the code to the cluster nodes is transparent?
> In other words, do we NOT have to distribute myJar.jar to each node on the
> cluster.
>
> thanks more,
> daniel
>
>
>
> On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <bi...@gmail.com> wrote:
>
> > We're doing the same thing using a JsonToMap UDF followed by a
> > MapToBag UDF. The former was similarly inspired by the elephant bird
> > JSONLoader. I'd be glad to collaborate on a contribution if you'd
> > like.
> >
> > Here's what our scripts look like:
> >
> > define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
> > define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
> > define concat org.apache.pig.builtin.StringConcat();
> >
> > raw = LOAD 'hbase://user_info'
> >      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
> >      AS (events_map:map[]);
> >
> > -- Convert our maps to bags so we can flatten them out
> > B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
> >
> > C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
> > event_v:chararray);
> >
> > -- Convert the JSON events into maps
> > D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
> >
> > -- Example showing how to filter on a given field
> > E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
> > event_map#'levt.asid' IS NOT NULL);
> >
> > -- Example showing how to pull data out of a map
> > F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
> >                                             event_map#'levt.astid' AS
> > astid;
> >
> >
> > thanks,
> > Bill
> >
> > On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <do...@gmail.com>
> > wrote:
> > > I noticed that there is a Pig JSON Loader (which might or might not be
> in
> > > piggbank).
> > > Could anyone confirm the existence or absence of a JSONToTuple UDF?
>  (not
> > a
> > > loader)
> > >
> > > I am inspired by the UDF mentioned on Slide 23 here:
> > > http://www.slideshare.net/danharvey/hbase-at-mendeley
> > >
> > >  doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc)
> as
> > > DOC;
> > >
> > > My desire is to store a raw JSON doc in a cell in HBase and run pig
> > queries
> > > against the tuples generated by the UDF.
> > > I used the HBase Loader already to get the cell-data, and now I need a
> > > JSON-deserializer.
> > >
> > > I would be willing to roll my own, (and contribute), but I figure I'd
> see
> > if
> > > there was anything out there first.
> > >
> > > thanks,
> > > daniel
> > >
> >
>

Re: JSONToTuple for pig UDF

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

FYI there's a ticket open already though it didn't see much action:

https://issues.apache.org/jira/browse/PIG-1914

Perhaps the best thing would be to discuss implementation approaches, etc,
there.

D

On Tue, Apr 19, 2011 at 12:11 PM, John Hui <jo...@gmail.com> wrote:

> Really, cool.  Let me take a look when I have some "downtime".  If that's
> the case, Xavier's parser is much better than mine.
>
> Who wants to take the lead in adding this to the piggybank, I am sure this
> makes for a very useful "storage" utility.
>
> John
>
> On Tue, Apr 19, 2011 at 3:09 PM, Xavier Stevens <xstevens@mozilla.com
> >wrote:
>
> > Hey John,
> >
> > If you take a look at mine it looks explicitly for Lists and converts
> > them to DataBags. I ran into that issue with our data. That said I won't
> > make any claims that it'll work for all data.
> >
> > Cheers,
> >
> > -Xavier
> >
> > On 4/19/11 12:02 PM, John Hui wrote:
> > > I'll post my solution in a few hours =)
> > >
> > > On Tue, Apr 19, 2011 at 3:02 PM, John Hui <jo...@gmail.com>
> wrote:
> > >
> > >> I don't think one parser will work for all solution.  It really
> depends
> > on
> > >> your data, since there might be a list within a list.
> > >>
> > >> But pick anyone as a starting point and customize it for your own json
> > data
> > >> format.
> > >>
> > >>
> > >> On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates <ga...@yahoo-inc.com>
> > wrote:
> > >>
> > >>> On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:
> > >>>
> > >>>  <snip>
> > >>>> A quick question about the UDF's registered at the top of a pig
> > script:
> > >>>>
> > >>>> does
> > >>>> REGISTER myJar.jar
> > >>>> distribute the jar across HDFS (like a Hadoop job jar) so that the
> > >>>> distribution of the code to the cluster nodes is transparent?
> > >>>> In other words, do we NOT have to distribute myJar.jar to each node
> on
> > >>>> the
> > >>>> cluster.
> > >>>>
> > >>> Pig takes care of getting myJar.jar to the task nodes; you do not
> have
> > to
> > >>> worry about it.
> > >>>
> > >>> Alan.
> > >>>
> > >>>
> >
>

Re: JSONToTuple for pig UDF

Posted by John Hui <jo...@gmail.com>.

Really, cool.  Let me take a look when I have some "downtime".  If that's
the case, Xavier's parser is much better than mine.

Who wants to take the lead in adding this to the piggybank, I am sure this
makes for a very useful "storage" utility.

John

On Tue, Apr 19, 2011 at 3:09 PM, Xavier Stevens <xs...@mozilla.com>wrote:

> Hey John,
>
> If you take a look at mine it looks explicitly for Lists and converts
> them to DataBags. I ran into that issue with our data. That said I won't
> make any claims that it'll work for all data.
>
> Cheers,
>
> -Xavier
>
> On 4/19/11 12:02 PM, John Hui wrote:
> > I'll post my solution in a few hours =)
> >
> > On Tue, Apr 19, 2011 at 3:02 PM, John Hui <jo...@gmail.com> wrote:
> >
> >> I don't think one parser will work for all solution.  It really depends
> on
> >> your data, since there might be a list within a list.
> >>
> >> But pick anyone as a starting point and customize it for your own json
> data
> >> format.
> >>
> >>
> >> On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates <ga...@yahoo-inc.com>
> wrote:
> >>
> >>> On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:
> >>>
> >>>  <snip>
> >>>> A quick question about the UDF's registered at the top of a pig
> script:
> >>>>
> >>>> does
> >>>> REGISTER myJar.jar
> >>>> distribute the jar across HDFS (like a Hadoop job jar) so that the
> >>>> distribution of the code to the cluster nodes is transparent?
> >>>> In other words, do we NOT have to distribute myJar.jar to each node on
> >>>> the
> >>>> cluster.
> >>>>
> >>> Pig takes care of getting myJar.jar to the task nodes; you do not have
> to
> >>> worry about it.
> >>>
> >>> Alan.
> >>>
> >>>
>

Re: JSONToTuple for pig UDF

Posted by Xavier Stevens <xs...@mozilla.com>.

Hey John,

If you take a look at mine it looks explicitly for Lists and converts
them to DataBags. I ran into that issue with our data. That said I won't
make any claims that it'll work for all data.

Cheers,

-Xavier

On 4/19/11 12:02 PM, John Hui wrote:
> I'll post my solution in a few hours =)
>
> On Tue, Apr 19, 2011 at 3:02 PM, John Hui <jo...@gmail.com> wrote:
>
>> I don't think one parser will work for all solution.  It really depends on
>> your data, since there might be a list within a list.
>>
>> But pick anyone as a starting point and customize it for your own json data
>> format.
>>
>>
>> On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>> On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:
>>>
>>>  <snip>
>>>> A quick question about the UDF's registered at the top of a pig script:
>>>>
>>>> does
>>>> REGISTER myJar.jar
>>>> distribute the jar across HDFS (like a Hadoop job jar) so that the
>>>> distribution of the code to the cluster nodes is transparent?
>>>> In other words, do we NOT have to distribute myJar.jar to each node on
>>>> the
>>>> cluster.
>>>>
>>> Pig takes care of getting myJar.jar to the task nodes; you do not have to
>>> worry about it.
>>>
>>> Alan.
>>>
>>>

Re: JSONToTuple for pig UDF

Posted by John Hui <jo...@gmail.com>.

I'll post my solution in a few hours =)

On Tue, Apr 19, 2011 at 3:02 PM, John Hui <jo...@gmail.com> wrote:

> I don't think one parser will work for all solution.  It really depends on
> your data, since there might be a list within a list.
>
> But pick anyone as a starting point and customize it for your own json data
> format.
>
>
> On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
>>
>> On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:
>>
>>  <snip>
>>>
>>> A quick question about the UDF's registered at the top of a pig script:
>>>
>>> does
>>> REGISTER myJar.jar
>>> distribute the jar across HDFS (like a Hadoop job jar) so that the
>>> distribution of the code to the cluster nodes is transparent?
>>> In other words, do we NOT have to distribute myJar.jar to each node on
>>> the
>>> cluster.
>>>
>>
>> Pig takes care of getting myJar.jar to the task nodes; you do not have to
>> worry about it.
>>
>> Alan.
>>
>>
>

Re: JSONToTuple for pig UDF

Posted by John Hui <jo...@gmail.com>.

I don't think one parser will work for all solution.  It really depends on
your data, since there might be a list within a list.

But pick anyone as a starting point and customize it for your own json data
format.

On Tue, Apr 19, 2011 at 3:00 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

>
> On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:
>
>  <snip>
>>
>> A quick question about the UDF's registered at the top of a pig script:
>>
>> does
>> REGISTER myJar.jar
>> distribute the jar across HDFS (like a Hadoop job jar) so that the
>> distribution of the code to the cluster nodes is transparent?
>> In other words, do we NOT have to distribute myJar.jar to each node on the
>> cluster.
>>
>
> Pig takes care of getting myJar.jar to the task nodes; you do not have to
> worry about it.
>
> Alan.
>
>

Re: JSONToTuple for pig UDF

Posted by Alan Gates <ga...@yahoo-inc.com>.

On Apr 19, 2011, at 11:44 AM, Daniel Eklund wrote:

> <snip>
> A quick question about the UDF's registered at the top of a pig  
> script:
>
> does
> REGISTER myJar.jar
> distribute the jar across HDFS (like a Hadoop job jar) so that the
> distribution of the code to the cluster nodes is transparent?
> In other words, do we NOT have to distribute myJar.jar to each node  
> on the
> cluster.

Pig takes care of getting myJar.jar to the task nodes; you do not have  
to worry about it.

Alan.

Re: JSONToTuple for pig UDF

Posted by Daniel Eklund <do...@gmail.com>.

Bill,  thanks...

 so that is a confirmation... people have rolled their own, and it's not in
piggybank.
I would absolutely be willing to work with you to get a contribution going,
but (as
a warning) I am extremely new to Pig.

I was looking at this:
http://wiki.apache.org/pig/UDFManual
to get my mind wrapped around the framework.  And I also discovered this
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
( I am assuming this was the UDF you mentioned that inspired you)...

A quick question about the UDF's registered at the top of a pig script:

does
REGISTER myJar.jar
distribute the jar across HDFS (like a Hadoop job jar) so that the
distribution of the code to the cluster nodes is transparent?
In other words, do we NOT have to distribute myJar.jar to each node on the
cluster.

thanks more,
daniel

On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <bi...@gmail.com> wrote:

> We're doing the same thing using a JsonToMap UDF followed by a
> MapToBag UDF. The former was similarly inspired by the elephant bird
> JSONLoader. I'd be glad to collaborate on a contribution if you'd
> like.
>
> Here's what our scripts look like:
>
> define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
> define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
> define concat org.apache.pig.builtin.StringConcat();
>
> raw = LOAD 'hbase://user_info'
>      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
>      AS (events_map:map[]);
>
> -- Convert our maps to bags so we can flatten them out
> B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
>
> C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
> event_v:chararray);
>
> -- Convert the JSON events into maps
> D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
>
> -- Example showing how to filter on a given field
> E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
> event_map#'levt.asid' IS NOT NULL);
>
> -- Example showing how to pull data out of a map
> F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
>                                             event_map#'levt.astid' AS
> astid;
>
>
> thanks,
> Bill
>
> On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <do...@gmail.com>
> wrote:
> > I noticed that there is a Pig JSON Loader (which might or might not be in
> > piggbank).
> > Could anyone confirm the existence or absence of a JSONToTuple UDF?  (not
> a
> > loader)
> >
> > I am inspired by the UDF mentioned on Slide 23 here:
> > http://www.slideshare.net/danharvey/hbase-at-mendeley
> >
> >  doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
> > DOC;
> >
> > My desire is to store a raw JSON doc in a cell in HBase and run pig
> queries
> > against the tuples generated by the UDF.
> > I used the HBase Loader already to get the cell-data, and now I need a
> > JSON-deserializer.
> >
> > I would be willing to roll my own, (and contribute), but I figure I'd see
> if
> > there was anything out there first.
> >
> > thanks,
> > daniel
> >
>

Re: JSONToTuple for pig UDF

Posted by Bill Graham <bi...@gmail.com>.

We're doing the same thing using a JsonToMap UDF followed by a
MapToBag UDF. The former was similarly inspired by the elephant bird
JSONLoader. I'd be glad to collaborate on a contribution if you'd
like.

Here's what our scripts look like:

define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
define concat org.apache.pig.builtin.StringConcat();

raw = LOAD 'hbase://user_info'
      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
      AS (events_map:map[]);

-- Convert our maps to bags so we can flatten them out
B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;

C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
event_v:chararray);

-- Convert the JSON events into maps
D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];

-- Example showing how to filter on a given field
E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
event_map#'levt.asid' IS NOT NULL);

-- Example showing how to pull data out of a map
F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
                                             event_map#'levt.astid' AS astid;


thanks,
Bill

On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <do...@gmail.com> wrote:
> I noticed that there is a Pig JSON Loader (which might or might not be in
> piggbank).
> Could anyone confirm the existence or absence of a JSONToTuple UDF?  (not a
> loader)
>
> I am inspired by the UDF mentioned on Slide 23 here:
> http://www.slideshare.net/danharvey/hbase-at-mendeley
>
>  doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
> DOC;
>
> My desire is to store a raw JSON doc in a cell in HBase and run pig queries
> against the tuples generated by the UDF.
> I used the HBase Loader already to get the cell-data, and now I need a
> JSON-deserializer.
>
> I would be willing to roll my own, (and contribute), but I figure I'd see if
> there was anything out there first.
>
> thanks,
> daniel
>