You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Anurag Gulati <An...@aexp.com> on 2012/04/05 00:37:39 UTC

Nested JSON Strings - How to Ingest and Manipulate?

Hi Guys!!

I'm over here trying to get my feet wet with Hadoop and my first task just happens to be a complex one.
I was hoping you could help me out.

I'm trying to read nested JSON structures (data received from Facebook) into Pig; then I'd like to be able to manipulate the data (eg. Return all lines where Hometown = phoenix,Arizona).

I have a single file with multiple lines of JSON.  Each line is a singular entry.  An Example of one line is below:

{"id":"10011666","name":"Test user","first_name":"Test","last_name":"user","link":"http:\/\/www.facebook.com\/test.user","username":"test.user","birthday":"09\/19\/1983","hometown":{"id":"103102203064024","name":"West Chester, Pennsylvania"},"location":{"id":"","name":null},"bio":"This is my Bio. I'm a geek that love to hack (in a good way)","quotes":"I like quotes. But I'm shortening this section cuz it was wild!","work":[{"employer":{"id":"6185812851","name":"American Eagle"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"133619273341785","name":"Counter Guy"},"start_date":"2012-01"},{"employer":{"id":"190876464341724","name":"Cardiac group"},"position":{"id":"105630109469647","name":"Executive Producer"},"description":"We create music for Artist Placement and TV\/Film.","start_date":"2002-01"},{"employer":{"id":"6185812851","name":"American Eagle"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"116439401740213","name":"Floor Guy"},"start_date":"2007-10","end_date":"2012-01"},{"employer":{"id":"110067355684846","name":"Saint Joseph Hospital"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"202489236428627","name":"Pharmacy IT Coordinator"},"start_date":"2005-10","end_date":"2007-10"},{"employer":{"id":"110067355684846","name":"Saint Joseph Hospital"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"144703015548786","name":"Pharmacy Tech"},"start_date":"2001-02","end_date":"2005-10"}],"sports":[{"id":"108606435830479","name":"Karate"}],"favorite_teams":[{"id":"87169796810","name":"Philadelphia Flyers"},{"id":"93625750491","name":"Philadelphia Phillies"},{"id":"45898408995","name":"Phoenix Suns"},{"id":"120163518021430","name":"Philadelphia Eagles"}],"favorite_athletes":[{"id":"77922840249","name":"Steve Nash"},{"id":"105590659475179","name":"Wayne Gretzky"},{"id":"62975399193","name":"Michael Jordan"}],"inspirational_people":[{"id":"106676942701904","name":"Gandhi"}],"education":[{"school":{"id":"109324275761313","name":"Corona del Sol High School"},"type":"High School"},{"school":{"id":"23680344606","name":"Arizona State University"},"type":"College"}],"gender":"male","interested_in":["female"],"relationship_status":"Single","religion":"Hinduism (One with all things)","political":"Liberal (Left of Center)","email":"app+22c90gj.9hh9d.f7304b58ac646e08b5f0f10a73547e34\u0040proxymail.facebook.com","website":"www.slashdot.org\r\nwww.gizmodo.com<http://www.slashdot.org/r/nwww.gizmodo.com>","timezone":-7,"locale":"en_US","languages":[{"id":"106059522759137","name":"English"},{"id":"112969428713061","name":"Hindi"}],"verified":true,"updated_time":"2012-03-22T17:24:25+0000"}

Would you be able to show me the syntax for importing this type of nested JSON into Pig.

I've been able to use ElephantBird to ingest the data ... but the deep nested structures are converted into bags ... which I'm having a hard time working with.
Does anyone have a way to easily ingest and manipulate the data?

I've worked with PHP a lot and their PHP encode/decode functions made my life soooo easy!  Hopefully you guys can get me back to my happy place!


Regards,

-Anurag G.

American Express made the following annotations on Wed Apr 04 2012 15:37:40 

****************************************************************************** 

"This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." 

American Express a ajout� le commentaire suivant le Wed Apr 04 2012 15:37:40 

Ce courrier et toute pi�ce jointe qu'il contient sont r�serv�s au seul destinataire indiqu� et peuvent renfermer des renseignements confidentiels et privil�gi�s. Si vous n'�tes pas le destinataire pr�vu, toute divulgation, duplication, utilisation ou distribution du courrier ou de toute pi�ce jointe est interdite. Si vous avez re�u cette communication par erreur, veuillez nous en aviser par courrier et d�truire imm�diatement le courrier et les pi�ces jointes. Merci. 

****************************************************************************** 
-------------------------------------------------------------------------------


Re: Nested JSON Strings - How to Ingest and Manipulate?

Posted by Norbert Burger <no...@gmail.com>.
Anurag - once you have elephant-bird parse the JSON into maps, then you
extract the nested JSON elements just as you would with any another map,
using the '#' projection operator.  In other words, the following generates
3-element tuples containing id, name, and link:

A = LOAD ...
B = FOREACH A GENERATE
com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
C = FOREACH B GENERATE json#'id', json#'name', json#'link';

Norbert

On Wed, Apr 4, 2012 at 6:44 PM, Anurag Gulati <An...@aexp.com>wrote:

> One error in my original message:
>
> I've been able to use ElephantBird to ingest the data ... but the deep
> nested structures are converted into MAPS (not bags) ... which I'm having a
> hard time working with.
>
>
> Thx!
>
> -----Original Message-----
> From: Anurag Gulati
> Sent: Wednesday, April 04, 2012 3:38 PM
> To: user@pig.apache.org
> Subject: Nested JSON Strings - How to Ingest and Manipulate?
>
> Hi Guys!!
>
> I'm over here trying to get my feet wet with Hadoop and my first task just
> happens to be a complex one.
> I was hoping you could help me out.
>
> I'm trying to read nested JSON structures (data received from Facebook)
> into Pig; then I'd like to be able to manipulate the data (eg. Return all
> lines where Hometown = phoenix,Arizona).
>
> I have a single file with multiple lines of JSON.  Each line is a singular
> entry.  An Example of one line is below:
>
> {"id":"10011666","name":"Test
> user","first_name":"Test","last_name":"user","link":"http:\/\/
> www.facebook.com\/test.user","username":"test.user","birthday":"09\/19\/1983","hometown":{"id":"103102203064024","name":"West
> Chester, Pennsylvania"},"location":{"id":"","name":null},"bio":"This is my
> Bio. I'm a geek that love to hack (in a good way)","quotes":"I like quotes.
> But I'm shortening this section cuz it was
> wild!","work":[{"employer":{"id":"6185812851","name":"American
> Eagle"},"location":{"id":"105540216147364","name":"Phoenix,
> Arizona"},"position":{"id":"133619273341785","name":"Counter
> Guy"},"start_date":"2012-01"},{"employer":{"id":"190876464341724","name":"Cardiac
> group"},"position":{"id":"105630109469647","name":"Executive
> Producer"},"description":"We create music for Artist Placement and
> TV\/Film.","start_date":"2002-01"},{"employer":{"id":"6185812851","name":"American
> Eagle"},"location":{"id":"105540216147364","name":"Phoenix,
> Arizona"},"position":{"id":"116439401740213","name":"Floor
> Guy"},"start_date":"2007-10","end_date":"2012-01"},{"employer":{"id":"110067355684846","name":"Saint
> Joseph Hospital"},"location":{"id":"105540216147364","name":"Phoenix,
> Arizona"},"position":{"id":"202489236428627","name":"Pharmacy IT
> Coordinator"},"start_date":"2005-10","end_date":"2007-10"},{"employer":{"id":"110067355684846","name":"Saint
> Joseph Hospital"},"location":{"id":"105540216147364","name":"Phoenix,
> Arizona"},"position":{"id":"144703015548786","name":"Pharmacy
> Tech"},"start_date":"2001-02","end_date":"2005-10"}],"sports":[{"id":"108606435830479","name":"Karate"}],"favorite_teams":[{"id":"87169796810","name":"Philadelphia
> Flyers"},{"id":"93625750491","name":"Philadelphia
> Phillies"},{"id":"45898408995","name":"Phoenix
> Suns"},{"id":"120163518021430","name":"Philadelphia
> Eagles"}],"favorite_athletes":[{"id":"77922840249","name":"Steve
> Nash"},{"id":"105590659475179","name":"Wayne
> Gretzky"},{"id":"62975399193","name":"Michael
> Jordan"}],"inspirational_people":[{"id":"106676942701904","name":"Gandhi"}],"education":[{"school":{"id":"109324275761313","name":"Corona
> del Sol High School"},"type":"High
> School"},{"school":{"id":"23680344606","name":"Arizona State
> University"},"type":"College"}],"gender":"male","interested_in":["female"],"relationship_status":"Single","religion":"Hinduism
> (One with all things)","political":"Liberal (Left of
> Center)","email":"app+22c90gj.9hh9d.f7304b58ac646e08b5f0f10a73547e34\
> u0040proxymail.facebook.com","website":"www.slashdot.org\r\
> nwww.gizmodo.com<http://www.slashdot.org/r/nwww.gizmodo.com
> >","timezone":-7,"locale":"en_US","languages":[{"id":"106059522759137","name":"English"},{"id":"112969428713061","name":"Hindi"}],"verified":true,"updated_time":"2012-03-22T17:24:25+0000"}
>
> Would you be able to show me the syntax for importing this type of nested
> JSON into Pig.
>
> I've been able to use ElephantBird to ingest the data ... but the deep
> nested structures are converted into bags ... which I'm having a hard time
> working with.
> Does anyone have a way to easily ingest and manipulate the data?
>
> I've worked with PHP a lot and their PHP encode/decode functions made my
> life soooo easy!  Hopefully you guys can get me back to my happy place!
>
>
> Regards,
>
> -Anurag G.
>
> American Express made the following annotations on Wed Apr 04 2012 15:37:40
>
>
> ******************************************************************************
>
> "This message and any attachments are solely for the intended recipient
> and may contain confidential or privileged information. If you are not the
> intended recipient, any disclosure, copying, use, or distribution of the
> information included in this message and any attachments is prohibited. If
> you have received this communication in error, please notify us by reply
> e-mail and immediately and permanently delete this message and any
> attachments. Thank you."
>
> American Express a ajout? le commentaire suivant le Wed Apr 04 2012
> 15:37:40
>
> Ce courrier et toute pi?ce jointe qu'il contient sont r?serv?s au seul
> destinataire indiqu? et peuvent renfermer des renseignements confidentiels
> et privil?gi?s. Si vous n'?tes pas le destinataire pr?vu, toute
> divulgation, duplication, utilisation ou distribution du courrier ou de
> toute pi?ce jointe est interdite. Si vous avez re?u cette communication par
> erreur, veuillez nous en aviser par courrier et d?truire imm?diatement le
> courrier et les pi?ces jointes. Merci.
>
>
> ******************************************************************************
>
> -------------------------------------------------------------------------------
>
>
> American Express made the following annotations on Wed Apr 04 2012 15:44:24
>
>
> ******************************************************************************
>
> "This message and any attachments are solely for the intended recipient
> and may contain confidential or privileged information. If you are not the
> intended recipient, any disclosure, copying, use, or distribution of the
> information included in this message and any attachments is prohibited. If
> you have received this communication in error, please notify us by reply
> e-mail and immediately and permanently delete this message and any
> attachments. Thank you."
>
> American Express a ajouté le commentaire suivant le Wed Apr 04 2012
> 15:44:24
>
> Ce courrier et toute pièce jointe qu'il contient sont réservés au seul
> destinataire indiqué et peuvent renfermer des renseignements confidentiels
> et privilégiés. Si vous n'êtes pas le destinataire prévu, toute
> divulgation, duplication, utilisation ou distribution du courrier ou de
> toute pièce jointe est interdite. Si vous avez reçu cette communication par
> erreur, veuillez nous en aviser par courrier et détruire immédiatement le
> courrier et les pièces jointes. Merci.
>
>
> ******************************************************************************
>
> -------------------------------------------------------------------------------
>
>

RE: Nested JSON Strings - How to Ingest and Manipulate?

Posted by Anurag Gulati <An...@aexp.com>.
One error in my original message:

I've been able to use ElephantBird to ingest the data ... but the deep nested structures are converted into MAPS (not bags) ... which I'm having a hard time working with.


Thx!

-----Original Message-----
From: Anurag Gulati 
Sent: Wednesday, April 04, 2012 3:38 PM
To: user@pig.apache.org
Subject: Nested JSON Strings - How to Ingest and Manipulate?

Hi Guys!!

I'm over here trying to get my feet wet with Hadoop and my first task just happens to be a complex one.
I was hoping you could help me out.

I'm trying to read nested JSON structures (data received from Facebook) into Pig; then I'd like to be able to manipulate the data (eg. Return all lines where Hometown = phoenix,Arizona).

I have a single file with multiple lines of JSON.  Each line is a singular entry.  An Example of one line is below:

{"id":"10011666","name":"Test user","first_name":"Test","last_name":"user","link":"http:\/\/www.facebook.com\/test.user","username":"test.user","birthday":"09\/19\/1983","hometown":{"id":"103102203064024","name":"West Chester, Pennsylvania"},"location":{"id":"","name":null},"bio":"This is my Bio. I'm a geek that love to hack (in a good way)","quotes":"I like quotes. But I'm shortening this section cuz it was wild!","work":[{"employer":{"id":"6185812851","name":"American Eagle"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"133619273341785","name":"Counter Guy"},"start_date":"2012-01"},{"employer":{"id":"190876464341724","name":"Cardiac group"},"position":{"id":"105630109469647","name":"Executive Producer"},"description":"We create music for Artist Placement and TV\/Film.","start_date":"2002-01"},{"employer":{"id":"6185812851","name":"American Eagle"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"116439401740213","name":"Floor Guy"},"start_date":"2007-10","end_date":"2012-01"},{"employer":{"id":"110067355684846","name":"Saint Joseph Hospital"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"202489236428627","name":"Pharmacy IT Coordinator"},"start_date":"2005-10","end_date":"2007-10"},{"employer":{"id":"110067355684846","name":"Saint Joseph Hospital"},"location":{"id":"105540216147364","name":"Phoenix, Arizona"},"position":{"id":"144703015548786","name":"Pharmacy Tech"},"start_date":"2001-02","end_date":"2005-10"}],"sports":[{"id":"108606435830479","name":"Karate"}],"favorite_teams":[{"id":"87169796810","name":"Philadelphia Flyers"},{"id":"93625750491","name":"Philadelphia Phillies"},{"id":"45898408995","name":"Phoenix Suns"},{"id":"120163518021430","name":"Philadelphia Eagles"}],"favorite_athletes":[{"id":"77922840249","name":"Steve Nash"},{"id":"105590659475179","name":"Wayne Gretzky"},{"id":"62975399193","name":"Michael Jordan"}],"inspirational_people":[{"id":"106676942701904","name":"Gandhi"}],"education":[{"school":{"id":"109324275761313","name":"Corona del Sol High School"},"type":"High School"},{"school":{"id":"23680344606","name":"Arizona State University"},"type":"College"}],"gender":"male","interested_in":["female"],"relationship_status":"Single","religion":"Hinduism (One with all things)","political":"Liberal (Left of Center)","email":"app+22c90gj.9hh9d.f7304b58ac646e08b5f0f10a73547e34\u0040proxymail.facebook.com","website":"www.slashdot.org\r\nwww.gizmodo.com<http://www.slashdot.org/r/nwww.gizmodo.com>","timezone":-7,"locale":"en_US","languages":[{"id":"106059522759137","name":"English"},{"id":"112969428713061","name":"Hindi"}],"verified":true,"updated_time":"2012-03-22T17:24:25+0000"}

Would you be able to show me the syntax for importing this type of nested JSON into Pig.

I've been able to use ElephantBird to ingest the data ... but the deep nested structures are converted into bags ... which I'm having a hard time working with.
Does anyone have a way to easily ingest and manipulate the data?

I've worked with PHP a lot and their PHP encode/decode functions made my life soooo easy!  Hopefully you guys can get me back to my happy place!


Regards,

-Anurag G.

American Express made the following annotations on Wed Apr 04 2012 15:37:40 

****************************************************************************** 

"This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." 

American Express a ajout? le commentaire suivant le Wed Apr 04 2012 15:37:40 

Ce courrier et toute pi?ce jointe qu'il contient sont r?serv?s au seul destinataire indiqu? et peuvent renfermer des renseignements confidentiels et privil?gi?s. Si vous n'?tes pas le destinataire pr?vu, toute divulgation, duplication, utilisation ou distribution du courrier ou de toute pi?ce jointe est interdite. Si vous avez re?u cette communication par erreur, veuillez nous en aviser par courrier et d?truire imm?diatement le courrier et les pi?ces jointes. Merci. 

****************************************************************************** 
-------------------------------------------------------------------------------


American Express made the following annotations on Wed Apr 04 2012 15:44:24 

****************************************************************************** 

"This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Thank you." 

American Express a ajout� le commentaire suivant le Wed Apr 04 2012 15:44:24 

Ce courrier et toute pi�ce jointe qu'il contient sont r�serv�s au seul destinataire indiqu� et peuvent renfermer des renseignements confidentiels et privil�gi�s. Si vous n'�tes pas le destinataire pr�vu, toute divulgation, duplication, utilisation ou distribution du courrier ou de toute pi�ce jointe est interdite. Si vous avez re�u cette communication par erreur, veuillez nous en aviser par courrier et d�truire imm�diatement le courrier et les pi�ces jointes. Merci. 

****************************************************************************** 
-------------------------------------------------------------------------------