You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by unmesha sreeveni <un...@gmail.com> on 2014/07/25 13:09:25 UTC

Json Parsing in Apache Pig

Hi

This is my code for sampling

*--Load data*
*inputdata = LOAD '$input' using PigStorage('$delimiter');*

*--Group data*
*groupedByAll = group inputdata all;*

*--output into hdfs*
*sampled = SAMPLE inputdata $fraction;*
*store sampled into '$output' using PigStorage('$delimiter'); *

 --Sampling.pig
--pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
output=OUT/pig -param delimiter="," -param fraction='0.05'

--Load data
inputdata = LOAD '$input' using PigStorage('$delimiter');

--Group data
groupedByAll = group inputdata all;

--output into hdfs
sampled = SAMPLE inputdata $fraction;
store sampled into '$output' using PigStorage('$delimiter');

I am taking input parameters as customized
pig -x mapreduce -f Sampling.pig -param input=foo.csv -param output=OUT/pig
-param delimiter="," -param fraction='0.05'

I would like to do a modification in the same
I am trying to take my input as json

sample json:
*{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}*

Now I need to parse the above json and take the needful params.
How to do the same
I know we can load json in apache pig but how to extract the needful from
the json

from here I only need
fraction,destination,source

Please suggest a way

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Json Parsing in Apache Pig

Posted by unmesha sreeveni <un...@gmail.com>.
Thanks .I am able to parse json using elephantbird.
Now I am able to get source,destination,fraction in different bags.

But how can I give these values to my pigscript?


--Load Json
loadJson =  LOAD '$inputJson' USING
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') AS
(json:map []);

--PARSING JSON
--Source
a = FOREACH loadJson GENERATE json#'source' AS ParsedInput;

--Destination
b = FOREACH loadJson GENERATE json#'destination' AS ParsedOutput;

--Delimiter
c = FOREACH loadJson GENERATE json#'delimiter' AS ParsedDelimiter;

--Reserviour fraction
d = FOREACH loadJson GENERATE json#'reservoirSize' AS ParsedFraction;



--Load data
inputdata = LOAD 'a' using PigStorage('c');          --How to load my
source which is in bag a,when giving 'a' it lookes for a file named a in my
current directory
store inputdata into '/home/sreeveni/myfiles/pig/OUT/ab';
--Group data
--groupedByAll = group inputdata all;

--output into hdfs
--sampled = SAMPLE inputdata $fraction;
--store sampled into '$output' using PigStorage('$delimiter');


How to achieve the same in Apache Pig?






On Sat, Jul 26, 2014 at 5:38 AM, Ryan Prociuk <ry...@gmail.com> wrote:

> I would recommend using the elephant-bird-pig JsonLoader
>
> Have used it quite extensively to parse nested Json datasets with no issue.
>
> You can download the jar files from maven and Register in the script.
>
>
> http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-pig
>
> https://github.com/kevinweil/elephant-bird/
>
> It has dependencies on the following jars
> json-simple-1.1.x.jar;
> elephant-bird-pig-4.x.jar;
> elephant-bird-hadoop-compat-4.x.jar;
> elephant-bird-core-4.x.jar;
>
> Parse the file
>
> fileA = LOAD '/hdfs-directory/' USING
> com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS
>  (json:map[]);
>
> B = FOREACH A GENERATE
>        json#'col1' = col1;
>
> Ryan
>
>
>
>
> On Fri, Jul 25, 2014 at 4:55 PM, Satish Kolli <fe...@gmail.com> wrote:
>
> > Did you try the standard JsonLoader? I didn't personally use it but it
> > looks like you can specify the schema to extract/parse from your json.
> >
> > http://pig.apache.org/docs/r0.13.0/func.html#jsonloadstore
> >
> > If not, you can also look at the following example I found googling:
> >
> > https://gist.github.com/kimsterv/601331
> >
> >
> > Thanks.
> >
> >
> >
> >
> > On Fri, Jul 25, 2014 at 8:01 AM, praveenesh kumar <pr...@gmail.com>
> > wrote:
> >
> > > One simple way is to write a UDF that will act as Json parser. Load
> your
> > > data and then call your UDF to parse and extract whatever you want from
> > the
> > > Json. You need to build what you want to get. Pig doesn't do that for
> > you,
> > > it gives you the capability to do that. How you do is upto you.
> > >
> > >
> > > On Fri, Jul 25, 2014 at 12:09 PM, unmesha sreeveni <
> > unmeshabiju@gmail.com>
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > This is my code for sampling
> > > >
> > > > *--Load data*
> > > > *inputdata = LOAD '$input' using PigStorage('$delimiter');*
> > > >
> > > > *--Group data*
> > > > *groupedByAll = group inputdata all;*
> > > >
> > > > *--output into hdfs*
> > > > *sampled = SAMPLE inputdata $fraction;*
> > > > *store sampled into '$output' using PigStorage('$delimiter'); *
> > > >
> > > >  --Sampling.pig
> > > > --pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> > > > output=OUT/pig -param delimiter="," -param fraction='0.05'
> > > >
> > > > --Load data
> > > > inputdata = LOAD '$input' using PigStorage('$delimiter');
> > > >
> > > > --Group data
> > > > groupedByAll = group inputdata all;
> > > >
> > > > --output into hdfs
> > > > sampled = SAMPLE inputdata $fraction;
> > > > store sampled into '$output' using PigStorage('$delimiter');
> > > >
> > > > I am taking input parameters as customized
> > > > pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> > > output=OUT/pig
> > > > -param delimiter="," -param fraction='0.05'
> > > >
> > > > I would like to do a modification in the same
> > > > I am trying to take my input as json
> > > >
> > > > sample json:
> > > >
> > > >
> > >
> >
> *{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}*
> > > >
> > > > Now I need to parse the above json and take the needful params.
> > > > How to do the same
> > > > I know we can load json in apache pig but how to extract the needful
> > from
> > > > the json
> > > >
> > > > from here I only need
> > > > fraction,destination,source
> > > >
> > > > Please suggest a way
> > > >
> > > > --
> > > > *Thanks & Regards *
> > > >
> > > >
> > > > *Unmesha Sreeveni U.B*
> > > > *Hadoop, Bigdata Developer*
> > > > *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> > > > http://www.unmeshasreeveni.blogspot.in/
> > > >
> > >
> >
>
>
>
> --
> Ryan Prociuk | Engineering Distributed Data
>



-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/

Re: Json Parsing in Apache Pig

Posted by Ryan Prociuk <ry...@gmail.com>.
I would recommend using the elephant-bird-pig JsonLoader

Have used it quite extensively to parse nested Json datasets with no issue.

You can download the jar files from maven and Register in the script.

http://mvnrepository.com/artifact/com.twitter.elephantbird/elephant-bird-pig

https://github.com/kevinweil/elephant-bird/

It has dependencies on the following jars
json-simple-1.1.x.jar;
elephant-bird-pig-4.x.jar;
elephant-bird-hadoop-compat-4.x.jar;
elephant-bird-core-4.x.jar;

Parse the file

fileA = LOAD '/hdfs-directory/' USING
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS
 (json:map[]);

B = FOREACH A GENERATE
       json#'col1' = col1;

Ryan




On Fri, Jul 25, 2014 at 4:55 PM, Satish Kolli <fe...@gmail.com> wrote:

> Did you try the standard JsonLoader? I didn't personally use it but it
> looks like you can specify the schema to extract/parse from your json.
>
> http://pig.apache.org/docs/r0.13.0/func.html#jsonloadstore
>
> If not, you can also look at the following example I found googling:
>
> https://gist.github.com/kimsterv/601331
>
>
> Thanks.
>
>
>
>
> On Fri, Jul 25, 2014 at 8:01 AM, praveenesh kumar <pr...@gmail.com>
> wrote:
>
> > One simple way is to write a UDF that will act as Json parser. Load your
> > data and then call your UDF to parse and extract whatever you want from
> the
> > Json. You need to build what you want to get. Pig doesn't do that for
> you,
> > it gives you the capability to do that. How you do is upto you.
> >
> >
> > On Fri, Jul 25, 2014 at 12:09 PM, unmesha sreeveni <
> unmeshabiju@gmail.com>
> > wrote:
> >
> > > Hi
> > >
> > > This is my code for sampling
> > >
> > > *--Load data*
> > > *inputdata = LOAD '$input' using PigStorage('$delimiter');*
> > >
> > > *--Group data*
> > > *groupedByAll = group inputdata all;*
> > >
> > > *--output into hdfs*
> > > *sampled = SAMPLE inputdata $fraction;*
> > > *store sampled into '$output' using PigStorage('$delimiter'); *
> > >
> > >  --Sampling.pig
> > > --pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> > > output=OUT/pig -param delimiter="," -param fraction='0.05'
> > >
> > > --Load data
> > > inputdata = LOAD '$input' using PigStorage('$delimiter');
> > >
> > > --Group data
> > > groupedByAll = group inputdata all;
> > >
> > > --output into hdfs
> > > sampled = SAMPLE inputdata $fraction;
> > > store sampled into '$output' using PigStorage('$delimiter');
> > >
> > > I am taking input parameters as customized
> > > pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> > output=OUT/pig
> > > -param delimiter="," -param fraction='0.05'
> > >
> > > I would like to do a modification in the same
> > > I am trying to take my input as json
> > >
> > > sample json:
> > >
> > >
> >
> *{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}*
> > >
> > > Now I need to parse the above json and take the needful params.
> > > How to do the same
> > > I know we can load json in apache pig but how to extract the needful
> from
> > > the json
> > >
> > > from here I only need
> > > fraction,destination,source
> > >
> > > Please suggest a way
> > >
> > > --
> > > *Thanks & Regards *
> > >
> > >
> > > *Unmesha Sreeveni U.B*
> > > *Hadoop, Bigdata Developer*
> > > *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> > > http://www.unmeshasreeveni.blogspot.in/
> > >
> >
>



-- 
Ryan Prociuk | Engineering Distributed Data

Re: Json Parsing in Apache Pig

Posted by Ryan Compton <co...@gmail.com>.
I've found Twitter's elephantbird library very useful here
(https://github.com/kevinweil/elephant-bird )

a = LOAD 'file3.json' USING
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')

Will parse the JSON into a map
http://pig.apache.org/docs/r0.11.1/basic.html#map-schema the JSONArray
gets parsed into a DataBag of maps.

cf. https://stackoverflow.com/questions/11035105/processing-json-through-pig-scripts/16501542#16501542

On Fri, Jul 25, 2014 at 4:55 PM, Satish Kolli <fe...@gmail.com> wrote:
> Did you try the standard JsonLoader? I didn't personally use it but it
> looks like you can specify the schema to extract/parse from your json.
>
> http://pig.apache.org/docs/r0.13.0/func.html#jsonloadstore
>
> If not, you can also look at the following example I found googling:
>
> https://gist.github.com/kimsterv/601331
>
>
> Thanks.
>
>
>
>
> On Fri, Jul 25, 2014 at 8:01 AM, praveenesh kumar <pr...@gmail.com>
> wrote:
>
>> One simple way is to write a UDF that will act as Json parser. Load your
>> data and then call your UDF to parse and extract whatever you want from the
>> Json. You need to build what you want to get. Pig doesn't do that for you,
>> it gives you the capability to do that. How you do is upto you.
>>
>>
>> On Fri, Jul 25, 2014 at 12:09 PM, unmesha sreeveni <un...@gmail.com>
>> wrote:
>>
>> > Hi
>> >
>> > This is my code for sampling
>> >
>> > *--Load data*
>> > *inputdata = LOAD '$input' using PigStorage('$delimiter');*
>> >
>> > *--Group data*
>> > *groupedByAll = group inputdata all;*
>> >
>> > *--output into hdfs*
>> > *sampled = SAMPLE inputdata $fraction;*
>> > *store sampled into '$output' using PigStorage('$delimiter'); *
>> >
>> >  --Sampling.pig
>> > --pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
>> > output=OUT/pig -param delimiter="," -param fraction='0.05'
>> >
>> > --Load data
>> > inputdata = LOAD '$input' using PigStorage('$delimiter');
>> >
>> > --Group data
>> > groupedByAll = group inputdata all;
>> >
>> > --output into hdfs
>> > sampled = SAMPLE inputdata $fraction;
>> > store sampled into '$output' using PigStorage('$delimiter');
>> >
>> > I am taking input parameters as customized
>> > pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
>> output=OUT/pig
>> > -param delimiter="," -param fraction='0.05'
>> >
>> > I would like to do a modification in the same
>> > I am trying to take my input as json
>> >
>> > sample json:
>> >
>> >
>> *{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}*
>> >
>> > Now I need to parse the above json and take the needful params.
>> > How to do the same
>> > I know we can load json in apache pig but how to extract the needful from
>> > the json
>> >
>> > from here I only need
>> > fraction,destination,source
>> >
>> > Please suggest a way
>> >
>> > --
>> > *Thanks & Regards *
>> >
>> >
>> > *Unmesha Sreeveni U.B*
>> > *Hadoop, Bigdata Developer*
>> > *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
>> > http://www.unmeshasreeveni.blogspot.in/
>> >
>>

Re: Json Parsing in Apache Pig

Posted by Satish Kolli <fe...@gmail.com>.
Did you try the standard JsonLoader? I didn't personally use it but it
looks like you can specify the schema to extract/parse from your json.

http://pig.apache.org/docs/r0.13.0/func.html#jsonloadstore

If not, you can also look at the following example I found googling:

https://gist.github.com/kimsterv/601331


Thanks.




On Fri, Jul 25, 2014 at 8:01 AM, praveenesh kumar <pr...@gmail.com>
wrote:

> One simple way is to write a UDF that will act as Json parser. Load your
> data and then call your UDF to parse and extract whatever you want from the
> Json. You need to build what you want to get. Pig doesn't do that for you,
> it gives you the capability to do that. How you do is upto you.
>
>
> On Fri, Jul 25, 2014 at 12:09 PM, unmesha sreeveni <un...@gmail.com>
> wrote:
>
> > Hi
> >
> > This is my code for sampling
> >
> > *--Load data*
> > *inputdata = LOAD '$input' using PigStorage('$delimiter');*
> >
> > *--Group data*
> > *groupedByAll = group inputdata all;*
> >
> > *--output into hdfs*
> > *sampled = SAMPLE inputdata $fraction;*
> > *store sampled into '$output' using PigStorage('$delimiter'); *
> >
> >  --Sampling.pig
> > --pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> > output=OUT/pig -param delimiter="," -param fraction='0.05'
> >
> > --Load data
> > inputdata = LOAD '$input' using PigStorage('$delimiter');
> >
> > --Group data
> > groupedByAll = group inputdata all;
> >
> > --output into hdfs
> > sampled = SAMPLE inputdata $fraction;
> > store sampled into '$output' using PigStorage('$delimiter');
> >
> > I am taking input parameters as customized
> > pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> output=OUT/pig
> > -param delimiter="," -param fraction='0.05'
> >
> > I would like to do a modification in the same
> > I am trying to take my input as json
> >
> > sample json:
> >
> >
> *{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}*
> >
> > Now I need to parse the above json and take the needful params.
> > How to do the same
> > I know we can load json in apache pig but how to extract the needful from
> > the json
> >
> > from here I only need
> > fraction,destination,source
> >
> > Please suggest a way
> >
> > --
> > *Thanks & Regards *
> >
> >
> > *Unmesha Sreeveni U.B*
> > *Hadoop, Bigdata Developer*
> > *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> > http://www.unmeshasreeveni.blogspot.in/
> >
>

Re: Json Parsing in Apache Pig

Posted by praveenesh kumar <pr...@gmail.com>.
One simple way is to write a UDF that will act as Json parser. Load your
data and then call your UDF to parse and extract whatever you want from the
Json. You need to build what you want to get. Pig doesn't do that for you,
it gives you the capability to do that. How you do is upto you.


On Fri, Jul 25, 2014 at 12:09 PM, unmesha sreeveni <un...@gmail.com>
wrote:

> Hi
>
> This is my code for sampling
>
> *--Load data*
> *inputdata = LOAD '$input' using PigStorage('$delimiter');*
>
> *--Group data*
> *groupedByAll = group inputdata all;*
>
> *--output into hdfs*
> *sampled = SAMPLE inputdata $fraction;*
> *store sampled into '$output' using PigStorage('$delimiter'); *
>
>  --Sampling.pig
> --pig -x mapreduce -f Sampling.pig -param input=foo.csv -param
> output=OUT/pig -param delimiter="," -param fraction='0.05'
>
> --Load data
> inputdata = LOAD '$input' using PigStorage('$delimiter');
>
> --Group data
> groupedByAll = group inputdata all;
>
> --output into hdfs
> sampled = SAMPLE inputdata $fraction;
> store sampled into '$output' using PigStorage('$delimiter');
>
> I am taking input parameters as customized
> pig -x mapreduce -f Sampling.pig -param input=foo.csv -param output=OUT/pig
> -param delimiter="," -param fraction='0.05'
>
> I would like to do a modification in the same
> I am trying to take my input as json
>
> sample json:
>
> *{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}*
>
> Now I need to parse the above json and take the needful params.
> How to do the same
> I know we can load json in apache pig but how to extract the needful from
> the json
>
> from here I only need
> fraction,destination,source
>
> Please suggest a way
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>