You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/09/05 05:37:31 UTC

Json and split into multiple files

I have a Json something like:

{
user{
 id : 1
name: user1
 }
product {
id: 1
name: product1
}
}

I want to be able to read this file and create 2 files as follows:

user file:
key,1,user1

product file:
key,1,product1

I know I need to call exec but the method will return Bags for each of
these dimensions.  But since it's all unordered how do I split it further
to write them to separate files?

Re: Json and split into multiple files

Posted by Mohit Anchlia <mo...@gmail.com>.
On Wed, Sep 12, 2012 at 7:51 PM, Alan Gates <ga...@hortonworks.com> wrote:

> I don't understand your use case or why you need to use exec or
> outputSchema.  Would it be possible to send a more complete example that
> makes clear why you need these?
>
> My Json has many fields and several parent elements. I already have POJO
that I can parse into and read fields from instead of hand typing all of
them. I also have a mapper and formatter that maps JSON to database fields
which is a fixed position in file. Hand typing all of it in Pig would be
really painful. With exec I can easily parse my Json and then use Mappers
to write to Tuples. It's faster to develop and easy to unit test.


> Alan.
>
> A tuple can contain a tuple, so it's certainly possible with
> outputSchema() to generate a schema that declares both your tuples.  But I
> don't think this answers your questions.
>
> On Sep 7, 2012, at 10:21 AM, Mohit Anchlia wrote:
>
> > It looks like I can use outputSchema(Schema input) call to do this. But
> > examples I see are only for one tuple. In my case if I am reading it
> right
> > I need tuple for each dimension and hence schema for each. For instance
> > there'll be one user tuple and then product tuple for instance. So I need
> > schema for each.
> >
> > How can I do this using outputSchema such that result is like below
> where I
> > can access each tuple and field that is a named field? Thanks for your
> help
> >
> > A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name:
> > chararray), product: tuple(id: int, name:chararray))
> >
> > On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >wrote:
> >
> >> I have a Json something like:
> >>
> >> {
> >> user{
> >> id : 1
> >> name: user1
> >> }
> >> product {
> >> id: 1
> >> name: product1
> >> }
> >> }
> >>
> >> I want to be able to read this file and create 2 files as follows:
> >>
> >> user file:
> >> key,1,user1
> >>
> >> product file:
> >> key,1,product1
> >>
> >> I know I need to call exec but the method will return Bags for each of
> >> these dimensions.  But since it's all unordered how do I split it
> further
> >> to write them to separate files?
> >>
>
>

Re: Json and split into multiple files

Posted by Alan Gates <ga...@hortonworks.com>.
I don't understand your use case or why you need to use exec or outputSchema.  Would it be possible to send a more complete example that makes clear why you need these?

Alan.

A tuple can contain a tuple, so it's certainly possible with outputSchema() to generate a schema that declares both your tuples.  But I don't think this answers your questions.

On Sep 7, 2012, at 10:21 AM, Mohit Anchlia wrote:

> It looks like I can use outputSchema(Schema input) call to do this. But
> examples I see are only for one tuple. In my case if I am reading it right
> I need tuple for each dimension and hence schema for each. For instance
> there'll be one user tuple and then product tuple for instance. So I need
> schema for each.
> 
> How can I do this using outputSchema such that result is like below where I
> can access each tuple and field that is a named field? Thanks for your help
> 
> A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name:
> chararray), product: tuple(id: int, name:chararray))
> 
> On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia <mo...@gmail.com>wrote:
> 
>> I have a Json something like:
>> 
>> {
>> user{
>> id : 1
>> name: user1
>> }
>> product {
>> id: 1
>> name: product1
>> }
>> }
>> 
>> I want to be able to read this file and create 2 files as follows:
>> 
>> user file:
>> key,1,user1
>> 
>> product file:
>> key,1,product1
>> 
>> I know I need to call exec but the method will return Bags for each of
>> these dimensions.  But since it's all unordered how do I split it further
>> to write them to separate files?
>> 


Re: Json and split into multiple files

Posted by Mohit Anchlia <mo...@gmail.com>.
It looks like I can use outputSchema(Schema input) call to do this. But
examples I see are only for one tuple. In my case if I am reading it right
I need tuple for each dimension and hence schema for each. For instance
there'll be one user tuple and then product tuple for instance. So I need
schema for each.

How can I do this using outputSchema such that result is like below where I
can access each tuple and field that is a named field? Thanks for your help

 A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name:
chararray), product: tuple(id: int, name:chararray))

On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> I have a Json something like:
>
> {
> user{
>  id : 1
> name: user1
>  }
> product {
> id: 1
> name: product1
> }
> }
>
> I want to be able to read this file and create 2 files as follows:
>
> user file:
> key,1,user1
>
> product file:
> key,1,product1
>
> I know I need to call exec but the method will return Bags for each of
> these dimensions.  But since it's all unordered how do I split it further
> to write them to separate files?
>

Re: Json and split into multiple files

Posted by Mohit Anchlia <mo...@gmail.com>.
My real life Json is much more complicated and I will have to use exec
method. But I was wondering how do I reference a Bag related to user and
all it's fields when it gets returned from the exec call?

On Thu, Sep 6, 2012 at 8:21 AM, Alan Gates <ga...@hortonworks.com> wrote:

> Loading the JSON below should give you a Pig record like:
> (user: tuple(id: int, name: chararray), product: tuple(id: int,
> name:chararray))
>
> In that case your Pig Latin would look like:
>
> A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name:
> chararray), product: tuple(id: int, name:chararray))
> B = foreach A generate user.id, user.name;
> store B into 'userfile';
> C = foreach A generate product.id, product.name;
> store C info 'productfile';
>
> I'm not sure what key is, so I'm not sure the above is what you're
> thinking or not.
>
> Alan.
>
> On Sep 5, 2012, at 12:04 PM, Mohit Anchlia wrote:
>
> > Any pointers would be appreciated
> >
> > On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >wrote:
> >
> >> I have a Json something like:
> >>
> >> {
> >> user{
> >> id : 1
> >> name: user1
> >> }
> >> product {
> >> id: 1
> >> name: product1
> >> }
> >> }
> >>
> >> I want to be able to read this file and create 2 files as follows:
> >>
> >> user file:
> >> key,1,user1
> >>
> >> product file:
> >> key,1,product1
> >>
> >> I know I need to call exec but the method will return Bags for each of
> >> these dimensions.  But since it's all unordered how do I split it
> further
> >> to write them to separate files?
> >>
>
>

Re: Json and split into multiple files

Posted by Alan Gates <ga...@hortonworks.com>.
Loading the JSON below should give you a Pig record like:
(user: tuple(id: int, name: chararray), product: tuple(id: int, name:chararray))

In that case your Pig Latin would look like:

A = load 'inputfile' using JsonLoader() as (user: tuple(id: int, name: chararray), product: tuple(id: int, name:chararray))
B = foreach A generate user.id, user.name;
store B into 'userfile';
C = foreach A generate product.id, product.name;
store C info 'productfile';

I'm not sure what key is, so I'm not sure the above is what you're thinking or not.

Alan.

On Sep 5, 2012, at 12:04 PM, Mohit Anchlia wrote:

> Any pointers would be appreciated
> 
> On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia <mo...@gmail.com>wrote:
> 
>> I have a Json something like:
>> 
>> {
>> user{
>> id : 1
>> name: user1
>> }
>> product {
>> id: 1
>> name: product1
>> }
>> }
>> 
>> I want to be able to read this file and create 2 files as follows:
>> 
>> user file:
>> key,1,user1
>> 
>> product file:
>> key,1,product1
>> 
>> I know I need to call exec but the method will return Bags for each of
>> these dimensions.  But since it's all unordered how do I split it further
>> to write them to separate files?
>> 


Re: Json and split into multiple files

Posted by Mohit Anchlia <mo...@gmail.com>.
Any pointers would be appreciated

On Tue, Sep 4, 2012 at 8:37 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> I have a Json something like:
>
> {
> user{
>  id : 1
> name: user1
>  }
> product {
> id: 1
> name: product1
> }
> }
>
> I want to be able to read this file and create 2 files as follows:
>
> user file:
> key,1,user1
>
> product file:
> key,1,product1
>
> I know I need to call exec but the method will return Bags for each of
> these dimensions.  But since it's all unordered how do I split it further
> to write them to separate files?
>