You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Srinivas Surasani <pi...@gmail.com> on 2012/08/28 06:36:52 UTC

updates using Pig

Hi,

I'm trying to do updates of records in hadoop using Pig ( I know this is
not ideal but trying out POC )..
data looks like the below:

*feed1:*
--> here trade key is unique for each order/record
--> this is history file

trade-key    trade-add-date       trade-price
*k1                 05/21/2012            2000*
k2                  04/21/2012             3000
k3                 03/21/2012            4000
k4                 05/21/2012             5000

*feed2:  *--> this is the latest/daily feed
trade-key    trade-add-date       trade-price
k5                06/22/2012             1000
k6                 06/22/2012            2000
*k1                06/21/2012             3000   ---> we can see here,
trade with key "k1" is appeared again..that means order with trade key "k1"
has some update*
*
*
Now I'm looking for the below output :  ( merging the both files and and
looking for common key from both feeds and keeping the latest key record in
the output file )
*k1                06/21/2012             3000*
*
k2                  04/21/2012             3000
k3                 06/21/2012            4000
k4                 07/21/2012             5000
*k5                06/22/2012             1000
k6                 06/22/2012            2000*

any help appreciated greatly !!
*

Regards,
Srinivas

Re: updates using Pig

Posted by pablomar <pa...@gmail.com>.

now I can see it :-)
very beautiful place


On Wed, Aug 29, 2012 at 5:47 AM, Srini <pi...@gmail.com> wrote:

> Thank-you very much  Jonathan...
>
> On Tue, Aug 28, 2012 at 2:47 AM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > I would do this with a cogroup. Whether or not you need a UDF depends on
> > whether or not a key can appear more than once in a file.
> >
> > trade-key    trade-add-date       trade-price
> >
> > feed_group = cogroup feed1 by trade-key, feed2 by trade-key;
> > feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ?
> > feed2 );
> >
> > and there you go (you may need to tweak the flatten to make it work).
> >
> > It'd be slightly more complicated if you had multiple key/date pairs.
> >
> > 2012/8/27 Srini <pi...@gmail.com>
> >
> > > Hello  TianYi Zhu,
> > >
> > > Thanks !! and will get back..
> > >
> > > -->by the way, you can sort these 2 files by trade-key then merge them
> > > using a
> > > small script, that's much more faster than using pig.
> > > ... Trying out POC on updates in hadoop
> > >
> > > Thanks,
> > > Srinivas
> > > On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
> > > tianyi.zhu@facilitatedigital.com> wrote:
> > >
> > > > Hi Srinivas,
> > > >
> > > > you can write a user defined function for this
> > > >
> > > > feed = union feed1, feed2;
> > > > feed_grouped = group feed by trade-key;
> > > > output = foreach feed_grouped generate
> > > > flatten(your_user_defined_function(feed)) as (trade-key,
> > trade-add-date,
> > > > trade-price)
> > > >
> > > > your_user_defined_function take the one or more records with the same
> > > > trade-key as input, and it should only output the latest tuple of
> > > > (trade-key, trade-add-date, trade-price)
> > > >
> > > >
> > > > by the way, you can sort these 2 files by trade-key then merge them
> > > using a
> > > > small script, that's much more faster than using pig.
> > > >
> > > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <
> > > piglearning@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm trying to do updates of records in hadoop using Pig ( I know
> this
> > > is
> > > > > not ideal but trying out POC )..
> > > > > data looks like the below:
> > > > >
> > > > > *feed1:*
> > > > > --> here trade key is unique for each order/record
> > > > > --> this is history file
> > > > >
> > > > > trade-key    trade-add-date       trade-price
> > > > > *k1                 05/21/2012            2000*
> > > > > k2                  04/21/2012             3000
> > > > > k3                 03/21/2012            4000
> > > > > k4                 05/21/2012             5000
> > > > >
> > > > > *feed2:  *--> this is the latest/daily feed
> > > > > trade-key    trade-add-date       trade-price
> > > > > k5                06/22/2012             1000
> > > > > k6                 06/22/2012            2000
> > > > > *k1                06/21/2012             3000   ---> we can see
> > here,
> > > > > trade with key "k1" is appeared again..that means order with trade
> > key
> > > > "k1"
> > > > > has some update*
> > > > > *
> > > > > *
> > > > > Now I'm looking for the below output :  ( merging the both files
> and
> > > and
> > > > > looking for common key from both feeds and keeping the latest key
> > > record
> > > > in
> > > > > the output file )
> > > > > *k1                06/21/2012             3000*
> > > > > *
> > > > > k2                  04/21/2012             3000
> > > > > k3                 06/21/2012            4000
> > > > > k4                 07/21/2012             5000
> > > > > *k5                06/22/2012             1000
> > > > > k6                 06/22/2012            2000*
> > > > >
> > > > > any help appreciated greatly !!
> > > > > *
> > > > >
> > > > > Regards,
> > > > > Srinivas
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Srinivas
> > > Srinivas@cloudwick.com
> > >
> >
>
>
>
> --
> Regards,
> Srinivas
> Srinivas@cloudwick.com
>

Re: updates using Pig

Posted by Srini <pi...@gmail.com>.

Thank-you very much  Jonathan...

On Tue, Aug 28, 2012 at 2:47 AM, Jonathan Coveney <jc...@gmail.com>wrote:

> I would do this with a cogroup. Whether or not you need a UDF depends on
> whether or not a key can appear more than once in a file.
>
> trade-key    trade-add-date       trade-price
>
> feed_group = cogroup feed1 by trade-key, feed2 by trade-key;
> feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ?
> feed2 );
>
> and there you go (you may need to tweak the flatten to make it work).
>
> It'd be slightly more complicated if you had multiple key/date pairs.
>
> 2012/8/27 Srini <pi...@gmail.com>
>
> > Hello  TianYi Zhu,
> >
> > Thanks !! and will get back..
> >
> > -->by the way, you can sort these 2 files by trade-key then merge them
> > using a
> > small script, that's much more faster than using pig.
> > ... Trying out POC on updates in hadoop
> >
> > Thanks,
> > Srinivas
> > On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
> > tianyi.zhu@facilitatedigital.com> wrote:
> >
> > > Hi Srinivas,
> > >
> > > you can write a user defined function for this
> > >
> > > feed = union feed1, feed2;
> > > feed_grouped = group feed by trade-key;
> > > output = foreach feed_grouped generate
> > > flatten(your_user_defined_function(feed)) as (trade-key,
> trade-add-date,
> > > trade-price)
> > >
> > > your_user_defined_function take the one or more records with the same
> > > trade-key as input, and it should only output the latest tuple of
> > > (trade-key, trade-add-date, trade-price)
> > >
> > >
> > > by the way, you can sort these 2 files by trade-key then merge them
> > using a
> > > small script, that's much more faster than using pig.
> > >
> > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <
> > piglearning@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm trying to do updates of records in hadoop using Pig ( I know this
> > is
> > > > not ideal but trying out POC )..
> > > > data looks like the below:
> > > >
> > > > *feed1:*
> > > > --> here trade key is unique for each order/record
> > > > --> this is history file
> > > >
> > > > trade-key    trade-add-date       trade-price
> > > > *k1                 05/21/2012            2000*
> > > > k2                  04/21/2012             3000
> > > > k3                 03/21/2012            4000
> > > > k4                 05/21/2012             5000
> > > >
> > > > *feed2:  *--> this is the latest/daily feed
> > > > trade-key    trade-add-date       trade-price
> > > > k5                06/22/2012             1000
> > > > k6                 06/22/2012            2000
> > > > *k1                06/21/2012             3000   ---> we can see
> here,
> > > > trade with key "k1" is appeared again..that means order with trade
> key
> > > "k1"
> > > > has some update*
> > > > *
> > > > *
> > > > Now I'm looking for the below output :  ( merging the both files and
> > and
> > > > looking for common key from both feeds and keeping the latest key
> > record
> > > in
> > > > the output file )
> > > > *k1                06/21/2012             3000*
> > > > *
> > > > k2                  04/21/2012             3000
> > > > k3                 06/21/2012            4000
> > > > k4                 07/21/2012             5000
> > > > *k5                06/22/2012             1000
> > > > k6                 06/22/2012            2000*
> > > >
> > > > any help appreciated greatly !!
> > > > *
> > > >
> > > > Regards,
> > > > Srinivas
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Srinivas
> > Srinivas@cloudwick.com
> >
>



-- 
Regards,
Srinivas
Srinivas@cloudwick.com

Re: updates using Pig

Posted by Jonathan Coveney <jc...@gmail.com>.

I would do this with a cogroup. Whether or not you need a UDF depends on
whether or not a key can appear more than once in a file.

trade-key    trade-add-date       trade-price

feed_group = cogroup feed1 by trade-key, feed2 by trade-key;
feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ?
feed2 );

and there you go (you may need to tweak the flatten to make it work).

It'd be slightly more complicated if you had multiple key/date pairs.

2012/8/27 Srini <pi...@gmail.com>

> Hello  TianYi Zhu,
>
> Thanks !! and will get back..
>
> -->by the way, you can sort these 2 files by trade-key then merge them
> using a
> small script, that's much more faster than using pig.
> ... Trying out POC on updates in hadoop
>
> Thanks,
> Srinivas
> On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
> tianyi.zhu@facilitatedigital.com> wrote:
>
> > Hi Srinivas,
> >
> > you can write a user defined function for this
> >
> > feed = union feed1, feed2;
> > feed_grouped = group feed by trade-key;
> > output = foreach feed_grouped generate
> > flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
> > trade-price)
> >
> > your_user_defined_function take the one or more records with the same
> > trade-key as input, and it should only output the latest tuple of
> > (trade-key, trade-add-date, trade-price)
> >
> >
> > by the way, you can sort these 2 files by trade-key then merge them
> using a
> > small script, that's much more faster than using pig.
> >
> > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <
> piglearning@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I'm trying to do updates of records in hadoop using Pig ( I know this
> is
> > > not ideal but trying out POC )..
> > > data looks like the below:
> > >
> > > *feed1:*
> > > --> here trade key is unique for each order/record
> > > --> this is history file
> > >
> > > trade-key    trade-add-date       trade-price
> > > *k1                 05/21/2012            2000*
> > > k2                  04/21/2012             3000
> > > k3                 03/21/2012            4000
> > > k4                 05/21/2012             5000
> > >
> > > *feed2:  *--> this is the latest/daily feed
> > > trade-key    trade-add-date       trade-price
> > > k5                06/22/2012             1000
> > > k6                 06/22/2012            2000
> > > *k1                06/21/2012             3000   ---> we can see here,
> > > trade with key "k1" is appeared again..that means order with trade key
> > "k1"
> > > has some update*
> > > *
> > > *
> > > Now I'm looking for the below output :  ( merging the both files and
> and
> > > looking for common key from both feeds and keeping the latest key
> record
> > in
> > > the output file )
> > > *k1                06/21/2012             3000*
> > > *
> > > k2                  04/21/2012             3000
> > > k3                 06/21/2012            4000
> > > k4                 07/21/2012             5000
> > > *k5                06/22/2012             1000
> > > k6                 06/22/2012            2000*
> > >
> > > any help appreciated greatly !!
> > > *
> > >
> > > Regards,
> > > Srinivas
> > >
> >
>
>
>
> --
> Regards,
> Srinivas
> Srinivas@cloudwick.com
>

Re: updates using Pig

Posted by Srini <pi...@gmail.com>.

Hello  TianYi Zhu,

Thanks !! and will get back..

-->by the way, you can sort these 2 files by trade-key then merge them
using a
small script, that's much more faster than using pig.
... Trying out POC on updates in hadoop

Thanks,
Srinivas
On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
tianyi.zhu@facilitatedigital.com> wrote:

> Hi Srinivas,
>
> you can write a user defined function for this
>
> feed = union feed1, feed2;
> feed_grouped = group feed by trade-key;
> output = foreach feed_grouped generate
> flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
> trade-price)
>
> your_user_defined_function take the one or more records with the same
> trade-key as input, and it should only output the latest tuple of
> (trade-key, trade-add-date, trade-price)
>
>
> by the way, you can sort these 2 files by trade-key then merge them using a
> small script, that's much more faster than using pig.
>
> On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <piglearning@gmail.com
> >wrote:
>
> > Hi,
> >
> > I'm trying to do updates of records in hadoop using Pig ( I know this is
> > not ideal but trying out POC )..
> > data looks like the below:
> >
> > *feed1:*
> > --> here trade key is unique for each order/record
> > --> this is history file
> >
> > trade-key    trade-add-date       trade-price
> > *k1                 05/21/2012            2000*
> > k2                  04/21/2012             3000
> > k3                 03/21/2012            4000
> > k4                 05/21/2012             5000
> >
> > *feed2:  *--> this is the latest/daily feed
> > trade-key    trade-add-date       trade-price
> > k5                06/22/2012             1000
> > k6                 06/22/2012            2000
> > *k1                06/21/2012             3000   ---> we can see here,
> > trade with key "k1" is appeared again..that means order with trade key
> "k1"
> > has some update*
> > *
> > *
> > Now I'm looking for the below output :  ( merging the both files and and
> > looking for common key from both feeds and keeping the latest key record
> in
> > the output file )
> > *k1                06/21/2012             3000*
> > *
> > k2                  04/21/2012             3000
> > k3                 06/21/2012            4000
> > k4                 07/21/2012             5000
> > *k5                06/22/2012             1000
> > k6                 06/22/2012            2000*
> >
> > any help appreciated greatly !!
> > *
> >
> > Regards,
> > Srinivas
> >
>



-- 
Regards,
Srinivas
Srinivas@cloudwick.com

Re: updates using Pig

Posted by TianYi Zhu <ti...@facilitatedigital.com>.

Hi Srinivas,

you can write a user defined function for this

feed = union feed1, feed2;
feed_grouped = group feed by trade-key;
output = foreach feed_grouped generate
flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
trade-price)

your_user_defined_function take the one or more records with the same
trade-key as input, and it should only output the latest tuple of
(trade-key, trade-add-date, trade-price)


by the way, you can sort these 2 files by trade-key then merge them using a
small script, that's much more faster than using pig.

On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <pi...@gmail.com>wrote:

> Hi,
>
> I'm trying to do updates of records in hadoop using Pig ( I know this is
> not ideal but trying out POC )..
> data looks like the below:
>
> *feed1:*
> --> here trade key is unique for each order/record
> --> this is history file
>
> trade-key    trade-add-date       trade-price
> *k1                 05/21/2012            2000*
> k2                  04/21/2012             3000
> k3                 03/21/2012            4000
> k4                 05/21/2012             5000
>
> *feed2:  *--> this is the latest/daily feed
> trade-key    trade-add-date       trade-price
> k5                06/22/2012             1000
> k6                 06/22/2012            2000
> *k1                06/21/2012             3000   ---> we can see here,
> trade with key "k1" is appeared again..that means order with trade key "k1"
> has some update*
> *
> *
> Now I'm looking for the below output :  ( merging the both files and and
> looking for common key from both feeds and keeping the latest key record in
> the output file )
> *k1                06/21/2012             3000*
> *
> k2                  04/21/2012             3000
> k3                 06/21/2012            4000
> k4                 07/21/2012             5000
> *k5                06/22/2012             1000
> k6                 06/22/2012            2000*
>
> any help appreciated greatly !!
> *
>
> Regards,
> Srinivas
>