You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Srinivas Surasani <pi...@gmail.com> on 2012/08/28 06:36:52 UTC
updates using Pig
Hi,
I'm trying to do updates of records in hadoop using Pig ( I know this is
not ideal but trying out POC )..
data looks like the below:
*feed1:*
--> here trade key is unique for each order/record
--> this is history file
trade-key trade-add-date trade-price
*k1 05/21/2012 2000*
k2 04/21/2012 3000
k3 03/21/2012 4000
k4 05/21/2012 5000
*feed2: *--> this is the latest/daily feed
trade-key trade-add-date trade-price
k5 06/22/2012 1000
k6 06/22/2012 2000
*k1 06/21/2012 3000 ---> we can see here,
trade with key "k1" is appeared again..that means order with trade key "k1"
has some update*
*
*
Now I'm looking for the below output : ( merging the both files and and
looking for common key from both feeds and keeping the latest key record in
the output file )
*k1 06/21/2012 3000*
*
k2 04/21/2012 3000
k3 06/21/2012 4000
k4 07/21/2012 5000
*k5 06/22/2012 1000
k6 06/22/2012 2000*
any help appreciated greatly !!
*
Regards,
Srinivas
Re: updates using Pig
Posted by pablomar <pa...@gmail.com>.
now I can see it :-)
very beautiful place
On Wed, Aug 29, 2012 at 5:47 AM, Srini <pi...@gmail.com> wrote:
> Thank-you very much Jonathan...
>
> On Tue, Aug 28, 2012 at 2:47 AM, Jonathan Coveney <jcoveney@gmail.com
> >wrote:
>
> > I would do this with a cogroup. Whether or not you need a UDF depends on
> > whether or not a key can appear more than once in a file.
> >
> > trade-key trade-add-date trade-price
> >
> > feed_group = cogroup feed1 by trade-key, feed2 by trade-key;
> > feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ?
> > feed2 );
> >
> > and there you go (you may need to tweak the flatten to make it work).
> >
> > It'd be slightly more complicated if you had multiple key/date pairs.
> >
> > 2012/8/27 Srini <pi...@gmail.com>
> >
> > > Hello TianYi Zhu,
> > >
> > > Thanks !! and will get back..
> > >
> > > -->by the way, you can sort these 2 files by trade-key then merge them
> > > using a
> > > small script, that's much more faster than using pig.
> > > ... Trying out POC on updates in hadoop
> > >
> > > Thanks,
> > > Srinivas
> > > On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
> > > tianyi.zhu@facilitatedigital.com> wrote:
> > >
> > > > Hi Srinivas,
> > > >
> > > > you can write a user defined function for this
> > > >
> > > > feed = union feed1, feed2;
> > > > feed_grouped = group feed by trade-key;
> > > > output = foreach feed_grouped generate
> > > > flatten(your_user_defined_function(feed)) as (trade-key,
> > trade-add-date,
> > > > trade-price)
> > > >
> > > > your_user_defined_function take the one or more records with the same
> > > > trade-key as input, and it should only output the latest tuple of
> > > > (trade-key, trade-add-date, trade-price)
> > > >
> > > >
> > > > by the way, you can sort these 2 files by trade-key then merge them
> > > using a
> > > > small script, that's much more faster than using pig.
> > > >
> > > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <
> > > piglearning@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm trying to do updates of records in hadoop using Pig ( I know
> this
> > > is
> > > > > not ideal but trying out POC )..
> > > > > data looks like the below:
> > > > >
> > > > > *feed1:*
> > > > > --> here trade key is unique for each order/record
> > > > > --> this is history file
> > > > >
> > > > > trade-key trade-add-date trade-price
> > > > > *k1 05/21/2012 2000*
> > > > > k2 04/21/2012 3000
> > > > > k3 03/21/2012 4000
> > > > > k4 05/21/2012 5000
> > > > >
> > > > > *feed2: *--> this is the latest/daily feed
> > > > > trade-key trade-add-date trade-price
> > > > > k5 06/22/2012 1000
> > > > > k6 06/22/2012 2000
> > > > > *k1 06/21/2012 3000 ---> we can see
> > here,
> > > > > trade with key "k1" is appeared again..that means order with trade
> > key
> > > > "k1"
> > > > > has some update*
> > > > > *
> > > > > *
> > > > > Now I'm looking for the below output : ( merging the both files
> and
> > > and
> > > > > looking for common key from both feeds and keeping the latest key
> > > record
> > > > in
> > > > > the output file )
> > > > > *k1 06/21/2012 3000*
> > > > > *
> > > > > k2 04/21/2012 3000
> > > > > k3 06/21/2012 4000
> > > > > k4 07/21/2012 5000
> > > > > *k5 06/22/2012 1000
> > > > > k6 06/22/2012 2000*
> > > > >
> > > > > any help appreciated greatly !!
> > > > > *
> > > > >
> > > > > Regards,
> > > > > Srinivas
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Srinivas
> > > Srinivas@cloudwick.com
> > >
> >
>
>
>
> --
> Regards,
> Srinivas
> Srinivas@cloudwick.com
>
Re: updates using Pig
Posted by Srini <pi...@gmail.com>.
Thank-you very much Jonathan...
On Tue, Aug 28, 2012 at 2:47 AM, Jonathan Coveney <jc...@gmail.com>wrote:
> I would do this with a cogroup. Whether or not you need a UDF depends on
> whether or not a key can appear more than once in a file.
>
> trade-key trade-add-date trade-price
>
> feed_group = cogroup feed1 by trade-key, feed2 by trade-key;
> feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ?
> feed2 );
>
> and there you go (you may need to tweak the flatten to make it work).
>
> It'd be slightly more complicated if you had multiple key/date pairs.
>
> 2012/8/27 Srini <pi...@gmail.com>
>
> > Hello TianYi Zhu,
> >
> > Thanks !! and will get back..
> >
> > -->by the way, you can sort these 2 files by trade-key then merge them
> > using a
> > small script, that's much more faster than using pig.
> > ... Trying out POC on updates in hadoop
> >
> > Thanks,
> > Srinivas
> > On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
> > tianyi.zhu@facilitatedigital.com> wrote:
> >
> > > Hi Srinivas,
> > >
> > > you can write a user defined function for this
> > >
> > > feed = union feed1, feed2;
> > > feed_grouped = group feed by trade-key;
> > > output = foreach feed_grouped generate
> > > flatten(your_user_defined_function(feed)) as (trade-key,
> trade-add-date,
> > > trade-price)
> > >
> > > your_user_defined_function take the one or more records with the same
> > > trade-key as input, and it should only output the latest tuple of
> > > (trade-key, trade-add-date, trade-price)
> > >
> > >
> > > by the way, you can sort these 2 files by trade-key then merge them
> > using a
> > > small script, that's much more faster than using pig.
> > >
> > > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <
> > piglearning@gmail.com
> > > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm trying to do updates of records in hadoop using Pig ( I know this
> > is
> > > > not ideal but trying out POC )..
> > > > data looks like the below:
> > > >
> > > > *feed1:*
> > > > --> here trade key is unique for each order/record
> > > > --> this is history file
> > > >
> > > > trade-key trade-add-date trade-price
> > > > *k1 05/21/2012 2000*
> > > > k2 04/21/2012 3000
> > > > k3 03/21/2012 4000
> > > > k4 05/21/2012 5000
> > > >
> > > > *feed2: *--> this is the latest/daily feed
> > > > trade-key trade-add-date trade-price
> > > > k5 06/22/2012 1000
> > > > k6 06/22/2012 2000
> > > > *k1 06/21/2012 3000 ---> we can see
> here,
> > > > trade with key "k1" is appeared again..that means order with trade
> key
> > > "k1"
> > > > has some update*
> > > > *
> > > > *
> > > > Now I'm looking for the below output : ( merging the both files and
> > and
> > > > looking for common key from both feeds and keeping the latest key
> > record
> > > in
> > > > the output file )
> > > > *k1 06/21/2012 3000*
> > > > *
> > > > k2 04/21/2012 3000
> > > > k3 06/21/2012 4000
> > > > k4 07/21/2012 5000
> > > > *k5 06/22/2012 1000
> > > > k6 06/22/2012 2000*
> > > >
> > > > any help appreciated greatly !!
> > > > *
> > > >
> > > > Regards,
> > > > Srinivas
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> > Srinivas
> > Srinivas@cloudwick.com
> >
>
--
Regards,
Srinivas
Srinivas@cloudwick.com
Re: updates using Pig
Posted by Jonathan Coveney <jc...@gmail.com>.
I would do this with a cogroup. Whether or not you need a UDF depends on
whether or not a key can appear more than once in a file.
trade-key trade-add-date trade-price
feed_group = cogroup feed1 by trade-key, feed2 by trade-key;
feed_proj = foreach feed_group generate FLATTEN( IsEmpty(feed2) ? feed1 ?
feed2 );
and there you go (you may need to tweak the flatten to make it work).
It'd be slightly more complicated if you had multiple key/date pairs.
2012/8/27 Srini <pi...@gmail.com>
> Hello TianYi Zhu,
>
> Thanks !! and will get back..
>
> -->by the way, you can sort these 2 files by trade-key then merge them
> using a
> small script, that's much more faster than using pig.
> ... Trying out POC on updates in hadoop
>
> Thanks,
> Srinivas
> On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
> tianyi.zhu@facilitatedigital.com> wrote:
>
> > Hi Srinivas,
> >
> > you can write a user defined function for this
> >
> > feed = union feed1, feed2;
> > feed_grouped = group feed by trade-key;
> > output = foreach feed_grouped generate
> > flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
> > trade-price)
> >
> > your_user_defined_function take the one or more records with the same
> > trade-key as input, and it should only output the latest tuple of
> > (trade-key, trade-add-date, trade-price)
> >
> >
> > by the way, you can sort these 2 files by trade-key then merge them
> using a
> > small script, that's much more faster than using pig.
> >
> > On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <
> piglearning@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > > I'm trying to do updates of records in hadoop using Pig ( I know this
> is
> > > not ideal but trying out POC )..
> > > data looks like the below:
> > >
> > > *feed1:*
> > > --> here trade key is unique for each order/record
> > > --> this is history file
> > >
> > > trade-key trade-add-date trade-price
> > > *k1 05/21/2012 2000*
> > > k2 04/21/2012 3000
> > > k3 03/21/2012 4000
> > > k4 05/21/2012 5000
> > >
> > > *feed2: *--> this is the latest/daily feed
> > > trade-key trade-add-date trade-price
> > > k5 06/22/2012 1000
> > > k6 06/22/2012 2000
> > > *k1 06/21/2012 3000 ---> we can see here,
> > > trade with key "k1" is appeared again..that means order with trade key
> > "k1"
> > > has some update*
> > > *
> > > *
> > > Now I'm looking for the below output : ( merging the both files and
> and
> > > looking for common key from both feeds and keeping the latest key
> record
> > in
> > > the output file )
> > > *k1 06/21/2012 3000*
> > > *
> > > k2 04/21/2012 3000
> > > k3 06/21/2012 4000
> > > k4 07/21/2012 5000
> > > *k5 06/22/2012 1000
> > > k6 06/22/2012 2000*
> > >
> > > any help appreciated greatly !!
> > > *
> > >
> > > Regards,
> > > Srinivas
> > >
> >
>
>
>
> --
> Regards,
> Srinivas
> Srinivas@cloudwick.com
>
Re: updates using Pig
Posted by Srini <pi...@gmail.com>.
Hello TianYi Zhu,
Thanks !! and will get back..
-->by the way, you can sort these 2 files by trade-key then merge them
using a
small script, that's much more faster than using pig.
... Trying out POC on updates in hadoop
Thanks,
Srinivas
On Tue, Aug 28, 2012 at 12:55 AM, TianYi Zhu <
tianyi.zhu@facilitatedigital.com> wrote:
> Hi Srinivas,
>
> you can write a user defined function for this
>
> feed = union feed1, feed2;
> feed_grouped = group feed by trade-key;
> output = foreach feed_grouped generate
> flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
> trade-price)
>
> your_user_defined_function take the one or more records with the same
> trade-key as input, and it should only output the latest tuple of
> (trade-key, trade-add-date, trade-price)
>
>
> by the way, you can sort these 2 files by trade-key then merge them using a
> small script, that's much more faster than using pig.
>
> On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <piglearning@gmail.com
> >wrote:
>
> > Hi,
> >
> > I'm trying to do updates of records in hadoop using Pig ( I know this is
> > not ideal but trying out POC )..
> > data looks like the below:
> >
> > *feed1:*
> > --> here trade key is unique for each order/record
> > --> this is history file
> >
> > trade-key trade-add-date trade-price
> > *k1 05/21/2012 2000*
> > k2 04/21/2012 3000
> > k3 03/21/2012 4000
> > k4 05/21/2012 5000
> >
> > *feed2: *--> this is the latest/daily feed
> > trade-key trade-add-date trade-price
> > k5 06/22/2012 1000
> > k6 06/22/2012 2000
> > *k1 06/21/2012 3000 ---> we can see here,
> > trade with key "k1" is appeared again..that means order with trade key
> "k1"
> > has some update*
> > *
> > *
> > Now I'm looking for the below output : ( merging the both files and and
> > looking for common key from both feeds and keeping the latest key record
> in
> > the output file )
> > *k1 06/21/2012 3000*
> > *
> > k2 04/21/2012 3000
> > k3 06/21/2012 4000
> > k4 07/21/2012 5000
> > *k5 06/22/2012 1000
> > k6 06/22/2012 2000*
> >
> > any help appreciated greatly !!
> > *
> >
> > Regards,
> > Srinivas
> >
>
--
Regards,
Srinivas
Srinivas@cloudwick.com
Re: updates using Pig
Posted by TianYi Zhu <ti...@facilitatedigital.com>.
Hi Srinivas,
you can write a user defined function for this
feed = union feed1, feed2;
feed_grouped = group feed by trade-key;
output = foreach feed_grouped generate
flatten(your_user_defined_function(feed)) as (trade-key, trade-add-date,
trade-price)
your_user_defined_function take the one or more records with the same
trade-key as input, and it should only output the latest tuple of
(trade-key, trade-add-date, trade-price)
by the way, you can sort these 2 files by trade-key then merge them using a
small script, that's much more faster than using pig.
On Tue, Aug 28, 2012 at 2:36 PM, Srinivas Surasani <pi...@gmail.com>wrote:
> Hi,
>
> I'm trying to do updates of records in hadoop using Pig ( I know this is
> not ideal but trying out POC )..
> data looks like the below:
>
> *feed1:*
> --> here trade key is unique for each order/record
> --> this is history file
>
> trade-key trade-add-date trade-price
> *k1 05/21/2012 2000*
> k2 04/21/2012 3000
> k3 03/21/2012 4000
> k4 05/21/2012 5000
>
> *feed2: *--> this is the latest/daily feed
> trade-key trade-add-date trade-price
> k5 06/22/2012 1000
> k6 06/22/2012 2000
> *k1 06/21/2012 3000 ---> we can see here,
> trade with key "k1" is appeared again..that means order with trade key "k1"
> has some update*
> *
> *
> Now I'm looking for the below output : ( merging the both files and and
> looking for common key from both feeds and keeping the latest key record in
> the output file )
> *k1 06/21/2012 3000*
> *
> k2 04/21/2012 3000
> k3 06/21/2012 4000
> k4 07/21/2012 5000
> *k5 06/22/2012 1000
> k6 06/22/2012 2000*
>
> any help appreciated greatly !!
> *
>
> Regards,
> Srinivas
>