You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Prashant Kommireddi <pr...@gmail.com> on 2012/06/07 00:52:05 UTC

Persisting Pig Scripts

Hi All,

What do you guys think about adding a feature to be able to persist the
script (file or cache in case of grunt) on HDFS or locally based on an
admin setting (pig.properties). This will help infrastructure/ops teams
analyze nature of Pig scripts and be able to make certain decisions based
on it (optimizing data storage based on access patterns etc). This is
actually something we want to do but the challenge is there is no central
place where we can track user scripts.

It could be a config param "pig.persist.script=/pig/". The script could be
stored with a configurable name -> ${mapred.job.name}+${user.name}+timestamp"
either on HDFS or local based on the configuration setting.

Thanks,
Prashant

Re: Persisting Pig Scripts

Posted by Jonathan Coveney <jc...@gmail.com>.
We could also just serialize the script to more than one value and paste it
together.

2012/6/11 Bill Graham <bi...@gmail.com>

> That's expected. It's a cap on the size of how much of the script can be
> stored. I'm not sure what the exact size limit is though, but if it's
> causing issues I'm sure we could make it a configurable value.
>
>
> On Mon, Jun 11, 2012 at 2:33 PM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > Bill,
> >
> > Would you know if that is expected or a bug?
> >
> >
> >
> >
> > On Wed, Jun 6, 2012 at 5:56 PM, Bill Graham <bi...@gmail.com>
> wrote:
> >
> >> One thing to be aware of when accessing the pig.script option is that
> >> AFAIK
> >> there's a limit to how large the script can be, after which the rest
> would
> >> be truncated.
> >>
> >>
> >> On Wed, Jun 6, 2012 at 5:44 PM, Prashant Kommireddi <
> prash1784@gmail.com
> >> >wrote:
> >>
> >> > I completely agree that's an option. But IMHO being able to do that
> >> upfront
> >> > would be a nice feature, adding cron is just an additional process we
> >> could
> >> > avoid if possible.
> >> >
> >> > On Wed, Jun 6, 2012 at 5:39 PM, Dmitriy Ryaboy <dv...@gmail.com>
> >> wrote:
> >> >
> >> > > You can write a nightly cron that runs the JobHistoryLoader job and
> >> > > stores parsed scripts to hdfs...
> >> > >
> >> > > D
> >> > >
> >> > > On Wed, Jun 6, 2012 at 5:16 PM, Prashant Kommireddi <
> >> prash1784@gmail.com
> >> > >
> >> > > wrote:
> >> > > > I think that would be more of a post-process vs having Pig write
> the
> >> > same
> >> > > > to a HDFS location. That would avoid having to parse it from
> >> job.xml.
> >> > > >
> >> > > > On Wed, Jun 6, 2012 at 4:19 PM, Daniel Dai <daijy@hortonworks.com
> >
> >> > > wrote:
> >> > > >
> >> > > >> One existing solution is "pig.script" entry inside job.xml, it is
> >> the
> >> > > >> serialized Pig script. JobHistoryLoader can load job.xml files
> and
> >> > grab
> >> > > >> those entries. Does that solve your problem?
> >> > > >>
> >> > > >> Daniel
> >> > > >>
> >> > > >> On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <
> >> > > prash1784@gmail.com
> >> > > >> >wrote:
> >> > > >>
> >> > > >> > Hi All,
> >> > > >> >
> >> > > >> > What do you guys think about adding a feature to be able to
> >> persist
> >> > > the
> >> > > >> > script (file or cache in case of grunt) on HDFS or locally
> based
> >> on
> >> > an
> >> > > >> > admin setting (pig.properties). This will help
> infrastructure/ops
> >> > > teams
> >> > > >> > analyze nature of Pig scripts and be able to make certain
> >> decisions
> >> > > based
> >> > > >> > on it (optimizing data storage based on access patterns etc).
> >> This
> >> > is
> >> > > >> > actually something we want to do but the challenge is there is
> no
> >> > > central
> >> > > >> > place where we can track user scripts.
> >> > > >> >
> >> > > >> > It could be a config param "pig.persist.script=/pig/". The
> script
> >> > > could
> >> > > >> be
> >> > > >> > stored with a configurable name -> ${mapred.job.name}+${
> >> user.name
> >> > > >> > }+timestamp"
> >> > > >> > either on HDFS or local based on the configuration setting.
> >> > > >> >
> >> > > >> > Thanks,
> >> > > >> > Prashant
> >> > > >> >
> >> > > >>
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> >> billgraham@gmail.com going forward.*
> >>
> >
> >
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: Persisting Pig Scripts

Posted by Bill Graham <bi...@gmail.com>.
That's expected. It's a cap on the size of how much of the script can be
stored. I'm not sure what the exact size limit is though, but if it's
causing issues I'm sure we could make it a configurable value.


On Mon, Jun 11, 2012 at 2:33 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Bill,
>
> Would you know if that is expected or a bug?
>
>
>
>
> On Wed, Jun 6, 2012 at 5:56 PM, Bill Graham <bi...@gmail.com> wrote:
>
>> One thing to be aware of when accessing the pig.script option is that
>> AFAIK
>> there's a limit to how large the script can be, after which the rest would
>> be truncated.
>>
>>
>> On Wed, Jun 6, 2012 at 5:44 PM, Prashant Kommireddi <prash1784@gmail.com
>> >wrote:
>>
>> > I completely agree that's an option. But IMHO being able to do that
>> upfront
>> > would be a nice feature, adding cron is just an additional process we
>> could
>> > avoid if possible.
>> >
>> > On Wed, Jun 6, 2012 at 5:39 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> wrote:
>> >
>> > > You can write a nightly cron that runs the JobHistoryLoader job and
>> > > stores parsed scripts to hdfs...
>> > >
>> > > D
>> > >
>> > > On Wed, Jun 6, 2012 at 5:16 PM, Prashant Kommireddi <
>> prash1784@gmail.com
>> > >
>> > > wrote:
>> > > > I think that would be more of a post-process vs having Pig write the
>> > same
>> > > > to a HDFS location. That would avoid having to parse it from
>> job.xml.
>> > > >
>> > > > On Wed, Jun 6, 2012 at 4:19 PM, Daniel Dai <da...@hortonworks.com>
>> > > wrote:
>> > > >
>> > > >> One existing solution is "pig.script" entry inside job.xml, it is
>> the
>> > > >> serialized Pig script. JobHistoryLoader can load job.xml files and
>> > grab
>> > > >> those entries. Does that solve your problem?
>> > > >>
>> > > >> Daniel
>> > > >>
>> > > >> On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <
>> > > prash1784@gmail.com
>> > > >> >wrote:
>> > > >>
>> > > >> > Hi All,
>> > > >> >
>> > > >> > What do you guys think about adding a feature to be able to
>> persist
>> > > the
>> > > >> > script (file or cache in case of grunt) on HDFS or locally based
>> on
>> > an
>> > > >> > admin setting (pig.properties). This will help infrastructure/ops
>> > > teams
>> > > >> > analyze nature of Pig scripts and be able to make certain
>> decisions
>> > > based
>> > > >> > on it (optimizing data storage based on access patterns etc).
>> This
>> > is
>> > > >> > actually something we want to do but the challenge is there is no
>> > > central
>> > > >> > place where we can track user scripts.
>> > > >> >
>> > > >> > It could be a config param "pig.persist.script=/pig/". The script
>> > > could
>> > > >> be
>> > > >> > stored with a configurable name -> ${mapred.job.name}+${
>> user.name
>> > > >> > }+timestamp"
>> > > >> > either on HDFS or local based on the configuration setting.
>> > > >> >
>> > > >> > Thanks,
>> > > >> > Prashant
>> > > >> >
>> > > >>
>> > >
>> >
>>
>>
>>
>> --
>> *Note that I'm no longer using my Yahoo! email address. Please email me at
>> billgraham@gmail.com going forward.*
>>
>
>


-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Persisting Pig Scripts

Posted by Prashant Kommireddi <pr...@gmail.com>.
Bill,

Would you know if that is expected or a bug?




On Wed, Jun 6, 2012 at 5:56 PM, Bill Graham <bi...@gmail.com> wrote:

> One thing to be aware of when accessing the pig.script option is that AFAIK
> there's a limit to how large the script can be, after which the rest would
> be truncated.
>
>
> On Wed, Jun 6, 2012 at 5:44 PM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > I completely agree that's an option. But IMHO being able to do that
> upfront
> > would be a nice feature, adding cron is just an additional process we
> could
> > avoid if possible.
> >
> > On Wed, Jun 6, 2012 at 5:39 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> > > You can write a nightly cron that runs the JobHistoryLoader job and
> > > stores parsed scripts to hdfs...
> > >
> > > D
> > >
> > > On Wed, Jun 6, 2012 at 5:16 PM, Prashant Kommireddi <
> prash1784@gmail.com
> > >
> > > wrote:
> > > > I think that would be more of a post-process vs having Pig write the
> > same
> > > > to a HDFS location. That would avoid having to parse it from job.xml.
> > > >
> > > > On Wed, Jun 6, 2012 at 4:19 PM, Daniel Dai <da...@hortonworks.com>
> > > wrote:
> > > >
> > > >> One existing solution is "pig.script" entry inside job.xml, it is
> the
> > > >> serialized Pig script. JobHistoryLoader can load job.xml files and
> > grab
> > > >> those entries. Does that solve your problem?
> > > >>
> > > >> Daniel
> > > >>
> > > >> On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <
> > > prash1784@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Hi All,
> > > >> >
> > > >> > What do you guys think about adding a feature to be able to
> persist
> > > the
> > > >> > script (file or cache in case of grunt) on HDFS or locally based
> on
> > an
> > > >> > admin setting (pig.properties). This will help infrastructure/ops
> > > teams
> > > >> > analyze nature of Pig scripts and be able to make certain
> decisions
> > > based
> > > >> > on it (optimizing data storage based on access patterns etc). This
> > is
> > > >> > actually something we want to do but the challenge is there is no
> > > central
> > > >> > place where we can track user scripts.
> > > >> >
> > > >> > It could be a config param "pig.persist.script=/pig/". The script
> > > could
> > > >> be
> > > >> > stored with a configurable name -> ${mapred.job.name}+${user.name
> > > >> > }+timestamp"
> > > >> > either on HDFS or local based on the configuration setting.
> > > >> >
> > > >> > Thanks,
> > > >> > Prashant
> > > >> >
> > > >>
> > >
> >
>
>
>
> --
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> billgraham@gmail.com going forward.*
>

Re: Persisting Pig Scripts

Posted by Bill Graham <bi...@gmail.com>.
One thing to be aware of when accessing the pig.script option is that AFAIK
there's a limit to how large the script can be, after which the rest would
be truncated.


On Wed, Jun 6, 2012 at 5:44 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> I completely agree that's an option. But IMHO being able to do that upfront
> would be a nice feature, adding cron is just an additional process we could
> avoid if possible.
>
> On Wed, Jun 6, 2012 at 5:39 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > You can write a nightly cron that runs the JobHistoryLoader job and
> > stores parsed scripts to hdfs...
> >
> > D
> >
> > On Wed, Jun 6, 2012 at 5:16 PM, Prashant Kommireddi <prash1784@gmail.com
> >
> > wrote:
> > > I think that would be more of a post-process vs having Pig write the
> same
> > > to a HDFS location. That would avoid having to parse it from job.xml.
> > >
> > > On Wed, Jun 6, 2012 at 4:19 PM, Daniel Dai <da...@hortonworks.com>
> > wrote:
> > >
> > >> One existing solution is "pig.script" entry inside job.xml, it is the
> > >> serialized Pig script. JobHistoryLoader can load job.xml files and
> grab
> > >> those entries. Does that solve your problem?
> > >>
> > >> Daniel
> > >>
> > >> On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <
> > prash1784@gmail.com
> > >> >wrote:
> > >>
> > >> > Hi All,
> > >> >
> > >> > What do you guys think about adding a feature to be able to persist
> > the
> > >> > script (file or cache in case of grunt) on HDFS or locally based on
> an
> > >> > admin setting (pig.properties). This will help infrastructure/ops
> > teams
> > >> > analyze nature of Pig scripts and be able to make certain decisions
> > based
> > >> > on it (optimizing data storage based on access patterns etc). This
> is
> > >> > actually something we want to do but the challenge is there is no
> > central
> > >> > place where we can track user scripts.
> > >> >
> > >> > It could be a config param "pig.persist.script=/pig/". The script
> > could
> > >> be
> > >> > stored with a configurable name -> ${mapred.job.name}+${user.name
> > >> > }+timestamp"
> > >> > either on HDFS or local based on the configuration setting.
> > >> >
> > >> > Thanks,
> > >> > Prashant
> > >> >
> > >>
> >
>



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Persisting Pig Scripts

Posted by Prashant Kommireddi <pr...@gmail.com>.
I completely agree that's an option. But IMHO being able to do that upfront
would be a nice feature, adding cron is just an additional process we could
avoid if possible.

On Wed, Jun 6, 2012 at 5:39 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> You can write a nightly cron that runs the JobHistoryLoader job and
> stores parsed scripts to hdfs...
>
> D
>
> On Wed, Jun 6, 2012 at 5:16 PM, Prashant Kommireddi <pr...@gmail.com>
> wrote:
> > I think that would be more of a post-process vs having Pig write the same
> > to a HDFS location. That would avoid having to parse it from job.xml.
> >
> > On Wed, Jun 6, 2012 at 4:19 PM, Daniel Dai <da...@hortonworks.com>
> wrote:
> >
> >> One existing solution is "pig.script" entry inside job.xml, it is the
> >> serialized Pig script. JobHistoryLoader can load job.xml files and grab
> >> those entries. Does that solve your problem?
> >>
> >> Daniel
> >>
> >> On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <
> prash1784@gmail.com
> >> >wrote:
> >>
> >> > Hi All,
> >> >
> >> > What do you guys think about adding a feature to be able to persist
> the
> >> > script (file or cache in case of grunt) on HDFS or locally based on an
> >> > admin setting (pig.properties). This will help infrastructure/ops
> teams
> >> > analyze nature of Pig scripts and be able to make certain decisions
> based
> >> > on it (optimizing data storage based on access patterns etc). This is
> >> > actually something we want to do but the challenge is there is no
> central
> >> > place where we can track user scripts.
> >> >
> >> > It could be a config param "pig.persist.script=/pig/". The script
> could
> >> be
> >> > stored with a configurable name -> ${mapred.job.name}+${user.name
> >> > }+timestamp"
> >> > either on HDFS or local based on the configuration setting.
> >> >
> >> > Thanks,
> >> > Prashant
> >> >
> >>
>

Re: Persisting Pig Scripts

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
You can write a nightly cron that runs the JobHistoryLoader job and
stores parsed scripts to hdfs...

D

On Wed, Jun 6, 2012 at 5:16 PM, Prashant Kommireddi <pr...@gmail.com> wrote:
> I think that would be more of a post-process vs having Pig write the same
> to a HDFS location. That would avoid having to parse it from job.xml.
>
> On Wed, Jun 6, 2012 at 4:19 PM, Daniel Dai <da...@hortonworks.com> wrote:
>
>> One existing solution is "pig.script" entry inside job.xml, it is the
>> serialized Pig script. JobHistoryLoader can load job.xml files and grab
>> those entries. Does that solve your problem?
>>
>> Daniel
>>
>> On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <prash1784@gmail.com
>> >wrote:
>>
>> > Hi All,
>> >
>> > What do you guys think about adding a feature to be able to persist the
>> > script (file or cache in case of grunt) on HDFS or locally based on an
>> > admin setting (pig.properties). This will help infrastructure/ops teams
>> > analyze nature of Pig scripts and be able to make certain decisions based
>> > on it (optimizing data storage based on access patterns etc). This is
>> > actually something we want to do but the challenge is there is no central
>> > place where we can track user scripts.
>> >
>> > It could be a config param "pig.persist.script=/pig/". The script could
>> be
>> > stored with a configurable name -> ${mapred.job.name}+${user.name
>> > }+timestamp"
>> > either on HDFS or local based on the configuration setting.
>> >
>> > Thanks,
>> > Prashant
>> >
>>

Re: Persisting Pig Scripts

Posted by Prashant Kommireddi <pr...@gmail.com>.
I think that would be more of a post-process vs having Pig write the same
to a HDFS location. That would avoid having to parse it from job.xml.

On Wed, Jun 6, 2012 at 4:19 PM, Daniel Dai <da...@hortonworks.com> wrote:

> One existing solution is "pig.script" entry inside job.xml, it is the
> serialized Pig script. JobHistoryLoader can load job.xml files and grab
> those entries. Does that solve your problem?
>
> Daniel
>
> On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > Hi All,
> >
> > What do you guys think about adding a feature to be able to persist the
> > script (file or cache in case of grunt) on HDFS or locally based on an
> > admin setting (pig.properties). This will help infrastructure/ops teams
> > analyze nature of Pig scripts and be able to make certain decisions based
> > on it (optimizing data storage based on access patterns etc). This is
> > actually something we want to do but the challenge is there is no central
> > place where we can track user scripts.
> >
> > It could be a config param "pig.persist.script=/pig/". The script could
> be
> > stored with a configurable name -> ${mapred.job.name}+${user.name
> > }+timestamp"
> > either on HDFS or local based on the configuration setting.
> >
> > Thanks,
> > Prashant
> >
>

Re: Persisting Pig Scripts

Posted by Daniel Dai <da...@hortonworks.com>.
One existing solution is "pig.script" entry inside job.xml, it is the
serialized Pig script. JobHistoryLoader can load job.xml files and grab
those entries. Does that solve your problem?

Daniel

On Wed, Jun 6, 2012 at 3:52 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> Hi All,
>
> What do you guys think about adding a feature to be able to persist the
> script (file or cache in case of grunt) on HDFS or locally based on an
> admin setting (pig.properties). This will help infrastructure/ops teams
> analyze nature of Pig scripts and be able to make certain decisions based
> on it (optimizing data storage based on access patterns etc). This is
> actually something we want to do but the challenge is there is no central
> place where we can track user scripts.
>
> It could be a config param "pig.persist.script=/pig/". The script could be
> stored with a configurable name -> ${mapred.job.name}+${user.name
> }+timestamp"
> either on HDFS or local based on the configuration setting.
>
> Thanks,
> Prashant
>