You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Casper Rasmussen <ca...@gmail.com> on 2008/02/15 15:55:00 UTC

Storage split question, load asterisks, userdefined job names

Hi

First of all i'm using an old version of pig, the one that ran on hadoop
12.1, and yes i will upgrade soon...

Following I have some requests/questions, based on the use of Pig so far:

1: If you have 1 billion files (purposely exaggerating) where apx 50 % of
the files are related to one segment and 50 % to another segment,
then i guess the pig script for isolating the segments would be something
like following:

files = LOAD 'path/to/1_billion_files' AS (segment);
sementA = FILTER files BY (segment='a');
sementB = FILTER files BY (segment='b');

STORE segmentA into 'segemtA.dat';
STORE segmentB into 'segemtB.dat';

So the question is, are all 1 billion files filtered and read twice? If so
(guess it is), would it be possible to do
something like this (just to avoid the overhead of 1 billion reads):

STORE SPLIT segmentA into 'segemtA.dat', segmentB into 'segemtB.dat';

2: Would it be possible to allow the use of asterisks in the load method of
Pig.

files = LOAD 'batches/*/batch/*/segments'

3: Allowing Userdefined hadoop job names when 'execution' a script, i have a
feeling that this one is in the newest version, true?

Appreciate any comments anyone might have, thanks :-)

Br Casper

Re: Storage split question, load asterisks, userdefined job names

Posted by Casper Rasmussen <ca...@gmail.com>.
Cool, even without the 'store split', it's nice working with Pig, and my
current work is build on the fact that the storage point is root of the
operations, so for now nothing is wasted :-)

Thanks...

On Fri, Feb 15, 2008 at 5:32 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> Actually, we do allow user to set job name:
>
> set job.name 'foo'.
>
> http://wiki.apache.org/pig/Grunt
>
> Olga
>
> > -----Original Message-----
> > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > Sent: Friday, February 15, 2008 8:05 AM
> > To: pig-user@incubator.apache.org
> > Subject: Re: Storage split question, load asterisks,
> > userdefined job names
> >
> >
> >
> > Casper Rasmussen wrote:
> > > Hi
> > >
> > > First of all i'm using an old version of pig, the one that ran on
> > > hadoop 12.1, and yes i will upgrade soon...
> > >
> > > Following I have some requests/questions, based on the use
> > of Pig so far:
> > >
> > > 1: If you have 1 billion files (purposely exaggerating)
> > where apx 50 %
> > > of the files are related to one segment and 50 % to another
> > segment,
> > > then i guess the pig script for isolating the segments would be
> > > something like following:
> > >
> > > files = LOAD 'path/to/1_billion_files' AS (segment);
> > sementA = FILTER
> > > files BY (segment='a'); sementB = FILTER files BY (segment='b');
> > >
> > > STORE segmentA into 'segemtA.dat';
> > > STORE segmentB into 'segemtB.dat';
> > >
> > > So the question is, are all 1 billion files filtered and
> > read twice?
> > > If so (guess it is), would it be possible to do something like this
> > > (just to avoid the overhead of 1 billion reads):
> > >
> > > STORE SPLIT segmentA into 'segemtA.dat', segmentB into
> > 'segemtB.dat';
> > >
> > Yes, currently all 1B files are read and filtered twice.  No,
> > your split suggestion won't work, yet.  Right now pig views
> > all jobs as a tree of operations, with a given store (or
> > dump) command as a root.  To do what you want we need to view
> > the commands as a graph, with multiple heads, which it can
> > evaluate simultaneously.  We're working in that direction but
> > it will be a while before we're there.
> > > 2: Would it be possible to allow the use of asterisks in the load
> > > method of Pig.
> > >
> > > files = LOAD 'batches/*/batch/*/segments'
> > >
> > The latest versions of pig use hadoop pattern matching in
> > their files, so the above commands would work.
> > > 3: Allowing Userdefined hadoop job names when 'execution' a
> > script, i
> > > have a feeling that this one is in the newest version, true?
> > >
> > We don't yet allow users to define their job names, but we
> > certainly have had requests to do so.
> > > Appreciate any comments anyone might have, thanks :-)
> > >
> > > Br Casper
> > >
> > >
> > Alan.
> >
>

RE: Storage split question, load asterisks, userdefined job names

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Actually, we do allow user to set job name:

set job.name 'foo'.

http://wiki.apache.org/pig/Grunt

Olga 

> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Friday, February 15, 2008 8:05 AM
> To: pig-user@incubator.apache.org
> Subject: Re: Storage split question, load asterisks, 
> userdefined job names
> 
> 
> 
> Casper Rasmussen wrote:
> > Hi
> >
> > First of all i'm using an old version of pig, the one that ran on 
> > hadoop 12.1, and yes i will upgrade soon...
> >
> > Following I have some requests/questions, based on the use 
> of Pig so far:
> >
> > 1: If you have 1 billion files (purposely exaggerating) 
> where apx 50 % 
> > of the files are related to one segment and 50 % to another 
> segment, 
> > then i guess the pig script for isolating the segments would be 
> > something like following:
> >
> > files = LOAD 'path/to/1_billion_files' AS (segment); 
> sementA = FILTER 
> > files BY (segment='a'); sementB = FILTER files BY (segment='b');
> >
> > STORE segmentA into 'segemtA.dat';
> > STORE segmentB into 'segemtB.dat';
> >
> > So the question is, are all 1 billion files filtered and 
> read twice? 
> > If so (guess it is), would it be possible to do something like this 
> > (just to avoid the overhead of 1 billion reads):
> >
> > STORE SPLIT segmentA into 'segemtA.dat', segmentB into 
> 'segemtB.dat';
> >   
> Yes, currently all 1B files are read and filtered twice.  No, 
> your split suggestion won't work, yet.  Right now pig views 
> all jobs as a tree of operations, with a given store (or 
> dump) command as a root.  To do what you want we need to view 
> the commands as a graph, with multiple heads, which it can 
> evaluate simultaneously.  We're working in that direction but 
> it will be a while before we're there.
> > 2: Would it be possible to allow the use of asterisks in the load 
> > method of Pig.
> >
> > files = LOAD 'batches/*/batch/*/segments'
> >   
> The latest versions of pig use hadoop pattern matching in 
> their files, so the above commands would work.
> > 3: Allowing Userdefined hadoop job names when 'execution' a 
> script, i 
> > have a feeling that this one is in the newest version, true?
> >   
> We don't yet allow users to define their job names, but we 
> certainly have had requests to do so.
> > Appreciate any comments anyone might have, thanks :-)
> >
> > Br Casper
> >
> >   
> Alan.
> 

Re: Storage split question, load asterisks, userdefined job names

Posted by Benjamin Francisoud <be...@joost.com>.
Alan Gates a écrit :
>
>> 3: Allowing Userdefined hadoop job names when 'execution' a script, i 
>> have a
>> feeling that this one is in the newest version, true?
>>   
> We don't yet allow users to define their job names, but we certainly 
> have had requests to do so.

+1 (may be in PigContext ?)

Re: Storage split question, load asterisks, userdefined job names

Posted by Alan Gates <ga...@yahoo-inc.com>.

Casper Rasmussen wrote:
> Hi
>
> First of all i'm using an old version of pig, the one that ran on hadoop
> 12.1, and yes i will upgrade soon...
>
> Following I have some requests/questions, based on the use of Pig so far:
>
> 1: If you have 1 billion files (purposely exaggerating) where apx 50 % of
> the files are related to one segment and 50 % to another segment,
> then i guess the pig script for isolating the segments would be something
> like following:
>
> files = LOAD 'path/to/1_billion_files' AS (segment);
> sementA = FILTER files BY (segment='a');
> sementB = FILTER files BY (segment='b');
>
> STORE segmentA into 'segemtA.dat';
> STORE segmentB into 'segemtB.dat';
>
> So the question is, are all 1 billion files filtered and read twice? If so
> (guess it is), would it be possible to do
> something like this (just to avoid the overhead of 1 billion reads):
>
> STORE SPLIT segmentA into 'segemtA.dat', segmentB into 'segemtB.dat';
>   
Yes, currently all 1B files are read and filtered twice.  No, your split 
suggestion won't work, yet.  Right now pig views all jobs as a tree of 
operations, with a given store (or dump) command as a root.  To do what 
you want we need to view the commands as a graph, with multiple heads, 
which it can evaluate simultaneously.  We're working in that direction 
but it will be a while before we're there.
> 2: Would it be possible to allow the use of asterisks in the load method of
> Pig.
>
> files = LOAD 'batches/*/batch/*/segments'
>   
The latest versions of pig use hadoop pattern matching in their files, 
so the above commands would work.
> 3: Allowing Userdefined hadoop job names when 'execution' a script, i have a
> feeling that this one is in the newest version, true?
>   
We don't yet allow users to define their job names, but we certainly 
have had requests to do so.
> Appreciate any comments anyone might have, thanks :-)
>
> Br Casper
>
>   
Alan.