You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mat Kelcey <ma...@gmail.com> on 2009/10/17 12:40:37 UTC

extracting values from a single value relation to use in subsequent operations

hi guys!

i have a list of numbers that i was to rescale to 0.0 -> 1.0

eg for (6,4,8) i want to convert to (0.5, 0.0, 1.0)

i can find the min/max...
grunt> numbers = load 'numbers' as (n:int);
grunt> dump numbers;
(6)
(4)
(8)
grunt> all_numbers = group numbers all;
grunt> min_max = foreach all_numbers { generate MIN(numbers.n) as min,
MAX(numbers.n) as max; }
grunt> dump min_max;
(4,8)

...and given the min max i can rescale the list
grunt> rescaled_numbers = foreach numbers { generate
((float)n-4F)/(8F-4F); }
grunt> dump rescaled_numbers;
(0.5F)
(0.0F)
(1.0F)

but how do i inject the values found during min_max into the
rescaled_numbers foreach clause?

perhaps i'm thinking about this totally the wrong way?

help me obi-wan kenobi, you're my only hope

mat

Re: extracting values from a single value relation to use in subsequent operations

Posted by Mat Kelcey <ma...@gmail.com>.
2009/10/18 Tamir Kamara <ta...@gmail.com>:
> Cross can be one solution but in my experience there're situations where it
> will bring a lot of overhead.
> My solution was to split the work into 2 scripts - one for the finding of
> the min/max and writing them to a file and the other for using them by

i think i stick with with Tamir's idea of breaking out into a script.
it's all part of a bigger combo of hadoop streaming, pig, ruby and
other bits and pieces.
thanks for the ideas on the cross joins, this helps with me with something else!

mat

Re: extracting values from a single value relation to use in subsequent operations

Posted by Tamir Kamara <ta...@gmail.com>.
Hi,

Cross can be one solution but in my experience there're situations where it
will bring a lot of overhead.
My solution was to split the work into 2 scripts - one for the finding of
the min/max and writing them to a file and the other for using them by
reading the file to a parameter with a command like this:
%declare min_value `hadoop fs -cat min-max-file | cut -f 1`
%declare max_value `hadoop fs -cat min-max-file | cut -f 2`

Tamir


On Sun, Oct 18, 2009 at 1:58 AM, Ashutosh Chauhan <
ashutosh.chauhan@gmail.com> wrote:

> I think fragment-replicate join may not work for this case as it can only
> do
> inner join. Cross should work, though 1 reducer will be a limiting factor.
>
> Ashutosh
> On Sat, Oct 17, 2009 at 18:44, Thejas Nair <te...@yahoo-inc.com> wrote:
>
> >
> > You can do a cross product of numbers, min_max and calculate -
> >
> > grunt> numbers_min_max = cross numbers, min_max;
> > grunt> dump numbers_min_max;
> > (6,4,8)
> > (4,4,8)
> > (8,4,8)
> > grunt> rescaled_numbers = foreach numbers_min_max { generate
> > ((float)n-min)/(max-min); }
> > grunt> dump rescaled_numbers;
> >
> > (0.5F)
> > (0.0F)
> > (1.0F)
> >
> >
> > I think the current cross product implementation uses a single reducer.
> So
> > a
> > more efficient thing to do might be a fragment replicate join. ( I did
> not
> > try this.)
> >
> > Ie, you could replace the above cross with -
> > grunt> numbers_min_max = join numbers by 'dummy', min_max  by 'dummy'
> using
> > 'replicated' ;
> >
> > -Thejas
> >
> >
> > On 10/17/09 3:40 AM, "Mat Kelcey" <ma...@gmail.com> wrote:
> >
> > > hi guys!
> > >
> > > i have a list of numbers that i was to rescale to 0.0 -> 1.0
> > >
> > > eg for (6,4,8) i want to convert to (0.5, 0.0, 1.0)
> > >
> > > i can find the min/max...
> > > grunt> numbers = load 'numbers' as (n:int);
> > > grunt> dump numbers;
> > > (6)
> > > (4)
> > > (8)
> > > grunt> all_numbers = group numbers all;
> > > grunt> min_max = foreach all_numbers { generate MIN(numbers.n) as min,
> > > MAX(numbers.n) as max; }
> > > grunt> dump min_max;
> > > (4,8)
> > >
> > > ...and given the min max i can rescale the list
> > > grunt> rescaled_numbers = foreach numbers { generate
> > > ((float)n-4F)/(8F-4F); }
> > > grunt> dump rescaled_numbers;
> > > (0.5F)
> > > (0.0F)
> > > (1.0F)
> > >
> > > but how do i inject the values found during min_max into the
> > > rescaled_numbers foreach clause?
> > >
> > > perhaps i'm thinking about this totally the wrong way?
> > >
> > > help me obi-wan kenobi, you're my only hope
> > >
> > > mat
> >
> >
>

Re: extracting values from a single value relation to use in subsequent operations

Posted by Ashutosh Chauhan <as...@gmail.com>.
I think fragment-replicate join may not work for this case as it can only do
inner join. Cross should work, though 1 reducer will be a limiting factor.

Ashutosh
On Sat, Oct 17, 2009 at 18:44, Thejas Nair <te...@yahoo-inc.com> wrote:

>
> You can do a cross product of numbers, min_max and calculate -
>
> grunt> numbers_min_max = cross numbers, min_max;
> grunt> dump numbers_min_max;
> (6,4,8)
> (4,4,8)
> (8,4,8)
> grunt> rescaled_numbers = foreach numbers_min_max { generate
> ((float)n-min)/(max-min); }
> grunt> dump rescaled_numbers;
>
> (0.5F)
> (0.0F)
> (1.0F)
>
>
> I think the current cross product implementation uses a single reducer. So
> a
> more efficient thing to do might be a fragment replicate join. ( I did not
> try this.)
>
> Ie, you could replace the above cross with -
> grunt> numbers_min_max = join numbers by 'dummy', min_max  by 'dummy' using
> 'replicated' ;
>
> -Thejas
>
>
> On 10/17/09 3:40 AM, "Mat Kelcey" <ma...@gmail.com> wrote:
>
> > hi guys!
> >
> > i have a list of numbers that i was to rescale to 0.0 -> 1.0
> >
> > eg for (6,4,8) i want to convert to (0.5, 0.0, 1.0)
> >
> > i can find the min/max...
> > grunt> numbers = load 'numbers' as (n:int);
> > grunt> dump numbers;
> > (6)
> > (4)
> > (8)
> > grunt> all_numbers = group numbers all;
> > grunt> min_max = foreach all_numbers { generate MIN(numbers.n) as min,
> > MAX(numbers.n) as max; }
> > grunt> dump min_max;
> > (4,8)
> >
> > ...and given the min max i can rescale the list
> > grunt> rescaled_numbers = foreach numbers { generate
> > ((float)n-4F)/(8F-4F); }
> > grunt> dump rescaled_numbers;
> > (0.5F)
> > (0.0F)
> > (1.0F)
> >
> > but how do i inject the values found during min_max into the
> > rescaled_numbers foreach clause?
> >
> > perhaps i'm thinking about this totally the wrong way?
> >
> > help me obi-wan kenobi, you're my only hope
> >
> > mat
>
>

Re: extracting values from a single value relation to use in subsequent operations

Posted by Thejas Nair <te...@yahoo-inc.com>.
You can do a cross product of numbers, min_max and calculate -

grunt> numbers_min_max = cross numbers, min_max;
grunt> dump numbers_min_max;
(6,4,8)
(4,4,8)
(8,4,8)
grunt> rescaled_numbers = foreach numbers_min_max { generate
((float)n-min)/(max-min); }
grunt> dump rescaled_numbers;

(0.5F)
(0.0F)
(1.0F)


I think the current cross product implementation uses a single reducer. So a
more efficient thing to do might be a fragment replicate join. ( I did not
try this.)

Ie, you could replace the above cross with -
grunt> numbers_min_max = join numbers by 'dummy', min_max  by 'dummy' using
'replicated' ;     

-Thejas


On 10/17/09 3:40 AM, "Mat Kelcey" <ma...@gmail.com> wrote:

> hi guys!
> 
> i have a list of numbers that i was to rescale to 0.0 -> 1.0
> 
> eg for (6,4,8) i want to convert to (0.5, 0.0, 1.0)
> 
> i can find the min/max...
> grunt> numbers = load 'numbers' as (n:int);
> grunt> dump numbers;
> (6)
> (4)
> (8)
> grunt> all_numbers = group numbers all;
> grunt> min_max = foreach all_numbers { generate MIN(numbers.n) as min,
> MAX(numbers.n) as max; }
> grunt> dump min_max;
> (4,8)
> 
> ...and given the min max i can rescale the list
> grunt> rescaled_numbers = foreach numbers { generate
> ((float)n-4F)/(8F-4F); }
> grunt> dump rescaled_numbers;
> (0.5F)
> (0.0F)
> (1.0F)
> 
> but how do i inject the values found during min_max into the
> rescaled_numbers foreach clause?
> 
> perhaps i'm thinking about this totally the wrong way?
> 
> help me obi-wan kenobi, you're my only hope
> 
> mat