You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Alexandru Toth <al...@gmail.com> on 2007/11/20 10:47:45 UTC

possible use of Pig for OLAP

Hi,

I am developing an Open Source OLAP application called "Cubulus". The
code is at http://sourceforge.net/projects/cubulus/ , a brief
presentation material at http://cubulus.sourceforge.net/ , and an
online demo at: http://alxtoth.webfactional.com

It would be interresting to use Pig instead of relational databases as backend.

The question is: can Pig scripts work is such manner that the file is
loaded only once, and then subsequent web requests process over and
over the same file? This becomes relevant if the data file is large,
and there is one datafile to process (or few datafiles). In fact, is
repated loading a problem at all :-) ?

-Alex

Re: possible use of Pig for OLAP

Posted by Ted Dunning <td...@veoh.com>.

Store later or store delayed or checkpoint all sound good as a way of
expressing this.

I agree that the normal user shouldn't bear the cost of a feature like this.


On 11/20/07 11:58 AM, "Chris Olston" <ol...@yahoo-inc.com> wrote:

> Yes, that's an option.
> 
> For the final "commit" you'd have to associate an explicit scope --
> which STORE statement(s) do I want the system to materialize for me?
> Or is it implicitly all as-yet-unmaterialized STORE commands in the
> current session?
> 
> If this change gets made, it'd be good to ensure that the "old way"
> still works -- most users won't need this functionality and we don't
> want to complicate their lives by making them type STORE followed by
> COMMIT each time.  Maybe we add a new command "STORE LATER" or
> something, for the case where you want to register a STORE but have
> it happen later as part of a batch of stores:
> 
> A = LOAD ...
> B = LOAD ...
> C = FILTER A BY ...
> STORE LATER C INTO ...
> D = JOIN A, B ...
> STORE LATER D INTO ...
> EXECUTE STORE C, D;
> 
> or something alone these lines.
> 
> -Chris
> 
> 
> On Nov 20, 2007, at 11:41 AM, Ted Dunning wrote:
> 
>> 
>> 
>> It sounds like it would be better to accept multiple STORE commands
>> in a
>> single program and only trigger execution of the map-reduce steps
>> when the
>> equivalent of a "commit" or "run" is given (EOF being an implied
>> commit).
>> 
>> 
>> 
>> On 11/20/07 11:27 AM, "Utkarsh Srivastava" <ut...@yahoo-inc.com>
>> wrote:
>> 
>>> The current implementation of SPLIT will be no more efficient that
>>> explicitly calling STORE.
>>> 
>>> Utkarsh
>>> 
>>> 
>>> On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:
>>> 
>>>> Exactly. You can write "STORE X" for each handle X that you want a
>>>> result for.
>>>> 
>>>> The only issue is that it will create a separate execution job for
>>>> each STORE command.
>>>> 
>>>> If you don't want to pay for doing it in multiple jobs, you could
>>>> imagine adding a "side store" function to Pig, so that it can store
>>>> side files but keep processing the "main" program.
>>>> 
>>>> It's possible that this can be accomplished today via the SPLIT
>>>> command -- anyone care to comment?
>>>> 
>>>> -Chris
>>>> 
>>>> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:
>>>> 
>>>>> 
>>>>> Can you just explicitly save those intermediate results?
>>>>> 
>>>>> 
>>>>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
>>>>> 
>>>>>> Chris Olston wrote:
>>>>>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>>>>>> operations, in the style of OLAP.
>>>>>>> 
>>>>>>> Regarding your 3rd paragraph question, do you mean:
>>>>>>> 
>>>>>>> a) there are several interrelated aggregation expressions that
>>>>>>> you want
>>>>>>> evaluated in just one pass over the data, or
>>>>>>> b) you do some initial aggregation, display it to the user, who
>>>>>>> can do
>>>>>>> "drill-down" operations in the GUI which require you to look up
>>>>>>> more
>>>>>>> data in the backend
>>>>>>> 
>>>>>>> ?
>>>>>>> 
>>>>>>> For (a), yes Pig can do that, although currently you have to
>>>>>>> encode it
>>>>>>> explicitly as a single Pig program (in future versions, we might
>>>>>>> be able
>>>>>>> to take multiple related Pig programs and execute them in a joint
>>>>>>> fashion). For (b), we don't currently have a mechanism to do that
>>>>>>> without reloading the data, although perhaps the operating
>>>>>>> system's file
>>>>>>> cache would help with that, under the covers, if the file
>>>>>>> partitions fit
>>>>>>> in memory and don't get evicted.
>>>>>> 
>>>>>> Would it be possible to modify Pig (and underlying local/
>>>>>> mapreduce impl)
>>>>>> so that if a specific syntax is used then an intermediate result
>>>>>> is also
>>>>>> stored into a temporary file? This way, on the first dump/store
>>>>>> Pig
>>>>>> would produce all intermediate results, then keep some of them,
>>>>>> and
>>>>>> re-use them for subsequent operators?
>>>>>> 
>>>>>> Example - let's say that ':=' means that the result should be kept
>>>>>> around until exit (or until any of previous intermediate results
>>>>>> changes):
>>>>>> 
>>>>>> -- A is not persisted
>>>>>> A = load 'sample.txt' as (date, time, ip, query);
>>>>>> -- B is to be persisted in a temp file
>>>>>> B := group A by ip;
>>>>>> -- compile & execute - creates B in a temp file
>>>>>> dump B;
>>>>>> C = foreach B generate group, query;
>>>>>> -- this uses already existing B data from a temp file
>>>>>> dump C;
>>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Christopher Olston, Ph.D.
>>>> Sr. Research Scientist
>>>> Yahoo! Research
>>>> 
>>>> 
>>> 
>> 
> 
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
> 
>

Re: possible use of Pig for OLAP

Posted by Chris Olston <ol...@yahoo-inc.com>.

Yes, that's an option.

For the final "commit" you'd have to associate an explicit scope --  
which STORE statement(s) do I want the system to materialize for me?  
Or is it implicitly all as-yet-unmaterialized STORE commands in the  
current session?

If this change gets made, it'd be good to ensure that the "old way"  
still works -- most users won't need this functionality and we don't  
want to complicate their lives by making them type STORE followed by  
COMMIT each time.  Maybe we add a new command "STORE LATER" or  
something, for the case where you want to register a STORE but have  
it happen later as part of a batch of stores:

A = LOAD ...
B = LOAD ...
C = FILTER A BY ...
STORE LATER C INTO ...
D = JOIN A, B ...
STORE LATER D INTO ...
EXECUTE STORE C, D;

or something alone these lines.

-Chris


On Nov 20, 2007, at 11:41 AM, Ted Dunning wrote:

>
>
> It sounds like it would be better to accept multiple STORE commands  
> in a
> single program and only trigger execution of the map-reduce steps  
> when the
> equivalent of a "commit" or "run" is given (EOF being an implied  
> commit).
>
>
>
> On 11/20/07 11:27 AM, "Utkarsh Srivastava" <ut...@yahoo-inc.com>  
> wrote:
>
>> The current implementation of SPLIT will be no more efficient that
>> explicitly calling STORE.
>>
>> Utkarsh
>>
>>
>> On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:
>>
>>> Exactly. You can write "STORE X" for each handle X that you want a
>>> result for.
>>>
>>> The only issue is that it will create a separate execution job for
>>> each STORE command.
>>>
>>> If you don't want to pay for doing it in multiple jobs, you could
>>> imagine adding a "side store" function to Pig, so that it can store
>>> side files but keep processing the "main" program.
>>>
>>> It's possible that this can be accomplished today via the SPLIT
>>> command -- anyone care to comment?
>>>
>>> -Chris
>>>
>>> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:
>>>
>>>>
>>>> Can you just explicitly save those intermediate results?
>>>>
>>>>
>>>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
>>>>
>>>>> Chris Olston wrote:
>>>>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>>>>> operations, in the style of OLAP.
>>>>>>
>>>>>> Regarding your 3rd paragraph question, do you mean:
>>>>>>
>>>>>> a) there are several interrelated aggregation expressions that
>>>>>> you want
>>>>>> evaluated in just one pass over the data, or
>>>>>> b) you do some initial aggregation, display it to the user, who
>>>>>> can do
>>>>>> "drill-down" operations in the GUI which require you to look up
>>>>>> more
>>>>>> data in the backend
>>>>>>
>>>>>> ?
>>>>>>
>>>>>> For (a), yes Pig can do that, although currently you have to
>>>>>> encode it
>>>>>> explicitly as a single Pig program (in future versions, we might
>>>>>> be able
>>>>>> to take multiple related Pig programs and execute them in a joint
>>>>>> fashion). For (b), we don't currently have a mechanism to do that
>>>>>> without reloading the data, although perhaps the operating
>>>>>> system's file
>>>>>> cache would help with that, under the covers, if the file
>>>>>> partitions fit
>>>>>> in memory and don't get evicted.
>>>>>
>>>>> Would it be possible to modify Pig (and underlying local/
>>>>> mapreduce impl)
>>>>> so that if a specific syntax is used then an intermediate result
>>>>> is also
>>>>> stored into a temporary file? This way, on the first dump/store  
>>>>> Pig
>>>>> would produce all intermediate results, then keep some of them,  
>>>>> and
>>>>> re-use them for subsequent operators?
>>>>>
>>>>> Example - let's say that ':=' means that the result should be kept
>>>>> around until exit (or until any of previous intermediate results
>>>>> changes):
>>>>>
>>>>> -- A is not persisted
>>>>> A = load 'sample.txt' as (date, time, ip, query);
>>>>> -- B is to be persisted in a temp file
>>>>> B := group A by ip;
>>>>> -- compile & execute - creates B in a temp file
>>>>> dump B;
>>>>> C = foreach B generate group, query;
>>>>> -- this uses already existing B data from a temp file
>>>>> dump C;
>>>>>
>>>>
>>>
>>> --
>>> Christopher Olston, Ph.D.
>>> Sr. Research Scientist
>>> Yahoo! Research
>>>
>>>
>>
>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: possible use of Pig for OLAP

Posted by Ted Dunning <td...@veoh.com>.


It sounds like it would be better to accept multiple STORE commands in a
single program and only trigger execution of the map-reduce steps when the
equivalent of a "commit" or "run" is given (EOF being an implied commit).



On 11/20/07 11:27 AM, "Utkarsh Srivastava" <ut...@yahoo-inc.com> wrote:

> The current implementation of SPLIT will be no more efficient that
> explicitly calling STORE.
> 
> Utkarsh
> 
> 
> On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:
> 
>> Exactly. You can write "STORE X" for each handle X that you want a
>> result for.
>> 
>> The only issue is that it will create a separate execution job for
>> each STORE command.
>> 
>> If you don't want to pay for doing it in multiple jobs, you could
>> imagine adding a "side store" function to Pig, so that it can store
>> side files but keep processing the "main" program.
>> 
>> It's possible that this can be accomplished today via the SPLIT
>> command -- anyone care to comment?
>> 
>> -Chris
>> 
>> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:
>> 
>>> 
>>> Can you just explicitly save those intermediate results?
>>> 
>>> 
>>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
>>> 
>>>> Chris Olston wrote:
>>>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>>>> operations, in the style of OLAP.
>>>>> 
>>>>> Regarding your 3rd paragraph question, do you mean:
>>>>> 
>>>>> a) there are several interrelated aggregation expressions that
>>>>> you want
>>>>> evaluated in just one pass over the data, or
>>>>> b) you do some initial aggregation, display it to the user, who
>>>>> can do
>>>>> "drill-down" operations in the GUI which require you to look up
>>>>> more
>>>>> data in the backend
>>>>> 
>>>>> ?
>>>>> 
>>>>> For (a), yes Pig can do that, although currently you have to
>>>>> encode it
>>>>> explicitly as a single Pig program (in future versions, we might
>>>>> be able
>>>>> to take multiple related Pig programs and execute them in a joint
>>>>> fashion). For (b), we don't currently have a mechanism to do that
>>>>> without reloading the data, although perhaps the operating
>>>>> system's file
>>>>> cache would help with that, under the covers, if the file
>>>>> partitions fit
>>>>> in memory and don't get evicted.
>>>> 
>>>> Would it be possible to modify Pig (and underlying local/
>>>> mapreduce impl)
>>>> so that if a specific syntax is used then an intermediate result
>>>> is also
>>>> stored into a temporary file? This way, on the first dump/store Pig
>>>> would produce all intermediate results, then keep some of them, and
>>>> re-use them for subsequent operators?
>>>> 
>>>> Example - let's say that ':=' means that the result should be kept
>>>> around until exit (or until any of previous intermediate results
>>>> changes):
>>>> 
>>>> -- A is not persisted
>>>> A = load 'sample.txt' as (date, time, ip, query);
>>>> -- B is to be persisted in a temp file
>>>> B := group A by ip;
>>>> -- compile & execute - creates B in a temp file
>>>> dump B;
>>>> C = foreach B generate group, query;
>>>> -- this uses already existing B data from a temp file
>>>> dump C;
>>>> 
>>> 
>> 
>> --
>> Christopher Olston, Ph.D.
>> Sr. Research Scientist
>> Yahoo! Research
>> 
>> 
>

Re: possible use of Pig for OLAP

Posted by Utkarsh Srivastava <ut...@yahoo-inc.com>.

The current implementation of SPLIT will be no more efficient that  
explicitly calling STORE.

Utkarsh


On Nov 20, 2007, at 10:56 AM, Chris Olston wrote:

> Exactly. You can write "STORE X" for each handle X that you want a  
> result for.
>
> The only issue is that it will create a separate execution job for  
> each STORE command.
>
> If you don't want to pay for doing it in multiple jobs, you could  
> imagine adding a "side store" function to Pig, so that it can store  
> side files but keep processing the "main" program.
>
> It's possible that this can be accomplished today via the SPLIT  
> command -- anyone care to comment?
>
> -Chris
>
> On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:
>
>>
>> Can you just explicitly save those intermediate results?
>>
>>
>> On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
>>
>>> Chris Olston wrote:
>>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>>> operations, in the style of OLAP.
>>>>
>>>> Regarding your 3rd paragraph question, do you mean:
>>>>
>>>> a) there are several interrelated aggregation expressions that  
>>>> you want
>>>> evaluated in just one pass over the data, or
>>>> b) you do some initial aggregation, display it to the user, who  
>>>> can do
>>>> "drill-down" operations in the GUI which require you to look up  
>>>> more
>>>> data in the backend
>>>>
>>>> ?
>>>>
>>>> For (a), yes Pig can do that, although currently you have to  
>>>> encode it
>>>> explicitly as a single Pig program (in future versions, we might  
>>>> be able
>>>> to take multiple related Pig programs and execute them in a joint
>>>> fashion). For (b), we don't currently have a mechanism to do that
>>>> without reloading the data, although perhaps the operating  
>>>> system's file
>>>> cache would help with that, under the covers, if the file  
>>>> partitions fit
>>>> in memory and don't get evicted.
>>>
>>> Would it be possible to modify Pig (and underlying local/ 
>>> mapreduce impl)
>>> so that if a specific syntax is used then an intermediate result  
>>> is also
>>> stored into a temporary file? This way, on the first dump/store Pig
>>> would produce all intermediate results, then keep some of them, and
>>> re-use them for subsequent operators?
>>>
>>> Example - let's say that ':=' means that the result should be kept
>>> around until exit (or until any of previous intermediate results  
>>> changes):
>>>
>>> -- A is not persisted
>>> A = load 'sample.txt' as (date, time, ip, query);
>>> -- B is to be persisted in a temp file
>>> B := group A by ip;
>>> -- compile & execute - creates B in a temp file
>>> dump B;
>>> C = foreach B generate group, query;
>>> -- this uses already existing B data from a temp file
>>> dump C;
>>>
>>
>
> --
> Christopher Olston, Ph.D.
> Sr. Research Scientist
> Yahoo! Research
>
>

Re: possible use of Pig for OLAP

Posted by Chris Olston <ol...@yahoo-inc.com>.

Exactly. You can write "STORE X" for each handle X that you want a  
result for.

The only issue is that it will create a separate execution job for  
each STORE command.

If you don't want to pay for doing it in multiple jobs, you could  
imagine adding a "side store" function to Pig, so that it can store  
side files but keep processing the "main" program.

It's possible that this can be accomplished today via the SPLIT  
command -- anyone care to comment?

-Chris

On Nov 20, 2007, at 10:40 AM, Ted Dunning wrote:

>
> Can you just explicitly save those intermediate results?
>
>
> On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:
>
>> Chris Olston wrote:
>>> Sounds interesting. Pig is geared toward large-scale aggregation
>>> operations, in the style of OLAP.
>>>
>>> Regarding your 3rd paragraph question, do you mean:
>>>
>>> a) there are several interrelated aggregation expressions that  
>>> you want
>>> evaluated in just one pass over the data, or
>>> b) you do some initial aggregation, display it to the user, who  
>>> can do
>>> "drill-down" operations in the GUI which require you to look up more
>>> data in the backend
>>>
>>> ?
>>>
>>> For (a), yes Pig can do that, although currently you have to  
>>> encode it
>>> explicitly as a single Pig program (in future versions, we might  
>>> be able
>>> to take multiple related Pig programs and execute them in a joint
>>> fashion). For (b), we don't currently have a mechanism to do that
>>> without reloading the data, although perhaps the operating  
>>> system's file
>>> cache would help with that, under the covers, if the file  
>>> partitions fit
>>> in memory and don't get evicted.
>>
>> Would it be possible to modify Pig (and underlying local/mapreduce  
>> impl)
>> so that if a specific syntax is used then an intermediate result  
>> is also
>> stored into a temporary file? This way, on the first dump/store Pig
>> would produce all intermediate results, then keep some of them, and
>> re-use them for subsequent operators?
>>
>> Example - let's say that ':=' means that the result should be kept
>> around until exit (or until any of previous intermediate results  
>> changes):
>>
>> -- A is not persisted
>> A = load 'sample.txt' as (date, time, ip, query);
>> -- B is to be persisted in a temp file
>> B := group A by ip;
>> -- compile & execute - creates B in a temp file
>> dump B;
>> C = foreach B generate group, query;
>> -- this uses already existing B data from a temp file
>> dump C;
>>
>

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: possible use of Pig for OLAP

Posted by Ted Dunning <td...@veoh.com>.

Can you just explicitly save those intermediate results?


On 11/20/07 10:31 AM, "Andrzej Bialecki" <ab...@getopt.org> wrote:

> Chris Olston wrote:
>> Sounds interesting. Pig is geared toward large-scale aggregation
>> operations, in the style of OLAP.
>> 
>> Regarding your 3rd paragraph question, do you mean:
>> 
>> a) there are several interrelated aggregation expressions that you want
>> evaluated in just one pass over the data, or
>> b) you do some initial aggregation, display it to the user, who can do
>> "drill-down" operations in the GUI which require you to look up more
>> data in the backend
>> 
>> ?
>> 
>> For (a), yes Pig can do that, although currently you have to encode it
>> explicitly as a single Pig program (in future versions, we might be able
>> to take multiple related Pig programs and execute them in a joint
>> fashion). For (b), we don't currently have a mechanism to do that
>> without reloading the data, although perhaps the operating system's file
>> cache would help with that, under the covers, if the file partitions fit
>> in memory and don't get evicted.
> 
> Would it be possible to modify Pig (and underlying local/mapreduce impl)
> so that if a specific syntax is used then an intermediate result is also
> stored into a temporary file? This way, on the first dump/store Pig
> would produce all intermediate results, then keep some of them, and
> re-use them for subsequent operators?
> 
> Example - let's say that ':=' means that the result should be kept
> around until exit (or until any of previous intermediate results changes):
> 
> -- A is not persisted
> A = load 'sample.txt' as (date, time, ip, query);
> -- B is to be persisted in a temp file
> B := group A by ip;
> -- compile & execute - creates B in a temp file
> dump B;
> C = foreach B generate group, query;
> -- this uses already existing B data from a temp file
> dump C;
>

Re: possible use of Pig for OLAP

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Olston wrote:
> Sounds interesting. Pig is geared toward large-scale aggregation 
> operations, in the style of OLAP.
> 
> Regarding your 3rd paragraph question, do you mean:
> 
> a) there are several interrelated aggregation expressions that you want 
> evaluated in just one pass over the data, or
> b) you do some initial aggregation, display it to the user, who can do 
> "drill-down" operations in the GUI which require you to look up more 
> data in the backend
> 
> ?
> 
> For (a), yes Pig can do that, although currently you have to encode it 
> explicitly as a single Pig program (in future versions, we might be able 
> to take multiple related Pig programs and execute them in a joint 
> fashion). For (b), we don't currently have a mechanism to do that 
> without reloading the data, although perhaps the operating system's file 
> cache would help with that, under the covers, if the file partitions fit 
> in memory and don't get evicted.

Would it be possible to modify Pig (and underlying local/mapreduce impl) 
so that if a specific syntax is used then an intermediate result is also 
stored into a temporary file? This way, on the first dump/store Pig 
would produce all intermediate results, then keep some of them, and 
re-use them for subsequent operators?

Example - let's say that ':=' means that the result should be kept 
around until exit (or until any of previous intermediate results changes):

-- A is not persisted
A = load 'sample.txt' as (date, time, ip, query);
-- B is to be persisted in a temp file
B := group A by ip;
-- compile & execute - creates B in a temp file
dump B;
C = foreach B generate group, query;
-- this uses already existing B data from a temp file
dump C;


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: possible use of Pig for OLAP

Posted by Ted Dunning <td...@veoh.com>.

I would see PIG for large scale analytics filling up Hbase for fast query and reporting of the aggregates as an interesting option.

In the future.

The pieces of this vision are definitely not there yet.

-----Original Message-----
From: Chris Olston [mailto:olston@yahoo-inc.com]
Sent: Tue 11/20/2007 9:29 AM
To: pig-dev@incubator.apache.org
Subject: Re: possible use of Pig for OLAP

Sounds interesting. Pig is geared toward large-scale aggregation  
operations, in the style of OLAP.

Regarding your 3rd paragraph question, do you mean:

a) there are several interrelated aggregation expressions that you  
want evaluated in just one pass over the data, or
b) you do some initial aggregation, display it to the user, who can  
do "drill-down" operations in the GUI which require you to look up  
more data in the backend

?

For (a), yes Pig can do that, although currently you have to encode  
it explicitly as a single Pig program (in future versions, we might  
be able to take multiple related Pig programs and execute them in a  
joint fashion). For (b), we don't currently have a mechanism to do  
that without reloading the data, although perhaps the operating  
system's file cache would help with that, under the covers, if the  
file partitions fit in memory and don't get evicted.

-Chris

On Nov 20, 2007, at 1:47 AM, Alexandru Toth wrote:

> Hi,
>
> I am developing an Open Source OLAP application called "Cubulus". The
> code is at http://sourceforge.net/projects/cubulus/ , a brief
> presentation material at http://cubulus.sourceforge.net/ , and an
> online demo at: http://alxtoth.webfactional.com
>
> It would be interresting to use Pig instead of relational databases  
> as backend.
>
> The question is: can Pig scripts work is such manner that the file is
> loaded only once, and then subsequent web requests process over and
> over the same file? This becomes relevant if the data file is large,
> and there is one datafile to process (or few datafiles). In fact, is
> repated loading a problem at all :-) ?
>
> -Alex

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research

Re: possible use of Pig for OLAP

Posted by Chris Olston <ol...@yahoo-inc.com>.

Sounds interesting. Pig is geared toward large-scale aggregation  
operations, in the style of OLAP.

Regarding your 3rd paragraph question, do you mean:

a) there are several interrelated aggregation expressions that you  
want evaluated in just one pass over the data, or
b) you do some initial aggregation, display it to the user, who can  
do "drill-down" operations in the GUI which require you to look up  
more data in the backend

?

For (a), yes Pig can do that, although currently you have to encode  
it explicitly as a single Pig program (in future versions, we might  
be able to take multiple related Pig programs and execute them in a  
joint fashion). For (b), we don't currently have a mechanism to do  
that without reloading the data, although perhaps the operating  
system's file cache would help with that, under the covers, if the  
file partitions fit in memory and don't get evicted.

-Chris

On Nov 20, 2007, at 1:47 AM, Alexandru Toth wrote:

> Hi,
>
> I am developing an Open Source OLAP application called "Cubulus". The
> code is at http://sourceforge.net/projects/cubulus/ , a brief
> presentation material at http://cubulus.sourceforge.net/ , and an
> online demo at: http://alxtoth.webfactional.com
>
> It would be interresting to use Pig instead of relational databases  
> as backend.
>
> The question is: can Pig scripts work is such manner that the file is
> loaded only once, and then subsequent web requests process over and
> over the same file? This becomes relevant if the data file is large,
> and there is one datafile to process (or few datafiles). In fact, is
> repated loading a problem at all :-) ?
>
> -Alex

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research